linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/6] vfsmount scaling and other bits
@ 2009-10-15  4:40 npiggin
  2009-10-15  4:40 ` [patch 1/6] fs: invalidate sb->s_bdev on remount,ro npiggin
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

Hi Al,

Finally got around to some numbers and a few comments with vfsmount scaling.
Patch looks a bit nicer now, and numbers aren't too bad I think.

I also broke the brlock stuff into its own header... it's a bit ugly but I
couldn't see a nice way to do the per-CPU stuff without adding indirection and
memory allocation of dynamic per-CPU allocations. Anyone like to yell at me
for this, or suggest a better approach?

I also put some other reasonable looking bits and pieces in there which have
filtered to the bottom of my vfs stack, as there are some clashes otherwise.
If you want me to stop sending them to you, could you merge them or tell me why
I'm an idiot, please?

Thanks,
Nick



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 1/6] fs: invalidate sb->s_bdev on remount,ro
  2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
@ 2009-10-15  4:40 ` npiggin
  2009-10-15  4:40 ` [patch 2/6] fs: no games with DCACHE_UNHASHED npiggin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

[-- Attachment #1: fs-remount-coherency.patch --]
[-- Type: text/plain, Size: 3144 bytes --]

Fixes a problem reported by "Jorge Boncompte [DTI2]" <jorge@dti2.net>
who is seeing corruption trying to snapshot a minix filesystem image.
Some filesystems modify their metadata via a path other than the bdev
buffer cache (eg. they may use a private linear mapping for their
metadata, or implement directories in pagecache, etc). Also, file
data modifications usually go to the bdev via their own mappings.

These updates are not coherent with buffercache IO (eg. via /dev/bdev)
and never have been. However there could be a reasonable expectation
that after a mount -oremount,ro operation then the buffercache should
subsequently be coherent with previous filesystem modifications.

So invalidate the bdev mappings on a remount,ro operation to provide
a coherency point.

The problem was exposed when we switched the old rd to brd because old rd
didn't really function like a normal block device and updates to rd via
mappings other than the buffercache would still end up going into its
buffercache. But the same problem has always affected other "normal" block
devices, including loop.

Reported-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Tested-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 fs/super.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -568,7 +568,7 @@ out:
 int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 {
 	int retval;
-	int remount_rw;
+	int remount_rw, remount_ro;
 
 	if (sb->s_frozen != SB_UNFROZEN)
 		return -EBUSY;
@@ -583,9 +583,12 @@ int do_remount_sb(struct super_block *sb
 	shrink_dcache_sb(sb);
 	sync_filesystem(sb);
 
+	remount_ro = (flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY);
+	remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY);
+
 	/* If we are remounting RDONLY and current sb is read/write,
 	   make sure there are no rw files opened */
-	if ((flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY)) {
+	if (remount_ro) {
 		if (force)
 			mark_files_ro(sb);
 		else if (!fs_may_remount_ro(sb))
@@ -594,7 +597,6 @@ int do_remount_sb(struct super_block *sb
 		if (retval < 0 && retval != -ENOSYS)
 			return -EBUSY;
 	}
-	remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY);
 
 	if (sb->s_op->remount_fs) {
 		retval = sb->s_op->remount_fs(sb, &flags, data);
@@ -604,6 +606,14 @@ int do_remount_sb(struct super_block *sb
 	sb->s_flags = (sb->s_flags & ~MS_RMT_MASK) | (flags & MS_RMT_MASK);
 	if (remount_rw)
 		vfs_dq_quota_on_remount(sb);
+	 /* Some filesystems modify their metadata via some other path
+	    than the bdev buffer cache (eg. use a private mapping, or
+	    directories in pagecache, etc). Also file data modifications
+	    go via their own mappings. So If we try to mount readonly
+	    then copy the filesystem from bdev, we could get stale data,
+	    so invalidate it to give a best effort at coherency. */
+	if (remount_ro && sb->s_bdev)
+		invalidate_bdev(sb->s_bdev);
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
  2009-10-15  4:40 ` [patch 1/6] fs: invalidate sb->s_bdev on remount,ro npiggin
@ 2009-10-15  4:40 ` npiggin
  2009-10-15  6:31   ` David Miller
  2009-10-15  7:44   ` Eric Dumazet
  2009-10-15  4:40 ` [patch 3/6] fs: dcache remove d_mounted npiggin
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel, Miklos Szeredi

[-- Attachment #1: unhashed-d_delete.patch --]
[-- Type: text/plain, Size: 4438 bytes --]

(this is in -mm)

Filesystems outside the regular namespace do not have to clear DCACHE_UNHASHED
in order to have a working /proc/$pid/fd/XXX. Nothing in proc prevents the
fd link from being used if its dentry is not in the hash.

Also, it does not get put into the dcache hash if DCACHE_UNHASHED is clear;
that depends on the filesystem calling d_add or d_rehash.

So delete the misleading comments and needless code.

Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/anon_inodes.c |   16 ----------------
 fs/pipe.c        |   18 ------------------
 net/socket.c     |   19 -------------------
 3 files changed, 53 deletions(-)

Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c
+++ linux-2.6/fs/pipe.c
@@ -887,17 +887,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -909,7 +898,6 @@ static char *pipefs_dname(struct dentry
 }
 
 static const struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -969,12 +957,6 @@ struct file *create_write_pipe(int flags
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c
+++ linux-2.6/net/socket.c
@@ -306,18 +306,6 @@ static struct file_system_type sock_fs_t
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -328,7 +316,6 @@ static char *sockfs_dname(struct dentry
 }
 
 static const struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -377,12 +364,6 @@ static int sock_attach_fd(struct socket
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c
+++ linux-2.6/fs/anon_inodes.c
@@ -35,24 +35,11 @@ static int anon_inodefs_get_sb(struct fi
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
-static const struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
-};
 
 /*
  * nop .set_page_dirty method so that people can use .page_mkwrite on
@@ -117,9 +104,6 @@ struct file *anon_inode_getfile(const ch
 	 */
 	atomic_inc(&anon_inode_inode->i_count);
 
-	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 3/6] fs: dcache remove d_mounted
  2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
  2009-10-15  4:40 ` [patch 1/6] fs: invalidate sb->s_bdev on remount,ro npiggin
  2009-10-15  4:40 ` [patch 2/6] fs: no games with DCACHE_UNHASHED npiggin
@ 2009-10-15  4:40 ` npiggin
  2009-10-15 10:37   ` Ian Kent
  2009-10-15  4:40 ` [patch 4/6] brlock: introduce special brlocks npiggin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

[-- Attachment #1: fs-dcache-remove-d_mounted.patch --]
[-- Type: text/plain, Size: 8159 bytes --]

Rather than keep a d_mounted count in the dentry (which is only used to
speculatively take a look in the mount hash table if it is non-zero), set a
dentry flag instead. The flag can be cleared by checking the hash table to see
if there are any mounts left. It is not time critical because it is performed
at detach time.

This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
might use later (and some configs have larger than 32-bit spinlocks which might
make use of the hole).

Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
In autofs4, when expring direct (or offset) mounts we need to ensure that we
block user path walks into the autofs mount, which is covered by another mount.
To do this we clear the mounted status so that follows stop before walking into
the mount and are essentially blocked until the expire is completed. The
automount daemon still finds the correct dentry for the umount due to the
follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
an inode operation for direct and offset mounts only and is called following
the lookup that stopped at the covered mount.

At the end of the expire the covering mount probably has gone away so the
mounted status need not be restored. But we need to check this and only restore
the mounted status if the expire failed.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/autofs4/expire.c    |   13 +++++++++++--
 fs/dcache.c            |    1 -
 fs/namespace.c         |   17 ++++++++++++++---
 include/linux/dcache.h |   42 +++++++++++++++++++++---------------------
 4 files changed, 46 insertions(+), 27 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -947,7 +947,6 @@ struct dentry *d_alloc(struct dentry * p
 	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
-	dentry->d_mounted = 0;
 	INIT_HLIST_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -467,6 +467,15 @@ static void __touch_mnt_namespace(struct
 	}
 }
 
+static void dentry_reset_mounted(struct vfsmount *mnt, struct dentry *dentry)
+{
+	if (!__lookup_mnt(mnt, dentry, 0)) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags &= ~DCACHE_MOUNTED;
+		spin_unlock(&dentry->d_lock);
+	}
+}
+
 static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
@@ -475,7 +484,7 @@ static void detach_mnt(struct vfsmount *
 	mnt->mnt_mountpoint = mnt->mnt_root;
 	list_del_init(&mnt->mnt_child);
 	list_del_init(&mnt->mnt_hash);
-	old_path->dentry->d_mounted--;
+	dentry_reset_mounted(old_path->mnt, old_path->dentry);
 }
 
 void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
@@ -483,7 +492,9 @@ void mnt_set_mountpoint(struct vfsmount
 {
 	child_mnt->mnt_parent = mntget(mnt);
 	child_mnt->mnt_mountpoint = dget(dentry);
-	dentry->d_mounted++;
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags |= DCACHE_MOUNTED;
+	spin_unlock(&dentry->d_lock);
 }
 
 static void attach_mnt(struct vfsmount *mnt, struct path *path)
@@ -1015,7 +1026,7 @@ void umount_tree(struct vfsmount *mnt, i
 		list_del_init(&p->mnt_child);
 		if (p->mnt_parent != p) {
 			p->mnt_parent->mnt_ghosts++;
-			p->mnt_mountpoint->d_mounted--;
+			dentry_reset_mounted(p->mnt_parent, p->mnt_mountpoint);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h
+++ linux-2.6/include/linux/dcache.h
@@ -83,14 +83,13 @@ full_name_hash(const unsigned char *name
 #ifdef CONFIG_64BIT
 #define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */
 #else
-#define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */
+#define DNAME_INLINE_LEN_MIN 44 /* 128 bytes */
 #endif
 
 struct dentry {
 	atomic_t d_count;
 	unsigned int d_flags;		/* protected by d_lock */
 	spinlock_t d_lock;		/* per dentry lock */
-	int d_mounted;
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
 	/*
@@ -161,30 +160,31 @@ d_iput:		no		no		no       yes
 
 /* d_flags entries */
 #define DCACHE_AUTOFS_PENDING 0x0001    /* autofs: "under construction" */
-#define DCACHE_NFSFS_RENAMED  0x0002    /* this dentry has been "silly
-					 * renamed" and has to be
-					 * deleted on the last dput()
-					 */
-#define	DCACHE_DISCONNECTED 0x0004
-     /* This dentry is possibly not currently connected to the dcache tree,
-      * in which case its parent will either be itself, or will have this
-      * flag as well.  nfsd will not use a dentry with this bit set, but will
-      * first endeavour to clear the bit either by discovering that it is
-      * connected, or by performing lookup operations.   Any filesystem which
-      * supports nfsd_operations MUST have a lookup function which, if it finds
-      * a directory inode with a DCACHE_DISCONNECTED dentry, will d_move
-      * that dentry into place and return that dentry rather than the passed one,
-      * typically using d_splice_alias.
-      */
+#define DCACHE_NFSFS_RENAMED  0x0002
+     /* this dentry has been "silly renamed" and has to be deleted on the last
+      * dput() */
+
+#define	DCACHE_DISCONNECTED	0x0004
+     /* This dentry is possibly not currently connected to the dcache tree, in
+      * which case its parent will either be itself, or will have this flag as
+      * well.  nfsd will not use a dentry with this bit set, but will first
+      * endeavour to clear the bit either by discovering that it is connected,
+      * or by performing lookup operations.   Any filesystem which supports
+      * nfsd_operations MUST have a lookup function which, if it finds a
+      * directory inode with a DCACHE_DISCONNECTED dentry, will d_move that
+      * dentry into place and return that dentry rather than the passed one,
+      * typically using d_splice_alias. */
 
 #define DCACHE_REFERENCED	0x0008  /* Recently used, don't discard. */
 #define DCACHE_UNHASHED		0x0010	
-
-#define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched by inotify */
+#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020
+     /* Parent inode is watched by inotify */
 
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
+#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080
+     /* Parent inode is watched by some fsnotify listener */
 
-#define DCACHE_FSNOTIFY_PARENT_WATCHED	0x0080 /* Parent inode is watched by some fsnotify listener */
+#define DCACHE_MOUNTED		0x0100	/* is a mountpoint */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -372,7 +372,7 @@ extern void dput(struct dentry *);
 
 static inline int d_mountpoint(struct dentry *dentry)
 {
-	return dentry->d_mounted;
+	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
 extern struct vfsmount *lookup_mnt(struct path *);
Index: linux-2.6/fs/autofs4/expire.c
===================================================================
--- linux-2.6.orig/fs/autofs4/expire.c
+++ linux-2.6/fs/autofs4/expire.c
@@ -276,7 +276,9 @@ struct dentry *autofs4_expire_direct(str
 		struct autofs_info *ino = autofs4_dentry_ino(root);
 		if (d_mountpoint(root)) {
 			ino->flags |= AUTOFS_INF_MOUNTPOINT;
-			root->d_mounted--;
+			spin_lock(&root->d_lock);
+			root->d_flags &= ~DCACHE_MOUNTED;
+			spin_unlock(&root->d_lock);
 		}
 		ino->flags |= AUTOFS_INF_EXPIRING;
 		init_completion(&ino->expire_complete);
@@ -499,7 +501,14 @@ int autofs4_do_expire_multi(struct super
 
 		spin_lock(&sbi->fs_lock);
 		if (ino->flags & AUTOFS_INF_MOUNTPOINT) {
-			sb->s_root->d_mounted++;
+			spin_lock(&sb->s_root->d_lock);
+			/*
+			 * If we haven't been expired away, then reset
+			 * mounted status.
+			 */
+			if (mnt->mnt_parent != mnt)
+				sb->s_root->d_flags |= DCACHE_MOUNTED;
+			spin_unlock(&sb->s_root->d_lock);
 			ino->flags &= ~AUTOFS_INF_MOUNTPOINT;
 		}
 		ino->flags &= ~AUTOFS_INF_EXPIRING;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 4/6] brlock: introduce special brlocks
  2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
                   ` (2 preceding siblings ...)
  2009-10-15  4:40 ` [patch 3/6] fs: dcache remove d_mounted npiggin
@ 2009-10-15  4:40 ` npiggin
  2009-10-15  6:58   ` [rfc][patch 4a/6] brlock: "fast" brlocks Nick Piggin
  2009-10-19  5:25   ` [patch 4/6] brlock: introduce special brlocks Andrew Morton
  2009-10-15  4:40 ` [patch 5/6] fs: brlock vfsmount_lock npiggin
  2009-10-15  4:40 ` [patch 6/6] fs: scale mntget/mntput npiggin
  5 siblings, 2 replies; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

[-- Attachment #1: kernel-introduce-brlock.patch --]
[-- Type: text/plain, Size: 3922 bytes --]

This patch introduces special brlocks, these can only be used as global
locks, and use some preprocessor trickery to allow us to retain a more
optimal per-cpu lock implementation. We don't bother working around
lockdep yet.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/brlock.h |  112 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)

Index: linux-2.6/include/linux/brlock.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/brlock.h
@@ -0,0 +1,112 @@
+/*
+ * Specialised big-reader spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * Copyright 2009, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_BRLOCK_H
+#define __LINUX_BRLOCK_H
+
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <asm/atomic.h>
+
+#if defined(CONFIG_SMP) && !defined(CONFIG_LOCKDEP)
+#define DECLARE_BRLOCK(name)						\
+ DECLARE_PER_CPU(spinlock_t, name##_lock);				\
+ static inline void name##_lock_init(void) {				\
+	int i;								\
+	for_each_possible_cpu(i) {					\
+		spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		spin_lock_init(lock);					\
+	}								\
+ }									\
+ static inline void name##_rlock(void) {				\
+	spinlock_t *lock;						\
+	lock = &get_cpu_var(name##_lock);				\
+	spin_lock(lock);						\
+ }									\
+ static inline void name##_runlock(void) {				\
+	spinlock_t *lock;						\
+	lock = &__get_cpu_var(name##_lock);				\
+	spin_unlock(lock);						\
+	put_cpu_var(name##_lock);					\
+ }									\
+ extern void name##_wlock(void);					\
+ extern void name##_wunlock(void);					\
+ static inline int name##_atomic_dec_and_rlock(atomic_t *a) {		\
+	int ret;							\
+	spinlock_t *lock;						\
+	lock = &get_cpu_var(name##_lock);				\
+	ret = atomic_dec_and_lock(a, lock);				\
+	if (!ret)							\
+		put_cpu_var(name##_lock);				\
+	return ret;							\
+ }									\
+ extern int name##_atomic_dec_and_wlock__failed(atomic_t *a);		\
+ static inline int name##_atomic_dec_and_wlock(atomic_t *a) {		\
+	if (atomic_add_unless(a, -1, 1))				\
+		return 0;						\
+	return name##_atomic_dec_and_wlock__failed(a);			\
+ }
+
+#define DEFINE_BRLOCK(name)						\
+ DEFINE_PER_CPU(spinlock_t, name##_lock);				\
+ void name##_wlock(void) {						\
+	int i;								\
+	for_each_online_cpu(i) {					\
+		spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		spin_lock(lock);					\
+	}								\
+ }									\
+ void name##_wunlock(void) {						\
+	int i;								\
+	for_each_online_cpu(i) {					\
+		spinlock_t *lock;					\
+		lock = &per_cpu(name##_lock, i);			\
+		spin_unlock(lock);					\
+	}								\
+ }									\
+ int name##_atomic_dec_and_wlock__failed(atomic_t *a) {			\
+	name##_wlock();							\
+	if (!atomic_dec_and_test(a)) {					\
+		name##_wunlock();					\
+		return 0;						\
+	}								\
+	return 1;							\
+ }
+
+#else
+
+#define DECLARE_BRLOCK(name)						\
+ spinlock_t name##_lock;						\
+ static inline void name##_lock_init(void) {				\
+	spin_lock_init(&name##_lock);					\
+ }									\
+ static inline void name##_rlock(void) {				\
+	spin_lock(&name##_lock);					\
+ }									\
+ static inline void name##_runlock(void) {				\
+	spin_unlock(&name##_lock);					\
+ }									\
+ static inline void name##_wlock(void) {				\
+	spin_lock(&name##_lock);					\
+ }									\
+ static inline void name##_wunlock(void) {				\
+	spin_unlock(&name##_lock);					\
+ }									\
+ static inline int name##_atomic_dec_and_rlock(atomic_t *a) {		\
+	return atomic_dec_and_lock(a, &name##_lock);			\
+ }									\
+ static inline int name##_atomic_dec_and_wlock(atomic_t *a) {		\
+	return atomic_dec_and_lock(a, &name##_lock);			\
+ }
+
+#define DEFINE_BRLOCK(name)						\
+ spinlock_t name##_lock
+#endif
+
+#endif



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 5/6] fs: brlock vfsmount_lock
  2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
                   ` (3 preceding siblings ...)
  2009-10-15  4:40 ` [patch 4/6] brlock: introduce special brlocks npiggin
@ 2009-10-15  4:40 ` npiggin
  2009-10-15  4:40 ` [patch 6/6] fs: scale mntget/mntput npiggin
  5 siblings, 0 replies; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

[-- Attachment #1: fs-vfsmount_lock-scale-2.patch --]
[-- Type: text/plain, Size: 20744 bytes --]

Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read
when performing mount hash lookups.

On a 64 core, 64 socket, 32 node Altix system (so a decent amount of
latency for remote nodes), a simple dumb mount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch
it took 6.8s, afterwards took 7.1s, or about 5% increase in time.

Number of atomics should remain the same for fastpath rlock cases, though
code will be slightly larger due to per-cpu access. Scalability will
probably not be incredibly improved in common cases yet, due to other locks
getting in the way. However independent path lookups over mountpoints should
be one case where scalability is improved.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/dcache.c                |    4 -
 fs/namei.c                 |   13 +--
 fs/namespace.c             |  178 ++++++++++++++++++++++++++++-----------------
 fs/pnode.c                 |   11 ++
 fs/proc/base.c             |    4 -
 include/linux/mount.h      |    4 -
 kernel/audit_tree.c        |    6 -
 security/tomoyo/realpath.c |    4 -
 8 files changed, 141 insertions(+), 83 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1908,7 +1908,7 @@ char *__d_path(const struct path *path,
 	char *end = buffer + buflen;
 	char *retval;
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_rlock();
 	prepend(&end, &buflen, "\0", 1);
 	if (d_unlinked(dentry) &&
 		(prepend(&end, &buflen, " (deleted)", 10) != 0))
@@ -1944,7 +1944,7 @@ char *__d_path(const struct path *path,
 	}
 
 out:
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
 	return retval;
 
 global_root:
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -685,15 +685,16 @@ int follow_up(struct path *path)
 {
 	struct vfsmount *parent;
 	struct dentry *mountpoint;
-	spin_lock(&vfsmount_lock);
+
+	vfsmount_rlock();
 	parent = path->mnt->mnt_parent;
 	if (parent == path->mnt) {
-		spin_unlock(&vfsmount_lock);
+		vfsmount_runlock();
 		return 0;
 	}
 	mntget(parent);
 	mountpoint = dget(path->mnt->mnt_mountpoint);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
 	dput(path->dentry);
 	path->dentry = mountpoint;
 	mntput(path->mnt);
@@ -772,15 +773,15 @@ static __always_inline void follow_dotdo
 			break;
 		}
 		spin_unlock(&dcache_lock);
-		spin_lock(&vfsmount_lock);
+		vfsmount_rlock();
 		parent = nd->path.mnt->mnt_parent;
 		if (parent == nd->path.mnt) {
-			spin_unlock(&vfsmount_lock);
+			vfsmount_runlock();
 			break;
 		}
 		mntget(parent);
 		nd->path.dentry = dget(nd->path.mnt->mnt_mountpoint);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_runlock();
 		dput(old);
 		mntput(nd->path.mnt);
 		nd->path.mnt = parent;
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -11,6 +11,8 @@
 #include <linux/syscalls.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -37,12 +39,10 @@
 #define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
 #define HASH_SIZE (1UL << HASH_SHIFT)
 
-/* spinlock for vfsmount related operations, inplace of dcache_lock */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);
-
 static int event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
+static DEFINE_SPINLOCK(mnt_id_lock);
 static int mnt_id_start = 0;
 static int mnt_group_start = 1;
 
@@ -54,6 +54,16 @@ static struct rw_semaphore namespace_sem
 struct kobject *fs_kobj;
 EXPORT_SYMBOL_GPL(fs_kobj);
 
+/*
+ * vfsmount lock may be taken for read to prevent changes to the
+ * vfsmount hash, ie. during mountpoint lookups or walking back
+ * up the tree.
+ *
+ * It should be taken for write in all cases where the vfsmount
+ * tree or hash is modified or when a vfsmount structure is modified.
+ */
+DEFINE_BRLOCK(vfsmount);
+
 static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
 {
 	unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -64,18 +74,21 @@ static inline unsigned long hash(struct
 
 #define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
 
-/* allocation is serialized by namespace_sem */
+/*
+ * allocation is serialized by namespace_sem, but we need the spinlock to
+ * serialize with freeing.
+ */
 static int mnt_alloc_id(struct vfsmount *mnt)
 {
 	int res;
 
 retry:
 	ida_pre_get(&mnt_id_ida, GFP_KERNEL);
-	spin_lock(&vfsmount_lock);
+	spin_lock(&mnt_id_lock);
 	res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
 	if (!res)
 		mnt_id_start = mnt->mnt_id + 1;
-	spin_unlock(&vfsmount_lock);
+	spin_unlock(&mnt_id_lock);
 	if (res == -EAGAIN)
 		goto retry;
 
@@ -85,11 +98,11 @@ retry:
 static void mnt_free_id(struct vfsmount *mnt)
 {
 	int id = mnt->mnt_id;
-	spin_lock(&vfsmount_lock);
+	spin_lock(&mnt_id_lock);
 	ida_remove(&mnt_id_ida, id);
 	if (mnt_id_start > id)
 		mnt_id_start = id;
-	spin_unlock(&vfsmount_lock);
+	spin_unlock(&mnt_id_lock);
 }
 
 /*
@@ -344,7 +357,7 @@ static int mnt_make_readonly(struct vfsm
 {
 	int ret = 0;
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	mnt->mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -378,15 +391,15 @@ static int mnt_make_readonly(struct vfsm
 	 */
 	smp_wmb();
 	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	return ret;
 }
 
 static void __mnt_unmake_readonly(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	mnt->mnt_flags &= ~MNT_READONLY;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 }
 
 void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
@@ -410,6 +423,7 @@ void free_vfsmnt(struct vfsmount *mnt)
 /*
  * find the first or last mount at @dentry on vfsmount @mnt depending on
  * @dir. If @dir is set return the first mount else return the last mount.
+ * vfsmount_lock must be held for read or write.
  */
 struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
 			      int dir)
@@ -439,10 +453,11 @@ struct vfsmount *__lookup_mnt(struct vfs
 struct vfsmount *lookup_mnt(struct path *path)
 {
 	struct vfsmount *child_mnt;
-	spin_lock(&vfsmount_lock);
+
+	vfsmount_rlock();
 	if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
 		mntget(child_mnt);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
 	return child_mnt;
 }
 
@@ -451,6 +466,9 @@ static inline int check_mnt(struct vfsmo
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void touch_mnt_namespace(struct mnt_namespace *ns)
 {
 	if (ns) {
@@ -459,6 +477,9 @@ static void touch_mnt_namespace(struct m
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void __touch_mnt_namespace(struct mnt_namespace *ns)
 {
 	if (ns && ns->event != event) {
@@ -467,6 +488,9 @@ static void __touch_mnt_namespace(struct
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void dentry_reset_mounted(struct vfsmount *mnt, struct dentry *dentry)
 {
 	if (!__lookup_mnt(mnt, dentry, 0)) {
@@ -476,6 +500,9 @@ static void dentry_reset_mounted(struct
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
 {
 	old_path->dentry = mnt->mnt_mountpoint;
@@ -487,6 +514,9 @@ static void detach_mnt(struct vfsmount *
 	dentry_reset_mounted(old_path->mnt, old_path->dentry);
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
 			struct vfsmount *child_mnt)
 {
@@ -497,6 +527,9 @@ void mnt_set_mountpoint(struct vfsmount
 	spin_unlock(&dentry->d_lock);
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 static void attach_mnt(struct vfsmount *mnt, struct path *path)
 {
 	mnt_set_mountpoint(path->mnt, path->dentry, mnt);
@@ -506,7 +539,7 @@ static void attach_mnt(struct vfsmount *
 }
 
 /*
- * the caller must hold vfsmount_lock
+ * vfsmount lock must be held for write
  */
 static void commit_tree(struct vfsmount *mnt)
 {
@@ -629,40 +662,41 @@ static inline void __mntput(struct vfsmo
 void mntput_no_expire(struct vfsmount *mnt)
 {
 repeat:
-	if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
-		if (likely(!mnt->mnt_pinned)) {
-			spin_unlock(&vfsmount_lock);
-			__mntput(mnt);
-			return;
-		}
-		atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
-		mnt->mnt_pinned = 0;
-		spin_unlock(&vfsmount_lock);
-		acct_auto_close_mnt(mnt);
-		security_sb_umount_close(mnt);
-		goto repeat;
+	if (!vfsmount_atomic_dec_and_wlock(&mnt->mnt_count))
+		return;
+
+	if (likely(!mnt->mnt_pinned)) {
+		vfsmount_wunlock();
+		__mntput(mnt);
+		return;
 	}
+	atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+	mnt->mnt_pinned = 0;
+	vfsmount_wunlock();
+	acct_auto_close_mnt(mnt);
+	security_sb_umount_close(mnt);
+	goto repeat;
 }
 
 EXPORT_SYMBOL(mntput_no_expire);
 
 void mnt_pin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	mnt->mnt_pinned++;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 }
 
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	if (mnt->mnt_pinned) {
 		atomic_inc(&mnt->mnt_count);
 		mnt->mnt_pinned--;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 }
 
 EXPORT_SYMBOL(mnt_unpin);
@@ -945,12 +979,12 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_rlock();
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		actual_refs += atomic_read(&p->mnt_count);
 		minimum_refs += 2;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -976,10 +1010,12 @@ EXPORT_SYMBOL(may_umount_tree);
 int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
-	spin_lock(&vfsmount_lock);
+
+	vfsmount_rlock();
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
+
 	return ret;
 }
 
@@ -994,13 +1030,14 @@ void release_mounts(struct list_head *he
 		if (mnt->mnt_parent != mnt) {
 			struct dentry *dentry;
 			struct vfsmount *m;
-			spin_lock(&vfsmount_lock);
+
+			vfsmount_wlock();
 			dentry = mnt->mnt_mountpoint;
 			m = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
 			mnt->mnt_parent = mnt;
 			m->mnt_ghosts--;
-			spin_unlock(&vfsmount_lock);
+			vfsmount_wunlock();
 			dput(dentry);
 			mntput(m);
 		}
@@ -1008,6 +1045,10 @@ void release_mounts(struct list_head *he
 	}
 }
 
+/*
+ * vfsmount lock must be held for write
+ * namespace_sem must be held for write
+ */
 void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
 {
 	struct vfsmount *p;
@@ -1098,7 +1139,7 @@ static int do_umount(struct vfsmount *mn
 	}
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	event++;
 
 	if (!(flags & MNT_DETACH))
@@ -1110,7 +1151,7 @@ static int do_umount(struct vfsmount *mn
 			umount_tree(mnt, 1, &umount_list);
 		retval = 0;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	if (retval)
 		security_sb_umount_busy(mnt);
 	up_write(&namespace_sem);
@@ -1217,19 +1258,19 @@ struct vfsmount *copy_tree(struct vfsmou
 			q = clone_mnt(p, p->mnt_root, flag);
 			if (!q)
 				goto Enomem;
-			spin_lock(&vfsmount_lock);
+			vfsmount_wlock();
 			list_add_tail(&q->mnt_list, &res->mnt_list);
 			attach_mnt(q, &path);
-			spin_unlock(&vfsmount_lock);
+			vfsmount_wunlock();
 		}
 	}
 	return res;
 Enomem:
 	if (res) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+		vfsmount_wlock();
 		umount_tree(res, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_wunlock();
 		release_mounts(&umount_list);
 	}
 	return NULL;
@@ -1248,9 +1289,9 @@ void drop_collected_mounts(struct vfsmou
 {
 	LIST_HEAD(umount_list);
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	umount_tree(mnt, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 }
@@ -1368,7 +1409,7 @@ static int attach_recursive_mnt(struct v
 			set_mnt_shared(p);
 	}
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	if (parent_path) {
 		detach_mnt(source_mnt, parent_path);
 		attach_mnt(source_mnt, path);
@@ -1382,7 +1423,8 @@ static int attach_recursive_mnt(struct v
 		list_del_init(&child->mnt_hash);
 		commit_tree(child);
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
+
 	return 0;
 
  out_cleanup_ids:
@@ -1444,10 +1486,10 @@ static int do_change_type(struct path *p
 			goto out_unlock;
 	}
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
 		change_mnt_propagation(m, type);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 
  out_unlock:
 	up_write(&namespace_sem);
@@ -1491,9 +1533,10 @@ static int do_loopback(struct path *path
 	err = graft_tree(mnt, path);
 	if (err) {
 		LIST_HEAD(umount_list);
-		spin_lock(&vfsmount_lock);
+
+		vfsmount_wlock();
 		umount_tree(mnt, 0, &umount_list);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_wunlock();
 		release_mounts(&umount_list);
 	}
 
@@ -1551,9 +1594,9 @@ static int do_remount(struct path *path,
 	if (!err) {
 		security_sb_post_remount(path->mnt, flags, data);
 
-		spin_lock(&vfsmount_lock);
+		vfsmount_wlock();
 		touch_mnt_namespace(path->mnt->mnt_ns);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_wunlock();
 	}
 	return err;
 }
@@ -1728,7 +1771,7 @@ void mark_mounts_for_expiry(struct list_
 		return;
 
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 
 	/* extract from the expiration list every vfsmount that matches the
 	 * following criteria:
@@ -1747,7 +1790,7 @@ void mark_mounts_for_expiry(struct list_
 		touch_mnt_namespace(mnt->mnt_ns);
 		umount_tree(mnt, 1, &umounts);
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	up_write(&namespace_sem);
 
 	release_mounts(&umounts);
@@ -1804,6 +1847,8 @@ resume:
 /*
  * process a list of expirable mountpoints with the intent of discarding any
  * submounts of a specific parent mountpoint
+ *
+ * vfsmount_lock must be held for write
  */
 static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts)
 {
@@ -2022,9 +2067,9 @@ static struct mnt_namespace *dup_mnt_ns(
 		kfree(new_ns);
 		return ERR_PTR(-ENOMEM);
 	}
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 
 	/*
 	 * Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2221,7 +2266,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 		goto out2; /* not attached */
 	/* make sure we can reach put_old from new_root */
 	tmp = old.mnt;
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	if (tmp != new.mnt) {
 		for (;;) {
 			if (tmp->mnt_parent == tmp)
@@ -2241,7 +2286,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	/* mount new_root on / */
 	attach_mnt(new.mnt, &root_parent);
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	chroot_fs_refs(&root, &new);
 	security_sb_post_pivotroot(&root, &new);
 	error = 0;
@@ -2257,7 +2302,7 @@ out1:
 out0:
 	return error;
 out3:
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	goto out2;
 }
 
@@ -2304,6 +2349,8 @@ void __init mnt_init(void)
 	for (u = 0; u < HASH_SIZE; u++)
 		INIT_LIST_HEAD(&mount_hashtable[u]);
 
+	vfsmount_lock_init();
+
 	err = sysfs_init();
 	if (err)
 		printk(KERN_WARNING "%s: sysfs_init error: %d\n",
@@ -2320,15 +2367,16 @@ void put_mnt_ns(struct mnt_namespace *ns
 	struct vfsmount *root;
 	LIST_HEAD(umount_list);
 
-	if (!atomic_dec_and_lock(&ns->count, &vfsmount_lock))
+	if (!vfsmount_atomic_dec_and_rlock(&ns->count))
 		return;
 	root = ns->root;
 	ns->root = NULL;
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
+
 	down_write(&namespace_sem);
-	spin_lock(&vfsmount_lock);
-	umount_tree(root, 0, &umount_list);
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wlock();
+  	umount_tree(root, 0, &umount_list);
+	vfsmount_wunlock();
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	kfree(ns);
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -126,6 +126,9 @@ static int do_make_slave(struct vfsmount
 	return 0;
 }
 
+/*
+ * vfsmount lock must be held for write
+ */
 void change_mnt_propagation(struct vfsmount *mnt, int type)
 {
 	if (type == MS_SHARED) {
@@ -264,12 +267,12 @@ int propagate_mnt(struct vfsmount *dest_
 		prev_src_mnt  = child;
 	}
 out:
-	spin_lock(&vfsmount_lock);
+	vfsmount_wlock();
 	while (!list_empty(&tmp_list)) {
 		child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
 		umount_tree(child, 0, &umount_list);
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_wunlock();
 	release_mounts(&umount_list);
 	return ret;
 }
@@ -290,6 +293,8 @@ static inline int do_refcount_check(stru
  * other mounts its parent propagates to.
  * Check if any of these mounts that **do not have submounts**
  * have more references than 'refcnt'. If so return busy.
+ *
+ * vfsmount lock must be held for read or write
  */
 int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
 {
@@ -347,6 +352,8 @@ static void __propagate_umount(struct vf
  * collect all mounts that receive propagation from the mount in @list,
  * and return these additional mounts in the same list.
  * @list: the list of mounts to be unmounted.
+ *
+ * vfsmount lock must be held for write
  */
 int propagate_umount(struct list_head *list)
 {
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -652,12 +652,12 @@ static unsigned mounts_poll(struct file
 
 	poll_wait(file, &ns->poll, wait);
 
-	spin_lock(&vfsmount_lock);
+	vfsmount_rlock();
 	if (p->event != ns->event) {
 		p->event = ns->event;
 		res |= POLLERR | POLLPRI;
 	}
-	spin_unlock(&vfsmount_lock);
+	vfsmount_runlock();
 
 	return res;
 }
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h
+++ linux-2.6/include/linux/mount.h
@@ -13,6 +13,7 @@
 #include <linux/list.h>
 #include <linux/nodemask.h>
 #include <linux/spinlock.h>
+#include <linux/brlock.h>
 #include <asm/atomic.h>
 
 struct super_block;
@@ -90,6 +91,8 @@ static inline struct vfsmount *mntget(st
 
 struct file; /* forward dec */
 
+DECLARE_BRLOCK(vfsmount);
+
 extern int mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct file *file);
 extern int mnt_clone_write(struct vfsmount *mnt);
@@ -123,7 +126,6 @@ extern int do_add_mount(struct vfsmount
 
 extern void mark_mounts_for_expiry(struct list_head *mounts);
 
-extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
 
 #endif /* _LINUX_MOUNT_H */
Index: linux-2.6/kernel/audit_tree.c
===================================================================
--- linux-2.6.orig/kernel/audit_tree.c
+++ linux-2.6/kernel/audit_tree.c
@@ -758,15 +758,15 @@ int audit_tag_tree(char *old, char *new)
 			continue;
 		}
 
-		spin_lock(&vfsmount_lock);
+		vfsmount_rlock();
 		if (!is_under(mnt, dentry, &path)) {
-			spin_unlock(&vfsmount_lock);
+			vfsmount_runlock();
 			path_put(&path);
 			put_tree(tree);
 			mutex_lock(&audit_filter_mutex);
 			continue;
 		}
-		spin_unlock(&vfsmount_lock);
+		vfsmount_runlock();
 		path_put(&path);
 
 		list_for_each_entry(p, &list, mnt_list) {
Index: linux-2.6/security/tomoyo/realpath.c
===================================================================
--- linux-2.6.orig/security/tomoyo/realpath.c
+++ linux-2.6/security/tomoyo/realpath.c
@@ -96,12 +96,12 @@ int tomoyo_realpath_from_path2(struct pa
 		root = current->fs->root;
 		path_get(&root);
 		read_unlock(&current->fs->lock);
-		spin_lock(&vfsmount_lock);
+		vfsmount_rlock();
 		if (root.mnt && root.mnt->mnt_ns)
 			ns_root.mnt = mntget(root.mnt->mnt_ns->root);
 		if (ns_root.mnt)
 			ns_root.dentry = dget(ns_root.mnt->mnt_root);
-		spin_unlock(&vfsmount_lock);
+		vfsmount_runlock();
 		spin_lock(&dcache_lock);
 		tmp = ns_root;
 		sp = __d_path(path, &tmp, newname, newname_len);



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 6/6] fs: scale mntget/mntput
  2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
                   ` (4 preceding siblings ...)
  2009-10-15  4:40 ` [patch 5/6] fs: brlock vfsmount_lock npiggin
@ 2009-10-15  4:40 ` npiggin
  5 siblings, 0 replies; 22+ messages in thread
From: npiggin @ 2009-10-15  4:40 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

[-- Attachment #1: fs-mntget-scale.patch --]
[-- Type: text/plain, Size: 11452 bytes --]

Improve scalability of mntget/mntput by using per-cpu counters protected by the
reader side of the brlock vfsmount_lock. MNT_MOUNTED in mnt_flags keeps track
of whether the vfsmount is actually attached to the tree so we can shortcut the
now-expensive refcount check in mntput (just decrement the count because we
know there must be at least one ref left).

No extra atomics in the common case because atomic mnt refcount is now replaced
with per-CPU spinlock. Code will be bigger and more complex however. With the
previous per-cpu locking patch, mount lookups and common case refcounting are
now per-cpu and should be ideally scalable. path lookups (and hence
path_get/path_put) within the same vfsmount should now be more scalable,
however this will often be hidden by dcache_lock on final dput, and d_lock on
common path elements (eg. cwd or root dentry).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/libfs.c            |    1 
 fs/namespace.c        |  151 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/pnode.c            |    4 -
 include/linux/mount.h |   34 ++++-------
 4 files changed, 152 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c
+++ linux-2.6/fs/namespace.c
@@ -138,6 +138,61 @@ void mnt_release_group_id(struct vfsmoun
 	mnt->mnt_group_id = 0;
 }
 
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void add_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;
+#else
+	mnt->mnt_count += n;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void inc_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id()))++;
+#else
+	mnt->mnt_count++;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void dec_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id()))--;
+#else
+	mnt->mnt_count--;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
+unsigned int count_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	unsigned int count = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		count += *per_cpu_ptr(mnt->mnt_count, cpu);
+	}
+
+	return count;
+#else
+	return mnt->mnt_count;
+#endif
+}
+
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
 	struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -154,7 +209,13 @@ struct vfsmount *alloc_vfsmnt(const char
 				goto out_free_id;
 		}
 
-		atomic_set(&mnt->mnt_count, 1);
+#ifdef CONFIG_SMP
+		mnt->mnt_count = alloc_percpu(int);
+		if (!mnt->mnt_count)
+			goto out_free_devname;
+#else
+		mnt->mnt_count = 0;
+#endif
 		INIT_LIST_HEAD(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -166,14 +227,19 @@ struct vfsmount *alloc_vfsmnt(const char
 #ifdef CONFIG_SMP
 		mnt->mnt_writers = alloc_percpu(int);
 		if (!mnt->mnt_writers)
-			goto out_free_devname;
+			goto out_free_mntcount;
 #else
 		mnt->mnt_writers = 0;
 #endif
+		preempt_disable();
+		inc_mnt_count(mnt);
+		preempt_enable();
 	}
 	return mnt;
 
 #ifdef CONFIG_SMP
+out_free_mntcount:
+	free_percpu(mnt->mnt_count);
 out_free_devname:
 	kfree(mnt->mnt_devname);
 #endif
@@ -512,6 +578,8 @@ static void detach_mnt(struct vfsmount *
 	list_del_init(&mnt->mnt_child);
 	list_del_init(&mnt->mnt_hash);
 	dentry_reset_mounted(old_path->mnt, old_path->dentry);
+	WARN_ON(!(mnt->mnt_flags & MNT_MOUNTED));
+	mnt->mnt_flags &= ~MNT_MOUNTED;
 }
 
 /*
@@ -536,6 +604,8 @@ static void attach_mnt(struct vfsmount *
 	list_add_tail(&mnt->mnt_hash, mount_hashtable +
 			hash(path->mnt, path->dentry));
 	list_add_tail(&mnt->mnt_child, &path->mnt->mnt_mounts);
+	WARN_ON(mnt->mnt_flags & MNT_MOUNTED);
+	mnt->mnt_flags |= MNT_MOUNTED;
 }
 
 /*
@@ -558,6 +628,8 @@ static void commit_tree(struct vfsmount
 	list_add_tail(&mnt->mnt_hash, mount_hashtable +
 				hash(parent, mnt->mnt_mountpoint));
 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+	WARN_ON(mnt->mnt_flags & MNT_MOUNTED);
+	mnt->mnt_flags |= MNT_MOUNTED;
 	touch_mnt_namespace(n);
 }
 
@@ -652,6 +724,9 @@ static inline void __mntput(struct vfsmo
 	/*
 	 * atomic_dec_and_lock() used to deal with ->mnt_count decrements
 	 * provides barriers, so count_mnt_writers() below is safe.  AV
+	 * XXX: hmm, we no longer have an atomic_dec_and_lock, so load of
+	 * mnt_writers may be moved up into the vfsmount lock critical
+	 * section? Do we need an smp_mb()?
 	 */
 	WARN_ON(count_mnt_writers(mnt));
 	dput(mnt->mnt_root);
@@ -661,44 +736,79 @@ static inline void __mntput(struct vfsmo
 
 void mntput_no_expire(struct vfsmount *mnt)
 {
-repeat:
-	if (!vfsmount_atomic_dec_and_wlock(&mnt->mnt_count))
+	if (likely(mnt->mnt_flags & MNT_MOUNTED)) {
+		vfsmount_rlock();
+		if (unlikely(!mnt->mnt_flags & MNT_MOUNTED)) {
+			vfsmount_runlock();
+			goto repeat;
+		}
+		dec_mnt_count(mnt);
+		vfsmount_runlock();
+
 		return;
+	}
 
+repeat:
+	vfsmount_wlock();
+	BUG_ON(mnt->mnt_flags & MNT_MOUNTED);
+	dec_mnt_count(mnt);
+	if (count_mnt_count(mnt)) {
+		vfsmount_wunlock();
+		return;
+	}
 	if (likely(!mnt->mnt_pinned)) {
 		vfsmount_wunlock();
 		__mntput(mnt);
 		return;
 	}
-	atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
+	add_mnt_count(mnt, mnt->mnt_pinned + 1);
 	mnt->mnt_pinned = 0;
 	vfsmount_wunlock();
 	acct_auto_close_mnt(mnt);
 	security_sb_umount_close(mnt);
 	goto repeat;
 }
-
 EXPORT_SYMBOL(mntput_no_expire);
 
+void mntput(struct vfsmount *mnt)
+{
+	if (mnt) {
+		/* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+		if (unlikely(mnt->mnt_expiry_mark))
+			mnt->mnt_expiry_mark = 0;
+		mntput_no_expire(mnt);
+	}
+}
+EXPORT_SYMBOL(mntput);
+
+struct vfsmount *mntget(struct vfsmount *mnt)
+{
+	if (mnt) {
+		preempt_disable();
+		inc_mnt_count(mnt);
+		preempt_enable();
+	}
+	return mnt;
+}
+EXPORT_SYMBOL(mntget);
+
 void mnt_pin(struct vfsmount *mnt)
 {
 	vfsmount_wlock();
 	mnt->mnt_pinned++;
 	vfsmount_wunlock();
 }
-
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
 	vfsmount_wlock();
 	if (mnt->mnt_pinned) {
-		atomic_inc(&mnt->mnt_count);
+		inc_mnt_count(mnt);
 		mnt->mnt_pinned--;
 	}
 	vfsmount_wunlock();
 }
-
 EXPORT_SYMBOL(mnt_unpin);
 
 static inline void mangle(struct seq_file *m, const char *s)
@@ -979,12 +1089,13 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	vfsmount_rlock();
+	/* write lock needed for count_mnt_count */
+	vfsmount_wlock();
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		actual_refs += atomic_read(&p->mnt_count);
+		actual_refs += count_mnt_count(p);
 		minimum_refs += 2;
 	}
-	vfsmount_runlock();
+	vfsmount_wunlock();
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -1011,10 +1122,10 @@ int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
 
-	vfsmount_rlock();
+	vfsmount_wlock();
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	vfsmount_runlock();
+	vfsmount_wunlock();
 
 	return ret;
 }
@@ -1065,6 +1176,8 @@ void umount_tree(struct vfsmount *mnt, i
 		__touch_mnt_namespace(p->mnt_ns);
 		p->mnt_ns = NULL;
 		list_del_init(&p->mnt_child);
+		WARN_ON(!(p->mnt_flags & MNT_MOUNTED));
+		p->mnt_flags &= ~MNT_MOUNTED;
 		if (p->mnt_parent != p) {
 			p->mnt_parent->mnt_ghosts++;
 			dentry_reset_mounted(p->mnt_parent, p->mnt_mountpoint);
@@ -1096,8 +1209,16 @@ static int do_umount(struct vfsmount *mn
 		    flags & (MNT_FORCE | MNT_DETACH))
 			return -EINVAL;
 
-		if (atomic_read(&mnt->mnt_count) != 2)
+		/*
+		 * probably don't strictly need the lock here if we examined
+		 * all race cases, but it's a slowpath.
+		 */
+		vfsmount_wlock();
+		if (count_mnt_count(mnt) != 2) {
+			vfsmount_wlock();
 			return -EBUSY;
+		}
+		vfsmount_wunlock();
 
 		if (!xchg(&mnt->mnt_expiry_mark, 1))
 			return -EAGAIN;
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h
+++ linux-2.6/include/linux/mount.h
@@ -32,11 +32,13 @@ struct mnt_namespace;
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
+#define MNT_MOUNTED	0x400
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
 #define MNT_PNODE_MASK	0x3000	/* propagation flag mask */
 
+
 struct vfsmount {
 	struct list_head mnt_hash;
 	struct vfsmount *mnt_parent;	/* fs we are mounted on */
@@ -57,12 +59,6 @@ struct vfsmount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	int mnt_id;			/* mount identifier */
 	int mnt_group_id;		/* peer group identifier */
-	/*
-	 * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
-	 * to let these frequently modified fields in a separate cache line
-	 * (so that reads of mnt_flags wont ping-pong on SMP machines)
-	 */
-	atomic_t mnt_count;
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	int mnt_pinned;
 	int mnt_ghosts;
@@ -71,6 +67,11 @@ struct vfsmount {
 #else
 	int mnt_writers;
 #endif
+#ifdef CONFIG_SMP
+	int *mnt_count;
+#else
+	int mnt_count;
+#endif
 };
 
 static inline int *get_mnt_writers_ptr(struct vfsmount *mnt)
@@ -82,34 +83,25 @@ static inline int *get_mnt_writers_ptr(s
 #endif
 }
 
-static inline struct vfsmount *mntget(struct vfsmount *mnt)
-{
-	if (mnt)
-		atomic_inc(&mnt->mnt_count);
-	return mnt;
-}
-
 struct file; /* forward dec */
 
 DECLARE_BRLOCK(vfsmount);
 
+extern unsigned int count_mnt_count(struct vfsmount *mnt);
+
 extern int mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct file *file);
 extern int mnt_clone_write(struct vfsmount *mnt);
 extern void mnt_drop_write(struct vfsmount *mnt);
+
 extern void mntput_no_expire(struct vfsmount *mnt);
+extern struct vfsmount *mntget(struct vfsmount *mnt);
+extern void mntput(struct vfsmount *mnt);
+
 extern void mnt_pin(struct vfsmount *mnt);
 extern void mnt_unpin(struct vfsmount *mnt);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
 
-static inline void mntput(struct vfsmount *mnt)
-{
-	if (mnt) {
-		mnt->mnt_expiry_mark = 0;
-		mntput_no_expire(mnt);
-	}
-}
-
 extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
 				      const char *name, void *data);
 
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c
+++ linux-2.6/fs/pnode.c
@@ -282,7 +282,7 @@ out:
  */
 static inline int do_refcount_check(struct vfsmount *mnt, int count)
 {
-	int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts;
+	int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts;
 	return (mycount > count);
 }
 
@@ -294,7 +294,7 @@ static inline int do_refcount_check(stru
  * Check if any of these mounts that **do not have submounts**
  * have more references than 'refcnt'. If so return busy.
  *
- * vfsmount lock must be held for read or write
+ * vfsmount lock must be held for write
  */
 int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
 {
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -244,6 +244,7 @@ int get_sb_pseudo(struct file_system_typ
 	d_instantiate(dentry, root);
 	s->s_root = dentry;
 	s->s_flags |= MS_ACTIVE;
+	mnt->mnt_flags |= MNT_MOUNTED;
 	simple_set_mnt(mnt, s);
 	return 0;
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  4:40 ` [patch 2/6] fs: no games with DCACHE_UNHASHED npiggin
@ 2009-10-15  6:31   ` David Miller
  2009-10-15  7:44   ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: David Miller @ 2009-10-15  6:31 UTC (permalink / raw)
  To: npiggin; +Cc: viro, linux-fsdevel, raven, torvalds, linux-kernel, mszeredi

From: npiggin@suse.de
Date: Thu, 15 Oct 2009 15:40:28 +1100

> (this is in -mm)
> 
> Filesystems outside the regular namespace do not have to clear DCACHE_UNHASHED
> in order to have a working /proc/$pid/fd/XXX. Nothing in proc prevents the
> fd link from being used if its dentry is not in the hash.
> 
> Also, it does not get put into the dcache hash if DCACHE_UNHASHED is clear;
> that depends on the filesystem calling d_add or d_rehash.
> 
> So delete the misleading comments and needless code.
> 
> Acked-by: Miklos Szeredi <mszeredi@suse.cz>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

For networking bits:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [rfc][patch 4a/6] brlock: "fast" brlocks
  2009-10-15  4:40 ` [patch 4/6] brlock: introduce special brlocks npiggin
@ 2009-10-15  6:58   ` Nick Piggin
  2009-10-15 11:05     ` Peter Zijlstra
  2009-10-19  5:25   ` [patch 4/6] brlock: introduce special brlocks Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2009-10-15  6:58 UTC (permalink / raw)
  To: linux-arch
  Cc: linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel,
	David Miller, Al Viro

[Not for merge. Stop reading if you're not interested in locking minutiae.]

OK, this is untested but I think the theory is right. Basically it is taking
the idea from Dave M's cool brlock optimisation stuff with one further
optimisation in that the read locker does not check the spinlock but
rather we keep another wlocked variable together inthe same cacheline per
CPU, so the read locker only has to touch one cacheline rather than 2.

This actually will reduce the number of atomics by 2 per path lookup,
however we have an smp_mb() there now which is really nasty on some
architectures (like ia64 and ppc64), and not that nice on x86 either.
We can probably do something interesting on ia64 and ppc64 so that we
take advantage of the fact rlocked and wlocked are in the same cacheline
so cache coherency (rather than memory consistency) should always provide
a strict ordering there. We still do need an acquire barrier -- but it is
a much nicer lwsync or st.acq on ppc and ia64.

But: is the avoidance of the atomic RMW a big win? On x86 cores I've tested
IIRC mfence is about as costly as a locked instruction which includes the
mfence...

So long story short: it might be a small win but it is going to be very
arch specific and will require arch specific code to do the barriers and
things. The generic spinlock brlock isn't bad at all, so I'll just post
this as a curiosity for the time being.
 
---
 include/linux/brlock.h |   97 +++++++++++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 34 deletions(-)

Index: linux-2.6/include/linux/brlock.h
===================================================================
--- linux-2.6.orig/include/linux/brlock.h
+++ linux-2.6/include/linux/brlock.h
@@ -13,37 +13,45 @@
 #include <asm/atomic.h>
 
 #if defined(CONFIG_SMP) && !defined(CONFIG_LOCKDEP)
+
+struct brlock_node {
+	unsigned int rlocked;
+	unsigned int wlocked;
+};
+
 #define DECLARE_BRLOCK(name)						\
- DECLARE_PER_CPU(spinlock_t, name##_lock);				\
- static inline void name##_lock_init(void) {				\
-	int i;								\
-	for_each_possible_cpu(i) {					\
-		spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		spin_lock_init(lock);					\
-	}								\
- }									\
+ DECLARE_PER_CPU_ALIGNED(struct brlock_node, name##_lock);		\
  static inline void name##_rlock(void) {				\
-	spinlock_t *lock;						\
-	lock = &get_cpu_var(name##_lock);				\
-	spin_lock(lock);						\
+	struct brlock_node *n;						\
+	n = &get_cpu_var(name##_lock);					\
+again:									\
+	n->rlocked++;							\
+	smp_mb(); /* ensure rlocked store before wlocked load */	\
+	if (unlikely(n->wlocked)) {					\
+		n->rlocked--;						\
+		while (n->wlocked)					\
+			cpu_relax();					\
+		goto again;						\
+	}								\
+	/* smp_mb() above also prevents crit stores leaking up */	\
  }									\
  static inline void name##_runlock(void) {				\
-	spinlock_t *lock;						\
-	lock = &__get_cpu_var(name##_lock);				\
-	spin_unlock(lock);						\
+	struct brlock_node *n;						\
+	smp_wmb(); /* prevent crit stores leaking out (XXX: should really be a release barrier because some architectures might leak crit loads out past the store, but I want to test x86 performance here) */				\
+	n = &__get_cpu_var(name##_lock);				\
+	n->rlocked--;							\
 	put_cpu_var(name##_lock);					\
  }									\
  extern void name##_wlock(void);					\
  extern void name##_wunlock(void);					\
  static inline int name##_atomic_dec_and_rlock(atomic_t *a) {		\
-	int ret;							\
-	spinlock_t *lock;						\
-	lock = &get_cpu_var(name##_lock);				\
-	ret = atomic_dec_and_lock(a, lock);				\
-	if (!ret)							\
-		put_cpu_var(name##_lock);				\
-	return ret;							\
+	if (likely(atomic_add_unless(a, -1, 1)))			\
+		return 0;						\
+	name##_rlock();							\
+	if (atomic_dec_and_test(a))					\
+		return 1;						\
+	name##_runlock();						\
+	return 0;							\
  }									\
  extern int name##_atomic_dec_and_wlock__failed(atomic_t *a);		\
  static inline int name##_atomic_dec_and_wlock(atomic_t *a) {		\
@@ -53,30 +61,51 @@
  }
 
 #define DEFINE_BRLOCK(name)						\
- DEFINE_PER_CPU(spinlock_t, name##_lock);				\
+ DEFINE_PER_CPU_ALIGNED(struct brlock_node, name##_lock);		\
+ static DEFINE_SPINLOCK(name##_wlock_lock);				\
+ static void name##_lock_init(void) {					\
+	int i;								\
+	for_each_possible_cpu(i) {					\
+		struct brlock_node *n;					\
+		n = &per_cpu(name##_lock, i);				\
+		n->rlocked = 0;						\
+		n->wlocked = 0;						\
+	}								\
+	spin_lock_init(&name##_wlock_lock);				\
+ }									\
  void name##_wlock(void) {						\
 	int i;								\
+	spin_lock(&name##_wlock_lock);					\
 	for_each_online_cpu(i) {					\
-		spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		spin_lock(lock);					\
+		struct brlock_node *n;					\
+		n = &per_cpu(name##_lock, i);				\
+		n->wlocked = 1;						\
 	}								\
+	smp_mb(); /* ensure wlocked store before rlocked load */	\
+	for_each_online_cpu(i) {					\
+		struct brlock_node *n;					\
+		n = &per_cpu(name##_lock, i);				\
+		while (n->rlocked)					\
+			cpu_relax();					\
+	}								\
+	smp_mb(); /* prevent crit ops leaking above rlocked check */	\
  }									\
  void name##_wunlock(void) {						\
 	int i;								\
+	smp_mb(); /* prevent crit ops leaking past wlocked = 0 */	\
 	for_each_online_cpu(i) {					\
-		spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		spin_unlock(lock);					\
+		struct brlock_node *n;					\
+		n = &per_cpu(name##_lock, i);				\
+		n->wlocked = 0;						\
 	}								\
+	spin_unlock(&name##_wlock_lock);				\
  }									\
  int name##_atomic_dec_and_wlock__failed(atomic_t *a) {			\
 	name##_wlock();							\
-	if (!atomic_dec_and_test(a)) {					\
-		name##_wunlock();					\
-		return 0;						\
-	}								\
-	return 1;							\
+	if (atomic_dec_and_test(a))					\
+		return 1;						\
+	name##_wunlock();						\
+	return 0;							\
  }
 
 #else

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  4:40 ` [patch 2/6] fs: no games with DCACHE_UNHASHED npiggin
  2009-10-15  6:31   ` David Miller
@ 2009-10-15  7:44   ` Eric Dumazet
  2009-10-15  8:13     ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-10-15  7:44 UTC (permalink / raw)
  To: npiggin
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel,
	Miklos Szeredi

npiggin@suse.de a écrit :
> (this is in -mm)
> 
> Filesystems outside the regular namespace do not have to clear DCACHE_UNHASHED
> in order to have a working /proc/$pid/fd/XXX. Nothing in proc prevents the
> fd link from being used if its dentry is not in the hash.
> 
> Also, it does not get put into the dcache hash if DCACHE_UNHASHED is clear;
> that depends on the filesystem calling d_add or d_rehash.
> 
> So delete the misleading comments and needless code.
> 

This was added in commit 304e61e6fbadec586dfe002b535f169a04248e49

    [PATCH] net: don't insert socket dentries into dentry_hashtable
    
    We currently insert socket dentries into the global dentry hashtable.  This
    is suboptimal because there is currently no way these entries can be used
    for a lookup().  (/proc/xxx/fd/xxx uses a different mechanism).  Inserting
    them in dentry hashtable slows dcache lookups.
    
    To let __dpath() still work correctly (ie not adding a " (deleted)") after
    dentry name, we do :
    
    - Right after d_alloc(), pretend they are hashed by clearing the
      DCACHE_UNHASHED bit.
    
    - Call d_instantiate() instead of d_add() : dentry is not inserted in
      hash table.
    
      __dpath() & friends work as intended during dentry lifetime.
    
    - At dismantle time, once dput() must clear the dentry, setting again
      DCACHE_UNHASHED bit inside the custom d_delete() function provided by
      socket code, so that dput() can just kill_it.
    
    Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Acked-by: "David S. Miller" <davem@davemloft.net>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>



Back in 2006, we had to perform this hack in order to not leak '(deleted)' in __d_path()

if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
	(prepend(&end, &buflen, " (deleted)", 10) != 0)) 
		goto Elong;

In current kernel this part became :

if (d_unlinked(dentry) &&
	(prepend(&end, &buflen, " (deleted)", 10) != 0))
		goto Elong;


So your cleanup seems good, thanks !

Acked-by: Eric Dumazet <dada1@cosmosbay.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  7:44   ` Eric Dumazet
@ 2009-10-15  8:13     ` Nick Piggin
  2009-10-15  8:29       ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2009-10-15  8:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel,
	Miklos Szeredi

On Thu, Oct 15, 2009 at 09:44:35AM +0200, Eric Dumazet wrote:
> npiggin@suse.de a écrit :
> > (this is in -mm)
> > 
> > Filesystems outside the regular namespace do not have to clear DCACHE_UNHASHED
> > in order to have a working /proc/$pid/fd/XXX. Nothing in proc prevents the
> > fd link from being used if its dentry is not in the hash.
> > 
> > Also, it does not get put into the dcache hash if DCACHE_UNHASHED is clear;
> > that depends on the filesystem calling d_add or d_rehash.
> > 
> > So delete the misleading comments and needless code.
> > 
> 
> This was added in commit 304e61e6fbadec586dfe002b535f169a04248e49
> 
>     [PATCH] net: don't insert socket dentries into dentry_hashtable
>     
>     We currently insert socket dentries into the global dentry hashtable.  This
>     is suboptimal because there is currently no way these entries can be used
>     for a lookup().  (/proc/xxx/fd/xxx uses a different mechanism).  Inserting
>     them in dentry hashtable slows dcache lookups.
>     
>     To let __dpath() still work correctly (ie not adding a " (deleted)") after
>     dentry name, we do :
>     
>     - Right after d_alloc(), pretend they are hashed by clearing the
>       DCACHE_UNHASHED bit.
>     
>     - Call d_instantiate() instead of d_add() : dentry is not inserted in
>       hash table.
>     
>       __dpath() & friends work as intended during dentry lifetime.
>     
>     - At dismantle time, once dput() must clear the dentry, setting again
>       DCACHE_UNHASHED bit inside the custom d_delete() function provided by
>       socket code, so that dput() can just kill_it.
>     
>     Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>     Cc: Al Viro <viro@zeniv.linux.org.uk>
>     Acked-by: "David S. Miller" <davem@davemloft.net>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> 
> 
> Back in 2006, we had to perform this hack in order to not leak '(deleted)' in __d_path()
> 
> if (!IS_ROOT(dentry) && d_unhashed(dentry) &&
> 	(prepend(&end, &buflen, " (deleted)", 10) != 0)) 
> 		goto Elong;
> 
> In current kernel this part became :
> 
> if (d_unlinked(dentry) &&
> 	(prepend(&end, &buflen, " (deleted)", 10) != 0))
> 		goto Elong;
> 
> 
> So your cleanup seems good, thanks !
> 
> Acked-by: Eric Dumazet <dada1@cosmosbay.com>

Ahh, hmm d_unlinked() is exactly the same code. OK, so I think this shows
why I'm an idiot. I didn't think about the __d_path output. Though the
comments are still misleading, in my defence (and my test programs did not
use any anoninode fds)...

Now both sockets and pipes define a d_dname so they are OK, but anon_inodes
does not. I think they should probably be made to just provide a d_dname
anyway so we can have the familiar format of "pseudofs:[ino]" rather than
"[pseudofs]" that we have now.

That should make this patch work for anon_inodes.c as well.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  8:13     ` Nick Piggin
@ 2009-10-15  8:29       ` Nick Piggin
  2009-10-15  9:13         ` Eric Dumazet
  2009-10-15 13:20         ` Matthew Wilcox
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2009-10-15  8:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel,
	Miklos Szeredi

On Thu, Oct 15, 2009 at 10:13:43AM +0200, Nick Piggin wrote:
> Now both sockets and pipes define a d_dname so they are OK, but anon_inodes
> does not. I think they should probably be made to just provide a d_dname
> anyway so we can have the familiar format of "pseudofs:[ino]" rather than
> "[pseudofs]" that we have now.
> 
> That should make this patch work for anon_inodes.c as well.

So what if we were to do this first? Are there any other reasons I've
missed? (btw. this changes the format the link if that is a problem.
Probably if we're going to do this change, we should change things like
[timerfd] to just timerfd while we're there to improve consistency
further.
--
Pipe and socket pseudo filesystems report their file descriptor link
paths as pipe:[ino] and socket:[ino]. anon_inodefs allows subsystems
to specify a name, but it does not report an associated inode number.

Implement this with anon_inodefs_dname in the same way pipefs and sockfs
are.

---
 fs/anon_inodes.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c
+++ linux-2.6/fs/anon_inodes.c
@@ -45,6 +45,15 @@ static int anon_inodefs_delete_dentry(st
 	return 1;
 }
 
+/*
+ * anon_inodefs_dname() is called from d_path().
+ */
+static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
+				dentry->d_name.name, dentry->d_inode->i_ino);
+}
+
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
@@ -52,6 +61,7 @@ static struct file_system_type anon_inod
 };
 static const struct dentry_operations anon_inodefs_dentry_operations = {
 	.d_delete	= anon_inodefs_delete_dentry,
+	.d_dname	= anon_inodefs_dname,
 };
 
 /*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  8:29       ` Nick Piggin
@ 2009-10-15  9:13         ` Eric Dumazet
  2009-10-15 13:20         ` Matthew Wilcox
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2009-10-15  9:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel,
	Miklos Szeredi

Nick Piggin a écrit :

> So what if we were to do this first? Are there any other reasons I've
> missed? (btw. this changes the format the link if that is a problem.
> Probably if we're going to do this change, we should change things like
> [timerfd] to just timerfd while we're there to improve consistency
> further.
> --
> Pipe and socket pseudo filesystems report their file descriptor link
> paths as pipe:[ino] and socket:[ino]. anon_inodefs allows subsystems
> to specify a name, but it does not report an associated inode number.
> 
> Implement this with anon_inodefs_dname in the same way pipefs and sockfs
> are.
> 
> ---
>  fs/anon_inodes.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-2.6/fs/anon_inodes.c
> ===================================================================
> --- linux-2.6.orig/fs/anon_inodes.c
> +++ linux-2.6/fs/anon_inodes.c
> @@ -45,6 +45,15 @@ static int anon_inodefs_delete_dentry(st
>  	return 1;
>  }
>  
> +/*
> + * anon_inodefs_dname() is called from d_path().
> + */
> +static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> +	return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> +				dentry->d_name.name, dentry->d_inode->i_ino);
> +}
> +
>  static struct file_system_type anon_inode_fs_type = {
>  	.name		= "anon_inodefs",
>  	.get_sb		= anon_inodefs_get_sb,
> @@ -52,6 +61,7 @@ static struct file_system_type anon_inod
>  };
>  static const struct dentry_operations anon_inodefs_dentry_operations = {
>  	.d_delete	= anon_inodefs_delete_dentry,
> +	.d_dname	= anon_inodefs_dname,
>  };
>  
>  /*

Yes, that would be fine, thanks Nick.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 3/6] fs: dcache remove d_mounted
  2009-10-15  4:40 ` [patch 3/6] fs: dcache remove d_mounted npiggin
@ 2009-10-15 10:37   ` Ian Kent
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Kent @ 2009-10-15 10:37 UTC (permalink / raw)
  To: npiggin; +Cc: Al Viro, linux-fsdevel, Linus Torvalds, linux-kernel

On Thu, Oct 15, 2009 at 03:40:29PM +1100, npiggin@suse.de wrote:
> Rather than keep a d_mounted count in the dentry (which is only used to
> speculatively take a look in the mount hash table if it is non-zero), set a
> dentry flag instead. The flag can be cleared by checking the hash table to see
> if there are any mounts left. It is not time critical because it is performed
> at detach time.
> 
> This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
> might use later (and some configs have larger than 32-bit spinlocks which might
> make use of the hole).
> 
> Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
> In autofs4, when expring direct (or offset) mounts we need to ensure that we
> block user path walks into the autofs mount, which is covered by another mount.
> To do this we clear the mounted status so that follows stop before walking into
> the mount and are essentially blocked until the expire is completed. The
> automount daemon still finds the correct dentry for the umount due to the
> follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
> an inode operation for direct and offset mounts only and is called following
> the lookup that stopped at the covered mount.
> 
> At the end of the expire the covering mount probably has gone away so the
> mounted status need not be restored. But we need to check this and only restore
> the mounted status if the expire failed.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

For the autofs4 bits.

Acked-by: Ian Kent <raven@themaw.net>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [rfc][patch 4a/6] brlock: "fast" brlocks
  2009-10-15  6:58   ` [rfc][patch 4a/6] brlock: "fast" brlocks Nick Piggin
@ 2009-10-15 11:05     ` Peter Zijlstra
  2009-10-15 11:26       ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-10-15 11:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-arch, linux-fsdevel, Ian Kent, Linus Torvalds,
	linux-kernel, David Miller, Al Viro

On Thu, 2009-10-15 at 08:58 +0200, Nick Piggin wrote:
> [Not for merge. Stop reading if you're not interested in locking minutiae.]
> 
> OK, this is untested but I think the theory is right. Basically it is taking
> the idea from Dave M's cool brlock optimisation stuff with one further
> optimisation in that the read locker does not check the spinlock but
> rather we keep another wlocked variable together inthe same cacheline per
> CPU, so the read locker only has to touch one cacheline rather than 2.
> 
> This actually will reduce the number of atomics by 2 per path lookup,
> however we have an smp_mb() there now which is really nasty on some
> architectures (like ia64 and ppc64), and not that nice on x86 either.
> We can probably do something interesting on ia64 and ppc64 so that we
> take advantage of the fact rlocked and wlocked are in the same cacheline
> so cache coherency (rather than memory consistency) should always provide
> a strict ordering there. We still do need an acquire barrier -- but it is
> a much nicer lwsync or st.acq on ppc and ia64.
> 
> But: is the avoidance of the atomic RMW a big win? On x86 cores I've tested
> IIRC mfence is about as costly as a locked instruction which includes the
> mfence...
> 
> So long story short: it might be a small win but it is going to be very
> arch specific and will require arch specific code to do the barriers and
> things. The generic spinlock brlock isn't bad at all, so I'll just post
> this as a curiosity for the time being.
>  

fwiw, I rather like this implementation better, and adding lockdep
annotations to this one shouldn't be hard.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [rfc][patch 4a/6] brlock: "fast" brlocks
  2009-10-15 11:05     ` Peter Zijlstra
@ 2009-10-15 11:26       ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2009-10-15 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-fsdevel, Ian Kent, Linus Torvalds,
	linux-kernel, David Miller, Al Viro

On Thu, Oct 15, 2009 at 01:05:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-10-15 at 08:58 +0200, Nick Piggin wrote:
> > [Not for merge. Stop reading if you're not interested in locking minutiae.]
> > 
> > OK, this is untested but I think the theory is right. Basically it is taking
> > the idea from Dave M's cool brlock optimisation stuff with one further
> > optimisation in that the read locker does not check the spinlock but
> > rather we keep another wlocked variable together inthe same cacheline per
> > CPU, so the read locker only has to touch one cacheline rather than 2.
> > 
> > This actually will reduce the number of atomics by 2 per path lookup,
> > however we have an smp_mb() there now which is really nasty on some
> > architectures (like ia64 and ppc64), and not that nice on x86 either.
> > We can probably do something interesting on ia64 and ppc64 so that we
> > take advantage of the fact rlocked and wlocked are in the same cacheline
> > so cache coherency (rather than memory consistency) should always provide
> > a strict ordering there. We still do need an acquire barrier -- but it is
> > a much nicer lwsync or st.acq on ppc and ia64.
> > 
> > But: is the avoidance of the atomic RMW a big win? On x86 cores I've tested
> > IIRC mfence is about as costly as a locked instruction which includes the
> > mfence...
> > 
> > So long story short: it might be a small win but it is going to be very
> > arch specific and will require arch specific code to do the barriers and
> > things. The generic spinlock brlock isn't bad at all, so I'll just post
> > this as a curiosity for the time being.
> >  
> 
> fwiw, I rather like this implementation better, and adding lockdep
> annotations to this one shouldn't be hard.

OK, although there is nothing preventing us from using raw spinlocks
and a new lockdep object to annotate the other one, right?

The problem with this one is that firstly it is not suitable for a
generic implementation (see XXX, we technically need smp_mb in the
unlock rather than smp_wmb -- I don't know if any actual CPUs do
this but several architectures do allow for stores to pass loads)

So we _really_ need to do proper acquire and release barriers both
here in the unlock and in the lock as well for it to be competitive
with the spinlocks.

smp_mb is a nasty hammer especially for release barrier because it
prevents loads from being started early until the unlock store is
visible. For acquire barrier it theoretically is not so bad, but in
practice like on powerpc they do it with an instruction that has to
apparently go out to the interconnect.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15  8:29       ` Nick Piggin
  2009-10-15  9:13         ` Eric Dumazet
@ 2009-10-15 13:20         ` Matthew Wilcox
  2009-10-15 14:41           ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2009-10-15 13:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds,
	linux-kernel, Miklos Szeredi

On Thu, Oct 15, 2009 at 10:29:03AM +0200, Nick Piggin wrote:
> On Thu, Oct 15, 2009 at 10:13:43AM +0200, Nick Piggin wrote:
> > Now both sockets and pipes define a d_dname so they are OK, but anon_inodes
> > does not. I think they should probably be made to just provide a d_dname
> > anyway so we can have the familiar format of "pseudofs:[ino]" rather than
> > "[pseudofs]" that we have now.
> > 
> > That should make this patch work for anon_inodes.c as well.
> 
> So what if we were to do this first? Are there any other reasons I've
> missed? (btw. this changes the format the link if that is a problem.
> Probably if we're going to do this change, we should change things like
> [timerfd] to just timerfd while we're there to improve consistency
> further.
> --
> Pipe and socket pseudo filesystems report their file descriptor link
> paths as pipe:[ino] and socket:[ino]. anon_inodefs allows subsystems
> to specify a name, but it does not report an associated inode number.

I think that's because the inode number is always 0 -- there's just one
anonymous inode.

> Implement this with anon_inodefs_dname in the same way pipefs and sockfs
> are.
> 
> ---
>  fs/anon_inodes.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-2.6/fs/anon_inodes.c
> ===================================================================
> --- linux-2.6.orig/fs/anon_inodes.c
> +++ linux-2.6/fs/anon_inodes.c
> @@ -45,6 +45,15 @@ static int anon_inodefs_delete_dentry(st
>  	return 1;
>  }
>  
> +/*
> + * anon_inodefs_dname() is called from d_path().
> + */
> +static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> +	return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> +				dentry->d_name.name, dentry->d_inode->i_ino);
> +}
> +
>  static struct file_system_type anon_inode_fs_type = {
>  	.name		= "anon_inodefs",
>  	.get_sb		= anon_inodefs_get_sb,
> @@ -52,6 +61,7 @@ static struct file_system_type anon_inod
>  };
>  static const struct dentry_operations anon_inodefs_dentry_operations = {
>  	.d_delete	= anon_inodefs_delete_dentry,
> +	.d_dname	= anon_inodefs_dname,
>  };
>  
>  /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2/6] fs: no games with DCACHE_UNHASHED
  2009-10-15 13:20         ` Matthew Wilcox
@ 2009-10-15 14:41           ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2009-10-15 14:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Eric Dumazet, Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds,
	linux-kernel, Miklos Szeredi

On Thu, Oct 15, 2009 at 07:20:45AM -0600, Matthew Wilcox wrote:
> On Thu, Oct 15, 2009 at 10:29:03AM +0200, Nick Piggin wrote:
> > On Thu, Oct 15, 2009 at 10:13:43AM +0200, Nick Piggin wrote:
> > > Now both sockets and pipes define a d_dname so they are OK, but anon_inodes
> > > does not. I think they should probably be made to just provide a d_dname
> > > anyway so we can have the familiar format of "pseudofs:[ino]" rather than
> > > "[pseudofs]" that we have now.
> > > 
> > > That should make this patch work for anon_inodes.c as well.
> > 
> > So what if we were to do this first? Are there any other reasons I've
> > missed? (btw. this changes the format the link if that is a problem.
> > Probably if we're going to do this change, we should change things like
> > [timerfd] to just timerfd while we're there to improve consistency
> > further.
> > --
> > Pipe and socket pseudo filesystems report their file descriptor link
> > paths as pipe:[ino] and socket:[ino]. anon_inodefs allows subsystems
> > to specify a name, but it does not report an associated inode number.
> 
> I think that's because the inode number is always 0 -- there's just one
> anonymous inode.

OIC. Hmm, well we could give them a unique number I guess. But anyway
for now, let's just make anon_inodefs_dname just print dentry->d_name.name.

> 
> > Implement this with anon_inodefs_dname in the same way pipefs and sockfs
> > are.
> > 
> > ---
> >  fs/anon_inodes.c |   10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > Index: linux-2.6/fs/anon_inodes.c
> > ===================================================================
> > --- linux-2.6.orig/fs/anon_inodes.c
> > +++ linux-2.6/fs/anon_inodes.c
> > @@ -45,6 +45,15 @@ static int anon_inodefs_delete_dentry(st
> >  	return 1;
> >  }
> >  
> > +/*
> > + * anon_inodefs_dname() is called from d_path().
> > + */
> > +static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
> > +{
> > +	return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> > +				dentry->d_name.name, dentry->d_inode->i_ino);
> > +}
> > +
> >  static struct file_system_type anon_inode_fs_type = {
> >  	.name		= "anon_inodefs",
> >  	.get_sb		= anon_inodefs_get_sb,
> > @@ -52,6 +61,7 @@ static struct file_system_type anon_inod
> >  };
> >  static const struct dentry_operations anon_inodefs_dentry_operations = {
> >  	.d_delete	= anon_inodefs_delete_dentry,
> > +	.d_dname	= anon_inodefs_dname,
> >  };
> >  
> >  /*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Matthew Wilcox				Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/6] brlock: introduce special brlocks
  2009-10-15  4:40 ` [patch 4/6] brlock: introduce special brlocks npiggin
  2009-10-15  6:58   ` [rfc][patch 4a/6] brlock: "fast" brlocks Nick Piggin
@ 2009-10-19  5:25   ` Andrew Morton
  2009-10-19  9:49     ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-10-19  5:25 UTC (permalink / raw)
  To: npiggin; +Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

On Thu, 15 Oct 2009 15:40:30 +1100 npiggin@suse.de wrote:

> +#define DECLARE_BRLOCK(name)						\

This:

> + DECLARE_PER_CPU(spinlock_t, name##_lock);				\
> + static inline void name##_lock_init(void) {				\
> +	int i;								\
> +	for_each_possible_cpu(i) {					\
> +		spinlock_t *lock;					\
> +		lock = &per_cpu(name##_lock, i);			\
> +		spin_lock_init(lock);					\
> +	}								\
> + }									\
> + static inline void name##_rlock(void) {				\
> +	spinlock_t *lock;						\
> +	lock = &get_cpu_var(name##_lock);				\
> +	spin_lock(lock);						\
> + }									\

generates a definition, not a declaration.  Hence DEFINE_BRLOCK.

</petpeeve #29>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/6] brlock: introduce special brlocks
  2009-10-19  5:25   ` [patch 4/6] brlock: introduce special brlocks Andrew Morton
@ 2009-10-19  9:49     ` Nick Piggin
  2009-10-19 12:24       ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2009-10-19  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

On Sun, Oct 18, 2009 at 10:25:12PM -0700, Andrew Morton wrote:
> On Thu, 15 Oct 2009 15:40:30 +1100 npiggin@suse.de wrote:
> 
> > +#define DECLARE_BRLOCK(name)						\
> 
> This:
> 
> > + DECLARE_PER_CPU(spinlock_t, name##_lock);				\
> > + static inline void name##_lock_init(void) {				\
> > +	int i;								\
> > +	for_each_possible_cpu(i) {					\
> > +		spinlock_t *lock;					\
> > +		lock = &per_cpu(name##_lock, i);			\
> > +		spin_lock_init(lock);					\
> > +	}								\
> > + }									\
> > + static inline void name##_rlock(void) {				\
> > +	spinlock_t *lock;						\
> > +	lock = &get_cpu_var(name##_lock);				\
> > +	spin_lock(lock);						\
> > + }									\
> 
> generates a definition, not a declaration.  Hence DEFINE_BRLOCK.
> 
> </petpeeve #29>

Well yes, but being a static inline, then I don't know of a better 
way. Probably just better not to pretend we are expanding a simple
declaration here, and name it something differently? (BRLOCK_HEADER(blah))?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/6] brlock: introduce special brlocks
  2009-10-19  9:49     ` Nick Piggin
@ 2009-10-19 12:24       ` Andrew Morton
  2009-10-19 12:48         ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-10-19 12:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

On Mon, 19 Oct 2009 11:49:09 +0200 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Oct 18, 2009 at 10:25:12PM -0700, Andrew Morton wrote:
> > On Thu, 15 Oct 2009 15:40:30 +1100 npiggin@suse.de wrote:
> > 
> > > +#define DECLARE_BRLOCK(name)						\
> > 
> > This:
> > 
> > > + DECLARE_PER_CPU(spinlock_t, name##_lock);				\
> > > + static inline void name##_lock_init(void) {				\
> > > +	int i;								\
> > > +	for_each_possible_cpu(i) {					\
> > > +		spinlock_t *lock;					\
> > > +		lock = &per_cpu(name##_lock, i);			\
> > > +		spin_lock_init(lock);					\
> > > +	}								\
> > > + }									\
> > > + static inline void name##_rlock(void) {				\
> > > +	spinlock_t *lock;						\
> > > +	lock = &get_cpu_var(name##_lock);				\
> > > +	spin_lock(lock);						\
> > > + }									\
> > 
> > generates a definition, not a declaration.  Hence DEFINE_BRLOCK.
> > 
> > </petpeeve #29>
> 
> Well yes, but being a static inline, then I don't know of a better 
> way. Probably just better not to pretend we are expanding a simple
> declaration here, and name it something differently? (BRLOCK_HEADER(blah))?

DEFINE_BRLOCK(blah)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 4/6] brlock: introduce special brlocks
  2009-10-19 12:24       ` Andrew Morton
@ 2009-10-19 12:48         ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2009-10-19 12:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, linux-fsdevel, Ian Kent, Linus Torvalds, linux-kernel

On Mon, Oct 19, 2009 at 05:24:00AM -0700, Andrew Morton wrote:
> On Mon, 19 Oct 2009 11:49:09 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > > generates a definition, not a declaration.  Hence DEFINE_BRLOCK.
> > > 
> > > </petpeeve #29>
> > 
> > Well yes, but being a static inline, then I don't know of a better 
> > way. Probably just better not to pretend we are expanding a simple
> > declaration here, and name it something differently? (BRLOCK_HEADER(blah))?
> 
> DEFINE_BRLOCK(blah)

Well I use DEFINE_BRLOCK for the .c file definitions which include
non-static non-inline functions, so you can't put it in a .h. So
AFAIKS you need both. Athough DECLARE_BRLOCK is not strictly for
declarations because of those static inline functions so I agree the
name is not ideal.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-10-19 12:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-15  4:40 [patch 0/6] vfsmount scaling and other bits npiggin
2009-10-15  4:40 ` [patch 1/6] fs: invalidate sb->s_bdev on remount,ro npiggin
2009-10-15  4:40 ` [patch 2/6] fs: no games with DCACHE_UNHASHED npiggin
2009-10-15  6:31   ` David Miller
2009-10-15  7:44   ` Eric Dumazet
2009-10-15  8:13     ` Nick Piggin
2009-10-15  8:29       ` Nick Piggin
2009-10-15  9:13         ` Eric Dumazet
2009-10-15 13:20         ` Matthew Wilcox
2009-10-15 14:41           ` Nick Piggin
2009-10-15  4:40 ` [patch 3/6] fs: dcache remove d_mounted npiggin
2009-10-15 10:37   ` Ian Kent
2009-10-15  4:40 ` [patch 4/6] brlock: introduce special brlocks npiggin
2009-10-15  6:58   ` [rfc][patch 4a/6] brlock: "fast" brlocks Nick Piggin
2009-10-15 11:05     ` Peter Zijlstra
2009-10-15 11:26       ` Nick Piggin
2009-10-19  5:25   ` [patch 4/6] brlock: introduce special brlocks Andrew Morton
2009-10-19  9:49     ` Nick Piggin
2009-10-19 12:24       ` Andrew Morton
2009-10-19 12:48         ` Nick Piggin
2009-10-15  4:40 ` [patch 5/6] fs: brlock vfsmount_lock npiggin
2009-10-15  4:40 ` [patch 6/6] fs: scale mntget/mntput npiggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).