All of lore.kernel.org
 help / color / mirror / Atom feed
* rcu-walk and dcache scaling tree update and status
@ 2010-12-13  2:37 Nick Piggin
  2010-12-13  2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  2:37 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell,
	linux-fsdevel, linux-kernel

The vfs-scale branch has had some progress, but it is now requiring
wider testing and detailed review, particularly of the fine details of
dcache_lock lifting, and rcu-walk synchronisation details and
documentation.

Linus has suggested pretty strongly that he wants to pull this in the
next merge window (recently, that "inodes will be RCU freed in 2.6.38"
in an urelated discussion). As far as I know, that's what he's going to
do. I'd like to get this some time in linux-next to improve test
coverage (many filesystems I can't even test, so there are bound to be a
few silly crashes). Stephen, how do I arrange that?

>From my point of view, it has had nowhere near enough review,
particularly I want Al to be happy with it, filesystem changes looked at
and tested by respective fs maintainers, and anybody who is good at
concurrency. However, if Linus still wants to merge it to kick things
along, I am not going to stop him this time, because I have no known
bugs or pending changes required.

I, like everybody else, would prefer bugs or design flaws to be found
*before* it goes upstream, of course. I would be happy to spend time on
irc with reviewers (ask me offline). And if anybody has reasonable
concerns or suggestions, I will be happy to take that into account. I
will not flame anybody who reads my replies, even if it takes a while
for one or both of us to understand.

Documentation/filesystems/path-lookup.txt is a good place to start
reviewing the fun stuff. I would much appreciate review of documentation
and comments too, if anything is not clear, omitted, or not matching the
code.

Also, please keep an eye on the end result when reviewing patches.
Particularly the locking patches before dcache_lock is lifted, these are
supposed to provide a lock coverage to lift dcache_lock with minimal
complexity. They are not supposed to be nice looking code that you'd
want to run on your production box, they are supposed to be nice
changesets (from a review and verification point of view).

Git tree is here:

  git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git

Branch is:

    vfs-scale-working

Changes since last posting:
* Add a lot more comments for rcu-walk code and functions
* Fix reported d_compare vfat crash
* Incorporate review suggestions
* Make rcu-walk bail out if we have to call a security subsystem
* Fix for filesystems rewriting dentry name in-place
* Audit d_seq barrier write-side, add a few places where it was missing
* Optimised dentry memcmp

Testing:
Testing filesystems is difficult, particularly remote filesystems, and
ones without mkfs packaged in debian. I'm running ltp and xfstests among
others, but those cover a tiny portion of what you can do with the
dcache. The more testing the merrier.

I have been unable to break anything for a long time, but the race
windows can be tiny. I've been trying to insert random delays into
different parts of critical sections, and write tests specifically
targetting particular races, but that's slow going -- review of locking,
or testing on different configurations should be much more productive.

Final note:
You won't be able to reproduce the parallel path walk scalability
numbers that I've posted, because the vfsmount refcounting scalability
patch is not included. I have a new idea for that now, so I'll be asking
for comments with that soon.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-13  2:37 rcu-walk and dcache scaling tree update and status Nick Piggin
@ 2010-12-13  2:42 ` Nick Piggin
  2010-12-13  3:31   ` Nick Piggin
  2010-12-13  7:25   ` Eric Dumazet
  2010-12-13  2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson
  2010-12-13  3:40 ` Stephen Rothwell
  2 siblings, 2 replies; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  2:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell,
	linux-fsdevel, linux-kernel

On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote:
> Final note:
> You won't be able to reproduce the parallel path walk scalability
> numbers that I've posted, because the vfsmount refcounting scalability
> patch is not included. I have a new idea for that now, so I'll be asking
> for comments with that soon.

Here is the patch I've been using, which works but has the problem
described in the changelog. But it works nicely for testing.

As I said, I have a promising approach to solving the problem.

fs: scale mntget/mntput

Improve scalability of mntget/mntput by using per-cpu counters protected by the
reader side of the brlock vfsmount_lock. If the mnt_hash field of the vfsmount
structure is attached to a list, then it is mounted which contributes to its
refcount, so the per-cpu counters need not be summed.

MNT_PSEUDO keeps track of whether the vfsmount is actually a pseudo filesystem
that will never be attached (such as sockfs).

No extra atomics in the common case because atomic mnt refcount is now replaced
with per-CPU spinlock. Code will be bigger and more complex however. With the
previous per-cpu locking patch, mount lookups and common case refcounting are
now per-cpu and should be ideally scalable. path lookups (and hence
path_get/path_put) within the same vfsmount should now be more scalable,
however this will often be hidden by dcache_lock on final dput, and d_lock on
common path elements (eg. cwd or root dentry).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

[Note: this is not for merging. Un-attached operation (lazy umount) may not be
 uncommon and will be slowed down and actually have worse scalablilty after
 this patch. I need to think about how to do fast refcounting with unattached
 mounts.]

---
 drivers/mtd/mtdchar.c |    1 
 fs/internal.h         |    1 
 fs/libfs.c            |    1 
 fs/namespace.c        |  167 +++++++++++++++++++++++++++++++++++++++++++-------
 fs/pnode.c            |    4 -
 include/linux/mount.h |   26 +------
 6 files changed, 154 insertions(+), 46 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c	2010-12-12 03:48:57.000000000 +1100
+++ linux-2.6/fs/namespace.c	2010-12-12 03:51:52.000000000 +1100
@@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun
 	mnt->mnt_group_id = 0;
 }
 
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void add_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;
+#else
+	mnt->mnt_count += n;
+#endif
+}
+
+static inline void set_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+	preempt_disable();
+	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
+	preempt_enable();
+#else
+	mnt->mnt_count = n;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void inc_mnt_count(struct vfsmount *mnt)
+{
+	add_mnt_count(mnt, 1);
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void dec_mnt_count(struct vfsmount *mnt)
+{
+	add_mnt_count(mnt, -1);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
+unsigned int count_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	unsigned int count = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		count += *per_cpu_ptr(mnt->mnt_count, cpu);
+	}
+
+	return count;
+#else
+	return mnt->mnt_count;
+#endif
+}
+
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
 	struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -154,7 +212,15 @@ struct vfsmount *alloc_vfsmnt(const char
 				goto out_free_id;
 		}
 
-		atomic_set(&mnt->mnt_count, 1);
+#ifdef CONFIG_SMP
+		mnt->mnt_count = alloc_percpu(int);
+		if (!mnt->mnt_count)
+			goto out_free_devname;
+#else
+		mnt->mnt_count = 0;
+#endif
+		set_mnt_count(mnt, 1);
+
 		INIT_LIST_HEAD(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -169,7 +235,7 @@ struct vfsmount *alloc_vfsmnt(const char
 #ifdef CONFIG_SMP
 		mnt->mnt_writers = alloc_percpu(int);
 		if (!mnt->mnt_writers)
-			goto out_free_devname;
+			goto out_free_mntcount;
 #else
 		mnt->mnt_writers = 0;
 #endif
@@ -177,6 +243,8 @@ struct vfsmount *alloc_vfsmnt(const char
 	return mnt;
 
 #ifdef CONFIG_SMP
+out_free_mntcount:
+	free_percpu(mnt->mnt_count);
 out_free_devname:
 	kfree(mnt->mnt_devname);
 #endif
@@ -662,8 +730,8 @@ static inline void __mntput(struct vfsmo
 	 * to make r/w->r/o transitions.
 	 */
 	/*
-	 * atomic_dec_and_lock() used to deal with ->mnt_count decrements
-	 * provides barriers, so count_mnt_writers() below is safe.  AV
+	 * The locking used to deal with mnt_count decrement provides barriers,
+	 * so count_mnt_writers() below is safe.
 	 */
 	WARN_ON(count_mnt_writers(mnt));
 	fsnotify_vfsmount_delete(mnt);
@@ -675,45 +743,76 @@ static inline void __mntput(struct vfsmo
 void mntput_no_expire(struct vfsmount *mnt)
 {
 repeat:
-	if (atomic_add_unless(&mnt->mnt_count, -1, 1))
+	if (likely(!list_empty(&mnt->mnt_hash) ||
+				mnt->mnt_flags & MNT_PSEUDO)) {
+		br_read_lock(vfsmount_lock);
+		if (unlikely(list_empty(&mnt->mnt_hash) &&
+				(!(mnt->mnt_flags & MNT_PSEUDO)))) {
+			br_read_unlock(vfsmount_lock);
+			goto repeat;
+		}
+		dec_mnt_count(mnt);
+		br_read_unlock(vfsmount_lock);
 		return;
+	}
+
 	br_write_lock(vfsmount_lock);
-	if (!atomic_dec_and_test(&mnt->mnt_count)) {
+	dec_mnt_count(mnt);
+	if (count_mnt_count(mnt)) {
 		br_write_unlock(vfsmount_lock);
 		return;
 	}
-	if (likely(!mnt->mnt_pinned)) {
+	if (unlikely(mnt->mnt_pinned)) {
+		add_mnt_count(mnt, mnt->mnt_pinned + 1);
+		mnt->mnt_pinned = 0;
 		br_write_unlock(vfsmount_lock);
-		__mntput(mnt);
-		return;
+		acct_auto_close_mnt(mnt);
+		goto repeat;
 	}
-	atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
-	mnt->mnt_pinned = 0;
 	br_write_unlock(vfsmount_lock);
-	acct_auto_close_mnt(mnt);
-	goto repeat;
+	__mntput(mnt);
 }
 EXPORT_SYMBOL(mntput_no_expire);
 
+void mntput(struct vfsmount *mnt)
+{
+	if (mnt) {
+		/* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+		if (unlikely(mnt->mnt_expiry_mark))
+			mnt->mnt_expiry_mark = 0;
+		mntput_no_expire(mnt);
+	}
+}
+EXPORT_SYMBOL(mntput);
+
+struct vfsmount *mntget(struct vfsmount *mnt)
+{
+	if (mnt) {
+		preempt_disable();
+		inc_mnt_count(mnt);
+		preempt_enable();
+	}
+	return mnt;
+}
+EXPORT_SYMBOL(mntget);
+
 void mnt_pin(struct vfsmount *mnt)
 {
 	br_write_lock(vfsmount_lock);
 	mnt->mnt_pinned++;
 	br_write_unlock(vfsmount_lock);
 }
-
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
 	br_write_lock(vfsmount_lock);
 	if (mnt->mnt_pinned) {
-		atomic_inc(&mnt->mnt_count);
+		inc_mnt_count(mnt);
 		mnt->mnt_pinned--;
 	}
 	br_write_unlock(vfsmount_lock);
 }
-
 EXPORT_SYMBOL(mnt_unpin);
 
 static inline void mangle(struct seq_file *m, const char *s)
@@ -1008,12 +1107,13 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	br_read_lock(vfsmount_lock);
+	/* write lock needed for count_mnt_count */
+	br_write_lock(vfsmount_lock);
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		actual_refs += atomic_read(&p->mnt_count);
+		actual_refs += count_mnt_count(p);
 		minimum_refs += 2;
 	}
-	br_read_unlock(vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -1040,10 +1140,10 @@ int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
 	down_read(&namespace_sem);
-	br_read_lock(vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	br_read_unlock(vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_read(&namespace_sem);
 	return ret;
 }
@@ -1125,8 +1225,16 @@ static int do_umount(struct vfsmount *mn
 		    flags & (MNT_FORCE | MNT_DETACH))
 			return -EINVAL;
 
-		if (atomic_read(&mnt->mnt_count) != 2)
+		/*
+		 * probably don't strictly need the lock here if we examined
+		 * all race cases, but it's a slowpath.
+		 */
+		br_write_lock(vfsmount_lock);
+		if (count_mnt_count(mnt) != 2) {
+			br_write_lock(vfsmount_lock);
 			return -EBUSY;
+		}
+		br_write_unlock(vfsmount_lock);
 
 		if (!xchg(&mnt->mnt_expiry_mark, 1))
 			return -EAGAIN;
@@ -2350,6 +2458,12 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
 	br_write_unlock(vfsmount_lock);
 	chroot_fs_refs(&root, &new);
+
+	/* Drop MNT_PSEUDO from old, add it to new. See init_mount_tree */
+	BUG_ON(!(root.mnt->mnt_flags & MNT_PSEUDO));
+	root.mnt->mnt_flags &= ~MNT_PSEUDO;
+	new.mnt->mnt_flags |= MNT_PSEUDO;
+
 	error = 0;
 	path_put(&root_parent);
 	path_put(&parent_path);
@@ -2376,6 +2490,13 @@ static void __init init_mount_tree(void)
 	mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
 	if (IS_ERR(mnt))
 		panic("Can't create rootfs");
+	/*
+	 * MNT_PSEUDO tells mnt refcounting that we're pinned, so don't
+	 * bother checking for zero references. Give one of these to root
+	 * because it isn't "attached" to the tree. See mntput().
+	 */
+	mnt->mnt_flags |= MNT_PSEUDO;
+
 	ns = create_mnt_ns(mnt);
 	if (IS_ERR(ns))
 		panic("Can't allocate initial namespace");
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h	2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/include/linux/mount.h	2010-12-12 03:51:52.000000000 +1100
@@ -30,6 +30,7 @@ struct mnt_namespace;
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
+#define MNT_PSEUDO	0x400
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
@@ -70,19 +71,15 @@ struct vfsmount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	int mnt_id;			/* mount identifier */
 	int mnt_group_id;		/* peer group identifier */
-	/*
-	 * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
-	 * to let these frequently modified fields in a separate cache line
-	 * (so that reads of mnt_flags wont ping-pong on SMP machines)
-	 */
-	atomic_t mnt_count;
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	int mnt_pinned;
 	int mnt_ghosts;
 #ifdef CONFIG_SMP
 	int __percpu *mnt_writers;
+	int __percpu *mnt_count;
 #else
 	int mnt_writers;
+	int mnt_count;
 #endif
 };
 
@@ -95,13 +92,6 @@ static inline int *get_mnt_writers_ptr(s
 #endif
 }
 
-static inline struct vfsmount *mntget(struct vfsmount *mnt)
-{
-	if (mnt)
-		atomic_inc(&mnt->mnt_count);
-	return mnt;
-}
-
 struct file; /* forward dec */
 
 extern int mnt_want_write(struct vfsmount *mnt);
@@ -109,18 +99,12 @@ extern int mnt_want_write_file(struct fi
 extern int mnt_clone_write(struct vfsmount *mnt);
 extern void mnt_drop_write(struct vfsmount *mnt);
 extern void mntput_no_expire(struct vfsmount *mnt);
+extern void mntput(struct vfsmount *mnt);
+extern struct vfsmount *mntget(struct vfsmount *mnt);
 extern void mnt_pin(struct vfsmount *mnt);
 extern void mnt_unpin(struct vfsmount *mnt);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
 
-static inline void mntput(struct vfsmount *mnt)
-{
-	if (mnt) {
-		mnt->mnt_expiry_mark = 0;
-		mntput_no_expire(mnt);
-	}
-}
-
 extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
 				      const char *name, void *data);
 
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c	2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/fs/pnode.c	2010-12-12 03:51:52.000000000 +1100
@@ -288,7 +288,7 @@ int propagate_mnt(struct vfsmount *dest_
  */
 static inline int do_refcount_check(struct vfsmount *mnt, int count)
 {
-	int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts;
+	int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts;
 	return (mycount > count);
 }
 
@@ -300,7 +300,7 @@ static inline int do_refcount_check(stru
  * Check if any of these mounts that **do not have submounts**
  * have more references than 'refcnt'. If so return busy.
  *
- * vfsmount lock must be held for read or write
+ * vfsmount lock must be held for write
  */
 int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
 {
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/fs/internal.h	2010-12-12 03:51:52.000000000 +1100
@@ -63,6 +63,7 @@ extern int copy_mount_string(const void
 
 extern void free_vfsmnt(struct vfsmount *);
 extern struct vfsmount *alloc_vfsmnt(const char *);
+extern unsigned int count_mnt_count(struct vfsmount *mnt);
 extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
 extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
 				struct vfsmount *);
Index: linux-2.6/drivers/mtd/mtdchar.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdchar.c	2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/drivers/mtd/mtdchar.c	2010-12-12 03:51:52.000000000 +1100
@@ -1201,6 +1201,7 @@ static int __init init_mtdchar(void)
 static void __exit cleanup_mtdchar(void)
 {
 	unregister_mtd_user(&mtdchar_notifier);
+	mtd_inode_mnt->mnt_flags &= ~MNT_PSEUDO;
 	mntput(mtd_inode_mnt);
 	unregister_filesystem(&mtd_inodefs_type);
 	__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
Index: linux-2.6/arch/ia64/kernel/perfmon.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/perfmon.c	2010-12-12 03:48:57.000000000 +1100
+++ linux-2.6/arch/ia64/kernel/perfmon.c	2010-12-12 03:51:52.000000000 +1100
@@ -1553,8 +1553,10 @@ init_pfm_fs(void)
 		err = PTR_ERR(pfmfs_mnt);
 		if (IS_ERR(pfmfs_mnt))
 			unregister_filesystem(&pfm_fs_type);
-		else
+		else {
 			err = 0;
+			pfmfs_mnt->mnt_flags |= MNT_PSEUDO;
+		}
 	}
 	return err;
 }
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c	2010-12-12 03:51:50.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c	2010-12-12 03:51:52.000000000 +1100
@@ -223,6 +223,7 @@ static int __init anon_inode_init(void)
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
 	}
+	anon_inode_mnt->mnt_flags |= MNT_PSEUDO;
 	anon_inode_inode = anon_inode_mkinode();
 	if (IS_ERR(anon_inode_inode)) {
 		error = PTR_ERR(anon_inode_inode);
@@ -232,6 +233,7 @@ static int __init anon_inode_init(void)
 	return 0;
 
 err_mntput:
+	anon_inode_mnt->mnt_flags &= ~MNT_PSEUDO;
 	mntput(anon_inode_mnt);
 err_unregister_filesystem:
 	unregister_filesystem(&anon_inode_fs_type);
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/fs/block_dev.c	2010-12-12 03:51:52.000000000 +1100
@@ -499,6 +499,7 @@ void __init bdev_cache_init(void)
 	bd_mnt = kern_mount(&bd_type);
 	if (IS_ERR(bd_mnt))
 		panic("Cannot create bdev pseudo-fs");
+	bd_mnt->mnt_flags |= MNT_PSEUDO;
 	/*
 	 * This vfsmount structure is only used to obtain the
 	 * blockdev_superblock, so tell kmemleak not to report it.
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c	2010-12-12 03:51:50.000000000 +1100
+++ linux-2.6/fs/pipe.c	2010-12-12 03:51:52.000000000 +1100
@@ -1285,6 +1285,7 @@ static int __init init_pipe_fs(void)
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
 		}
+		pipe_mnt->mnt_flags |= MNT_PSEUDO;
 	}
 	return err;
 }
@@ -1292,6 +1293,7 @@ static int __init init_pipe_fs(void)
 static void __exit exit_pipe_fs(void)
 {
 	unregister_filesystem(&pipe_fs_type);
+	pipe_mnt->mnt_flags &= ~MNT_PSEUDO;
 	mntput(pipe_mnt);
 }
 
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-12-12 03:51:50.000000000 +1100
+++ linux-2.6/net/socket.c	2010-12-12 03:51:52.000000000 +1100
@@ -2375,6 +2375,8 @@ EXPORT_SYMBOL(sock_unregister);
 
 static int __init sock_init(void)
 {
+	int err;
+
 	/*
 	 *      Initialize sock SLAB cache.
 	 */
@@ -2391,8 +2393,16 @@ static int __init sock_init(void)
 	 */
 
 	init_inodecache();
-	register_filesystem(&sock_fs_type);
+
+	err = register_filesystem(&sock_fs_type);
+	if (err)
+		goto out_fs;
 	sock_mnt = kern_mount(&sock_fs_type);
+	if (IS_ERR(sock_mnt)) {
+		err = PTR_ERR(sock_mnt);
+		goto out_mount;
+	}
+	sock_mnt->mnt_flags |= MNT_PSEUDO;
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */
@@ -2405,7 +2415,13 @@ static int __init sock_init(void)
 	skb_timestamping_init();
 #endif
 
-	return 0;
+out:
+	return err;
+
+out_mount:
+	unregister_filesystem(&sock_fs_type);
+out_fs:
+	goto out;
 }
 
 core_initcall(sock_init);	/* early initcall */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  2:37 rcu-walk and dcache scaling tree update and status Nick Piggin
  2010-12-13  2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin
@ 2010-12-13  2:53 ` Ed Tomlinson
  2010-12-13  2:59   ` Nick Piggin
  2010-12-13  3:40 ` Stephen Rothwell
  2 siblings, 1 reply; 17+ messages in thread
From: Ed Tomlinson @ 2010-12-13  2:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell,
	linux-fsdevel, linux-kernel

On Sunday 12 December 2010 21:37:33 Nick Piggin wrote:
> The vfs-scale branch has had some progress, but it is now requiring
> wider testing and detailed review, particularly of the fine details of
> dcache_lock lifting, and rcu-walk synchronisation details and
> documentation.
> 
> Linus has suggested pretty strongly that he wants to pull this in the
> next merge window (recently, that "inodes will be RCU freed in 2.6.38"
> in an urelated discussion). As far as I know, that's what he's going to
> do. I'd like to get this some time in linux-next to improve test
> coverage (many filesystems I can't even test, so there are bound to be a
> few silly crashes). Stephen, how do I arrange that?
> 
> From my point of view, it has had nowhere near enough review,
> particularly I want Al to be happy with it, filesystem changes looked at
> and tested by respective fs maintainers, and anybody who is good at
> concurrency. However, if Linus still wants to merge it to kick things
> along, I am not going to stop him this time, because I have no known
> bugs or pending changes required.
> 
> I, like everybody else, would prefer bugs or design flaws to be found
> *before* it goes upstream, of course. I would be happy to spend time on
> irc with reviewers (ask me offline). And if anybody has reasonable
> concerns or suggestions, I will be happy to take that into account. I
> will not flame anybody who reads my replies, even if it takes a while
> for one or both of us to understand.
> 
> Documentation/filesystems/path-lookup.txt is a good place to start
> reviewing the fun stuff. I would much appreciate review of documentation
> and comments too, if anything is not clear, omitted, or not matching the
> code.
> 
> Also, please keep an eye on the end result when reviewing patches.
> Particularly the locking patches before dcache_lock is lifted, these are
> supposed to provide a lock coverage to lift dcache_lock with minimal
> complexity. They are not supposed to be nice looking code that you'd
> want to run on your production box, they are supposed to be nice
> changesets (from a review and verification point of view).
> 
> Git tree is here:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch is:
> 
>     vfs-scale-working
> 
> Changes since last posting:
> * Add a lot more comments for rcu-walk code and functions
> * Fix reported d_compare vfat crash
> * Incorporate review suggestions
> * Make rcu-walk bail out if we have to call a security subsystem
> * Fix for filesystems rewriting dentry name in-place
> * Audit d_seq barrier write-side, add a few places where it was missing
> * Optimised dentry memcmp
> 
> Testing:
> Testing filesystems is difficult, particularly remote filesystems, and
> ones without mkfs packaged in debian. I'm running ltp and xfstests among
> others, but those cover a tiny portion of what you can do with the
> dcache. The more testing the merrier.
> 
> I have been unable to break anything for a long time, but the race
> windows can be tiny. I've been trying to insert random delays into
> different parts of critical sections, and write tests specifically
> targetting particular races, but that's slow going -- review of locking,
> or testing on different configurations should be much more productive.
> 
> Final note:
> You won't be able to reproduce the parallel path walk scalability
> numbers that I've posted, because the vfsmount refcounting scalability
> patch is not included. I have a new idea for that now, so I'll be asking
> for comments with that soon.

I get this when building:

security/security.c: In function 'security_inode_exec_permission':
security/security.c:520: error: 'rcu' undeclared (first use in this function)
security/security.c:520: error: (Each undeclared identifier is reported only once
security/security.c:520: error: for each function it appears in.)
make[1]: *** [security/security.o] Error 1
make: *** [security] Error 2

Missing include maybe?

Ed



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson
@ 2010-12-13  2:59   ` Nick Piggin
  2010-12-13  3:45     ` Stephen Rothwell
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  2:59 UTC (permalink / raw)
  To: Ed Tomlinson
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro,
	Stephen Rothwell, linux-fsdevel, linux-kernel

On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote:
> On Sunday 12 December 2010 21:37:33 Nick Piggin wrote:
> > Final note:
> > You won't be able to reproduce the parallel path walk scalability
> > numbers that I've posted, because the vfsmount refcounting scalability
> > patch is not included. I have a new idea for that now, so I'll be asking
> > for comments with that soon.
> 
> I get this when building:
> 
> security/security.c: In function 'security_inode_exec_permission':
> security/security.c:520: error: 'rcu' undeclared (first use in this function)
> security/security.c:520: error: (Each undeclared identifier is reported only once
> security/security.c:520: error: for each function it appears in.)
> make[1]: *** [security/security.o] Error 1
> make: *** [security] Error 2
> 
> Missing include maybe?

Ah, missed permutation while doing a final build check. `rcu` should
just be renamed to `flags`.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-13  2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin
@ 2010-12-13  3:31   ` Nick Piggin
  2010-12-13  3:43     ` Nick Piggin
  2010-12-13  7:25   ` Eric Dumazet
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  3:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell,
	linux-fsdevel, linux-kernel

On Mon, Dec 13, 2010 at 01:42:17PM +1100, Nick Piggin wrote:
> On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote:
> > Final note:
> > You won't be able to reproduce the parallel path walk scalability
> > numbers that I've posted, because the vfsmount refcounting scalability
> > patch is not included. I have a new idea for that now, so I'll be asking
> > for comments with that soon.
> 
> Here is the patch I've been using, which works but has the problem
> described in the changelog. But it works nicely for testing.
> 
> As I said, I have a promising approach to solving the problem.
> 
> fs: scale mntget/mntput
 
[...]

> [Note: this is not for merging. Un-attached operation (lazy umount) may not be
>  uncommon and will be slowed down and actually have worse scalablilty after
>  this patch. I need to think about how to do fast refcounting with unattached
>  mounts.]

So the problem this patch tries to fix is vfsmount refcount scalability.
We need to take a ref for every successful path lookup, and often
lookups are going to the same mountpoint.

(Yes this little bouncing atomic hurts, badly, even on my small 2s12c
tightly connected system on the parallel git diff workload -- because
there are other bouncing kernel cachelines in this workload).

The fundamental difficulty is that a simple refcount can never be SMP
scalable, because dropping the ref requires we check whether we are
the last reference (which implies communicating with other CPUs that
might have taken references).

We can make them scalable by keeping a local count, and checking the
global sum less frequently. Some possibilities:

- avoid checking global sum while vfsmount is mounted, because the mount
  contributes to the refcount (that is what this patch does, but it
  kills performance inside a lazy umounted subtree).

- check global sum once every time interval (this would delay mount and
  sb garbage collection, so it's probably a showstopper).

- check global sum only if local sum goes to 0 (this is difficult with
  vfsmounts because the 'get' and the 'put' can happen on different
  CPUs, so we'd need to have a per-thread refcount, or carry around the
  CPU number with the refcount, both get horribly ugly, it turns out).

My proposal is a variant / generalisation of the 1st idea, which is to
have "long" refcounts. Normal refcounts will be per-cpu difference of
incs and decs, but dropping a reference will not have to check the
global sum while "long" refcounts are elevated. If the mount is a long
refcount, then that is what this current patch essentially is.

But then I would also have cwd take the long refcount, which allows
detached operation to remain fast while there are processes working
inside the detached namespace.

Details of locking aren't completely worked out -- it's a bit more
tricky because umount can be much heavier than fork() or chdir(), so
there are some difficulties in making long refcount operations faster
(the problem is remaining race-free versus the fast mntput check, but
I think a seqcount to go with the long refcount should do the trick).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  2:37 rcu-walk and dcache scaling tree update and status Nick Piggin
  2010-12-13  2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin
  2010-12-13  2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson
@ 2010-12-13  3:40 ` Stephen Rothwell
  2010-12-13  3:48   ` Nick Piggin
  2 siblings, 1 reply; 17+ messages in thread
From: Stephen Rothwell @ 2010-12-13  3:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel,
	linux-kernel, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1068 bytes --]

Hi Nick,

On Mon, 13 Dec 2010 13:37:33 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
>
> Linus has suggested pretty strongly that he wants to pull this in the
> next merge window (recently, that "inodes will be RCU freed in 2.6.38"
> in an urelated discussion). As far as I know, that's what he's going to
> do. I'd like to get this some time in linux-next to improve test
> coverage (many filesystems I can't even test, so there are bound to be a
> few silly crashes). Stephen, how do I arrange that?

Well, you ask me :-) (letting me know where the git tree is).

I this case I would like an email from Al and/or Linus (and/or maybe
Christoph?) before I add this to linux-next as you say below that
(despite being as well tested as you can do) it does require more review
and buyin from others.  Of course, if Linus is going to merge this during
the next merge window, then it really should be in linux-next as soon as
it is ready enough.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-13  3:31   ` Nick Piggin
@ 2010-12-13  3:43     ` Nick Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  3:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell,
	linux-fsdevel, linux-kernel

On Mon, Dec 13, 2010 at 02:31:10PM +1100, Nick Piggin wrote:
> (Yes this little bouncing atomic hurts, badly, even on my small 2s12c
> tightly connected system on the parallel git diff workload -- because
> there are other bouncing kernel cachelines in this workload).
          ^^^
      are no other


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  2:59   ` Nick Piggin
@ 2010-12-13  3:45     ` Stephen Rothwell
  2010-12-13  3:50       ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen Rothwell @ 2010-12-13  3:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ed Tomlinson, Linus Torvalds, Andrew Morton, Al Viro,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 951 bytes --]

Hi Nick,

On Mon, 13 Dec 2010 13:59:10 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
>
> On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote:
> > 
> > I get this when building:
> > 
> > security/security.c: In function 'security_inode_exec_permission':
> > security/security.c:520: error: 'rcu' undeclared (first use in this function)
> > security/security.c:520: error: (Each undeclared identifier is reported only once
> > security/security.c:520: error: for each function it appears in.)
> > make[1]: *** [security/security.o] Error 1
> > make: *** [security] Error 2
> > 
> > Missing include maybe?
> 
> Ah, missed permutation while doing a final build check. `rcu` should
> just be renamed to `flags`.

You will, of course never put anything not build tested into a tree for
linux-next inclusion, right?  ;-)
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  3:40 ` Stephen Rothwell
@ 2010-12-13  3:48   ` Nick Piggin
  2010-12-14  0:03     ` Stephen Rothwell
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  3:48 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro,
	linux-fsdevel, linux-kernel, Christoph Hellwig

On Mon, Dec 13, 2010 at 02:40:49PM +1100, Stephen Rothwell wrote:
> Hi Nick,
> 
> On Mon, 13 Dec 2010 13:37:33 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
> >
> > Linus has suggested pretty strongly that he wants to pull this in the
> > next merge window (recently, that "inodes will be RCU freed in 2.6.38"
> > in an urelated discussion). As far as I know, that's what he's going to
> > do. I'd like to get this some time in linux-next to improve test
> > coverage (many filesystems I can't even test, so there are bound to be a
> > few silly crashes). Stephen, how do I arrange that?
> 
> Well, you ask me :-) (letting me know where the git tree is).
> 
> I this case I would like an email from Al and/or Linus (and/or maybe
> Christoph?) before I add this to linux-next as you say below that
> (despite being as well tested as you can do) it does require more review
> and buyin from others.  Of course, if Linus is going to merge this during
> the next merge window, then it really should be in linux-next as soon as
> it is ready enough.

Yes I would like an ack from Al to have it in linux-next. Note that
wouldn't have to be an ack on the patches or agreement to merge it next
window, just an "ok to get wider review and testing" type of thing.

If Linus's is intending *not* to pull it, then it's not so urgent, and I
might instead ask Andrew to host it in -mm for a while. I just don't
want to get caught with my pants down here, because there are a good
number of untested filesystems and things.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  3:45     ` Stephen Rothwell
@ 2010-12-13  3:50       ` Nick Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  3:50 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Nick Piggin, Ed Tomlinson, Linus Torvalds, Andrew Morton,
	Al Viro, linux-fsdevel, linux-kernel

On Mon, Dec 13, 2010 at 02:45:55PM +1100, Stephen Rothwell wrote:
> Hi Nick,
> 
> On Mon, 13 Dec 2010 13:59:10 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
> >
> > On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote:
> > > 
> > > I get this when building:
> > > 
> > > security/security.c: In function 'security_inode_exec_permission':
> > > security/security.c:520: error: 'rcu' undeclared (first use in this function)
> > > security/security.c:520: error: (Each undeclared identifier is reported only once
> > > security/security.c:520: error: for each function it appears in.)
> > > make[1]: *** [security/security.o] Error 1
> > > make: *** [security] Error 2
> > > 
> > > Missing include maybe?
> > 
> > Ah, missed permutation while doing a final build check. `rcu` should
> > just be renamed to `flags`.
> 
> You will, of course never put anything not build tested into a tree for
> linux-next inclusion, right?  ;-)

Right. It will be guaranteed to have been build tested with at least
one .config :)

Which reminds me, I haven't done an allmodconfig recently to see if
I've missed exports...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-13  2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin
  2010-12-13  3:31   ` Nick Piggin
@ 2010-12-13  7:25   ` Eric Dumazet
  2010-12-13  8:33     ` Nick Piggin
  2010-12-14 12:40     ` Nick Piggin
  1 sibling, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2010-12-13  7:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, Stephen Rothwell,
	linux-fsdevel, linux-kernel

Le lundi 13 décembre 2010 à 13:42 +1100, Nick Piggin a écrit :
> + */
> +static inline void add_mnt_count(struct vfsmount *mnt, int n)

maybe name it __add_mnt_count() (should be called with preempt off)

> +{
> +#ifdef CONFIG_SMP
> +	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;

__this_cpu_add(mnt->mnt_count, n);

> +#else
> +	mnt->mnt_count += n;
> +#endif
> +}

and define a preempt safe version :

static inline void __add_mnt_count(struct vfsmount *mnt, int n)
{
#ifdef CONFIG_SMP
	__this_cpu_add(mnt->mnt_count, n);
#else
	mnt->mnt_count += n;
#endif
}

static inline void add_mnt_count(struct vfsmount *mnt, int n)
{
#ifdef CONFIG_SMP
	this_cpu_add(mnt->mnt_count, n);
#else
	preempt_disable();
	mnt->mnt_count += n;
	preempt_enable();
#endif
}

> +
> +static inline void set_mnt_count(struct vfsmount *mnt, int n)
> +{
> +#ifdef CONFIG_SMP


> +	preempt_disable();
> +	(*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
> +	preempt_enable();

last 3 lines can be replaced by :

	this_cpu_write(mnt->mnt_count, n);

> +#else
> +	mnt->mnt_count = n;
> +#endif
> +}
> +

>  #ifdef CONFIG_SMP
>  	int __percpu *mnt_writers;
> +	int __percpu *mnt_count;
>  #else
>  	int mnt_writers;
> +	int mnt_count;
>  #endif

You could use a struct and one per cpu allocation to use one cache line
for both objects :

struct mnt_counters {
	int writers;
	int count;
};

...

#ifdef CONFIG_SMP
	struct mnt_counters __percpu *mnt_counters;
#else
	struct mnt_counters mnt_counters;
#endif

This would use one pointer instead of two in SMP



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-13  7:25   ` Eric Dumazet
@ 2010-12-13  8:33     ` Nick Piggin
  2010-12-14 12:40     ` Nick Piggin
  1 sibling, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2010-12-13  8:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro,
	Stephen Rothwell, linux-fsdevel, linux-kernel

On Mon, Dec 13, 2010 at 6:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 13 décembre 2010 à 13:42 +1100, Nick Piggin a écrit :
>> + */
>> +static inline void add_mnt_count(struct vfsmount *mnt, int n)
>
> maybe name it __add_mnt_count() (should be called with preempt off)
>
>> +{
>> +#ifdef CONFIG_SMP
>> +     (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;
>
> __this_cpu_add(mnt->mnt_count, n);
>
>> +#else
>> +     mnt->mnt_count += n;
>> +#endif
>> +}
>
> and define a preempt safe version :
>
> static inline void __add_mnt_count(struct vfsmount *mnt, int n)
> {
> #ifdef CONFIG_SMP
>        __this_cpu_add(mnt->mnt_count, n);
> #else
>        mnt->mnt_count += n;
> #endif
> }
>
> static inline void add_mnt_count(struct vfsmount *mnt, int n)
> {
> #ifdef CONFIG_SMP
>        this_cpu_add(mnt->mnt_count, n);
> #else
>        preempt_disable();
>        mnt->mnt_count += n;
>        preempt_enable();
> #endif
> }

That looks good, thanks.


>> +
>> +static inline void set_mnt_count(struct vfsmount *mnt, int n)
>> +{
>> +#ifdef CONFIG_SMP
>
>
>> +     preempt_disable();
>> +     (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
>> +     preempt_enable();
>
> last 3 lines can be replaced by :
>
>        this_cpu_write(mnt->mnt_count, n);

Yep, thanks.


>> +#else
>> +     mnt->mnt_count = n;
>> +#endif
>> +}
>> +
>
>>  #ifdef CONFIG_SMP
>>       int __percpu *mnt_writers;
>> +     int __percpu *mnt_count;
>>  #else
>>       int mnt_writers;
>> +     int mnt_count;
>>  #endif
>
> You could use a struct and one per cpu allocation to use one cache line
> for both objects :
>
> struct mnt_counters {
>        int writers;
>        int count;
> };
>
> ...
>
> #ifdef CONFIG_SMP
>        struct mnt_counters __percpu *mnt_counters;
> #else
>        struct mnt_counters mnt_counters;
> #endif
>
> This would use one pointer instead of two in SMP

Yes that's a good point too.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-13  3:48   ` Nick Piggin
@ 2010-12-14  0:03     ` Stephen Rothwell
  2010-12-14  0:16       ` Stephen Rothwell
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen Rothwell @ 2010-12-14  0:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel,
	linux-kernel, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2381 bytes --]

Hi Nick,

On Mon, 13 Dec 2010 14:48:54 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
>
> Yes I would like an ack from Al to have it in linux-next. Note that
> wouldn't have to be an ack on the patches or agreement to merge it next
> window, just an "ok to get wider review and testing" type of thing.
> 
> If Linus's is intending *not* to pull it, then it's not so urgent, and I
> might instead ask Andrew to host it in -mm for a while. I just don't
> want to get caught with my pants down here, because there are a good
> number of untested filesystems and things.

OK, since Linus seems so keen (and you fixed the reported compile error)
I will include your tree today.  If Al has objections, I can always
remove it again.

I have called the tree (for my purposes) "vfs_scale" and have just you
listed as the contact.

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgment of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
     * submitted under GPL v2 (or later) and include the Contributor's
	Signed-off-by,
     * posted to the relevant mailing list,
     * reviewed by you (or another maintainer of your subsystem tree),
     * successfully unit tested, and 
     * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
sfr@canb.auug.org.au

Legal Stuff:
By participating in linux-next, your subsystem tree contributions are
public and will be included in the linux-next trees.  You may be sent
e-mail messages indicating errors or other issues when the
patches/commits from your subsystem tree are merged and tested in
linux-next.  These messages may also be cross-posted to the linux-next
mailing list, the linux-kernel mailing list, etc.  The linux-next tree
project and IBM (my employer) make no warranties regarding the linux-next
project, the testing procedures, the results, the e-mails, etc.  If you
don't agree to these ground rules, let me know and I'll remove your tree
from participation in linux-next.

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: rcu-walk and dcache scaling tree update and status
  2010-12-14  0:03     ` Stephen Rothwell
@ 2010-12-14  0:16       ` Stephen Rothwell
  0 siblings, 0 replies; 17+ messages in thread
From: Stephen Rothwell @ 2010-12-14  0:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Al Viro, linux-fsdevel,
	linux-kernel, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 359 bytes --]

Hi Nick,

On Tue, 14 Dec 2010 11:03:26 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> I have called the tree (for my purposes) "vfs_scale" and have just you
                                            ^^^^^^^^^
Actually "vfs-scale".

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-13  7:25   ` Eric Dumazet
  2010-12-13  8:33     ` Nick Piggin
@ 2010-12-14 12:40     ` Nick Piggin
  2010-12-15  8:16       ` Andreas Dilger
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2010-12-14 12:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Al Viro,
	Stephen Rothwell, linux-fsdevel, linux-kernel

OK, this ones includes your suggestions, and implements the approach I
outlined earlier. I verified it works and is scalable on detached
mounts, so I'll add it to the git branch when it survives a bit more
testing. It's the last piece of the puzzle...

fs: scale mntget/mntput

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
  for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
  is difficult for vfsmounts, because we can't hold preempt off for the life of
  a reference, so a counter would need to be per-thread or tied strongly to a
  particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
  the total difference and hence find the refcount when summing all CPUs. Then,
  keep a single integer "long" refcount for slow and long lasting references,
  and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 arch/ia64/kernel/perfmon.c |    2 
 drivers/mtd/mtdchar.c      |    2 
 fs/anon_inodes.c           |    2 
 fs/fs_struct.c             |   26 +++-
 fs/internal.h              |    1 
 fs/namei.c                 |   24 ++++
 fs/namespace.c             |  240 +++++++++++++++++++++++++++++++++++++--------
 fs/pipe.c                  |    2 
 fs/pnode.c                 |    4 
 fs/super.c                 |    2 
 include/linux/mount.h      |   53 +++------
 include/linux/path.h       |    2 
 net/socket.c               |   19 +++
 13 files changed, 282 insertions(+), 97 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/namespace.c	2010-12-14 23:33:00.000000000 +1100
@@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun
 	mnt->mnt_group_id = 0;
 }
 
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void add_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+	this_cpu_add(mnt->mnt_pcp->mnt_count, n);
+#else
+	preempt_disable();
+	mnt->mnt_count += n;
+	preempt_enable();
+#endif
+}
+
+static inline void set_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+	this_cpu_add(mnt->mnt_pcp->mnt_count, n);
+#else
+	mnt->mnt_count = n;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void inc_mnt_count(struct vfsmount *mnt)
+{
+	add_mnt_count(mnt, 1);
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void dec_mnt_count(struct vfsmount *mnt)
+{
+	add_mnt_count(mnt, -1);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
+unsigned int count_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	unsigned int count = atomic_read(&mnt->mnt_longrefs);
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count;
+	}
+
+	return count;
+#else
+	return mnt->mnt_count;
+#endif
+}
+
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
 	struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -154,7 +212,17 @@ struct vfsmount *alloc_vfsmnt(const char
 				goto out_free_id;
 		}
 
-		atomic_set(&mnt->mnt_count, 1);
+#ifdef CONFIG_SMP
+		mnt->mnt_pcp = alloc_percpu(struct mnt_pcp);
+		if (!mnt->mnt_pcp)
+			goto out_free_devname;
+
+		atomic_set(&mnt->mnt_longrefs, 1);
+#else
+		mnt->mnt_count = 1;
+		mnt->mnt_writers = 0;
+#endif
+
 		INIT_LIST_HEAD(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -166,13 +234,6 @@ struct vfsmount *alloc_vfsmnt(const char
 #ifdef CONFIG_FSNOTIFY
 		INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
 #endif
-#ifdef CONFIG_SMP
-		mnt->mnt_writers = alloc_percpu(int);
-		if (!mnt->mnt_writers)
-			goto out_free_devname;
-#else
-		mnt->mnt_writers = 0;
-#endif
 	}
 	return mnt;
 
@@ -219,7 +280,7 @@ EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 static inline void inc_mnt_writers(struct vfsmount *mnt)
 {
 #ifdef CONFIG_SMP
-	(*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))++;
+	this_cpu_inc(mnt->mnt_pcp->mnt_writers);
 #else
 	mnt->mnt_writers++;
 #endif
@@ -228,7 +289,7 @@ static inline void inc_mnt_writers(struc
 static inline void dec_mnt_writers(struct vfsmount *mnt)
 {
 #ifdef CONFIG_SMP
-	(*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))--;
+	this_cpu_dec(mnt->mnt_pcp->mnt_writers);
 #else
 	mnt->mnt_writers--;
 #endif
@@ -241,7 +302,7 @@ static unsigned int count_mnt_writers(st
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-		count += *per_cpu_ptr(mnt->mnt_writers, cpu);
+		count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_writers;
 	}
 
 	return count;
@@ -418,7 +479,7 @@ void free_vfsmnt(struct vfsmount *mnt)
 	kfree(mnt->mnt_devname);
 	mnt_free_id(mnt);
 #ifdef CONFIG_SMP
-	free_percpu(mnt->mnt_writers);
+	free_percpu(mnt->mnt_pcp);
 #endif
 	kmem_cache_free(mnt_cache, mnt);
 }
@@ -652,9 +713,10 @@ static struct vfsmount *clone_mnt(struct
 	return NULL;
 }
 
-static inline void __mntput(struct vfsmount *mnt)
+static inline void mntfree(struct vfsmount *mnt)
 {
 	struct super_block *sb = mnt->mnt_sb;
+
 	/*
 	 * This probably indicates that somebody messed
 	 * up a mnt_want/drop_write() pair.  If this
@@ -662,8 +724,8 @@ static inline void __mntput(struct vfsmo
 	 * to make r/w->r/o transitions.
 	 */
 	/*
-	 * atomic_dec_and_lock() used to deal with ->mnt_count decrements
-	 * provides barriers, so count_mnt_writers() below is safe.  AV
+	 * The locking used to deal with mnt_count decrement provides barriers,
+	 * so count_mnt_writers() below is safe.
 	 */
 	WARN_ON(count_mnt_writers(mnt));
 	fsnotify_vfsmount_delete(mnt);
@@ -672,28 +734,113 @@ static inline void __mntput(struct vfsmo
 	deactivate_super(sb);
 }
 
-void mntput_no_expire(struct vfsmount *mnt)
+#ifdef CONFIG_SMP
+static inline void __mntput(struct vfsmount *mnt, int longrefs)
 {
-repeat:
-	if (atomic_add_unless(&mnt->mnt_count, -1, 1))
-		return;
+	if (!longrefs) {
+put_again:
+		br_read_lock(vfsmount_lock);
+		if (likely(atomic_read(&mnt->mnt_longrefs))) {
+			dec_mnt_count(mnt);
+			br_read_unlock(vfsmount_lock);
+			return;
+		}
+		br_read_unlock(vfsmount_lock);
+	} else {
+		BUG_ON(!atomic_read(&mnt->mnt_longrefs));
+		if (atomic_add_unless(&mnt->mnt_longrefs, -1, 1))
+			return;
+	}
+
 	br_write_lock(vfsmount_lock);
-	if (!atomic_dec_and_test(&mnt->mnt_count)) {
+	if (!longrefs)
+		dec_mnt_count(mnt);
+	else
+		atomic_dec(&mnt->mnt_longrefs);
+	if (count_mnt_count(mnt)) {
 		br_write_unlock(vfsmount_lock);
 		return;
 	}
-	if (likely(!mnt->mnt_pinned)) {
+	if (unlikely(mnt->mnt_pinned)) {
+		add_mnt_count(mnt, mnt->mnt_pinned + 1);
+		mnt->mnt_pinned = 0;
 		br_write_unlock(vfsmount_lock);
-		__mntput(mnt);
+		acct_auto_close_mnt(mnt);
+		goto put_again;
+	}
+	br_write_unlock(vfsmount_lock);
+	mntfree(mnt);
+}
+#else
+static inline void __mntput(struct vfsmount *mnt, int longrefs)
+{
+put_again:
+	dec_mnt_count(mnt);
+	if (likely(count_mnt_count(mnt)))
 		return;
+	br_write_lock(vfsmount_lock);
+	if (unlikely(mnt->mnt_pinned)) {
+		add_mnt_count(mnt, mnt->mnt_pinned + 1);
+		mnt->mnt_pinned = 0;
+		br_write_unlock(vfsmount_lock);
+		acct_auto_close_mnt(mnt);
+		goto put_again;
 	}
-	atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
-	mnt->mnt_pinned = 0;
 	br_write_unlock(vfsmount_lock);
-	acct_auto_close_mnt(mnt);
-	goto repeat;
+	mntfree(mnt);
+}
+#endif
+
+static void mntput_no_expire(struct vfsmount *mnt)
+{
+	__mntput(mnt, 0);
 }
-EXPORT_SYMBOL(mntput_no_expire);
+
+void mntput(struct vfsmount *mnt)
+{
+	if (mnt) {
+		/* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+		if (unlikely(mnt->mnt_expiry_mark))
+			mnt->mnt_expiry_mark = 0;
+		__mntput(mnt, 0);
+	}
+}
+EXPORT_SYMBOL(mntput);
+
+struct vfsmount *mntget(struct vfsmount *mnt)
+{
+	if (mnt)
+		inc_mnt_count(mnt);
+	return mnt;
+}
+EXPORT_SYMBOL(mntget);
+
+void mntput_long(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	if (mnt) {
+		/* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+		if (unlikely(mnt->mnt_expiry_mark))
+			mnt->mnt_expiry_mark = 0;
+		__mntput(mnt, 1);
+	}
+#else
+	mntput(mnt);
+#endif
+}
+EXPORT_SYMBOL(mntput_long);
+
+struct vfsmount *mntget_long(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+	if (mnt)
+		atomic_inc(&mnt->mnt_longrefs);
+	return mnt;
+#else
+	return mntget(mnt);
+#endif
+}
+EXPORT_SYMBOL(mntget_long);
 
 void mnt_pin(struct vfsmount *mnt)
 {
@@ -701,19 +848,17 @@ void mnt_pin(struct vfsmount *mnt)
 	mnt->mnt_pinned++;
 	br_write_unlock(vfsmount_lock);
 }
-
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
 	br_write_lock(vfsmount_lock);
 	if (mnt->mnt_pinned) {
-		atomic_inc(&mnt->mnt_count);
+		inc_mnt_count(mnt);
 		mnt->mnt_pinned--;
 	}
 	br_write_unlock(vfsmount_lock);
 }
-
 EXPORT_SYMBOL(mnt_unpin);
 
 static inline void mangle(struct seq_file *m, const char *s)
@@ -1008,12 +1153,13 @@ int may_umount_tree(struct vfsmount *mnt
 	int minimum_refs = 0;
 	struct vfsmount *p;
 
-	br_read_lock(vfsmount_lock);
+	/* write lock needed for count_mnt_count */
+	br_write_lock(vfsmount_lock);
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
-		actual_refs += atomic_read(&p->mnt_count);
+		actual_refs += count_mnt_count(p);
 		minimum_refs += 2;
 	}
-	br_read_unlock(vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -1040,10 +1186,10 @@ int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
 	down_read(&namespace_sem);
-	br_read_lock(vfsmount_lock);
+	br_write_lock(vfsmount_lock);
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	br_read_unlock(vfsmount_lock);
+	br_write_unlock(vfsmount_lock);
 	up_read(&namespace_sem);
 	return ret;
 }
@@ -1070,7 +1216,7 @@ void release_mounts(struct list_head *he
 			dput(dentry);
 			mntput(m);
 		}
-		mntput(mnt);
+		mntput_long(mnt);
 	}
 }
 
@@ -1125,8 +1271,16 @@ static int do_umount(struct vfsmount *mn
 		    flags & (MNT_FORCE | MNT_DETACH))
 			return -EINVAL;
 
-		if (atomic_read(&mnt->mnt_count) != 2)
+		/*
+		 * probably don't strictly need the lock here if we examined
+		 * all race cases, but it's a slowpath.
+		 */
+		br_write_lock(vfsmount_lock);
+		if (count_mnt_count(mnt) != 2) {
+			br_write_lock(vfsmount_lock);
 			return -EBUSY;
+		}
+		br_write_unlock(vfsmount_lock);
 
 		if (!xchg(&mnt->mnt_expiry_mark, 1))
 			return -EAGAIN;
@@ -1815,7 +1969,7 @@ int do_add_mount(struct vfsmount *newmnt
 
 unlock:
 	up_write(&namespace_sem);
-	mntput(newmnt);
+	mntput_long(newmnt);
 	return err;
 }
 
@@ -2148,11 +2302,11 @@ static struct mnt_namespace *dup_mnt_ns(
 		if (fs) {
 			if (p == fs->root.mnt) {
 				rootmnt = p;
-				fs->root.mnt = mntget(q);
+				fs->root.mnt = mntget_long(q);
 			}
 			if (p == fs->pwd.mnt) {
 				pwdmnt = p;
-				fs->pwd.mnt = mntget(q);
+				fs->pwd.mnt = mntget_long(q);
 			}
 		}
 		p = next_mnt(p, mnt_ns->root);
@@ -2161,9 +2315,9 @@ static struct mnt_namespace *dup_mnt_ns(
 	up_write(&namespace_sem);
 
 	if (rootmnt)
-		mntput(rootmnt);
+		mntput_long(rootmnt);
 	if (pwdmnt)
-		mntput(pwdmnt);
+		mntput_long(pwdmnt);
 
 	return new_ns;
 }
@@ -2350,6 +2504,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
 	br_write_unlock(vfsmount_lock);
 	chroot_fs_refs(&root, &new);
+
 	error = 0;
 	path_put(&root_parent);
 	path_put(&parent_path);
@@ -2376,6 +2531,7 @@ static void __init init_mount_tree(void)
 	mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
 	if (IS_ERR(mnt))
 		panic("Can't create rootfs");
+
 	ns = create_mnt_ns(mnt);
 	if (IS_ERR(ns))
 		panic("Can't allocate initial namespace");
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/include/linux/mount.h	2010-12-14 21:07:29.000000000 +1100
@@ -13,6 +13,7 @@
 #include <linux/list.h>
 #include <linux/nodemask.h>
 #include <linux/spinlock.h>
+#include <linux/seqlock.h>
 #include <asm/atomic.h>
 
 struct super_block;
@@ -46,12 +47,24 @@ struct mnt_namespace;
 
 #define MNT_INTERNAL	0x4000
 
+struct mnt_pcp {
+	int mnt_count;
+	int mnt_writers;
+};
+
 struct vfsmount {
 	struct list_head mnt_hash;
 	struct vfsmount *mnt_parent;	/* fs we are mounted on */
 	struct dentry *mnt_mountpoint;	/* dentry of mountpoint */
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
+#ifdef CONFIG_SMP
+	struct mnt_pcp __percpu *mnt_pcp;
+	atomic_t mnt_longrefs;
+#else
+	int mnt_count;
+	int mnt_writers;
+#endif
 	struct list_head mnt_mounts;	/* list of children, anchored here */
 	struct list_head mnt_child;	/* and going through their mnt_child */
 	int mnt_flags;
@@ -70,57 +83,25 @@ struct vfsmount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	int mnt_id;			/* mount identifier */
 	int mnt_group_id;		/* peer group identifier */
-	/*
-	 * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
-	 * to let these frequently modified fields in a separate cache line
-	 * (so that reads of mnt_flags wont ping-pong on SMP machines)
-	 */
-	atomic_t mnt_count;
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	int mnt_pinned;
 	int mnt_ghosts;
-#ifdef CONFIG_SMP
-	int __percpu *mnt_writers;
-#else
-	int mnt_writers;
-#endif
 };
 
-static inline int *get_mnt_writers_ptr(struct vfsmount *mnt)
-{
-#ifdef CONFIG_SMP
-	return mnt->mnt_writers;
-#else
-	return &mnt->mnt_writers;
-#endif
-}
-
-static inline struct vfsmount *mntget(struct vfsmount *mnt)
-{
-	if (mnt)
-		atomic_inc(&mnt->mnt_count);
-	return mnt;
-}
-
 struct file; /* forward dec */
 
 extern int mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct file *file);
 extern int mnt_clone_write(struct vfsmount *mnt);
 extern void mnt_drop_write(struct vfsmount *mnt);
-extern void mntput_no_expire(struct vfsmount *mnt);
+extern void mntput(struct vfsmount *mnt);
+extern struct vfsmount *mntget(struct vfsmount *mnt);
+extern void mntput_long(struct vfsmount *mnt);
+extern struct vfsmount *mntget_long(struct vfsmount *mnt);
 extern void mnt_pin(struct vfsmount *mnt);
 extern void mnt_unpin(struct vfsmount *mnt);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
 
-static inline void mntput(struct vfsmount *mnt)
-{
-	if (mnt) {
-		mnt->mnt_expiry_mark = 0;
-		mntput_no_expire(mnt);
-	}
-}
-
 extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
 				      const char *name, void *data);
 
Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/pnode.c	2010-12-14 21:07:29.000000000 +1100
@@ -288,7 +288,7 @@ int propagate_mnt(struct vfsmount *dest_
  */
 static inline int do_refcount_check(struct vfsmount *mnt, int count)
 {
-	int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts;
+	int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts;
 	return (mycount > count);
 }
 
@@ -300,7 +300,7 @@ static inline int do_refcount_check(stru
  * Check if any of these mounts that **do not have submounts**
  * have more references than 'refcnt'. If so return busy.
  *
- * vfsmount lock must be held for read or write
+ * vfsmount lock must be held for write
  */
 int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
 {
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/internal.h	2010-12-14 21:07:29.000000000 +1100
@@ -63,6 +63,7 @@ extern int copy_mount_string(const void
 
 extern void free_vfsmnt(struct vfsmount *);
 extern struct vfsmount *alloc_vfsmnt(const char *);
+extern unsigned int count_mnt_count(struct vfsmount *mnt);
 extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
 extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
 				struct vfsmount *);
Index: linux-2.6/drivers/mtd/mtdchar.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdchar.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/drivers/mtd/mtdchar.c	2010-12-14 21:07:29.000000000 +1100
@@ -1201,7 +1201,7 @@ static int __init init_mtdchar(void)
 static void __exit cleanup_mtdchar(void)
 {
 	unregister_mtd_user(&mtdchar_notifier);
-	mntput(mtd_inode_mnt);
+	mntput_long(mtd_inode_mnt);
 	unregister_filesystem(&mtd_inodefs_type);
 	__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
 }
Index: linux-2.6/arch/ia64/kernel/perfmon.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/perfmon.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/arch/ia64/kernel/perfmon.c	2010-12-14 21:07:29.000000000 +1100
@@ -1542,7 +1542,7 @@ pfm_exit_smpl_buffer(pfm_buffer_fmt_t *f
  * any operations on the root directory. However, we need a non-trivial
  * d_name - pfm: will go nicely and kill the special-casing in procfs.
  */
-static struct vfsmount *pfmfs_mnt;
+static struct vfsmount *pfmfs_mnt __read_mostly;
 
 static int __init
 init_pfm_fs(void)
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c	2010-12-14 21:07:29.000000000 +1100
@@ -232,7 +232,7 @@ static int __init anon_inode_init(void)
 	return 0;
 
 err_mntput:
-	mntput(anon_inode_mnt);
+	mntput_long(anon_inode_mnt);
 err_unregister_filesystem:
 	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/pipe.c	2010-12-14 21:07:29.000000000 +1100
@@ -1292,7 +1292,7 @@ static int __init init_pipe_fs(void)
 static void __exit exit_pipe_fs(void)
 {
 	unregister_filesystem(&pipe_fs_type);
-	mntput(pipe_mnt);
+	mntput_long(pipe_mnt);
 }
 
 fs_initcall(init_pipe_fs);
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/net/socket.c	2010-12-14 21:07:29.000000000 +1100
@@ -2375,6 +2375,8 @@ EXPORT_SYMBOL(sock_unregister);
 
 static int __init sock_init(void)
 {
+	int err;
+
 	/*
 	 *      Initialize sock SLAB cache.
 	 */
@@ -2391,8 +2393,15 @@ static int __init sock_init(void)
 	 */
 
 	init_inodecache();
-	register_filesystem(&sock_fs_type);
+
+	err = register_filesystem(&sock_fs_type);
+	if (err)
+		goto out_fs;
 	sock_mnt = kern_mount(&sock_fs_type);
+	if (IS_ERR(sock_mnt)) {
+		err = PTR_ERR(sock_mnt);
+		goto out_mount;
+	}
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */
@@ -2405,7 +2414,13 @@ static int __init sock_init(void)
 	skb_timestamping_init();
 #endif
 
-	return 0;
+out:
+	return err;
+
+out_mount:
+	unregister_filesystem(&sock_fs_type);
+out_fs:
+	goto out;
 }
 
 core_initcall(sock_init);	/* early initcall */
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c	2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/super.c	2010-12-14 21:07:29.000000000 +1100
@@ -1140,7 +1140,7 @@ static struct vfsmount *fs_set_subtype(s
 	return mnt;
 
  err:
-	mntput(mnt);
+	mntput_long(mnt);
 	return ERR_PTR(err);
 }
 
Index: linux-2.6/fs/fs_struct.c
===================================================================
--- linux-2.6.orig/fs/fs_struct.c	2010-12-14 22:01:15.000000000 +1100
+++ linux-2.6/fs/fs_struct.c	2010-12-14 22:56:27.000000000 +1100
@@ -17,11 +17,11 @@ void set_fs_root(struct fs_struct *fs, s
 	write_seqcount_begin(&fs->seq);
 	old_root = fs->root;
 	fs->root = *path;
-	path_get(path);
+	path_get_long(path);
 	write_seqcount_end(&fs->seq);
 	spin_unlock(&fs->lock);
 	if (old_root.dentry)
-		path_put(&old_root);
+		path_put_long(&old_root);
 }
 
 /*
@@ -36,12 +36,12 @@ void set_fs_pwd(struct fs_struct *fs, st
 	write_seqcount_begin(&fs->seq);
 	old_pwd = fs->pwd;
 	fs->pwd = *path;
-	path_get(path);
+	path_get_long(path);
 	write_seqcount_end(&fs->seq);
 	spin_unlock(&fs->lock);
 
 	if (old_pwd.dentry)
-		path_put(&old_pwd);
+		path_put_long(&old_pwd);
 }
 
 void chroot_fs_refs(struct path *old_root, struct path *new_root)
@@ -59,13 +59,13 @@ void chroot_fs_refs(struct path *old_roo
 			write_seqcount_begin(&fs->seq);
 			if (fs->root.dentry == old_root->dentry
 			    && fs->root.mnt == old_root->mnt) {
-				path_get(new_root);
+				path_get_long(new_root);
 				fs->root = *new_root;
 				count++;
 			}
 			if (fs->pwd.dentry == old_root->dentry
 			    && fs->pwd.mnt == old_root->mnt) {
-				path_get(new_root);
+				path_get_long(new_root);
 				fs->pwd = *new_root;
 				count++;
 			}
@@ -76,13 +76,13 @@ void chroot_fs_refs(struct path *old_roo
 	} while_each_thread(g, p);
 	read_unlock(&tasklist_lock);
 	while (count--)
-		path_put(old_root);
+		path_put_long(old_root);
 }
 
 void free_fs_struct(struct fs_struct *fs)
 {
-	path_put(&fs->root);
-	path_put(&fs->pwd);
+	path_put_long(&fs->root);
+	path_put_long(&fs->pwd);
 	kmem_cache_free(fs_cachep, fs);
 }
 
@@ -115,7 +115,13 @@ struct fs_struct *copy_fs_struct(struct
 		spin_lock_init(&fs->lock);
 		seqcount_init(&fs->seq);
 		fs->umask = old->umask;
-		get_fs_root_and_pwd(old, &fs->root, &fs->pwd);
+
+		spin_lock(&old->lock);
+		fs->root = old->root;
+		path_get_long(&fs->root);
+		fs->pwd = old->pwd;
+		path_get_long(&fs->pwd);
+		spin_unlock(&old->lock);
 	}
 	return fs;
 }
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-12-14 22:06:54.000000000 +1100
+++ linux-2.6/fs/namei.c	2010-12-14 22:07:51.000000000 +1100
@@ -448,6 +448,18 @@ void path_get(struct path *path)
 EXPORT_SYMBOL(path_get);
 
 /**
+ * path_get_long - get a long reference to a path
+ * @path: path to get the reference to
+ *
+ * Given a path increment the reference count to the dentry and the vfsmount.
+ */
+void path_get_long(struct path *path)
+{
+	mntget_long(path->mnt);
+	dget(path->dentry);
+}
+
+/**
  * path_put - put a reference to a path
  * @path: path to put the reference to
  *
@@ -461,6 +473,18 @@ void path_put(struct path *path)
 EXPORT_SYMBOL(path_put);
 
 /**
+ * path_put_long - put a long reference to a path
+ * @path: path to put the reference to
+ *
+ * Given a path decrement the reference count to the dentry and the vfsmount.
+ */
+void path_put_long(struct path *path)
+{
+	dput(path->dentry);
+	mntput_long(path->mnt);
+}
+
+/**
  * nameidata_drop_rcu - drop this nameidata out of rcu-walk
  * @nd: nameidata pathwalk data to drop
  * @Returns: 0 on success, -ECHLID on failure
Index: linux-2.6/include/linux/path.h
===================================================================
--- linux-2.6.orig/include/linux/path.h	2010-12-14 22:06:38.000000000 +1100
+++ linux-2.6/include/linux/path.h	2010-12-14 22:08:13.000000000 +1100
@@ -10,7 +10,9 @@ struct path {
 };
 
 extern void path_get(struct path *);
+extern void path_get_long(struct path *);
 extern void path_put(struct path *);
+extern void path_put_long(struct path *);
 
 static inline int path_equal(const struct path *path1, const struct path *path2)
 {

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and  dcache scaling tree update and status)
  2010-12-14 12:40     ` Nick Piggin
@ 2010-12-15  8:16       ` Andreas Dilger
  2010-12-15 10:24         ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Dilger @ 2010-12-15  8:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Linus Torvalds, Andrew Morton, Al Viro,
	Stephen Rothwell, linux-fsdevel, linux-kernel

On 2010-12-14, at 05:40, Nick Piggin wrote:
> +static inline void add_mnt_count(struct vfsmount *mnt, int n)
> +static inline void set_mnt_count(struct vfsmount *mnt, int n)
> +static inline void inc_mnt_count(struct vfsmount *mnt)
> +static inline void dec_mnt_count(struct vfsmount *mnt)
> +unsigned int count_mnt_count(struct vfsmount *mnt)

Minor nit - it is easier to find these related functions via tag
completion if they are of the form "mnt_count_{add,set,inc,dec}"
and it would also be more consistent and easier to understand if
you rename count_mnt_count() to mnt_count_sum() or mnt_count_read().

This also follows the kernel naming convention much more closely
(e.g. atomic_{inc,dec,add,sub,set,read}(), mutex_{init,lock,unlock}(),
etc), which is normally of the form {type}_{action}.

Yes, I see the other {inc,dec,count}_mnt_writers() functions
exist, but I'd prefer not to add more bad function names to
the kernel.  Maybe a separate patch to clean up those names is
in order...

Cheers, Andreas






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)
  2010-12-15  8:16       ` Andreas Dilger
@ 2010-12-15 10:24         ` Nick Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2010-12-15 10:24 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Nick Piggin, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Al Viro, Stephen Rothwell, linux-fsdevel, linux-kernel

On Wed, Dec 15, 2010 at 01:16:52AM -0700, Andreas Dilger wrote:
> On 2010-12-14, at 05:40, Nick Piggin wrote:
> > +static inline void add_mnt_count(struct vfsmount *mnt, int n)
> > +static inline void set_mnt_count(struct vfsmount *mnt, int n)
> > +static inline void inc_mnt_count(struct vfsmount *mnt)
> > +static inline void dec_mnt_count(struct vfsmount *mnt)
> > +unsigned int count_mnt_count(struct vfsmount *mnt)
> 
> Minor nit - it is easier to find these related functions via tag
> completion if they are of the form "mnt_count_{add,set,inc,dec}"
> and it would also be more consistent and easier to understand if
> you rename count_mnt_count() to mnt_count_sum() or mnt_count_read().
> 
> This also follows the kernel naming convention much more closely
> (e.g. atomic_{inc,dec,add,sub,set,read}(), mutex_{init,lock,unlock}(),
> etc), which is normally of the form {type}_{action}.

OK, I agree.

 
> Yes, I see the other {inc,dec,count}_mnt_writers() functions
> exist, but I'd prefer not to add more bad function names to
> the kernel.  Maybe a separate patch to clean up those names is
> in order...

I might have added those too :P Yes they should be consistent, so
I'll rename them first.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-12-16  2:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-13  2:37 rcu-walk and dcache scaling tree update and status Nick Piggin
2010-12-13  2:42 ` [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status) Nick Piggin
2010-12-13  3:31   ` Nick Piggin
2010-12-13  3:43     ` Nick Piggin
2010-12-13  7:25   ` Eric Dumazet
2010-12-13  8:33     ` Nick Piggin
2010-12-14 12:40     ` Nick Piggin
2010-12-15  8:16       ` Andreas Dilger
2010-12-15 10:24         ` Nick Piggin
2010-12-13  2:53 ` rcu-walk and dcache scaling tree update and status Ed Tomlinson
2010-12-13  2:59   ` Nick Piggin
2010-12-13  3:45     ` Stephen Rothwell
2010-12-13  3:50       ` Nick Piggin
2010-12-13  3:40 ` Stephen Rothwell
2010-12-13  3:48   ` Nick Piggin
2010-12-14  0:03     ` Stephen Rothwell
2010-12-14  0:16       ` Stephen Rothwell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.