All of lore.kernel.org
 help / color / mirror / Atom feed
* vfsmount lock issues on very large ppc64 box
@ 2011-07-17  0:50 Anton Blanchard
  2011-07-17  1:04 ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Anton Blanchard @ 2011-07-17  0:50 UTC (permalink / raw)
  To: npiggin, viro; +Cc: linux-kernel, linux-fsdevel


When compiling a kernel with make -j on a ppc64 box with 896
HW threads we spend a very large amount of time in vfsmount
lock code:

    20.85%  [k] .vfsmount_lock_local_lock
            |
            |--57.20%-- .mntput_no_expire
            |          |
            |          |--74.93%-- .fput
            |          |          |
            |          |          |--91.19%-- .filp_close
            |          |          |          |
            |          |          |          |--98.97%-- .sys_close

    14.15%  [k] .vfsmount_lock_global_lock_online
            |
            |--100.00%-- .mntput_no_expire
            |          .fput
            |          .filp_close
            |          |
            |          |--70.01%-- .put_files_struct
            |          |          .do_exit
            |          |          .do_group_exit
            |          |          .sys_exit_group
            |          |          syscall_exit
            |          |
            |           --29.99%-- .sys_close


Looking closer, all of these calls are in pipefs and sockfs.
Since we never mount either filesystem they never get a long term
reference and we always end up in the very slow write brlock path
that takes a lock for each online CPU.

Here is a quick hack that takes a long term reference on pipefs
and sockfs which fixes the problem. Any thoughts on how we should
fix it properly?

---
Signed-off-by: Anton Blanchard <anton@samba.org>

Index: linux-2.6-work/fs/pipe.c
===================================================================
--- linux-2.6-work.orig/fs/pipe.c	2011-07-17 10:21:54.695472158 +1000
+++ linux-2.6-work/fs/pipe.c	2011-07-17 10:33:31.127204731 +1000
@@ -20,6 +20,7 @@
 #include <linux/audit.h>
 #include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include "internal.h"
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -1286,11 +1287,13 @@ static int __init init_pipe_fs(void)
 			unregister_filesystem(&pipe_fs_type);
 		}
 	}
+	mnt_make_longterm(pipe_mnt);
 	return err;
 }
 
 static void __exit exit_pipe_fs(void)
 {
+	mnt_make_shortterm(pipe_mnt);
 	unregister_filesystem(&pipe_fs_type);
 	mntput(pipe_mnt);
 }
Index: linux-2.6-work/net/socket.c
===================================================================
--- linux-2.6-work.orig/net/socket.c	2011-07-17 10:21:54.685471989 +1000
+++ linux-2.6-work/net/socket.c	2011-07-17 10:33:41.247375257 +1000
@@ -2500,6 +2500,8 @@ void sock_unregister(int family)
 }
 EXPORT_SYMBOL(sock_unregister);
 
+extern void mnt_make_longterm(struct vfsmount *);
+
 static int __init sock_init(void)
 {
 	int err;
@@ -2530,6 +2532,8 @@ static int __init sock_init(void)
 		goto out_mount;
 	}
 
+	mnt_make_longterm(sock_mnt);
+
 	/* The real protocol initialization is performed in later initcalls.
 	 */
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: vfsmount lock issues on very large ppc64 box
  2011-07-17  0:50 vfsmount lock issues on very large ppc64 box Anton Blanchard
@ 2011-07-17  1:04 ` Matthew Wilcox
  2011-07-17  8:46   ` Andi Kleen
  2011-07-18 16:41   ` vfsmount lock issues on very large ppc64 box Tim Chen
  0 siblings, 2 replies; 35+ messages in thread
From: Matthew Wilcox @ 2011-07-17  1:04 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: npiggin, viro, linux-kernel, linux-fsdevel, Andi Kleen, Tim Chen

On Sun, Jul 17, 2011 at 10:50:27AM +1000, Anton Blanchard wrote:
> Looking closer, all of these calls are in pipefs and sockfs.
> Since we never mount either filesystem they never get a long term
> reference and we always end up in the very slow write brlock path
> that takes a lock for each online CPU.
> 
> Here is a quick hack that takes a long term reference on pipefs
> and sockfs which fixes the problem. Any thoughts on how we should
> fix it properly?

I know Tim and Andi have been looking into this ... I forget what their
fix was though.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: vfsmount lock issues on very large ppc64 box
  2011-07-17  1:04 ` Matthew Wilcox
@ 2011-07-17  8:46   ` Andi Kleen
  2011-07-18  8:17     ` Eric Dumazet
  2011-07-18 16:41   ` vfsmount lock issues on very large ppc64 box Tim Chen
  1 sibling, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2011-07-17  8:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anton Blanchard, npiggin, viro, linux-kernel, linux-fsdevel,
	Andi Kleen, Tim Chen

On Sat, Jul 16, 2011 at 07:04:28PM -0600, Matthew Wilcox wrote:
> On Sun, Jul 17, 2011 at 10:50:27AM +1000, Anton Blanchard wrote:
> > Looking closer, all of these calls are in pipefs and sockfs.
> > Since we never mount either filesystem they never get a long term
> > reference and we always end up in the very slow write brlock path
> > that takes a lock for each online CPU.
> > 
> > Here is a quick hack that takes a long term reference on pipefs
> > and sockfs which fixes the problem. Any thoughts on how we should
> > fix it properly?
> 
> I know Tim and Andi have been looking into this ... I forget what their
> fix was though.

Tim posted a patch, but it wasn't applied for unknown reasons.

https://lkml.org/lkml/2011/4/13/561

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: vfsmount lock issues on very large ppc64 box
  2011-07-17  8:46   ` Andi Kleen
@ 2011-07-18  8:17     ` Eric Dumazet
  2011-07-18 15:40       ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2011-07-18  8:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Matthew Wilcox, Anton Blanchard, npiggin, viro, linux-kernel,
	linux-fsdevel, Tim Chen

Le dimanche 17 juillet 2011 à 10:46 +0200, Andi Kleen a écrit :
> On Sat, Jul 16, 2011 at 07:04:28PM -0600, Matthew Wilcox wrote:
> > On Sun, Jul 17, 2011 at 10:50:27AM +1000, Anton Blanchard wrote:
> > > Looking closer, all of these calls are in pipefs and sockfs.
> > > Since we never mount either filesystem they never get a long term
> > > reference and we always end up in the very slow write brlock path
> > > that takes a lock for each online CPU.
> > > 
> > > Here is a quick hack that takes a long term reference on pipefs
> > > and sockfs which fixes the problem. Any thoughts on how we should
> > > fix it properly?
> > 
> > I know Tim and Andi have been looking into this ... I forget what their
> > fix was though.
> 
> Tim posted a patch, but it wasn't applied for unknown reasons.
> 
> https://lkml.org/lkml/2011/4/13/561
> 

Could Tim respin his patch then ?

This problem hits even a two sockets machine.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: vfsmount lock issues on very large ppc64 box
  2011-07-18  8:17     ` Eric Dumazet
@ 2011-07-18 15:40       ` Christoph Hellwig
  2011-07-18 15:51         ` Al Viro
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-18 15:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Matthew Wilcox, Anton Blanchard, npiggin, viro,
	linux-kernel, linux-fsdevel, Tim Chen

On Mon, Jul 18, 2011 at 10:17:08AM +0200, Eric Dumazet wrote:
> > Tim posted a patch, but it wasn't applied for unknown reasons.
> > 
> > https://lkml.org/lkml/2011/4/13/561
> > 
> 
> Could Tim respin his patch then ?

Btw, it seems like most kern_mount users want this and he's still
missing some like devtmpfs, sysfs or ipc.  What about splitting
kern_mount into a variant that gets a long-term reference for these
callers, and one that doesn't for afs/cifs/nfs automounts?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: vfsmount lock issues on very large ppc64 box
  2011-07-18 15:40       ` Christoph Hellwig
@ 2011-07-18 15:51         ` Al Viro
  2011-07-19 16:32           ` [Patch] VFS : mount lock scalability for files systems without mount point (WAS vfsmount lock issues on very large ppc64 box) Tim Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Al Viro @ 2011-07-18 15:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Dumazet, Andi Kleen, Matthew Wilcox, Anton Blanchard,
	npiggin, linux-kernel, linux-fsdevel, Tim Chen

On Mon, Jul 18, 2011 at 11:40:08AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 18, 2011 at 10:17:08AM +0200, Eric Dumazet wrote:
> > > Tim posted a patch, but it wasn't applied for unknown reasons.
> > > 
> > > https://lkml.org/lkml/2011/4/13/561
> > > 
> > 
> > Could Tim respin his patch then ?
> 
> Btw, it seems like most kern_mount users want this and he's still
> missing some like devtmpfs, sysfs or ipc.  What about splitting
> kern_mount into a variant that gets a long-term reference for these
> callers, and one that doesn't for afs/cifs/nfs automounts?

	Careful - we need to balance that on shutdown side with
mnt_make_shortterm() before the final mntput()...  Making it too
easy just on the kern_mount side will lead to easy-to-miss bugs.
For one thing, it's visible only on SMP boxen; for another there's
a lot of such internal vfsmounts (pipefs, sockets, etc.) that
are never shut down, so there'll be no easily copied examples of
what should not be forgotten on __exit side of things in obvious
places...

	FWIW, I had missed the original posting back in April - had been
without net.access then and it hadn't been resent when I came back, asked
people to resend the important ones and shitcanned the huge pile of pending
mail from the previous month...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: vfsmount lock issues on very large ppc64 box
  2011-07-17  1:04 ` Matthew Wilcox
  2011-07-17  8:46   ` Andi Kleen
@ 2011-07-18 16:41   ` Tim Chen
  1 sibling, 0 replies; 35+ messages in thread
From: Tim Chen @ 2011-07-18 16:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anton Blanchard, npiggin, viro, linux-kernel, linux-fsdevel,
	Andi Kleen, Alexander Viro



On Sat, 2011-07-16 at 19:04 -0600, Matthew Wilcox wrote:
> On Sun, Jul 17, 2011 at 10:50:27AM +1000, Anton Blanchard wrote:
> > Looking closer, all of these calls are in pipefs and sockfs.
> > Since we never mount either filesystem they never get a long term
> > reference and we always end up in the very slow write brlock path
> > that takes a lock for each online CPU.
> > 
> > Here is a quick hack that takes a long term reference on pipefs
> > and sockfs which fixes the problem. Any thoughts on how we should
> > fix it properly?
> 
> I know Tim and Andi have been looking into this ... I forget what their
> fix was though.
> 

I've sent out a trial patch a while ago on this problem, but didn't get
a response back. Nick or Al, can you take another look at this patch now
that other folks are also running into this issue?

http://marc.info/?l=linux-fsdevel&m=130273200502778&w=2

Thanks.

Tim


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-18 15:51         ` Al Viro
@ 2011-07-19 16:32           ` Tim Chen
  2011-07-21 20:40             ` Al Viro
  0 siblings, 1 reply; 35+ messages in thread
From: Tim Chen @ 2011-07-19 16:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Eric Dumazet, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

On Mon, 2011-07-18 at 16:51 +0100, Al Viro wrote:

> 
> 	Careful - we need to balance that on shutdown side with
> mnt_make_shortterm() before the final mntput()...  Making it too
> easy just on the kern_mount side will lead to easy-to-miss bugs.
> For one thing, it's visible only on SMP boxen; for another there's
> a lot of such internal vfsmounts (pipefs, sockets, etc.) that
> are never shut down, so there'll be no easily copied examples of
> what should not be forgotten on __exit side of things in obvious
> places...
> 

Al,

I've respun my patch to try to address your comments. 
Any further suggestions are welcomed.

Thanks.

Tim

---------------

For a number of file systems that don't have a mount point (e.g. sockfs
and pipefs), they are not marked as long term. Therefore in
mntput_no_expire, all locks in vfs_mount lock are taken instead of just
local cpu's lock to aggregate reference counts when we release
reference to file objects.  In fact, only local lock need to have been
taken to update ref counts as these file systems are in no danger of
going away until we are ready to unregister them. 

The attached patch marks file systems using kern_mount without 
mount point as long term.  The contentions of vfs_mount lock 
is now eliminated.  Before un-registering such file system,
kern_unmount should be called to remove the long term flag and
make the mount point ready to be freed. 

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index 3f92731..708e669 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1193,6 +1193,7 @@ static void __exit cleanup_mtdchar(void)
 {
 	unregister_mtd_user(&mtdchar_notifier);
 	mntput(mtd_inode_mnt);
+	kern_unmount(mtd_inode_mnt);
 	unregister_filesystem(&mtd_inodefs_type);
 	__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
 }
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index c5567cb..3b7db2b 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -234,6 +234,7 @@ static int __init anon_inode_init(void)
 
 err_mntput:
 	mntput(anon_inode_mnt);
+	kern_unmount(anon_inode_mnt);
 err_unregister_filesystem:
 	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7aafeb8..0b686ce 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1030,6 +1030,7 @@ static int __init init_hugetlbfs_fs(void)
 static void __exit exit_hugetlbfs_fs(void)
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
+	kern_unmount(hugetlbfs_vfsmount);
 	unregister_filesystem(&hugetlbfs_fs_type);
 	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
diff --git a/fs/namespace.c b/fs/namespace.c
index fe59bd1..ae8c358 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2386,6 +2386,29 @@ void mnt_make_shortterm(struct vfsmount *mnt)
 #endif
 }
 
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+	struct vfsmount *mnt;
+
+	mnt = kern_mount_data(type, NULL);
+	if (!IS_ERR(mnt)) {
+		/* it is a longterm mount, don't release mnt until */
+		/* we unmount before file sys is unregistered */
+		mnt_make_longterm(mnt);
+		mntget();
+	}
+	return mnt;
+}
+
+void kern_unmount(struct vfsmount *mnt)
+{
+	/* release long term mount so mount point can be released */
+	if (!IS_ERR_OR_NULL(mnt)) {
+		mnt_make_shortterm(mnt);
+		mntput();
+	}
+}
+
 /*
  * Allocate a new namespace structure and populate it with contents
  * copied from the namespace of the passed in task structure.
diff --git a/fs/pipe.c b/fs/pipe.c
index da42f7d..26552aa 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1291,6 +1291,7 @@ static int __init init_pipe_fs(void)
 
 static void __exit exit_pipe_fs(void)
 {
+	kern_unmount(pipe_mnt);
 	unregister_filesystem(&pipe_fs_type);
 	mntput(pipe_mnt);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..79f2dae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1868,7 +1868,8 @@ static inline int sb_is_dirty(struct super_block *sb)
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
-#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount(struct file_system_type *type);
+extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 3545934..de7900e 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1984,6 +1984,7 @@ __initcall(init_sel_fs);
 void exit_sel_fs(void)
 {
 	kobject_put(selinuxfs_kobj);
+	kern_unmount(selinuxfs_mount);
 	unregister_filesystem(&sel_fs_type);
 }
 #endif



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-19 16:32           ` [Patch] VFS : mount lock scalability for files systems without mount point (WAS vfsmount lock issues on very large ppc64 box) Tim Chen
@ 2011-07-21 20:40             ` Al Viro
  2011-07-22  0:27               ` Tim Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Al Viro @ 2011-07-21 20:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: Christoph Hellwig, Eric Dumazet, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

On Tue, Jul 19, 2011 at 09:32:38AM -0700, Tim Chen wrote:
> @@ -1193,6 +1193,7 @@ static void __exit cleanup_mtdchar(void)
>  {
>  	unregister_mtd_user(&mtdchar_notifier);
>  	mntput(mtd_inode_mnt);
> +	kern_unmount(mtd_inode_mnt);

Surely you want to merge that mntput() in there...

> +void kern_unmount(struct vfsmount *mnt)
> +{
> +	/* release long term mount so mount point can be released */
> +	if (!IS_ERR_OR_NULL(mnt)) {
> +		mnt_make_shortterm(mnt);
> +		mntput();
> +	}
> +}

... and if you pass it the argument, it'll be much happier.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-21 20:40             ` Al Viro
@ 2011-07-22  0:27               ` Tim Chen
  2011-07-23 13:24                 ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Tim Chen @ 2011-07-22  0:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Eric Dumazet, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

On Thu, 2011-07-21 at 21:40 +0100, Al Viro wrote:
> On Tue, Jul 19, 2011 at 09:32:38AM -0700, Tim Chen wrote:
> > @@ -1193,6 +1193,7 @@ static void __exit cleanup_mtdchar(void)
> >  {
> >  	unregister_mtd_user(&mtdchar_notifier);
> >  	mntput(mtd_inode_mnt);
> > +	kern_unmount(mtd_inode_mnt);
> 
> Surely you want to merge that mntput() in there...
> 

I've now merged mntput found in file system exit code of mtdchar,
anon_inodes, and pipefs.  There wasn't any mntput originally in exit
code of hugetlbfs and selinux_fs.  I think the mntput in kern_unmount
for them should still be valid. 

> > +void kern_unmount(struct vfsmount *mnt)
> > +{
> > +	/* release long term mount so mount point can be released */
> > +	if (!IS_ERR_OR_NULL(mnt)) {
> > +		mnt_make_shortterm(mnt);
> > +		mntput();
> > +	}
> > +}
> 
> ... and if you pass it the argument, it'll be much happier.

I must be pretty brain dead in the morning and not tested the 
mntput tweaks in the final patch sent properly.  My apology.  
It's fixed now.

Thanks.

Tim

-------------

For a number of file systems that don't have a mount point (e.g. sockfs
and pipefs), they are not marked as long term. Therefore in
mntput_no_expire, all locks in vfs_mount lock are taken instead of just
local cpu's lock to aggregate reference counts when we release
reference to file objects.  In fact, only local lock need to have been
taken to update ref counts as these file systems are in no danger of
going away until we are ready to unregister them. 

The attached patch marks file systems using kern_mount without 
actual mount point as long term.  The contentions of 
vfs_mount lock is now eliminated.  Before un-registering such file
system, kern_unmount should be called to remove the long term flag and
make the mount point ready to be freed. 

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index 3f92731..f1af222 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1192,7 +1192,7 @@ err_unregister_chdev:
 static void __exit cleanup_mtdchar(void)
 {
 	unregister_mtd_user(&mtdchar_notifier);
-	mntput(mtd_inode_mnt);
+	kern_unmount(mtd_inode_mnt);
 	unregister_filesystem(&mtd_inodefs_type);
 	__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
 }
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index c5567cb..4d433d3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -233,7 +233,7 @@ static int __init anon_inode_init(void)
 	return 0;
 
 err_mntput:
-	mntput(anon_inode_mnt);
+	kern_unmount(anon_inode_mnt);
 err_unregister_filesystem:
 	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7aafeb8..0b686ce 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1030,6 +1030,7 @@ static int __init init_hugetlbfs_fs(void)
 static void __exit exit_hugetlbfs_fs(void)
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
+	kern_unmount(hugetlbfs_vfsmount);
 	unregister_filesystem(&hugetlbfs_fs_type);
 	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
diff --git a/fs/namespace.c b/fs/namespace.c
index fe59bd1..50d71a9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2386,6 +2386,28 @@ void mnt_make_shortterm(struct vfsmount *mnt)
 #endif
 }
 
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+	struct vfsmount *mnt;
+
+	mnt = kern_mount_data(type, NULL);
+	if (!IS_ERR(mnt)) {
+		/* it is a longterm mount, don't release mnt until */
+		/* we unmount before file sys is unregistered */
+		mnt_make_longterm(mnt);
+	}
+	return mnt;
+}
+
+void kern_unmount(struct vfsmount *mnt)
+{
+	/* release long term mount so mount point can be released */
+	if (!IS_ERR_OR_NULL(mnt)) {
+		mnt_make_shortterm(mnt);
+		mntput(mnt);
+	}
+}
+
 /*
  * Allocate a new namespace structure and populate it with contents
  * copied from the namespace of the passed in task structure.
diff --git a/fs/pipe.c b/fs/pipe.c
index da42f7d..1b7f9af 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1291,8 +1291,8 @@ static int __init init_pipe_fs(void)
 
 static void __exit exit_pipe_fs(void)
 {
+	kern_unmount(pipe_mnt);
 	unregister_filesystem(&pipe_fs_type);
-	mntput(pipe_mnt);
 }
 
 fs_initcall(init_pipe_fs);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..79f2dae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1868,7 +1868,8 @@ static inline int sb_is_dirty(struct super_block *sb)
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
-#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount(struct file_system_type *type);
+extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 3545934..de7900e 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1984,6 +1984,7 @@ __initcall(init_sel_fs);
 void exit_sel_fs(void)
 {
 	kobject_put(selinuxfs_kobj);
+	kern_unmount(selinuxfs_mount);
 	unregister_filesystem(&sel_fs_type);
 }
 #endif




^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-22  0:27               ` Tim Chen
@ 2011-07-23 13:24                 ` Christoph Hellwig
  2011-07-25 22:39                   ` Tim Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-23 13:24 UTC (permalink / raw)
  To: Tim Chen
  Cc: Al Viro, Christoph Hellwig, Eric Dumazet, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel

I think you actually want this done in kern_mount_data, as both
ipc and proc want long-term references as well.  I also suspect with
additional creep of container awareness more internal mounts will switch
to kern_mount_data.  Al, what do you think about simply passing the
private data argument to kern_mount and kill kern_mount_data?  It's not
like the additional argument is going to cause us any pain.


> +struct vfsmount *kern_mount(struct file_system_type *type)
> +{
> +	struct vfsmount *mnt;
> +
> +	mnt = kern_mount_data(type, NULL);
> +	if (!IS_ERR(mnt)) {
> +		/* it is a longterm mount, don't release mnt until */
> +		/* we unmount before file sys is unregistered */

Please use normal kernel comment style, e.g.

	/*
	 * This is a longterm mount, don't release mnt until we umount
	 * it just before unregister_filesystem().
	 */
	
Adding proper kerneldoc comments for the kern_mount/umount function that
explain things in more detail would also be nice.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-23 13:24                 ` Christoph Hellwig
@ 2011-07-25 22:39                   ` Tim Chen
  2011-07-25 22:51                     ` Al Viro
  0 siblings, 1 reply; 35+ messages in thread
From: Tim Chen @ 2011-07-25 22:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Eric Dumazet, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

On Sat, 2011-07-23 at 09:24 -0400, Christoph Hellwig wrote:
> I think you actually want this done in kern_mount_data, as both
> ipc and proc want long-term references as well.  I also suspect with
> additional creep of container awareness more internal mounts will switch
> to kern_mount_data.  Al, what do you think about simply passing the
> private data argument to kern_mount and kill kern_mount_data?  It's not
> like the additional argument is going to cause us any pain.
> 
> 

I've updated the patch and merged mnt_make_longterm into
kern_mount_data to address your comments.  I've left the kern_mount
and kern_mount_data as separate functions for now.  Let me know
if removing kern_mount_data is preferable instead.

> Please use normal kernel comment style, e.g.
> 
> 	/*
> 	 * This is a longterm mount, don't release mnt until we umount
> 	 * it just before unregister_filesystem().
> 	 */
> 	
> Adding proper kerneldoc comments for the kern_mount/umount function that
> explain things in more detail would also be nice.
> 

I've added kerneldoc comments to kern_mount.

Thanks.

Tim

------------
For a number of file systems that don't have a mount point (e.g. sockfs
and pipefs), they are not marked as long term. Therefore in
mntput_no_expire, all locks in vfs_mount lock are taken instead of just
local cpu's lock to aggregate reference counts when we release
reference to file objects.  In fact, only local lock need to have been
taken to update ref counts as these file systems are in no danger of
going away until we are ready to unregister them. 

The attached patch marks file systems using kern_mount without 
mount point as long term.  The contentions of vfs_mount lock 
is now eliminated.  Before un-registering such file system,
kern_unmount should be called to remove the long term flag and
make the mount point ready to be freed. 

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index 3f92731..f1af222 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1192,7 +1192,7 @@ err_unregister_chdev:
 static void __exit cleanup_mtdchar(void)
 {
 	unregister_mtd_user(&mtdchar_notifier);
-	mntput(mtd_inode_mnt);
+	kern_unmount(mtd_inode_mnt);
 	unregister_filesystem(&mtd_inodefs_type);
 	__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
 }
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index c5567cb..4d433d3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -233,7 +233,7 @@ static int __init anon_inode_init(void)
 	return 0;
 
 err_mntput:
-	mntput(anon_inode_mnt);
+	kern_unmount(anon_inode_mnt);
 err_unregister_filesystem:
 	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7aafeb8..0b686ce 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1030,6 +1030,7 @@ static int __init init_hugetlbfs_fs(void)
 static void __exit exit_hugetlbfs_fs(void)
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
+	kern_unmount(hugetlbfs_vfsmount);
 	unregister_filesystem(&hugetlbfs_fs_type);
 	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
diff --git a/fs/namespace.c b/fs/namespace.c
index fe59bd1..78b5a70 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2386,6 +2386,36 @@ void mnt_make_shortterm(struct vfsmount *mnt)
 #endif
 }
 
+/**
+ * kern_mount - mounting internal file system by kernel 
+ * @type: file system type
+ *
+ * This is called to mount internal file system and returns
+ * mount point.  When the file system is unregistered, kern_unmount 
+ * should be called to release the mount point.
+ */
+struct vfsmount *kern_mount(struct file_system_type *type)
+{
+	return kern_mount_data(type, NULL);
+}
+
+/**
+ * kern_unmount - unmounting internal file system mounted by kernel
+ * @mnt: file system mount point
+ *
+ * This is called to unmount internal file system and release
+ * the mount point associated.  It should be called before
+ * an internal file system mounted by kernel is unregistered. 
+ */
+void kern_unmount(struct vfsmount *mnt)
+{
+	/* release long term mount so mount point can be released */
+	if (!IS_ERR_OR_NULL(mnt)) {
+		mnt_make_shortterm(mnt);
+		mntput(mnt);
+	}
+}
+
 /*
  * Allocate a new namespace structure and populate it with contents
  * copied from the namespace of the passed in task structure.
@@ -2719,8 +2749,27 @@ void put_mnt_ns(struct mnt_namespace *ns)
 }
 EXPORT_SYMBOL(put_mnt_ns);
 
+/**
+ * kern_mount_data - mounting internal file system by kernel 
+ * @type: file system type
+ * @data: pointer to mount data
+ *
+ * This is called to mount internal file system and returns
+ * mount point.  When the file system is unregistered, kern_unmount 
+ * should be called to release the mount point.
+ */
 struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
 {
-	return vfs_kern_mount(type, MS_KERNMOUNT, type->name, data);
+	struct vfsmount *mnt;
+
+	mnt = vfs_kern_mount(type, MS_KERNMOUNT, type->name, data);
+	if (!IS_ERR(mnt)) {
+		/* 
+		 * This is a longterm mount, don't release mnt until we
+		 * unmount it just before unregister_filesystem(). 
+		 */
+		mnt_make_longterm(mnt);
+	}
+	return mnt;
 }
 EXPORT_SYMBOL_GPL(kern_mount_data);
diff --git a/fs/pipe.c b/fs/pipe.c
index da42f7d..1b7f9af 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1291,8 +1291,8 @@ static int __init init_pipe_fs(void)
 
 static void __exit exit_pipe_fs(void)
 {
+	kern_unmount(pipe_mnt);
 	unregister_filesystem(&pipe_fs_type);
-	mntput(pipe_mnt);
 }
 
 fs_initcall(init_pipe_fs);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..79f2dae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1868,7 +1868,8 @@ static inline int sb_is_dirty(struct super_block *sb)
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
-#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount(struct file_system_type *type);
+extern void kern_unmount(struct vfsmount *mnt);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 3545934..de7900e 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1984,6 +1984,7 @@ __initcall(init_sel_fs);
 void exit_sel_fs(void)
 {
 	kobject_put(selinuxfs_kobj);
+	kern_unmount(selinuxfs_mount);
 	unregister_filesystem(&sel_fs_type);
 }
 #endif



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-25 22:39                   ` Tim Chen
@ 2011-07-25 22:51                     ` Al Viro
  2011-07-25 23:22                       ` Tim Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Al Viro @ 2011-07-25 22:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Christoph Hellwig, Eric Dumazet, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

On Mon, Jul 25, 2011 at 03:39:10PM -0700, Tim Chen wrote:
> On Sat, 2011-07-23 at 09:24 -0400, Christoph Hellwig wrote:
> > I think you actually want this done in kern_mount_data, as both
> > ipc and proc want long-term references as well.  I also suspect with
> > additional creep of container awareness more internal mounts will switch
> > to kern_mount_data.  Al, what do you think about simply passing the
> > private data argument to kern_mount and kill kern_mount_data?  It's not
> > like the additional argument is going to cause us any pain.
> > 
> > 
> 
> I've updated the patch and merged mnt_make_longterm into
> kern_mount_data to address your comments.  I've left the kern_mount
> and kern_mount_data as separate functions for now.  Let me know
> if removing kern_mount_data is preferable instead.

It's mostly merged in mainline already...  BTW, you forgot to export
that sucker - mtdchar can be modular.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-25 22:51                     ` Al Viro
@ 2011-07-25 23:22                       ` Tim Chen
  2011-07-26  6:00                         ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Tim Chen @ 2011-07-25 23:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Eric Dumazet, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

On Mon, 2011-07-25 at 23:51 +0100, Al Viro wrote:

> It's mostly merged in mainline already...  BTW, you forgot to export
> that sucker - mtdchar can be modular.

Thanks for fixing it up and merging it.

Tim


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Patch] VFS : mount lock scalability for files systems without mount point   (WAS vfsmount lock issues on very large ppc64 box)
  2011-07-25 23:22                       ` Tim Chen
@ 2011-07-26  6:00                         ` Eric Dumazet
  2011-07-26  8:21                             ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2011-07-26  6:00 UTC (permalink / raw)
  To: Tim Chen
  Cc: Al Viro, Christoph Hellwig, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel

Le lundi 25 juillet 2011 à 16:22 -0700, Tim Chen a écrit :
> On Mon, 2011-07-25 at 23:51 +0100, Al Viro wrote:
> 
> > It's mostly merged in mainline already...  BTW, you forgot to export
> > that sucker - mtdchar can be modular.
> 
> Thanks for fixing it up and merging it.
> 

Next step is to not chain pipes/sockets into superblock s_inodes list

inode_sb_list_add()/inode_sb_list_del() is the very last contention
point because of spin_lock(&inode_sb_list_lock);




^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
  2011-07-26  6:00                         ` Eric Dumazet
@ 2011-07-26  8:21                             ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-26  8:21 UTC (permalink / raw)
  To: Tim Chen, Al Viro, David Miller
  Cc: Christoph Hellwig, Andi Kleen, Matthew Wilcox, Anton Blanchard,
	npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 08:00 +0200, Eric Dumazet a écrit :

> Next step is to not chain pipes/sockets into superblock s_inodes list
> 
> inode_sb_list_add()/inode_sb_list_del() is the very last contention
> point because of spin_lock(&inode_sb_list_lock);

Well, not 'last' contention point, as we still hit remove_inode_hash(),
inode_wb_list_del(), inode_lru_list_del(), but thats a clear win on my
2x4x2 machine : 9 seconds instead of 22 on a close(socket()) benchmark.


[PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list

Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/anon_inodes.c   |    2 +-
 fs/inode.c         |   31 +++++++++++++++++++++----------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |    3 ++-
 net/socket.c       |    2 +-
 5 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4d433d3..269499e 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -187,7 +187,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = __new_inode(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/inode.c b/fs/inode.c
index 96c77b8..8a6d62b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -362,9 +362,11 @@ EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_sb_list_lock);
-	list_del_init(&inode->i_sb_list);
-	spin_unlock(&inode_sb_list_lock);
+	if (!list_empty(&inode->i_sb_list)) {
+		spin_lock(&inode_sb_list_lock);
+		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode_sb_list_lock);
+	}
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -796,6 +798,19 @@ unsigned int get_next_ino(void)
 }
 EXPORT_SYMBOL(get_next_ino);
 
+struct inode *__new_inode(struct super_block *sb)
+{
+	struct inode *inode = alloc_inode(sb);
+
+	if (inode) {
+		spin_lock(&inode->i_lock);
+		inode->i_state = 0;
+		spin_unlock(&inode->i_lock);
+		INIT_LIST_HEAD(&inode->i_sb_list);
+	}
+	return inode;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -814,13 +829,9 @@ struct inode *new_inode(struct super_block *sb)
 
 	spin_lock_prefetch(&inode_sb_list_lock);
 
-	inode = alloc_inode(sb);
-	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_state = 0;
-		spin_unlock(&inode->i_lock);
-		inode_sb_list_add(inode);
-	}
+	inode = __new_inode(sb);
+	if (inode)
+			inode_sb_list_add(inode);
 	return inode;
 }
 EXPORT_SYMBOL(new_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 1b7f9af..937b962 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -948,7 +948,7 @@ static const struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = __new_inode(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a665804..60be54f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2310,7 +2310,8 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void __destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *sb);
+extern struct inode *new_inode(struct super_block *sb);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
diff --git a/net/socket.c b/net/socket.c
index 02dc82d..b4b8a08 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -467,7 +467,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = __new_inode(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
@ 2011-07-26  8:21                             ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-26  8:21 UTC (permalink / raw)
  To: Tim Chen, Al Viro, David Miller
  Cc: Christoph Hellwig, Andi Kleen, Matthew Wilcox, Anton Blanchard,
	npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 08:00 +0200, Eric Dumazet a écrit :

> Next step is to not chain pipes/sockets into superblock s_inodes list
> 
> inode_sb_list_add()/inode_sb_list_del() is the very last contention
> point because of spin_lock(&inode_sb_list_lock);

Well, not 'last' contention point, as we still hit remove_inode_hash(),
inode_wb_list_del(), inode_lru_list_del(), but thats a clear win on my
2x4x2 machine : 9 seconds instead of 22 on a close(socket()) benchmark.


[PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list

Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/anon_inodes.c   |    2 +-
 fs/inode.c         |   31 +++++++++++++++++++++----------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |    3 ++-
 net/socket.c       |    2 +-
 5 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4d433d3..269499e 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -187,7 +187,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = __new_inode(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/inode.c b/fs/inode.c
index 96c77b8..8a6d62b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -362,9 +362,11 @@ EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_sb_list_lock);
-	list_del_init(&inode->i_sb_list);
-	spin_unlock(&inode_sb_list_lock);
+	if (!list_empty(&inode->i_sb_list)) {
+		spin_lock(&inode_sb_list_lock);
+		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode_sb_list_lock);
+	}
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -796,6 +798,19 @@ unsigned int get_next_ino(void)
 }
 EXPORT_SYMBOL(get_next_ino);
 
+struct inode *__new_inode(struct super_block *sb)
+{
+	struct inode *inode = alloc_inode(sb);
+
+	if (inode) {
+		spin_lock(&inode->i_lock);
+		inode->i_state = 0;
+		spin_unlock(&inode->i_lock);
+		INIT_LIST_HEAD(&inode->i_sb_list);
+	}
+	return inode;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -814,13 +829,9 @@ struct inode *new_inode(struct super_block *sb)
 
 	spin_lock_prefetch(&inode_sb_list_lock);
 
-	inode = alloc_inode(sb);
-	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_state = 0;
-		spin_unlock(&inode->i_lock);
-		inode_sb_list_add(inode);
-	}
+	inode = __new_inode(sb);
+	if (inode)
+			inode_sb_list_add(inode);
 	return inode;
 }
 EXPORT_SYMBOL(new_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 1b7f9af..937b962 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -948,7 +948,7 @@ static const struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = __new_inode(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a665804..60be54f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2310,7 +2310,8 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void __destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *sb);
+extern struct inode *new_inode(struct super_block *sb);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
diff --git a/net/socket.c b/net/socket.c
index 02dc82d..b4b8a08 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -467,7 +467,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = __new_inode(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
  2011-07-26  8:21                             ` Eric Dumazet
  (?)
@ 2011-07-26  9:03                             ` Christoph Hellwig
  2011-07-26  9:36                               ` Eric Dumazet
  -1 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-26  9:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tim Chen, Al Viro, David Miller, Christoph Hellwig, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

On Tue, Jul 26, 2011 at 10:21:06AM +0200, Eric Dumazet wrote:
> Well, not 'last' contention point, as we still hit remove_inode_hash(),

There should be no ned to put pipe or anon inodes on the inode hash.
Probably sockets don't need it either, but I'd need to look at it in
detail.

> inode_wb_list_del()

The should never be on the wb list either, doing an unlocked check for
actually beeing on the list before taking the lock should help you.

> inode_lru_list_del(),

No real need to keep inodes in the LRU if we only allocate them using
new_inode but never look them up either.  You might want to try setting
.drop_inode to generic_delete_inode for these.

> +struct inode *__new_inode(struct super_block *sb)
> +{
> +	struct inode *inode = alloc_inode(sb);
> +
> +	if (inode) {
> +		spin_lock(&inode->i_lock);
> +		inode->i_state = 0;
> +		spin_unlock(&inode->i_lock);
> +		INIT_LIST_HEAD(&inode->i_sb_list);
> +	}
> +	return inode;
> +}

This needs a much better name like new_inode_pseudo, and a kerneldoc 
comment explaining when it is safe to use, and the consequences, which
appear to me:

 - fs may never be unmount
 - quotas can't work on the filesystem
 - writeback can't work on the filesystem

> @@ -814,13 +829,9 @@ struct inode *new_inode(struct super_block *sb)
>  
>  	spin_lock_prefetch(&inode_sb_list_lock);
>  
> -	inode = alloc_inode(sb);
> -	if (inode) {
> -		spin_lock(&inode->i_lock);
> -		inode->i_state = 0;
> -		spin_unlock(&inode->i_lock);
> -		inode_sb_list_add(inode);
> -	}
> +	inode = __new_inode(sb);
> +	if (inode)
> +			inode_sb_list_add(inode);

bad indentation.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
  2011-07-26  9:03                             ` Christoph Hellwig
@ 2011-07-26  9:36                               ` Eric Dumazet
  2011-07-26  9:42                                 ` Christoph Hellwig
  2011-07-27 15:21                                   ` Eric Dumazet
  0 siblings, 2 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-26  9:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 05:03 -0400, Christoph Hellwig a écrit :
> On Tue, Jul 26, 2011 at 10:21:06AM +0200, Eric Dumazet wrote:
> > Well, not 'last' contention point, as we still hit remove_inode_hash(),
> 
> There should be no ned to put pipe or anon inodes on the inode hash.
> Probably sockets don't need it either, but I'd need to look at it in
> detail.
> 
> > inode_wb_list_del()
> 
> The should never be on the wb list either, doing an unlocked check for
> actually beeing on the list before taking the lock should help you.

Yes, it might even help regular inodes ;)

> 
> > inode_lru_list_del(),
> 
> No real need to keep inodes in the LRU if we only allocate them using
> new_inode but never look them up either.  You might want to try setting
> .drop_inode to generic_delete_inode for these.

Yes, I'll take a look, thanks.

> 
> > +struct inode *__new_inode(struct super_block *sb)
> > +{
> > +	struct inode *inode = alloc_inode(sb);
> > +
> > +	if (inode) {
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_state = 0;
> > +		spin_unlock(&inode->i_lock);
> > +		INIT_LIST_HEAD(&inode->i_sb_list);
> > +	}
> > +	return inode;
> > +}
> 
> This needs a much better name like new_inode_pseudo, and a kerneldoc 
> comment explaining when it is safe to use, and the consequences, which
> appear to me:
> 
>  - fs may never be unmount
>  - quotas can't work on the filesystem
>  - writeback can't work on the filesystem

Thanks for reviewing, here is v2 of the patch, addressing your comments.


[PATCH v2] vfs: dont chain pipe/anon/socket on superblock s_inodes list

Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
v2: address Christoph comments

 fs/anon_inodes.c   |    2 +-
 fs/inode.c         |   39 ++++++++++++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |    3 ++-
 net/socket.c       |    2 +-
 5 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4d433d3..f11e43e 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -187,7 +187,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_pseudo(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/inode.c b/fs/inode.c
index 96c77b8..319b93b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -362,9 +362,11 @@ EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_sb_list_lock);
-	list_del_init(&inode->i_sb_list);
-	spin_unlock(&inode_sb_list_lock);
+	if (!list_empty(&inode->i_sb_list)) {
+		spin_lock(&inode_sb_list_lock);
+		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode_sb_list_lock);
+	}
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -797,6 +799,29 @@ unsigned int get_next_ino(void)
 EXPORT_SYMBOL(get_next_ino);
 
 /**
+ *	new_inode_pseudo 	- obtain an inode
+ *	@sb: superblock
+ *
+ *	Allocates a new inode for given superblock.
+ *	Inode wont be chained in superblock s_inodes list
+ *	This means :
+ *	- fs can't be unmount
+ *	- quotas, fsnotify, writeback can't work
+ */
+struct inode *new_inode_pseudo(struct super_block *sb)
+{
+	struct inode *inode = alloc_inode(sb);
+
+	if (inode) {
+		spin_lock(&inode->i_lock);
+		inode->i_state = 0;
+		spin_unlock(&inode->i_lock);
+		INIT_LIST_HEAD(&inode->i_sb_list);
+	}
+	return inode;
+}
+
+/**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
  *
@@ -814,13 +839,9 @@ struct inode *new_inode(struct super_block *sb)
 
 	spin_lock_prefetch(&inode_sb_list_lock);
 
-	inode = alloc_inode(sb);
-	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_state = 0;
-		spin_unlock(&inode->i_lock);
+	inode = new_inode_pseudo(sb);
+	if (inode)
 		inode_sb_list_add(inode);
-	}
 	return inode;
 }
 EXPORT_SYMBOL(new_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 1b7f9af..0e0be1d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -948,7 +948,7 @@ static const struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_pseudo(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a665804..cc363fa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2310,7 +2310,8 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void __destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *new_inode_pseudo(struct super_block *sb);
+extern struct inode *new_inode(struct super_block *sb);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
diff --git a/net/socket.c b/net/socket.c
index 02dc82d..26ed35c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -467,7 +467,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_pseudo(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
  2011-07-26  9:36                               ` Eric Dumazet
@ 2011-07-26  9:42                                 ` Christoph Hellwig
  2011-07-26 10:43                                     ` Eric Dumazet
  2011-07-27 15:21                                   ` Eric Dumazet
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-26  9:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Tim Chen, Al Viro, David Miller, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

On Tue, Jul 26, 2011 at 11:36:34AM +0200, Eric Dumazet wrote:
> [PATCH v2] vfs: dont chain pipe/anon/socket on superblock s_inodes list
> 
> Workloads using pipes and sockets hit inode_sb_list_lock contention.
> 
> superblock s_inodes list is needed for quota, dirty, pagecache and
> fsnotify management. pipe/anon/socket fs are clearly not candidates for
> these.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good to me,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
  2011-07-26  9:42                                 ` Christoph Hellwig
@ 2011-07-26 10:43                                     ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-26 10:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 05:42 -0400, Christoph Hellwig a écrit :
> On Tue, Jul 26, 2011 at 11:36:34AM +0200, Eric Dumazet wrote:
> > [PATCH v2] vfs: dont chain pipe/anon/socket on superblock s_inodes list
> > 
> > Workloads using pipes and sockets hit inode_sb_list_lock contention.
> > 
> > superblock s_inodes list is needed for quota, dirty, pagecache and
> > fsnotify management. pipe/anon/socket fs are clearly not candidates for
> > these.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good to me,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

Thanks !

BTW, we have one atomic op that could be avoided in new_inode()

spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);

can probably be changed to something less expensive...

inode->i_state = 0;
smp_wmb();

Not clear if we really need a memory barrier either....




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
@ 2011-07-26 10:43                                     ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-26 10:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 05:42 -0400, Christoph Hellwig a écrit :
> On Tue, Jul 26, 2011 at 11:36:34AM +0200, Eric Dumazet wrote:
> > [PATCH v2] vfs: dont chain pipe/anon/socket on superblock s_inodes list
> > 
> > Workloads using pipes and sockets hit inode_sb_list_lock contention.
> > 
> > superblock s_inodes list is needed for quota, dirty, pagecache and
> > fsnotify management. pipe/anon/socket fs are clearly not candidates for
> > these.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good to me,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

Thanks !

BTW, we have one atomic op that could be avoided in new_inode()

spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);

can probably be changed to something less expensive...

inode->i_state = 0;
smp_wmb();

Not clear if we really need a memory barrier either....



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list
  2011-07-26 10:43                                     ` Eric Dumazet
  (?)
@ 2011-07-26 11:49                                     ` Christoph Hellwig
  -1 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-26 11:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Tim Chen, Al Viro, David Miller, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

On Tue, Jul 26, 2011 at 12:43:33PM +0200, Eric Dumazet wrote:
> BTW, we have one atomic op that could be avoided in new_inode()
> 
> spin_lock(&inode->i_lock);
> inode->i_state = 0;
> spin_unlock(&inode->i_lock);
> 
> can probably be changed to something less expensive...
> 
> inode->i_state = 0;
> smp_wmb();
> 
> Not clear if we really need a memory barrier either....

I think we already had this in some of the earlier vfs/inode scale
series, but it got lost when Al asked to just put the fundamental
changes in.

For plain new_inode() the barrier shouldn't be needed as we take
the sb list lock just a little later.  I'm not sure about your new
variant, so I'll rather lave that to you.

There's a few other things missing from earlier iterations, most notable
the non-atomic i_count, and the bucket locks for the inode hash, if
you're eager enough to look into that area.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] vfs: avoid taking locks if inode not in lists
  2011-07-26  9:36                               ` Eric Dumazet
@ 2011-07-27 15:21                                   ` Eric Dumazet
  2011-07-27 15:21                                   ` Eric Dumazet
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-27 15:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 11:36 +0200, Eric Dumazet a écrit :
> Le mardi 26 juillet 2011 à 05:03 -0400, Christoph Hellwig a écrit :
> > On Tue, Jul 26, 2011 at 10:21:06AM +0200, Eric Dumazet wrote:
> > > Well, not 'last' contention point, as we still hit remove_inode_hash(),
> > 
> > There should be no ned to put pipe or anon inodes on the inode hash.
> > Probably sockets don't need it either, but I'd need to look at it in
> > detail.
> > 
> > > inode_wb_list_del()
> > 
> > The should never be on the wb list either, doing an unlocked check for
> > actually beeing on the list before taking the lock should help you.
> 
> Yes, it might even help regular inodes ;)
> 
> > 
> > > inode_lru_list_del(),
> > 
> > No real need to keep inodes in the LRU if we only allocate them using
> > new_inode but never look them up either.  You might want to try setting
> > .drop_inode to generic_delete_inode for these.
> 
> Yes, I'll take a look, thanks.

If I am not mistaken, we can add unlocked checks on the three hot spots.

After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
my dev machine takes ~3us instead of ~9us.

Maybe its better to split it in three patches, just let me know.

22us -> 3us, thats a nice patch series ;)

Thanks

[PATCH] vfs: avoid taking locks if inode not in lists

sockets and pipes inodes destruction hits three possibly contended
locks :

system-wide inode_hash_lock in remove_inode_hash()
superblock s_inode_lru_lock in inode_lru_list_del()
bdi wb.list_lock in inode_wb_list_del()

Before even taking locks, we can perform an unlocked test to check if
inode can possibly be in the lists.

On a 2x4x2 machine, a close(socket()) pair can be 200% faster with these
changes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/fs-writeback.c |   10 ++++++----
 fs/inode.c        |    6 ++++++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1599aa9..8b90bdb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -182,11 +182,13 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	if (!list_empty(&inode->i_wb_list)) {
+		struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	spin_lock(&bdi->wb.list_lock);
-	list_del_init(&inode->i_wb_list);
-	spin_unlock(&bdi->wb.list_lock);
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
 }
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..796a420 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -338,6 +338,9 @@ static void inode_lru_list_add(struct inode *inode)
 
 static void inode_lru_list_del(struct inode *inode)
 {
+	if (list_empty(&inode->i_lru))
+		return;
+
 	spin_lock(&inode->i_sb->s_inode_lru_lock);
 	if (!list_empty(&inode->i_lru)) {
 		list_del_init(&inode->i_lru);
@@ -406,6 +409,9 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void remove_inode_hash(struct inode *inode)
 {
+	if (inode_unhashed(inode))
+		return;
+
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
 	hlist_del_init(&inode->i_hash);



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] vfs: avoid taking locks if inode not in lists
@ 2011-07-27 15:21                                   ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-27 15:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mardi 26 juillet 2011 à 11:36 +0200, Eric Dumazet a écrit :
> Le mardi 26 juillet 2011 à 05:03 -0400, Christoph Hellwig a écrit :
> > On Tue, Jul 26, 2011 at 10:21:06AM +0200, Eric Dumazet wrote:
> > > Well, not 'last' contention point, as we still hit remove_inode_hash(),
> > 
> > There should be no ned to put pipe or anon inodes on the inode hash.
> > Probably sockets don't need it either, but I'd need to look at it in
> > detail.
> > 
> > > inode_wb_list_del()
> > 
> > The should never be on the wb list either, doing an unlocked check for
> > actually beeing on the list before taking the lock should help you.
> 
> Yes, it might even help regular inodes ;)
> 
> > 
> > > inode_lru_list_del(),
> > 
> > No real need to keep inodes in the LRU if we only allocate them using
> > new_inode but never look them up either.  You might want to try setting
> > .drop_inode to generic_delete_inode for these.
> 
> Yes, I'll take a look, thanks.

If I am not mistaken, we can add unlocked checks on the three hot spots.

After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
my dev machine takes ~3us instead of ~9us.

Maybe its better to split it in three patches, just let me know.

22us -> 3us, thats a nice patch series ;)

Thanks

[PATCH] vfs: avoid taking locks if inode not in lists

sockets and pipes inodes destruction hits three possibly contended
locks :

system-wide inode_hash_lock in remove_inode_hash()
superblock s_inode_lru_lock in inode_lru_list_del()
bdi wb.list_lock in inode_wb_list_del()

Before even taking locks, we can perform an unlocked test to check if
inode can possibly be in the lists.

On a 2x4x2 machine, a close(socket()) pair can be 200% faster with these
changes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/fs-writeback.c |   10 ++++++----
 fs/inode.c        |    6 ++++++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1599aa9..8b90bdb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -182,11 +182,13 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	if (!list_empty(&inode->i_wb_list)) {
+		struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	spin_lock(&bdi->wb.list_lock);
-	list_del_init(&inode->i_wb_list);
-	spin_unlock(&bdi->wb.list_lock);
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
 }
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..796a420 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -338,6 +338,9 @@ static void inode_lru_list_add(struct inode *inode)
 
 static void inode_lru_list_del(struct inode *inode)
 {
+	if (list_empty(&inode->i_lru))
+		return;
+
 	spin_lock(&inode->i_sb->s_inode_lru_lock);
 	if (!list_empty(&inode->i_lru)) {
 		list_del_init(&inode->i_lru);
@@ -406,6 +409,9 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void remove_inode_hash(struct inode *inode)
 {
+	if (inode_unhashed(inode))
+		return;
+
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
 	hlist_del_init(&inode->i_hash);


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: avoid taking locks if inode not in lists
  2011-07-27 15:21                                   ` Eric Dumazet
  (?)
@ 2011-07-27 17:12                                   ` Andi Kleen
  -1 siblings, 0 replies; 35+ messages in thread
From: Andi Kleen @ 2011-07-27 17:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Tim Chen, Al Viro, David Miller, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

> Before even taking locks, we can perform an unlocked test to check if
> inode can possibly be in the lists.
> 
> On a 2x4x2 machine, a close(socket()) pair can be 200% faster with these
> changes.

Great result! Seems simple and obvious to me.

-Andi


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: avoid taking locks if inode not in lists
  2011-07-27 15:21                                   ` Eric Dumazet
  (?)
  (?)
@ 2011-07-27 20:44                                   ` Christoph Hellwig
  2011-07-27 20:59                                     ` Andi Kleen
                                                       ` (2 more replies)
  -1 siblings, 3 replies; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-27 20:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Tim Chen, Al Viro, David Miller, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

On Wed, Jul 27, 2011 at 05:21:05PM +0200, Eric Dumazet wrote:
> If I am not mistaken, we can add unlocked checks on the three hot spots.
> 
> After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
> my dev machine takes ~3us instead of ~9us.
> 
> Maybe its better to split it in three patches, just let me know.

I think three patches would be a lot cleaner.

As for safety of the unlocked checks:

 - inode are either hashed when created or never, so that one looks
   fine.
 - same for the sb list.
 - the writeback list is a bit more dynamic as we move things around
   quite a bit.  But in additon to the inode_wb_list_del call from
   evict() it only ever gets remove in writeback_single_inode, which
   for a freeing inode can only be called from the callers of evict().

Btw, I wonder if you should micro-optimize things a bit further by
moving the unhashed checks from the deletion functions into the callers
and thus save a function call for each of them.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: avoid taking locks if inode not in lists
  2011-07-27 20:44                                   ` Christoph Hellwig
@ 2011-07-27 20:59                                     ` Andi Kleen
  2011-07-27 21:01                                       ` Christoph Hellwig
  2011-07-28  4:41                                       ` Eric Dumazet
  2011-07-28  4:55                                       ` Eric Dumazet
  2 siblings, 1 reply; 35+ messages in thread
From: Andi Kleen @ 2011-07-27 20:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Dumazet, Tim Chen, Al Viro, David Miller, Andi Kleen,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

> Btw, I wonder if you should micro-optimize things a bit further by
> moving the unhashed checks from the deletion functions into the callers
> and thus save a function call for each of them.

If the caller is in the same file modern gcc is able to do that automatically
if you're lucky enough ("partial inlining")

I would not uglify the code for it.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: avoid taking locks if inode not in lists
  2011-07-27 20:59                                     ` Andi Kleen
@ 2011-07-27 21:01                                       ` Christoph Hellwig
  2011-07-28  4:11                                           ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2011-07-27 21:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Eric Dumazet, Tim Chen, Al Viro, David Miller,
	Matthew Wilcox, Anton Blanchard, npiggin, linux-kernel,
	linux-fsdevel, netdev

On Wed, Jul 27, 2011 at 10:59:57PM +0200, Andi Kleen wrote:
> > Btw, I wonder if you should micro-optimize things a bit further by
> > moving the unhashed checks from the deletion functions into the callers
> > and thus save a function call for each of them.
> 
> If the caller is in the same file modern gcc is able to do that automatically
> if you're lucky enough ("partial inlining")
> 
> I would not uglify the code for it.

Depending on how you look at it the code might actually be a tad
cleaner.  One of called functions is outside of inode.c.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] vfs: conditionally call inode_wb_list_del()
  2011-07-27 21:01                                       ` Christoph Hellwig
@ 2011-07-28  4:11                                           ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-28  4:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Tim Chen, Al Viro, David Miller, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mercredi 27 juillet 2011 à 17:01 -0400, Christoph Hellwig a écrit :
> On Wed, Jul 27, 2011 at 10:59:57PM +0200, Andi Kleen wrote:
> > > Btw, I wonder if you should micro-optimize things a bit further by
> > > moving the unhashed checks from the deletion functions into the callers
> > > and thus save a function call for each of them.
> > 
> > If the caller is in the same file modern gcc is able to do that automatically
> > if you're lucky enough ("partial inlining")
> > 
> > I would not uglify the code for it.
> 
> Depending on how you look at it the code might actually be a tad
> cleaner.  One of called functions is outside of inode.c.
> 

Thats right, thanks again for your valuable input Christoph.

The following is a clear win, since we avoid the call to external
function.

[PATCH] vfs: conditionally call inode_wb_list_del()

Some inodes (pipes, sockets, ...) are not in bdi writeback list.

evict() can avoid calling inode_wb_list_del() and its expensive spinlock
by checking inode i_wb_list being empty or not.

At this point, no other cpu/user can concurrently manipulate this inode
i_wb_list

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/inode.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..9dab13a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -454,7 +454,9 @@ static void evict(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(!list_empty(&inode->i_lru));
 
-	inode_wb_list_del(inode);
+	if (!list_empty(&inode->i_wb_list))
+		inode_wb_list_del(inode);
+
 	inode_sb_list_del(inode);
 
 	if (op->evict_inode) {



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] vfs: conditionally call inode_wb_list_del()
@ 2011-07-28  4:11                                           ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-28  4:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Tim Chen, Al Viro, David Miller, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mercredi 27 juillet 2011 à 17:01 -0400, Christoph Hellwig a écrit :
> On Wed, Jul 27, 2011 at 10:59:57PM +0200, Andi Kleen wrote:
> > > Btw, I wonder if you should micro-optimize things a bit further by
> > > moving the unhashed checks from the deletion functions into the callers
> > > and thus save a function call for each of them.
> > 
> > If the caller is in the same file modern gcc is able to do that automatically
> > if you're lucky enough ("partial inlining")
> > 
> > I would not uglify the code for it.
> 
> Depending on how you look at it the code might actually be a tad
> cleaner.  One of called functions is outside of inode.c.
> 

Thats right, thanks again for your valuable input Christoph.

The following is a clear win, since we avoid the call to external
function.

[PATCH] vfs: conditionally call inode_wb_list_del()

Some inodes (pipes, sockets, ...) are not in bdi writeback list.

evict() can avoid calling inode_wb_list_del() and its expensive spinlock
by checking inode i_wb_list being empty or not.

At this point, no other cpu/user can concurrently manipulate this inode
i_wb_list

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/inode.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..9dab13a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -454,7 +454,9 @@ static void evict(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(!list_empty(&inode->i_lru));
 
-	inode_wb_list_del(inode);
+	if (!list_empty(&inode->i_wb_list))
+		inode_wb_list_del(inode);
+
 	inode_sb_list_del(inode);
 
 	if (op->evict_inode) {


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: avoid taking locks if inode not in lists
  2011-07-27 20:44                                   ` Christoph Hellwig
@ 2011-07-28  4:41                                       ` Eric Dumazet
  2011-07-28  4:41                                       ` Eric Dumazet
  2011-07-28  4:55                                       ` Eric Dumazet
  2 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-28  4:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mercredi 27 juillet 2011 à 16:44 -0400, Christoph Hellwig a écrit :
> On Wed, Jul 27, 2011 at 05:21:05PM +0200, Eric Dumazet wrote:
> > If I am not mistaken, we can add unlocked checks on the three hot spots.
> > 
> > After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
> > my dev machine takes ~3us instead of ~9us.
> > 
> > Maybe its better to split it in three patches, just let me know.
> 
> I think three patches would be a lot cleaner.
> 
> As for safety of the unlocked checks:
> 
>  - inode are either hashed when created or never, so that one looks
>    fine.
>  - same for the sb list.
>  - the writeback list is a bit more dynamic as we move things around
>    quite a bit.  But in additon to the inode_wb_list_del call from
>    evict() it only ever gets remove in writeback_single_inode, which
>    for a freeing inode can only be called from the callers of evict().
> 
> Btw, I wonder if you should micro-optimize things a bit further by
> moving the unhashed checks from the deletion functions into the callers
> and thus save a function call for each of them.
> 

What about following patch, addressing the micro-optimization and Andi
Kleen concern about evict() readability ?

Thanks !

[PATCH] vfs: avoid taking inode_hash_lock on pipes and sockets

Some inodes (pipes, sockets, ...) are not hashed, no need to take
contended inode_hash_lock at dismantle time.

nice speedup on SMP machines on socket intensive workloads.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/inode.c         |    6 +++---
 include/linux/fs.h |    9 ++++++++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..73b5598 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -399,12 +399,12 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
- *	remove_inode_hash - remove an inode from the hash
+ *	__remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
  *	Remove an inode from the superblock.
  */
-void remove_inode_hash(struct inode *inode)
+void __remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
@@ -412,7 +412,7 @@ void remove_inode_hash(struct inode *inode)
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
-EXPORT_SYMBOL(remove_inode_hash);
+EXPORT_SYMBOL(__remove_inode_hash);
 
 void end_writeback(struct inode *inode)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f23bcb7..786b3b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2317,11 +2317,18 @@ extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
-extern void remove_inode_hash(struct inode *);
 static inline void insert_inode_hash(struct inode *inode)
 {
 	__insert_inode_hash(inode, inode->i_ino);
 }
+
+extern void __remove_inode_hash(struct inode *);
+static inline void remove_inode_hash(struct inode *inode)
+{
+	if (!inode_unhashed(inode))
+		__remove_inode_hash(inode);
+}
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] vfs: avoid taking locks if inode not in lists
@ 2011-07-28  4:41                                       ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-28  4:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mercredi 27 juillet 2011 à 16:44 -0400, Christoph Hellwig a écrit :
> On Wed, Jul 27, 2011 at 05:21:05PM +0200, Eric Dumazet wrote:
> > If I am not mistaken, we can add unlocked checks on the three hot spots.
> > 
> > After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
> > my dev machine takes ~3us instead of ~9us.
> > 
> > Maybe its better to split it in three patches, just let me know.
> 
> I think three patches would be a lot cleaner.
> 
> As for safety of the unlocked checks:
> 
>  - inode are either hashed when created or never, so that one looks
>    fine.
>  - same for the sb list.
>  - the writeback list is a bit more dynamic as we move things around
>    quite a bit.  But in additon to the inode_wb_list_del call from
>    evict() it only ever gets remove in writeback_single_inode, which
>    for a freeing inode can only be called from the callers of evict().
> 
> Btw, I wonder if you should micro-optimize things a bit further by
> moving the unhashed checks from the deletion functions into the callers
> and thus save a function call for each of them.
> 

What about following patch, addressing the micro-optimization and Andi
Kleen concern about evict() readability ?

Thanks !

[PATCH] vfs: avoid taking inode_hash_lock on pipes and sockets

Some inodes (pipes, sockets, ...) are not hashed, no need to take
contended inode_hash_lock at dismantle time.

nice speedup on SMP machines on socket intensive workloads.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/inode.c         |    6 +++---
 include/linux/fs.h |    9 ++++++++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..73b5598 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -399,12 +399,12 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
- *	remove_inode_hash - remove an inode from the hash
+ *	__remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
  *	Remove an inode from the superblock.
  */
-void remove_inode_hash(struct inode *inode)
+void __remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
@@ -412,7 +412,7 @@ void remove_inode_hash(struct inode *inode)
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
-EXPORT_SYMBOL(remove_inode_hash);
+EXPORT_SYMBOL(__remove_inode_hash);
 
 void end_writeback(struct inode *inode)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f23bcb7..786b3b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2317,11 +2317,18 @@ extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
-extern void remove_inode_hash(struct inode *);
 static inline void insert_inode_hash(struct inode *inode)
 {
 	__insert_inode_hash(inode, inode->i_ino);
 }
+
+extern void __remove_inode_hash(struct inode *);
+static inline void remove_inode_hash(struct inode *inode)
+{
+	if (!inode_unhashed(inode))
+		__remove_inode_hash(inode);
+}
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] vfs: avoid call to inode_lru_list_del() if possible
  2011-07-27 20:44                                   ` Christoph Hellwig
@ 2011-07-28  4:55                                       ` Eric Dumazet
  2011-07-28  4:41                                       ` Eric Dumazet
  2011-07-28  4:55                                       ` Eric Dumazet
  2 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-28  4:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mercredi 27 juillet 2011 à 16:44 -0400, Christoph Hellwig a écrit :
> On Wed, Jul 27, 2011 at 05:21:05PM +0200, Eric Dumazet wrote:
> > If I am not mistaken, we can add unlocked checks on the three hot spots.
> > 
> > After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
> > my dev machine takes ~3us instead of ~9us.
> > 
> > Maybe its better to split it in three patches, just let me know.
> 
> I think three patches would be a lot cleaner.
> 
> As for safety of the unlocked checks:
> 
>  - inode are either hashed when created or never, so that one looks
>    fine.
>  - same for the sb list.
>  - the writeback list is a bit more dynamic as we move things around
>    quite a bit.  But in additon to the inode_wb_list_del call from
>    evict() it only ever gets remove in writeback_single_inode, which
>    for a freeing inode can only be called from the callers of evict().
> 
> Btw, I wonder if you should micro-optimize things a bit further by
> moving the unhashed checks from the deletion functions into the callers
> and thus save a function call for each of them.
> 

Here is the last patch, addressing inode_lru_list_del() call.

Only the call done from iput_final() can obviously benefit from checking
i_lru being empty or not, so it makes sense to perform the check at
caller site instead of doing it in inode_lru_list_del()

[PATCH] vfs: avoid call to inode_lru_list_del() if possible

inode_lru_list_del() is expensive because of per superblock lru locking,
while some inodes are not in lru list.

Adding a check in iput_final() can speedup pipe/sockets workloads on
SMP.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/inode.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..b8b8939 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1328,7 +1328,8 @@ static void iput_final(struct inode *inode)
 	}
 
 	inode->i_state |= I_FREEING;
-	inode_lru_list_del(inode);
+	if (!list_empty(&inode->i_lru))
+		inode_lru_list_del(inode);
 	spin_unlock(&inode->i_lock);
 
 	evict(inode);



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] vfs: avoid call to inode_lru_list_del() if possible
@ 2011-07-28  4:55                                       ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2011-07-28  4:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tim Chen, Al Viro, David Miller, Andi Kleen, Matthew Wilcox,
	Anton Blanchard, npiggin, linux-kernel, linux-fsdevel, netdev

Le mercredi 27 juillet 2011 à 16:44 -0400, Christoph Hellwig a écrit :
> On Wed, Jul 27, 2011 at 05:21:05PM +0200, Eric Dumazet wrote:
> > If I am not mistaken, we can add unlocked checks on the three hot spots.
> > 
> > After following patch, a close(socket(PF_INET, SOCK_DGRAM, 0)) pair on
> > my dev machine takes ~3us instead of ~9us.
> > 
> > Maybe its better to split it in three patches, just let me know.
> 
> I think three patches would be a lot cleaner.
> 
> As for safety of the unlocked checks:
> 
>  - inode are either hashed when created or never, so that one looks
>    fine.
>  - same for the sb list.
>  - the writeback list is a bit more dynamic as we move things around
>    quite a bit.  But in additon to the inode_wb_list_del call from
>    evict() it only ever gets remove in writeback_single_inode, which
>    for a freeing inode can only be called from the callers of evict().
> 
> Btw, I wonder if you should micro-optimize things a bit further by
> moving the unhashed checks from the deletion functions into the callers
> and thus save a function call for each of them.
> 

Here is the last patch, addressing inode_lru_list_del() call.

Only the call done from iput_final() can obviously benefit from checking
i_lru being empty or not, so it makes sense to perform the check at
caller site instead of doing it in inode_lru_list_del()

[PATCH] vfs: avoid call to inode_lru_list_del() if possible

inode_lru_list_del() is expensive because of per superblock lru locking,
while some inodes are not in lru list.

Adding a check in iput_final() can speedup pipe/sockets workloads on
SMP.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/inode.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d0c72ff..b8b8939 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1328,7 +1328,8 @@ static void iput_final(struct inode *inode)
 	}
 
 	inode->i_state |= I_FREEING;
-	inode_lru_list_del(inode);
+	if (!list_empty(&inode->i_lru))
+		inode_lru_list_del(inode);
 	spin_unlock(&inode->i_lock);
 
 	evict(inode);


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-07-28  4:55 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-17  0:50 vfsmount lock issues on very large ppc64 box Anton Blanchard
2011-07-17  1:04 ` Matthew Wilcox
2011-07-17  8:46   ` Andi Kleen
2011-07-18  8:17     ` Eric Dumazet
2011-07-18 15:40       ` Christoph Hellwig
2011-07-18 15:51         ` Al Viro
2011-07-19 16:32           ` [Patch] VFS : mount lock scalability for files systems without mount point (WAS vfsmount lock issues on very large ppc64 box) Tim Chen
2011-07-21 20:40             ` Al Viro
2011-07-22  0:27               ` Tim Chen
2011-07-23 13:24                 ` Christoph Hellwig
2011-07-25 22:39                   ` Tim Chen
2011-07-25 22:51                     ` Al Viro
2011-07-25 23:22                       ` Tim Chen
2011-07-26  6:00                         ` Eric Dumazet
2011-07-26  8:21                           ` [PATCH] vfs: dont chain pipe/anon/socket on superblock s_inodes list Eric Dumazet
2011-07-26  8:21                             ` Eric Dumazet
2011-07-26  9:03                             ` Christoph Hellwig
2011-07-26  9:36                               ` Eric Dumazet
2011-07-26  9:42                                 ` Christoph Hellwig
2011-07-26 10:43                                   ` Eric Dumazet
2011-07-26 10:43                                     ` Eric Dumazet
2011-07-26 11:49                                     ` Christoph Hellwig
2011-07-27 15:21                                 ` [PATCH] vfs: avoid taking locks if inode not in lists Eric Dumazet
2011-07-27 15:21                                   ` Eric Dumazet
2011-07-27 17:12                                   ` Andi Kleen
2011-07-27 20:44                                   ` Christoph Hellwig
2011-07-27 20:59                                     ` Andi Kleen
2011-07-27 21:01                                       ` Christoph Hellwig
2011-07-28  4:11                                         ` [PATCH] vfs: conditionally call inode_wb_list_del() Eric Dumazet
2011-07-28  4:11                                           ` Eric Dumazet
2011-07-28  4:41                                     ` [PATCH] vfs: avoid taking locks if inode not in lists Eric Dumazet
2011-07-28  4:41                                       ` Eric Dumazet
2011-07-28  4:55                                     ` [PATCH] vfs: avoid call to inode_lru_list_del() if possible Eric Dumazet
2011-07-28  4:55                                       ` Eric Dumazet
2011-07-18 16:41   ` vfsmount lock issues on very large ppc64 box Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.