All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-14 15:25 ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-14 15:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, Jan Kara, Andrew Morton


We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.

Jan tracked the cause of this down to UDF inode reclaim blocking:

gnome-screens D ffff810006d1d598     0 20686      1
 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
Call Trace:
 [<ffffffff804477f3>] io_schedule+0x63/0xa5
 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
 [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
 [<ffffffff802c442a>] __bread+0x70/0x86
 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
 [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
 [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
 [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
 [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
 [<ffffffff802822d8>] __do_fault+0x10c/0x417
 [<ffffffff80284232>] handle_mm_fault+0x466/0x940
 [<ffffffff8044b922>] do_page_fault+0x676/0xabf

Which blocks with the inode lock held, which then blocks other
reclaimers:

X             D ffff81009d47c400     0 17285  14831
 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
Call Trace:
 [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
 [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
 [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
 [<ffffffff802ad949>] do_select+0x308/0x506
 [<ffffffff802adced>] core_sys_select+0x1a6/0x254
 [<ffffffff802ae0b7>] sys_select+0xb5/0x157

Now I think the main problem is having the filesystem block (and do IO
in inode reclaim. The problem is that this doesn't get accounted well
and penalizes a random allocator with a big latency spike caused by
work generated from elsewhere.

I think the best idea would be to avoid this. By design if possible,
or by deferring the hard work to an asynchronous context. If the latter,
then the fs would probably want to throttle creation of new work with
queue size of the deferred work, but let's not get into those details.

Anyway, another obvious thing we looked at is the iprune_mutex which
is causing the cascading blocking. We could turn this into an rwsem to
improve concurrency. It is unreasonable to totally ban all potentially
slow or blocking operations in inode reclaim, so I think this is a cheap
way to get a small improvement.

This doesn't solve the whole problem of course. The process doing inode
reclaim will still take the latency hit, and concurrent processes may
end up contending on filesystem locks. So fs developers should keep
these problems in mind please (or discuss alternatives).

Jan points out this has the potential to uncover concurrency bugs in fs
code.

Comments?

Thanks,
Nick

---
 fs/inode.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -25,6 +25,7 @@
 #include <linux/fsnotify.h>
 #include <linux/mount.h>
 #include <linux/async.h>
+#include <linux/rwsem.h>
 
 /*
  * This is needed for the following functions:
@@ -93,7 +94,7 @@ DEFINE_SPINLOCK(inode_lock);
  * from its final dispose_list, the struct super_block they refer to
  * (for inode->i_sb->s_op) may already have been freed and reused.
  */
-static DEFINE_MUTEX(iprune_mutex);
+static DECLARE_RWSEM(iprune_sem);
 
 /*
  * Statistics gathering..
@@ -365,7 +366,7 @@ static int invalidate_list(struct list_h
 		/*
 		 * We can reschedule here without worrying about the list's
 		 * consistency because the per-sb list of inodes must not
-		 * change during umount anymore, and because iprune_mutex keeps
+		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
@@ -404,7 +405,7 @@ int invalidate_inodes(struct super_block
 	int busy;
 	LIST_HEAD(throw_away);
 
-	mutex_lock(&iprune_mutex);
+	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	fsnotify_unmount_inodes(&sb->s_inodes);
@@ -412,7 +413,7 @@ int invalidate_inodes(struct super_block
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
-	mutex_unlock(&iprune_mutex);
+	up_write(&iprune_sem);
 
 	return busy;
 }
@@ -451,7 +452,7 @@ static void prune_icache(int nr_to_scan)
 	int nr_scanned;
 	unsigned long reap = 0;
 
-	mutex_lock(&iprune_mutex);
+	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -493,7 +494,7 @@ static void prune_icache(int nr_to_scan)
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
-	mutex_unlock(&iprune_mutex);
+	up_read(&iprune_sem);
 }
 
 /*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-14 15:25 ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-14 15:25 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, Jan Kara, Andrew Morton


We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.

Jan tracked the cause of this down to UDF inode reclaim blocking:

gnome-screens D ffff810006d1d598     0 20686      1
 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
Call Trace:
 [<ffffffff804477f3>] io_schedule+0x63/0xa5
 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
 [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
 [<ffffffff802c442a>] __bread+0x70/0x86
 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
 [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
 [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
 [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
 [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
 [<ffffffff802822d8>] __do_fault+0x10c/0x417
 [<ffffffff80284232>] handle_mm_fault+0x466/0x940
 [<ffffffff8044b922>] do_page_fault+0x676/0xabf

Which blocks with the inode lock held, which then blocks other
reclaimers:

X             D ffff81009d47c400     0 17285  14831
 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
Call Trace:
 [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
 [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
 [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
 [<ffffffff802ad949>] do_select+0x308/0x506
 [<ffffffff802adced>] core_sys_select+0x1a6/0x254
 [<ffffffff802ae0b7>] sys_select+0xb5/0x157

Now I think the main problem is having the filesystem block (and do IO
in inode reclaim. The problem is that this doesn't get accounted well
and penalizes a random allocator with a big latency spike caused by
work generated from elsewhere.

I think the best idea would be to avoid this. By design if possible,
or by deferring the hard work to an asynchronous context. If the latter,
then the fs would probably want to throttle creation of new work with
queue size of the deferred work, but let's not get into those details.

Anyway, another obvious thing we looked at is the iprune_mutex which
is causing the cascading blocking. We could turn this into an rwsem to
improve concurrency. It is unreasonable to totally ban all potentially
slow or blocking operations in inode reclaim, so I think this is a cheap
way to get a small improvement.

This doesn't solve the whole problem of course. The process doing inode
reclaim will still take the latency hit, and concurrent processes may
end up contending on filesystem locks. So fs developers should keep
these problems in mind please (or discuss alternatives).

Jan points out this has the potential to uncover concurrency bugs in fs
code.

Comments?

Thanks,
Nick

---
 fs/inode.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -25,6 +25,7 @@
 #include <linux/fsnotify.h>
 #include <linux/mount.h>
 #include <linux/async.h>
+#include <linux/rwsem.h>
 
 /*
  * This is needed for the following functions:
@@ -93,7 +94,7 @@ DEFINE_SPINLOCK(inode_lock);
  * from its final dispose_list, the struct super_block they refer to
  * (for inode->i_sb->s_op) may already have been freed and reused.
  */
-static DEFINE_MUTEX(iprune_mutex);
+static DECLARE_RWSEM(iprune_sem);
 
 /*
  * Statistics gathering..
@@ -365,7 +366,7 @@ static int invalidate_list(struct list_h
 		/*
 		 * We can reschedule here without worrying about the list's
 		 * consistency because the per-sb list of inodes must not
-		 * change during umount anymore, and because iprune_mutex keeps
+		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
@@ -404,7 +405,7 @@ int invalidate_inodes(struct super_block
 	int busy;
 	LIST_HEAD(throw_away);
 
-	mutex_lock(&iprune_mutex);
+	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	fsnotify_unmount_inodes(&sb->s_inodes);
@@ -412,7 +413,7 @@ int invalidate_inodes(struct super_block
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
-	mutex_unlock(&iprune_mutex);
+	up_write(&iprune_sem);
 
 	return busy;
 }
@@ -451,7 +452,7 @@ static void prune_icache(int nr_to_scan)
 	int nr_scanned;
 	unsigned long reap = 0;
 
-	mutex_lock(&iprune_mutex);
+	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -493,7 +494,7 @@ static void prune_icache(int nr_to_scan)
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
-	mutex_unlock(&iprune_mutex);
+	up_read(&iprune_sem);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-14 15:25 ` Nick Piggin
  (?)
@ 2009-08-14 22:58 ` Andrew Morton
  2009-08-14 23:39     ` Jan Kara
                     ` (2 more replies)
  -1 siblings, 3 replies; 16+ messages in thread
From: Andrew Morton @ 2009-08-14 22:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-mm, jack

On Fri, 14 Aug 2009 17:25:05 +0200
Nick Piggin <npiggin@suse.de> wrote:

> 
> We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.
> 
> Jan tracked the cause of this down to UDF inode reclaim blocking:
> 
> gnome-screens D ffff810006d1d598     0 20686      1
>  ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
>  ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
>  ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
> Call Trace:
>  [<ffffffff804477f3>] io_schedule+0x63/0xa5
>  [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
>  [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
>  [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
>  [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
>  [<ffffffff802c442a>] __bread+0x70/0x86
>  [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
>  [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
>  [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
>  [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
>  [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
>  [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
>  [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
>  [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
>  [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
>  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
>  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
>  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
>  [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
>  [<ffffffff802822d8>] __do_fault+0x10c/0x417
>  [<ffffffff80284232>] handle_mm_fault+0x466/0x940
>  [<ffffffff8044b922>] do_page_fault+0x676/0xabf
> 
> Which blocks with the inode lock held, which then blocks other
> reclaimers:
> 
> X             D ffff81009d47c400     0 17285  14831
>  ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
>  ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
>  ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
> Call Trace:
>  [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
>  [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
>  [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
>  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
>  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
>  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
>  [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
>  [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
>  [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
>  [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
>  [<ffffffff802ad949>] do_select+0x308/0x506
>  [<ffffffff802adced>] core_sys_select+0x1a6/0x254
>  [<ffffffff802ae0b7>] sys_select+0xb5/0x157

That isn't a hang.  When the bread() completes, everything proceeds.

> Now I think the main problem is having the filesystem block (and do IO
> in inode reclaim. The problem is that this doesn't get accounted well
> and penalizes a random allocator with a big latency spike caused by
> work generated from elsewhere.

Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
filesystems have very simple, non-blocking, non-IO-doing
->clear_inode() implementations.  This sounds like a design problem
within UDF.

> I think the best idea would be to avoid this. By design if possible,
> or by deferring the hard work to an asynchronous context. If the latter,
> then the fs would probably want to throttle creation of new work with
> queue size of the deferred work, but let's not get into those details.
> 
> Anyway, another obvious thing we looked at is the iprune_mutex which
> is causing the cascading blocking. We could turn this into an rwsem to
> improve concurrency. It is unreasonable to totally ban all potentially
> slow or blocking operations in inode reclaim, so I think this is a cheap
> way to get a small improvement.
> 
> This doesn't solve the whole problem of course. The process doing inode
> reclaim will still take the latency hit, and concurrent processes may
> end up contending on filesystem locks. So fs developers should keep
> these problems in mind please (or discuss alternatives).
> 
> Jan points out this has the potential to uncover concurrency bugs in fs
> code.
> 
> Comments?

I bet you found that nice comment over iprune_mutex to be useful, no?

That comment needs updating by this patch, btw.

> -	mutex_unlock(&iprune_mutex);
> +	up_write(&iprune_sem);

yup, the patch looks OK.  inode_lock protects the lists and each thread
will make each inode ineligible for lookup by other threads, so it's
hard to see how there could be races in the VFS code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-14 22:58 ` Andrew Morton
@ 2009-08-14 23:39     ` Jan Kara
  2009-08-15  4:45     ` Nick Piggin
  2009-08-15  5:14     ` Nick Piggin
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2009-08-14 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-fsdevel, linux-mm

On Fri 14-08-09 15:58:47, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.
> > 
> > Jan tracked the cause of this down to UDF inode reclaim blocking:
> > 
> > gnome-screens D ffff810006d1d598     0 20686      1
> >  ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
> >  ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
> >  ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
> > Call Trace:
> >  [<ffffffff804477f3>] io_schedule+0x63/0xa5
> >  [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
> >  [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
> >  [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
> >  [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
> >  [<ffffffff802c442a>] __bread+0x70/0x86
> >  [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
> >  [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
> >  [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
> >  [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
> >  [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
> >  [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
> >  [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
> >  [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
> >  [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
> >  [<ffffffff802822d8>] __do_fault+0x10c/0x417
> >  [<ffffffff80284232>] handle_mm_fault+0x466/0x940
> >  [<ffffffff8044b922>] do_page_fault+0x676/0xabf
> > 
> > Which blocks with the inode lock held, which then blocks other
> > reclaimers:
> > 
> > X             D ffff81009d47c400     0 17285  14831
> >  ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
> >  ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
> >  ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
> > Call Trace:
> >  [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
> >  [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
> >  [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
> >  [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
> >  [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
> >  [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
> >  [<ffffffff802ad949>] do_select+0x308/0x506
> >  [<ffffffff802adced>] core_sys_select+0x1a6/0x254
> >  [<ffffffff802ae0b7>] sys_select+0xb5/0x157
> 
> That isn't a hang.  When the bread() completes, everything proceeds.
> 
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.
  Yes, it's a problem within the UDF code. I already got rid of discarding
the preallocation in clear_inode() but still last extent is truncated
there. The trouble with getting rid of that is that according to specs, a
length of the last extent has to exactly match i_size (and even for
directories or symlinks, i_size isn't blocksize aligned). So far we have
all extent lengths block aligned and set the length of the last one in
clear_inode. To get rid of that I'll probably set a length of the last
extent on each inode write, but one has to be careful about races with
truncate...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-14 23:39     ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2009-08-14 23:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-fsdevel, linux-mm

On Fri 14-08-09 15:58:47, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.
> > 
> > Jan tracked the cause of this down to UDF inode reclaim blocking:
> > 
> > gnome-screens D ffff810006d1d598     0 20686      1
> >  ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
> >  ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
> >  ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
> > Call Trace:
> >  [<ffffffff804477f3>] io_schedule+0x63/0xa5
> >  [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
> >  [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
> >  [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
> >  [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
> >  [<ffffffff802c442a>] __bread+0x70/0x86
> >  [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
> >  [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
> >  [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
> >  [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
> >  [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
> >  [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
> >  [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
> >  [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
> >  [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
> >  [<ffffffff802822d8>] __do_fault+0x10c/0x417
> >  [<ffffffff80284232>] handle_mm_fault+0x466/0x940
> >  [<ffffffff8044b922>] do_page_fault+0x676/0xabf
> > 
> > Which blocks with the inode lock held, which then blocks other
> > reclaimers:
> > 
> > X             D ffff81009d47c400     0 17285  14831
> >  ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
> >  ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
> >  ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
> > Call Trace:
> >  [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
> >  [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
> >  [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
> >  [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
> >  [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
> >  [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
> >  [<ffffffff802ad949>] do_select+0x308/0x506
> >  [<ffffffff802adced>] core_sys_select+0x1a6/0x254
> >  [<ffffffff802ae0b7>] sys_select+0xb5/0x157
> 
> That isn't a hang.  When the bread() completes, everything proceeds.
> 
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.
  Yes, it's a problem within the UDF code. I already got rid of discarding
the preallocation in clear_inode() but still last extent is truncated
there. The trouble with getting rid of that is that according to specs, a
length of the last extent has to exactly match i_size (and even for
directories or symlinks, i_size isn't blocksize aligned). So far we have
all extent lengths block aligned and set the length of the last one in
clear_inode. To get rid of that I'll probably set a length of the last
extent on each inode write, but one has to be careful about races with
truncate...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-14 22:58 ` Andrew Morton
@ 2009-08-15  4:45     ` Nick Piggin
  2009-08-15  4:45     ` Nick Piggin
  2009-08-15  5:14     ` Nick Piggin
  2 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-15  4:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-mm, jack

On Fri, Aug 14, 2009 at 03:58:47PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.
> > 
> > Jan tracked the cause of this down to UDF inode reclaim blocking:
> > 
> > gnome-screens D ffff810006d1d598     0 20686      1
> >  ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
> >  ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
> >  ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
> > Call Trace:
> >  [<ffffffff804477f3>] io_schedule+0x63/0xa5
> >  [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
> >  [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
> >  [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
> >  [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
> >  [<ffffffff802c442a>] __bread+0x70/0x86
> >  [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
> >  [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
> >  [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
> >  [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
> >  [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
> >  [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
> >  [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
> >  [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
> >  [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
> >  [<ffffffff802822d8>] __do_fault+0x10c/0x417
> >  [<ffffffff80284232>] handle_mm_fault+0x466/0x940
> >  [<ffffffff8044b922>] do_page_fault+0x676/0xabf
> > 
> > Which blocks with the inode lock held, which then blocks other
> > reclaimers:
> > 
> > X             D ffff81009d47c400     0 17285  14831
> >  ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
> >  ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
> >  ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
> > Call Trace:
> >  [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
> >  [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
> >  [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
> >  [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
> >  [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
> >  [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
> >  [<ffffffff802ad949>] do_select+0x308/0x506
> >  [<ffffffff802adced>] core_sys_select+0x1a6/0x254
> >  [<ffffffff802ae0b7>] sys_select+0xb5/0x157
> 
> That isn't a hang.  When the bread() completes, everything proceeds.

Yes sorry bad wording (from our bug report). It isn't a hang, just
realy bad latency and everything stops for a while.

 
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.

Yep.

 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> > 
> > Anyway, another obvious thing we looked at is the iprune_mutex which
> > is causing the cascading blocking. We could turn this into an rwsem to
> > improve concurrency. It is unreasonable to totally ban all potentially
> > slow or blocking operations in inode reclaim, so I think this is a cheap
> > way to get a small improvement.
> > 
> > This doesn't solve the whole problem of course. The process doing inode
> > reclaim will still take the latency hit, and concurrent processes may
> > end up contending on filesystem locks. So fs developers should keep
> > these problems in mind please (or discuss alternatives).
> > 
> > Jan points out this has the potential to uncover concurrency bugs in fs
> > code.
> > 
> > Comments?
> 
> I bet you found that nice comment over iprune_mutex to be useful, no?
> 
> That comment needs updating by this patch, btw.

Yes will do.

 
> > -	mutex_unlock(&iprune_mutex);
> > +	up_write(&iprune_sem);
> 
> yup, the patch looks OK.  inode_lock protects the lists and each thread
> will make each inode ineligible for lookup by other threads, so it's
> hard to see how there could be races in the VFS code.

I guess dispose_list does a bit of stuff unserialised now. However
most of the same stuff gets done in inode deletion as well, so I
think it is pretty low risk.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-15  4:45     ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-15  4:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-mm, jack

On Fri, Aug 14, 2009 at 03:58:47PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.
> > 
> > Jan tracked the cause of this down to UDF inode reclaim blocking:
> > 
> > gnome-screens D ffff810006d1d598     0 20686      1
> >  ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
> >  ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
> >  ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
> > Call Trace:
> >  [<ffffffff804477f3>] io_schedule+0x63/0xa5
> >  [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
> >  [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
> >  [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
> >  [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
> >  [<ffffffff802c442a>] __bread+0x70/0x86
> >  [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
> >  [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
> >  [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
> >  [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
> >  [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
> >  [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
> >  [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
> >  [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
> >  [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
> >  [<ffffffff802822d8>] __do_fault+0x10c/0x417
> >  [<ffffffff80284232>] handle_mm_fault+0x466/0x940
> >  [<ffffffff8044b922>] do_page_fault+0x676/0xabf
> > 
> > Which blocks with the inode lock held, which then blocks other
> > reclaimers:
> > 
> > X             D ffff81009d47c400     0 17285  14831
> >  ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
> >  ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
> >  ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
> > Call Trace:
> >  [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
> >  [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
> >  [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
> >  [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
> >  [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
> >  [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
> >  [<ffffffff802ad949>] do_select+0x308/0x506
> >  [<ffffffff802adced>] core_sys_select+0x1a6/0x254
> >  [<ffffffff802ae0b7>] sys_select+0xb5/0x157
> 
> That isn't a hang.  When the bread() completes, everything proceeds.

Yes sorry bad wording (from our bug report). It isn't a hang, just
realy bad latency and everything stops for a while.

 
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.

Yep.

 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> > 
> > Anyway, another obvious thing we looked at is the iprune_mutex which
> > is causing the cascading blocking. We could turn this into an rwsem to
> > improve concurrency. It is unreasonable to totally ban all potentially
> > slow or blocking operations in inode reclaim, so I think this is a cheap
> > way to get a small improvement.
> > 
> > This doesn't solve the whole problem of course. The process doing inode
> > reclaim will still take the latency hit, and concurrent processes may
> > end up contending on filesystem locks. So fs developers should keep
> > these problems in mind please (or discuss alternatives).
> > 
> > Jan points out this has the potential to uncover concurrency bugs in fs
> > code.
> > 
> > Comments?
> 
> I bet you found that nice comment over iprune_mutex to be useful, no?
> 
> That comment needs updating by this patch, btw.

Yes will do.

 
> > -	mutex_unlock(&iprune_mutex);
> > +	up_write(&iprune_sem);
> 
> yup, the patch looks OK.  inode_lock protects the lists and each thread
> will make each inode ineligible for lookup by other threads, so it's
> hard to see how there could be races in the VFS code.

I guess dispose_list does a bit of stuff unserialised now. However
most of the same stuff gets done in inode deletion as well, so I
think it is pretty low risk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-14 22:58 ` Andrew Morton
@ 2009-08-15  5:14     ` Nick Piggin
  2009-08-15  4:45     ` Nick Piggin
  2009-08-15  5:14     ` Nick Piggin
  2 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-15  5:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-mm, jack

On Fri, Aug 14, 2009 at 03:58:47PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.

BTW. some filesystems do some work, and we can do blocking memory
allocations, take blocking locks etc, and a fair bit of work like
waiting for inode to sync or throwing away pagecache and assoc buffers. 

UDF probably is just doing more work than most and also is more likely
to be used on desktop and noticed than like ocfs2 or something like
that doing a bit of work there.

OK, if you like the patch then I've fixed up the changelog and comment
(yes it was helpful). Maybe it could sit in -mm for a while?

Thanks,

--
We have had a report of bad memory allocation latency during DVD-RAM (UDF)
writing. This is causing the user's desktop session to become unusable.

Jan tracked the cause of this down to UDF inode reclaim blocking:

gnome-screens D ffff810006d1d598     0 20686      1
 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
Call Trace:
 [<ffffffff804477f3>] io_schedule+0x63/0xa5
 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
 [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
 [<ffffffff802c442a>] __bread+0x70/0x86
 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
 [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
 [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
 [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
 [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
 [<ffffffff802822d8>] __do_fault+0x10c/0x417
 [<ffffffff80284232>] handle_mm_fault+0x466/0x940
 [<ffffffff8044b922>] do_page_fault+0x676/0xabf

This blocks with iprune_mutex held, which then blocks other reclaimers:

X             D ffff81009d47c400     0 17285  14831
 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
Call Trace:
 [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
 [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
 [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
 [<ffffffff802ad949>] do_select+0x308/0x506
 [<ffffffff802adced>] core_sys_select+0x1a6/0x254
 [<ffffffff802ae0b7>] sys_select+0xb5/0x157

Now I think the main problem is having the filesystem block (and do IO)
in inode reclaim. The problem is that this doesn't get accounted well
and penalizes a random allocator with a big latency spike caused by
work generated from elsewhere.

I think the best idea would be to avoid this. By design if possible,
or by deferring the hard work to an asynchronous context. If the latter,
then the fs would probably want to throttle creation of new work with
queue size of the deferred work, but let's not get into those details.

Anyway, the other obvious thing we looked at is the iprune_mutex which
is causing the cascading blocking. We could turn this into an rwsem to
improve concurrency. It is unreasonable to totally ban all potentially
slow or blocking operations in inode reclaim, so I think this is a cheap
way to get a small improvement.

This doesn't solve the whole problem of course. The process doing inode
reclaim will still take the latency hit, and concurrent processes may
end up contending on filesystem locks. So fs developers should keep
these problems in mind.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 fs/inode.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/backing-dev.h>
 #include <linux/wait.h>
+#include <linux/rwsem.h>
 #include <linux/hash.h>
 #include <linux/swap.h>
 #include <linux/security.h>
@@ -87,14 +88,18 @@ static struct hlist_head *inode_hashtabl
 DEFINE_SPINLOCK(inode_lock);
 
 /*
- * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
+ * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
  * been invalidating, or by the time it calls clear_inode & destroy_inode
  * from its final dispose_list, the struct super_block they refer to
  * (for inode->i_sb->s_op) may already have been freed and reused.
+ *
+ * We make this an rwsem because the fastpath is icache shrinking. In
+ * some cases a filesystem may be doing a significant amount of work in
+ * its inode reclaim code, so this should improve parallelism.
  */
-static DEFINE_MUTEX(iprune_mutex);
+static DECLARE_RWSEM(iprune_sem);
 
 /*
  * Statistics gathering..
@@ -375,7 +380,7 @@ static int invalidate_list(struct list_h
 		/*
 		 * We can reschedule here without worrying about the list's
 		 * consistency because the per-sb list of inodes must not
-		 * change during umount anymore, and because iprune_mutex keeps
+		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
@@ -414,7 +419,7 @@ int invalidate_inodes(struct super_block
 	int busy;
 	LIST_HEAD(throw_away);
 
-	mutex_lock(&iprune_mutex);
+	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	fsnotify_unmount_inodes(&sb->s_inodes);
@@ -422,7 +427,7 @@ int invalidate_inodes(struct super_block
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
-	mutex_unlock(&iprune_mutex);
+	up_write(&iprune_sem);
 
 	return busy;
 }
@@ -461,7 +466,7 @@ static void prune_icache(int nr_to_scan)
 	int nr_scanned;
 	unsigned long reap = 0;
 
-	mutex_lock(&iprune_mutex);
+	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -503,7 +508,7 @@ static void prune_icache(int nr_to_scan)
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
-	mutex_unlock(&iprune_mutex);
+	up_read(&iprune_sem);
 }
 
 /*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-15  5:14     ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-15  5:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-mm, jack

On Fri, Aug 14, 2009 at 03:58:47PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.

BTW. some filesystems do some work, and we can do blocking memory
allocations, take blocking locks etc, and a fair bit of work like
waiting for inode to sync or throwing away pagecache and assoc buffers. 

UDF probably is just doing more work than most and also is more likely
to be used on desktop and noticed than like ocfs2 or something like
that doing a bit of work there.

OK, if you like the patch then I've fixed up the changelog and comment
(yes it was helpful). Maybe it could sit in -mm for a while?

Thanks,

--
We have had a report of bad memory allocation latency during DVD-RAM (UDF)
writing. This is causing the user's desktop session to become unusable.

Jan tracked the cause of this down to UDF inode reclaim blocking:

gnome-screens D ffff810006d1d598     0 20686      1
 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
Call Trace:
 [<ffffffff804477f3>] io_schedule+0x63/0xa5
 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
 [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
 [<ffffffff802c442a>] __bread+0x70/0x86
 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
 [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
 [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
 [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
 [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
 [<ffffffff802822d8>] __do_fault+0x10c/0x417
 [<ffffffff80284232>] handle_mm_fault+0x466/0x940
 [<ffffffff8044b922>] do_page_fault+0x676/0xabf

This blocks with iprune_mutex held, which then blocks other reclaimers:

X             D ffff81009d47c400     0 17285  14831
 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
Call Trace:
 [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
 [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
 [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
 [<ffffffff802ad949>] do_select+0x308/0x506
 [<ffffffff802adced>] core_sys_select+0x1a6/0x254
 [<ffffffff802ae0b7>] sys_select+0xb5/0x157

Now I think the main problem is having the filesystem block (and do IO)
in inode reclaim. The problem is that this doesn't get accounted well
and penalizes a random allocator with a big latency spike caused by
work generated from elsewhere.

I think the best idea would be to avoid this. By design if possible,
or by deferring the hard work to an asynchronous context. If the latter,
then the fs would probably want to throttle creation of new work with
queue size of the deferred work, but let's not get into those details.

Anyway, the other obvious thing we looked at is the iprune_mutex which
is causing the cascading blocking. We could turn this into an rwsem to
improve concurrency. It is unreasonable to totally ban all potentially
slow or blocking operations in inode reclaim, so I think this is a cheap
way to get a small improvement.

This doesn't solve the whole problem of course. The process doing inode
reclaim will still take the latency hit, and concurrent processes may
end up contending on filesystem locks. So fs developers should keep
these problems in mind.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 fs/inode.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/backing-dev.h>
 #include <linux/wait.h>
+#include <linux/rwsem.h>
 #include <linux/hash.h>
 #include <linux/swap.h>
 #include <linux/security.h>
@@ -87,14 +88,18 @@ static struct hlist_head *inode_hashtabl
 DEFINE_SPINLOCK(inode_lock);
 
 /*
- * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
+ * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
  * been invalidating, or by the time it calls clear_inode & destroy_inode
  * from its final dispose_list, the struct super_block they refer to
  * (for inode->i_sb->s_op) may already have been freed and reused.
+ *
+ * We make this an rwsem because the fastpath is icache shrinking. In
+ * some cases a filesystem may be doing a significant amount of work in
+ * its inode reclaim code, so this should improve parallelism.
  */
-static DEFINE_MUTEX(iprune_mutex);
+static DECLARE_RWSEM(iprune_sem);
 
 /*
  * Statistics gathering..
@@ -375,7 +380,7 @@ static int invalidate_list(struct list_h
 		/*
 		 * We can reschedule here without worrying about the list's
 		 * consistency because the per-sb list of inodes must not
-		 * change during umount anymore, and because iprune_mutex keeps
+		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
@@ -414,7 +419,7 @@ int invalidate_inodes(struct super_block
 	int busy;
 	LIST_HEAD(throw_away);
 
-	mutex_lock(&iprune_mutex);
+	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	fsnotify_unmount_inodes(&sb->s_inodes);
@@ -422,7 +427,7 @@ int invalidate_inodes(struct super_block
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
-	mutex_unlock(&iprune_mutex);
+	up_write(&iprune_sem);
 
 	return busy;
 }
@@ -461,7 +466,7 @@ static void prune_icache(int nr_to_scan)
 	int nr_scanned;
 	unsigned long reap = 0;
 
-	mutex_lock(&iprune_mutex);
+	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -503,7 +508,7 @@ static void prune_icache(int nr_to_scan)
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
-	mutex_unlock(&iprune_mutex);
+	up_read(&iprune_sem);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-14 15:25 ` Nick Piggin
@ 2009-08-15 19:57   ` Christoph Hellwig
  -1 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2009-08-15 19:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> Now I think the main problem is having the filesystem block (and do IO
> in inode reclaim. The problem is that this doesn't get accounted well
> and penalizes a random allocator with a big latency spike caused by
> work generated from elsewhere.
> 
> I think the best idea would be to avoid this. By design if possible,
> or by deferring the hard work to an asynchronous context. If the latter,
> then the fs would probably want to throttle creation of new work with
> queue size of the deferred work, but let's not get into those details.

I don't really see a good way to avoid this.  For any filesystem that
does some sort of preallocations we need to drop them in ->clear_inode.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-15 19:57   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2009-08-15 19:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> Now I think the main problem is having the filesystem block (and do IO
> in inode reclaim. The problem is that this doesn't get accounted well
> and penalizes a random allocator with a big latency spike caused by
> work generated from elsewhere.
> 
> I think the best idea would be to avoid this. By design if possible,
> or by deferring the hard work to an asynchronous context. If the latter,
> then the fs would probably want to throttle creation of new work with
> queue size of the deferred work, but let's not get into those details.

I don't really see a good way to avoid this.  For any filesystem that
does some sort of preallocations we need to drop them in ->clear_inode.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-15 19:57   ` Christoph Hellwig
@ 2009-08-16 10:05     ` Nick Piggin
  -1 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-16 10:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Sat, Aug 15, 2009 at 03:57:42PM -0400, Christoph Hellwig wrote:
> On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> > 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> 
> I don't really see a good way to avoid this.  For any filesystem that
> does some sort of preallocations we need to drop them in ->clear_inode.

OK, I agree sometimes it is not going to be possible. Although if the
preallocations are on-disk, do you still have to drop them? If not on
disk, then no IO is required.

But anyway, I propose this patch exactly because it is not always
possible to avoid slow/blocking ops (even in the vfs there are some).

If it ever turns up to be a problem after that, I guess it might be
possible to have another callback to schedule async slow work before
dropping the inode. Or something like that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-16 10:05     ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-16 10:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Sat, Aug 15, 2009 at 03:57:42PM -0400, Christoph Hellwig wrote:
> On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> > 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> 
> I don't really see a good way to avoid this.  For any filesystem that
> does some sort of preallocations we need to drop them in ->clear_inode.

OK, I agree sometimes it is not going to be possible. Although if the
preallocations are on-disk, do you still have to drop them? If not on
disk, then no IO is required.

But anyway, I propose this patch exactly because it is not always
possible to avoid slow/blocking ops (even in the vfs there are some).

If it ever turns up to be a problem after that, I guess it might be
possible to have another callback to schedule async slow work before
dropping the inode. Or something like that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-15 19:57   ` Christoph Hellwig
@ 2009-08-16 22:11     ` Andreas Dilger
  -1 siblings, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2009-08-16 22:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Aug 15, 2009  15:57 -0400, Christoph Hellwig wrote:
> On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> > 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> 
> I don't really see a good way to avoid this.  For any filesystem that
> does some sort of preallocations we need to drop them in ->clear_inode.

One of the problems I've seen in the past is that filesystem memory reclaim
(in particular dentry/inode cleanup) cannot happen within filesystems due
to potential deadlocks.  This is particularly problematic when there is a
lot of memory pressure from within the kernel and very little from userspace
(e.g. updatedb or find).

However, many/most inodes/dentries in the filesystem could be discarded
quite easily and would not deadlock the system.  I wonder if it makes
sense to keep a mask in the inode that the filesystem could set that
determines whether it is safe to clean up the inode even though __GFP_FS
is not set?  That would potentially allow e.g. shrink_icache_memory()
to free a large number of "non-tricky" inodes if needed (e.g. ones without
locks/preallocation/expensive cleanup).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
@ 2009-08-16 22:11     ` Andreas Dilger
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2009-08-16 22:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Aug 15, 2009  15:57 -0400, Christoph Hellwig wrote:
> On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> > 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> 
> I don't really see a good way to avoid this.  For any filesystem that
> does some sort of preallocations we need to drop them in ->clear_inode.

One of the problems I've seen in the past is that filesystem memory reclaim
(in particular dentry/inode cleanup) cannot happen within filesystems due
to potential deadlocks.  This is particularly problematic when there is a
lot of memory pressure from within the kernel and very little from userspace
(e.g. updatedb or find).

However, many/most inodes/dentries in the filesystem could be discarded
quite easily and would not deadlock the system.  I wonder if it makes
sense to keep a mask in the inode that the filesystem could set that
determines whether it is safe to clean up the inode even though __GFP_FS
is not set?  That would potentially allow e.g. shrink_icache_memory()
to free a large number of "non-tricky" inodes if needed (e.g. ones without
locks/preallocation/expensive cleanup).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rfc][patch] fs: turn iprune_mutex into rwsem
  2009-08-16 22:11     ` Andreas Dilger
  (?)
@ 2009-08-17  6:34     ` Nick Piggin
  -1 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-08-17  6:34 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, linux-fsdevel, linux-mm, Jan Kara, Andrew Morton

On Sun, Aug 16, 2009 at 04:11:59PM -0600, Andreas Dilger wrote:
> On Aug 15, 2009  15:57 -0400, Christoph Hellwig wrote:
> > On Fri, Aug 14, 2009 at 05:25:05PM +0200, Nick Piggin wrote:
> > > Now I think the main problem is having the filesystem block (and do IO
> > > in inode reclaim. The problem is that this doesn't get accounted well
> > > and penalizes a random allocator with a big latency spike caused by
> > > work generated from elsewhere.
> > > 
> > > I think the best idea would be to avoid this. By design if possible,
> > > or by deferring the hard work to an asynchronous context. If the latter,
> > > then the fs would probably want to throttle creation of new work with
> > > queue size of the deferred work, but let's not get into those details.
> > 
> > I don't really see a good way to avoid this.  For any filesystem that
> > does some sort of preallocations we need to drop them in ->clear_inode.
> 
> One of the problems I've seen in the past is that filesystem memory reclaim
> (in particular dentry/inode cleanup) cannot happen within filesystems due
> to potential deadlocks.  This is particularly problematic when there is a
> lot of memory pressure from within the kernel and very little from userspace
> (e.g. updatedb or find).

Hm OK. It should still kick off kswapd at least which can reclaim GFP_FS.

 
> However, many/most inodes/dentries in the filesystem could be discarded
> quite easily and would not deadlock the system.  I wonder if it makes
> sense to keep a mask in the inode that the filesystem could set that
> determines whether it is safe to clean up the inode even though __GFP_FS
> is not set?  That would potentially allow e.g. shrink_icache_memory()
> to free a large number of "non-tricky" inodes if needed (e.g. ones without
> locks/preallocation/expensive cleanup).

I guess it would be possible but even before calling into the FS it
requires taking some locks. So we would need to make all existing
~__GFP_FS allocations have all the FS bits clear, and then where safe
they could be converted to just set some of the FS bits but not
others.

It's rather complex though. Do you have any specific workloads you
can reproduce? Because we could also look at well this patch to start
with but also maybe like another callback which can schedule slow
work on tricky inodes.

Hmm, I guess we could just trylock on the iprune_mutex and inode_lock
if GFP_FS is not set, then the fs would have to work out what to do
with the gfp_mask it has.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-08-17  6:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-14 15:25 [rfc][patch] fs: turn iprune_mutex into rwsem Nick Piggin
2009-08-14 15:25 ` Nick Piggin
2009-08-14 22:58 ` Andrew Morton
2009-08-14 23:39   ` Jan Kara
2009-08-14 23:39     ` Jan Kara
2009-08-15  4:45   ` Nick Piggin
2009-08-15  4:45     ` Nick Piggin
2009-08-15  5:14   ` Nick Piggin
2009-08-15  5:14     ` Nick Piggin
2009-08-15 19:57 ` Christoph Hellwig
2009-08-15 19:57   ` Christoph Hellwig
2009-08-16 10:05   ` Nick Piggin
2009-08-16 10:05     ` Nick Piggin
2009-08-16 22:11   ` Andreas Dilger
2009-08-16 22:11     ` Andreas Dilger
2009-08-17  6:34     ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.