linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb()
@ 2009-03-18  8:13 Masasyoshi MIZUMA
  2009-03-23 10:38 ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Masasyoshi MIZUMA @ 2009-03-18  8:13 UTC (permalink / raw)
  To: linux-fsdevel

I create the patch which fixes lack of mutex_lock in drop_pagecache_sb().
Please check the bug and the patch (below).

----------------------------------------------------------------------

When drop_pagecache_sb() frees inodes, it doesn't get mutex_lock of 
iprune_mutex. Therefore, if it races the process which frees inodes 
(ex. prune_icache()), OS panic may happen.

An example of the panic flow is the following:
----------------------------------------------------------------------
            [process A]               |         [process B]
 |                                    |
 |  shrink_icache_memory()            |
 |      |                             |
 |      V                             |
 |    prune_icache()                  |  drop_pagecache()
 |      mutex_lock(&iprune_mutex)     |      |
 |      spin_lock(&inode_lock)        |      |
 |          |                         |      V
 |          |                         |    drop_pagecache_sb()
 |          |                         |        |
 |          V                         |        V
 |      spin_unlock(&inode_lock)      |      spin_lock(&inode_lock)
 |          |                         |          |
 |          |                         |          |
 |          V                         |          V
 |      dispose_list()                |        __iget()
 |        list_del()                  |            |
 |            |                       |            |
 |            V                       |            V
 |        spin_lock(&inode_lock)      |          list_move() <----- PANIC !!
 |                                    |
 V                                    |
(time)
----------------------------------------------------------------------
If the inode which Process B do list_move() with is the same as the one which
Process A did list_del() with, OS may panic.

Fixing this bug requires the following:
- Adding mutex_lock of iprune_mutex to drop_pagecache_sb().
- Changing iprune_mutex from static to global.

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
---
 fs/drop_caches.c          |    2 ++
 fs/inode.c                |    2 +-
 include/linux/writeback.h |    1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff -Nurp linux-2.6.29-rc7.orig/fs/drop_caches.c linux-2.6.29-rc7/fs/drop_caches.c
--- linux-2.6.29-rc7.orig/fs/drop_caches.c      2009-03-10 06:38:17.137283010 +0900
+++ linux-2.6.29-rc7/fs/drop_caches.c   2009-03-11 04:04:29.705284864 +0900
@@ -16,6 +16,7 @@ static void drop_pagecache_sb(struct sup
 {
        struct inode *inode, *toput_inode = NULL;

+       mutex_lock(&iprune_mutex);
        spin_lock(&inode_lock);
        list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
                if (inode->i_state & (I_FREEING|I_WILL_FREE))
@@ -31,6 +32,7 @@ static void drop_pagecache_sb(struct sup
        }
        spin_unlock(&inode_lock);
        iput(toput_inode);
+       mutex_unlock(&iprune_mutex);
 }

 static void drop_pagecache(void)
diff -Nurp linux-2.6.29-rc7.orig/fs/inode.c linux-2.6.29-rc7/fs/inode.c
--- linux-2.6.29-rc7.orig/fs/inode.c    2009-03-10 06:38:22.404282773 +0900
+++ linux-2.6.29-rc7/fs/inode.c 2009-03-18 07:17:31.032284423 +0900
@@ -91,7 +91,7 @@ DEFINE_SPINLOCK(inode_lock);
  * from its final dispose_list, the struct super_block they refer to
  * (for inode->i_sb->s_op) may already have been freed and reused.
  */
-static DEFINE_MUTEX(iprune_mutex);
+DEFINE_MUTEX(iprune_mutex);

 /*
  * Statistics gathering..
diff -Nurp linux-2.6.29-rc7.orig/include/linux/writeback.h linux-2.6.29-rc7/include/linux/writeback.h
--- linux-2.6.29-rc7.orig/include/linux/writeback.h     2009-03-10 06:38:27.133285255 +0900
+++ linux-2.6.29-rc7/include/linux/writeback.h  2009-03-11 05:27:43.679283477 +0900
@@ -10,6 +10,7 @@
 struct backing_dev_info;

 extern spinlock_t inode_lock;
+extern struct mutex iprune_mutex;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb()
  2009-03-18  8:13 [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb() Masasyoshi MIZUMA
@ 2009-03-23 10:38 ` Wu Fengguang
  2009-03-24  7:06   ` Masayoshi MIZUMA
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2009-03-23 10:38 UTC (permalink / raw)
  To: Masasyoshi MIZUMA; +Cc: linux-fsdevel

Masasyoshi, 

On Wed, Mar 18, 2009 at 05:13:35PM +0900, Masasyoshi MIZUMA wrote:
> I create the patch which fixes lack of mutex_lock in drop_pagecache_sb().
> Please check the bug and the patch (below).

Is this a real producible bug or a theory one?
IMHO the I_FREEING flag should avoid the race.

> ----------------------------------------------------------------------
> 
> When drop_pagecache_sb() frees inodes, it doesn't get mutex_lock of 
> iprune_mutex. Therefore, if it races the process which frees inodes 
> (ex. prune_icache()), OS panic may happen.
> 
> An example of the panic flow is the following:
> ----------------------------------------------------------------------
>             [process A]               |         [process B]
>  |                                    |
>  |  shrink_icache_memory()            |
>  |      |                             |
>  |      V                             |
>  |    prune_icache()                  |  drop_pagecache()
>  |      mutex_lock(&iprune_mutex)     |      |
>  |      spin_lock(&inode_lock)        |      |
>  |          |                         |      V
>  |          |                         |    drop_pagecache_sb()
>  |          |                         |        |

          inode->i_state |= I_FREEING;

>  |          V                         |        V
>  |      spin_unlock(&inode_lock)      |      spin_lock(&inode_lock)
>  |          |                         |          |

                                                if (inode->i_state & (I_FREEING|I_WILL_FREE))
                                                        continue;

>  |          |                         |          |
>  |          V                         |          V
>  |      dispose_list()                |        __iget()
>  |        list_del()                  |            |
>  |            |                       |            |
>  |            V                       |            V
>  |        spin_lock(&inode_lock)      |          list_move() <----- PANIC !!
>  |                                    |
>  V                                    |
> (time)
> ----------------------------------------------------------------------
> If the inode which Process B do list_move() with is the same as the one which
> Process A did list_del() with, OS may panic.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb()
  2009-03-23 10:38 ` Wu Fengguang
@ 2009-03-24  7:06   ` Masayoshi MIZUMA
  2009-03-24  7:44     ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Masayoshi MIZUMA @ 2009-03-24  7:06 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel, viro

Hi, Fengguang

On Mon, 23 Mar 2009 18:38:46 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Masasyoshi, 
> 
> On Wed, Mar 18, 2009 at 05:13:35PM +0900, Masasyoshi MIZUMA wrote:
> > I create the patch which fixes lack of mutex_lock in drop_pagecache_sb().
> > Please check the bug and the patch (below).
> 
Thank you for your comment, and I apologize to you for my lack
of explanation.

> Is this a real producible bug or a theory one?
This is a real bug.

> IMHO the I_FREEING flag should avoid the race.
I supplement the explanation for this problem.

clear_inode() is called by dispose_list(), and sets the inode's 
i_state to I_CLEAR. Therefore, the following conditional expression 
doesn't match for the inode:
"if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue;"
As the result, this problem can happen.

> 
> > ----------------------------------------------------------------------
> > 
> > When drop_pagecache_sb() frees inodes, it doesn't get mutex_lock of 
> > iprune_mutex. Therefore, if it races the process which frees inodes 
> > (ex. prune_icache()), OS panic may happen.
> > 
> > An example of the panic flow is the following:
> > ----------------------------------------------------------------------
> >             [process A]               |         [process B]
> >  |                                    |
> >  |  shrink_icache_memory()            |
> >  |      |                             |
> >  |      V                             |
> >  |    prune_icache()                  |  drop_pagecache()
> >  |      mutex_lock(&iprune_mutex)     |      |
> >  |      spin_lock(&inode_lock)        |      |
> >  |          |                         |      V
> >  |          |                         |    drop_pagecache_sb()
> >  |          |                         |        |
> 
>           inode->i_state |= I_FREEING;
> 
> >  |          V                         |        V
> >  |      spin_unlock(&inode_lock)      |      spin_lock(&inode_lock)
> >  |          |                         |          |
> 
>                                                 if (inode->i_state & (I_FREEING|I_WILL_FREE))
>                                                         continue;
> 
> >  |          |                         |          |
> >  |          V                         |          V
> >  |      dispose_list()                |        __iget()
> >  |        list_del()                  |            |
> >  |            |                       |            |
> >  |            V                       |            V
> >  |        spin_lock(&inode_lock)      |          list_move() <----- PANIC !!
> >  |                                    |
> >  V                                    |
> > (time)
> > ----------------------------------------------------------------------
> > If the inode which Process B do list_move() with is the same as the one which
> > Process A did list_del() with, OS may panic.

I applied your comment and then modified the panic flow figure.
Please check below:
----------------------------------------------------------------------
            [process A]               |        [process B]
 |                                    |
 |  shrink_icache_memory()            |
 |      |                             |
 |      V                             |
 |    prune_icache()                  | drop_pagecache()
 |      mutex_lock(&iprune_mutex)     |     |
 |      spin_lock(&inode_lock)        |     |
 |          |                         |     V
 |          |                         |   drop_pagecache_sb()
 |          |                         |       |
 |          V                         |       |
 |      inode->i_state |= I_FREEING;  |       |
 |          |                         |       |
 |          V                         |       V
 |      spin_unlock(&inode_lock)      |     spin_lock(&inode_lock)
 |          |                         |         |
 |          |                         |         |
 |          V                         |         |
 |      dispose_list()                |         |
 |        list_del()                  |         |
 |            |                       |         |
 |            V                       |         |
 |        clear_inode()               |         |
 |          inode->i_state = I_CLEAR  |         |
 |            |                       |         |
 |            |                       |         V
 |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
 |            |                       |              continue;           <---- NOT MATCH
 |            |                       |           |
 |            |                       |           V
 |            |                       |      __iget()   
 |            |                       |            |
 |            V                       |            V
 |        spin_lock(&inode_lock)      |        list_move() <----- PANIC !!
 |                                    |
 V                                    |
(time)
----------------------------------------------------------------------

Thanks,
Masayoshi MIZUMA



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb()
  2009-03-24  7:06   ` Masayoshi MIZUMA
@ 2009-03-24  7:44     ` Wu Fengguang
  2009-03-24 12:05       ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2009-03-24  7:44 UTC (permalink / raw)
  To: Masayoshi MIZUMA; +Cc: linux-fsdevel, viro, Jan Kara, Nick Piggin

Hi Masayoshi,

On Tue, Mar 24, 2009 at 03:06:45PM +0800, Masayoshi MIZUMA wrote:
> Hi, Fengguang
> 
> On Mon, 23 Mar 2009 18:38:46 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Masasyoshi, 
> > 
> > On Wed, Mar 18, 2009 at 05:13:35PM +0900, Masasyoshi MIZUMA wrote:
> > > I create the patch which fixes lack of mutex_lock in drop_pagecache_sb().
> > > Please check the bug and the patch (below).
> > 
> Thank you for your comment, and I apologize to you for my lack
> of explanation.
> 
> > Is this a real producible bug or a theory one?
> This is a real bug.
> 
> > IMHO the I_FREEING flag should avoid the race.
> I supplement the explanation for this problem.
> 
> clear_inode() is called by dispose_list(), and sets the inode's 
> i_state to I_CLEAR. Therefore, the following conditional expression 
> doesn't match for the inode:
> "if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue;"
> As the result, this problem can happen.
> 
> > 
> > > ----------------------------------------------------------------------
> > > 
> > > When drop_pagecache_sb() frees inodes, it doesn't get mutex_lock of 
> > > iprune_mutex. Therefore, if it races the process which frees inodes 
> > > (ex. prune_icache()), OS panic may happen.
> > > 
> > > An example of the panic flow is the following:
> > > ----------------------------------------------------------------------
> > >             [process A]               |         [process B]
> > >  |                                    |
> > >  |  shrink_icache_memory()            |
> > >  |      |                             |
> > >  |      V                             |
> > >  |    prune_icache()                  |  drop_pagecache()
> > >  |      mutex_lock(&iprune_mutex)     |      |
> > >  |      spin_lock(&inode_lock)        |      |
> > >  |          |                         |      V
> > >  |          |                         |    drop_pagecache_sb()
> > >  |          |                         |        |
> > 
> >           inode->i_state |= I_FREEING;
> > 
> > >  |          V                         |        V
> > >  |      spin_unlock(&inode_lock)      |      spin_lock(&inode_lock)
> > >  |          |                         |          |
> > 
> >                                                 if (inode->i_state & (I_FREEING|I_WILL_FREE))
> >                                                         continue;
> > 
> > >  |          |                         |          |
> > >  |          V                         |          V
> > >  |      dispose_list()                |        __iget()
> > >  |        list_del()                  |            |
> > >  |            |                       |            |
> > >  |            V                       |            V
> > >  |        spin_lock(&inode_lock)      |          list_move() <----- PANIC !!
> > >  |                                    |
> > >  V                                    |
> > > (time)
> > > ----------------------------------------------------------------------
> > > If the inode which Process B do list_move() with is the same as the one which
> > > Process A did list_del() with, OS may panic.
> 
> I applied your comment and then modified the panic flow figure.
> Please check below:
> ----------------------------------------------------------------------
>             [process A]               |        [process B]
>  |                                    |
>  |  shrink_icache_memory()            |
>  |      |                             |
>  |      V                             |
>  |    prune_icache()                  | drop_pagecache()
>  |      mutex_lock(&iprune_mutex)     |     |
>  |      spin_lock(&inode_lock)        |     |
>  |          |                         |     V
>  |          |                         |   drop_pagecache_sb()
>  |          |                         |       |
>  |          V                         |       |
>  |      inode->i_state |= I_FREEING;  |       |
>  |          |                         |       |
>  |          V                         |       V
>  |      spin_unlock(&inode_lock)      |     spin_lock(&inode_lock)
>  |          |                         |         |
>  |          |                         |         |
>  |          V                         |         |
>  |      dispose_list()                |         |
>  |        list_del()                  |         |
>  |            |                       |         |
>  |            V                       |         |
>  |        clear_inode()               |         |
>  |          inode->i_state = I_CLEAR  |         |
>  |            |                       |         |
>  |            |                       |         V
>  |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
>  |            |                       |              continue;           <---- NOT MATCH
>  |            |                       |           |
>  |            |                       |           V
>  |            |                       |      __iget()   
>  |            |                       |            |
>  |            V                       |            V
>  |        spin_lock(&inode_lock)      |        list_move() <----- PANIC !!
>  |                                    |
>  V                                    |
> (time)
> ----------------------------------------------------------------------

Ah thanks for the explanation!

How about this lightweight fix? Since s_umount is already taken in
drop_pagecache(), it's not necessary to take iprune_mutex again.

Thanks,
Fengguang
---
 fs/drop_caches.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb()
  2009-03-24  7:44     ` Wu Fengguang
@ 2009-03-24 12:05       ` Jan Kara
  2009-03-24 12:11         ` Wu Fengguang
  2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2009-03-24 12:05 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Masayoshi MIZUMA, linux-fsdevel, viro, Jan Kara, Nick Piggin

On Tue 24-03-09 15:44:57, Wu Fengguang wrote:
> Hi Masayoshi,
> 
> On Tue, Mar 24, 2009 at 03:06:45PM +0800, Masayoshi MIZUMA wrote:
> > Hi, Fengguang
> > 
> > On Mon, 23 Mar 2009 18:38:46 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Masasyoshi, 
> > > 
> > > On Wed, Mar 18, 2009 at 05:13:35PM +0900, Masasyoshi MIZUMA wrote:
> > > > I create the patch which fixes lack of mutex_lock in drop_pagecache_sb().
> > > > Please check the bug and the patch (below).
> > > 
> > Thank you for your comment, and I apologize to you for my lack
> > of explanation.
> > 
> > > Is this a real producible bug or a theory one?
> > This is a real bug.
> > 
> > > IMHO the I_FREEING flag should avoid the race.
> > I supplement the explanation for this problem.
> > 
> > clear_inode() is called by dispose_list(), and sets the inode's 
> > i_state to I_CLEAR. Therefore, the following conditional expression 
> > doesn't match for the inode:
> > "if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue;"
> > As the result, this problem can happen.
> > 
> > > 
> > > > ----------------------------------------------------------------------
> > > > 
> > > > When drop_pagecache_sb() frees inodes, it doesn't get mutex_lock of 
> > > > iprune_mutex. Therefore, if it races the process which frees inodes 
> > > > (ex. prune_icache()), OS panic may happen.
> > > > 
> > > > An example of the panic flow is the following:
> > > > ----------------------------------------------------------------------
> > > >             [process A]               |         [process B]
> > > >  |                                    |
> > > >  |  shrink_icache_memory()            |
> > > >  |      |                             |
> > > >  |      V                             |
> > > >  |    prune_icache()                  |  drop_pagecache()
> > > >  |      mutex_lock(&iprune_mutex)     |      |
> > > >  |      spin_lock(&inode_lock)        |      |
> > > >  |          |                         |      V
> > > >  |          |                         |    drop_pagecache_sb()
> > > >  |          |                         |        |
> > > 
> > >           inode->i_state |= I_FREEING;
> > > 
> > > >  |          V                         |        V
> > > >  |      spin_unlock(&inode_lock)      |      spin_lock(&inode_lock)
> > > >  |          |                         |          |
> > > 
> > >                                                 if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > >                                                         continue;
> > > 
> > > >  |          |                         |          |
> > > >  |          V                         |          V
> > > >  |      dispose_list()                |        __iget()
> > > >  |        list_del()                  |            |
> > > >  |            |                       |            |
> > > >  |            V                       |            V
> > > >  |        spin_lock(&inode_lock)      |          list_move() <----- PANIC !!
> > > >  |                                    |
> > > >  V                                    |
> > > > (time)
> > > > ----------------------------------------------------------------------
> > > > If the inode which Process B do list_move() with is the same as the one which
> > > > Process A did list_del() with, OS may panic.
> > 
> > I applied your comment and then modified the panic flow figure.
> > Please check below:
> > ----------------------------------------------------------------------
> >             [process A]               |        [process B]
> >  |                                    |
> >  |  shrink_icache_memory()            |
> >  |      |                             |
> >  |      V                             |
> >  |    prune_icache()                  | drop_pagecache()
> >  |      mutex_lock(&iprune_mutex)     |     |
> >  |      spin_lock(&inode_lock)        |     |
> >  |          |                         |     V
> >  |          |                         |   drop_pagecache_sb()
> >  |          |                         |       |
> >  |          V                         |       |
> >  |      inode->i_state |= I_FREEING;  |       |
> >  |          |                         |       |
> >  |          V                         |       V
> >  |      spin_unlock(&inode_lock)      |     spin_lock(&inode_lock)
> >  |          |                         |         |
> >  |          |                         |         |
> >  |          V                         |         |
> >  |      dispose_list()                |         |
> >  |        list_del()                  |         |
> >  |            |                       |         |
> >  |            V                       |         |
> >  |        clear_inode()               |         |
> >  |          inode->i_state = I_CLEAR  |         |
> >  |            |                       |         |
> >  |            |                       |         V
> >  |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
> >  |            |                       |              continue;           <---- NOT MATCH
> >  |            |                       |           |
> >  |            |                       |           V
> >  |            |                       |      __iget()   
> >  |            |                       |            |
> >  |            V                       |            V
> >  |        spin_lock(&inode_lock)      |        list_move() <----- PANIC !!
> >  |                                    |
> >  V                                    |
> > (time)
> > ----------------------------------------------------------------------
> 
> Ah thanks for the explanation!
> 
> How about this lightweight fix? Since s_umount is already taken in
> drop_pagecache(), it's not necessary to take iprune_mutex again.
  Yes, the patch looks fine like this. But I believe there are more places
which might miss similar check. Probably dquot.c: add_dquot_ref() needs
the same check and fs-writeback.c: generic_sync_sb_inodes() also needs
adding I_CLEAR to the test, doesn't it?

								Honza

> --- mm.orig/fs/drop_caches.c
> +++ mm/fs/drop_caches.c
> @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
>  
>  	spin_lock(&inode_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE))
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
>  			continue;
>  		if (inode->i_mapping->nrpages == 0)
>  			continue;
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb()
  2009-03-24 12:05       ` Jan Kara
@ 2009-03-24 12:11         ` Wu Fengguang
  2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
  1 sibling, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2009-03-24 12:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Masayoshi MIZUMA, linux-fsdevel, viro, Nick Piggin

On Tue, Mar 24, 2009 at 08:05:02PM +0800, Jan Kara wrote:
> On Tue 24-03-09 15:44:57, Wu Fengguang wrote:
> > Hi Masayoshi,
> > 
> > On Tue, Mar 24, 2009 at 03:06:45PM +0800, Masayoshi MIZUMA wrote:
> > > Hi, Fengguang
> > > 
> > > On Mon, 23 Mar 2009 18:38:46 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Masasyoshi, 
> > > > 
> > > > On Wed, Mar 18, 2009 at 05:13:35PM +0900, Masasyoshi MIZUMA wrote:
> > > > > I create the patch which fixes lack of mutex_lock in drop_pagecache_sb().
> > > > > Please check the bug and the patch (below).
> > > > 
> > > Thank you for your comment, and I apologize to you for my lack
> > > of explanation.
> > > 
> > > > Is this a real producible bug or a theory one?
> > > This is a real bug.
> > > 
> > > > IMHO the I_FREEING flag should avoid the race.
> > > I supplement the explanation for this problem.
> > > 
> > > clear_inode() is called by dispose_list(), and sets the inode's 
> > > i_state to I_CLEAR. Therefore, the following conditional expression 
> > > doesn't match for the inode:
> > > "if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue;"
> > > As the result, this problem can happen.
> > > 
> > > > 
> > > > > ----------------------------------------------------------------------
> > > > > 
> > > > > When drop_pagecache_sb() frees inodes, it doesn't get mutex_lock of 
> > > > > iprune_mutex. Therefore, if it races the process which frees inodes 
> > > > > (ex. prune_icache()), OS panic may happen.
> > > > > 
> > > > > An example of the panic flow is the following:
> > > > > ----------------------------------------------------------------------
> > > > >             [process A]               |         [process B]
> > > > >  |                                    |
> > > > >  |  shrink_icache_memory()            |
> > > > >  |      |                             |
> > > > >  |      V                             |
> > > > >  |    prune_icache()                  |  drop_pagecache()
> > > > >  |      mutex_lock(&iprune_mutex)     |      |
> > > > >  |      spin_lock(&inode_lock)        |      |
> > > > >  |          |                         |      V
> > > > >  |          |                         |    drop_pagecache_sb()
> > > > >  |          |                         |        |
> > > > 
> > > >           inode->i_state |= I_FREEING;
> > > > 
> > > > >  |          V                         |        V
> > > > >  |      spin_unlock(&inode_lock)      |      spin_lock(&inode_lock)
> > > > >  |          |                         |          |
> > > > 
> > > >                                                 if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > > >                                                         continue;
> > > > 
> > > > >  |          |                         |          |
> > > > >  |          V                         |          V
> > > > >  |      dispose_list()                |        __iget()
> > > > >  |        list_del()                  |            |
> > > > >  |            |                       |            |
> > > > >  |            V                       |            V
> > > > >  |        spin_lock(&inode_lock)      |          list_move() <----- PANIC !!
> > > > >  |                                    |
> > > > >  V                                    |
> > > > > (time)
> > > > > ----------------------------------------------------------------------
> > > > > If the inode which Process B do list_move() with is the same as the one which
> > > > > Process A did list_del() with, OS may panic.
> > > 
> > > I applied your comment and then modified the panic flow figure.
> > > Please check below:
> > > ----------------------------------------------------------------------
> > >             [process A]               |        [process B]
> > >  |                                    |
> > >  |  shrink_icache_memory()            |
> > >  |      |                             |
> > >  |      V                             |
> > >  |    prune_icache()                  | drop_pagecache()
> > >  |      mutex_lock(&iprune_mutex)     |     |
> > >  |      spin_lock(&inode_lock)        |     |
> > >  |          |                         |     V
> > >  |          |                         |   drop_pagecache_sb()
> > >  |          |                         |       |
> > >  |          V                         |       |
> > >  |      inode->i_state |= I_FREEING;  |       |
> > >  |          |                         |       |
> > >  |          V                         |       V
> > >  |      spin_unlock(&inode_lock)      |     spin_lock(&inode_lock)
> > >  |          |                         |         |
> > >  |          |                         |         |
> > >  |          V                         |         |
> > >  |      dispose_list()                |         |
> > >  |        list_del()                  |         |
> > >  |            |                       |         |
> > >  |            V                       |         |
> > >  |        clear_inode()               |         |
> > >  |          inode->i_state = I_CLEAR  |         |
> > >  |            |                       |         |
> > >  |            |                       |         V
> > >  |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > >  |            |                       |              continue;           <---- NOT MATCH
> > >  |            |                       |           |
> > >  |            |                       |           V
> > >  |            |                       |      __iget()   
> > >  |            |                       |            |
> > >  |            V                       |            V
> > >  |        spin_lock(&inode_lock)      |        list_move() <----- PANIC !!
> > >  |                                    |
> > >  V                                    |
> > > (time)
> > > ----------------------------------------------------------------------
> > 
> > Ah thanks for the explanation!
> > 
> > How about this lightweight fix? Since s_umount is already taken in
> > drop_pagecache(), it's not necessary to take iprune_mutex again.
>   Yes, the patch looks fine like this. But I believe there are more places
> which might miss similar check. Probably dquot.c: add_dquot_ref() needs
> the same check and fs-writeback.c: generic_sync_sb_inodes() also needs
> adding I_CLEAR to the test, doesn't it?

Exactly! Basically any test on I_FREEING shall be coupled with I_CLEAR,
because of the I_FREEING => I_CLEAR transition _outside_ of inode_lock.

Thanks,
Fengguang

> > --- mm.orig/fs/drop_caches.c
> > +++ mm/fs/drop_caches.c
> > @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
> >  
> >  	spin_lock(&inode_lock);
> >  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > -		if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> >  			continue;
> >  		if (inode->i_mapping->nrpages == 0)
> >  			continue;
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] skip I_CLEAR state inodes
  2009-03-24 12:05       ` Jan Kara
  2009-03-24 12:11         ` Wu Fengguang
@ 2009-03-24 12:40         ` Wu Fengguang
  2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
  2009-06-01 21:38           ` [PATCH] " Eric Sandeen
  1 sibling, 2 replies; 28+ messages in thread
From: Wu Fengguang @ 2009-03-24 12:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, LKML, Jan Kara, Masayoshi MIZUMA, linux-fsdevel, viro,
	Nick Piggin

Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

clear_inode() will switch inode state from I_FREEING to I_CLEAR,
and do so _outside_ of inode_lock. So any I_FREEING testing is
incomplete without the testing of I_CLEAR.

Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
Jan Kara reminds fixing the other two cases. Thanks!

Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/dquot.c        |    2 +-
 fs/drop_caches.c  |    2 +-
 fs/fs-writeback.c |    3 ++-
 3 files changed, 4 insertions(+), 3 deletions(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
 		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 			struct address_space *mapping;
 
-			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+			if (inode->i_state &
+					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 				continue;
 			mapping = inode->i_mapping;
 			if (mapping->nrpages == 0)
--- mm.orig/fs/dquot.c
+++ mm/fs/dquot.c
@@ -793,7 +793,7 @@ static void add_dquot_ref(struct super_b
 			continue;
 		if (!dqinit_needed(inode, type))
 			continue;
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
 			continue;
 
 		__iget(inode);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes
  2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
@ 2009-03-30  7:18           ` Wu Fengguang
  2009-03-31 23:43             ` Andrew Morton
  2009-06-01 21:38           ` [PATCH] " Eric Sandeen
  1 sibling, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2009-03-30  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, LKML, Jan Kara, Masayoshi MIZUMA, linux-fsdevel, viro,
	Nick Piggin

clear_inode() will switch inode state from I_FREEING to I_CLEAR,
and do so _outside_ of inode_lock. So any I_FREEING testing is
incomplete without a coupled testing of I_CLEAR.

So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.

Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/drop_caches.c  |    2 +-
 fs/fs-writeback.c |    3 ++-
 fs/quota/dquot.c  |    2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
 		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 			struct address_space *mapping;
 
-			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+			if (inode->i_state &
+					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 				continue;
 			mapping = inode->i_mapping;
 			if (mapping->nrpages == 0)
--- mm.orig/fs/quota/dquot.c
+++ mm/fs/quota/dquot.c
@@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
 			continue;
 		if (!atomic_read(&inode->i_writecount))
 			continue;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes
  2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
@ 2009-03-31 23:43             ` Andrew Morton
  2009-04-01  0:53               ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2009-03-31 23:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: stable, linux-kernel, jack, m.mizuma, linux-fsdevel, viro, npiggin

On Mon, 30 Mar 2009 15:18:24 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> and do so _outside_ of inode_lock. So any I_FREEING testing is
> incomplete without a coupled testing of I_CLEAR.
> 
> So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> add_dquot_ref().
> 
> Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
> reminds fixing the other two cases.

ok...

But what is the user-visible consequence of this?  You cc'ed
stable@kernel.org so I assume it's serious.  People will want to know
what problem we're fixing!

> 
> --- mm.orig/fs/drop_caches.c
> +++ mm/fs/drop_caches.c
> @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
>  
>  	spin_lock(&inode_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
>  			continue;
>  		if (inode->i_mapping->nrpages == 0)
>  			continue;
> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
>  		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>  			struct address_space *mapping;
>  
> -			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +			if (inode->i_state &
> +					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
>  				continue;
>  			mapping = inode->i_mapping;
>  			if (mapping->nrpages == 0)
> --- mm.orig/fs/quota/dquot.c
> +++ mm/fs/quota/dquot.c
> @@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
>  
>  	spin_lock(&inode_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
>  			continue;
>  		if (!atomic_read(&inode->i_writecount))
>  			continue;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes
  2009-03-31 23:43             ` Andrew Morton
@ 2009-04-01  0:53               ` Wu Fengguang
  0 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2009-04-01  0:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, linux-kernel, jack, m.mizuma, linux-fsdevel, viro, npiggin

On Wed, Apr 01, 2009 at 07:43:32AM +0800, Andrew Morton wrote:
> On Mon, 30 Mar 2009 15:18:24 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > incomplete without a coupled testing of I_CLEAR.
> >
> > So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > add_dquot_ref().
> >
> > Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
> > reminds fixing the other two cases.
>
> ok...
>
> But what is the user-visible consequence of this?  You cc'ed
> stable@kernel.org so I assume it's serious.  People will want to know
> what problem we're fixing!

Sorry, the changelog could be expanded with the following paragraph:

Fix real kernel panics.  Masayoshi MIZUMA has a nice panic flow:
----------------------------------------------------------------------
            [process A]               |        [process B]
 |                                    |
 |    prune_icache()                  | drop_pagecache()
 |      spin_lock(&inode_lock)        |   drop_pagecache_sb()
 |      inode->i_state |= I_FREEING;  |       |
 |      spin_unlock(&inode_lock)      |       V
 |          |                         |     spin_lock(&inode_lock)
 |          V                         |         |
 |      dispose_list()                |         |
 |        list_del()                  |         |
 |        clear_inode()               |         |
 |          inode->i_state = I_CLEAR  |         |
 |            |                       |         V
 |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
 |            |                       |              continue;           <==== NOT MATCH
 |            |                       |
 |            |                       | (DANGER from here on! Accessing disposing inode!)
 |            |                       |
 |            |                       |      __iget()
 |            |                       |        list_move() <===== PANIC on poisoned list !!
 V            V                       |
(time)
----------------------------------------------------------------------

Thanks,
Fengguang

> >
> > --- mm.orig/fs/drop_caches.c
> > +++ mm/fs/drop_caches.c
> > @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
> >
> >  	spin_lock(&inode_lock);
> >  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> >  			continue;
> >  		if (inode->i_mapping->nrpages == 0)
> >  			continue;
> > --- mm.orig/fs/fs-writeback.c
> > +++ mm/fs/fs-writeback.c
> > @@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
> >  		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> >  			struct address_space *mapping;
> >
> > -			if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > +			if (inode->i_state &
> > +					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> >  				continue;
> >  			mapping = inode->i_mapping;
> >  			if (mapping->nrpages == 0)
> > --- mm.orig/fs/quota/dquot.c
> > +++ mm/fs/quota/dquot.c
> > @@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
> >
> >  	spin_lock(&inode_lock);
> >  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> >  			continue;
> >  		if (!atomic_read(&inode->i_writecount))
> >  			continue;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
  2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
@ 2009-06-01 21:38           ` Eric Sandeen
  2009-06-02  8:55             ` Wu Fengguang
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Sandeen @ 2009-06-01 21:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA, linux-fsdevel,
	viro, Nick Piggin

Wu Fengguang wrote:
> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> add_dquot_ref().
> 
> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> and do so _outside_ of inode_lock. So any I_FREEING testing is
> incomplete without the testing of I_CLEAR.
> 
> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> Jan Kara reminds fixing the other two cases. Thanks!

Is there a reason it's not done for __sync_single_inode as well?

Jeff Layton asked the question and I'm following it up :)

__sync_single_inode currently only tests I_FREEING, but I think we are
safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
I_SYNC to be cleared before it changes I_STATE.

On the other hand, testing I_CLEAR here probably would be safe anyway,
and it'd be bonus points for consistency?

Same basic question for generic_sync_sb_inodes, which has a
BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
as well?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-01 21:38           ` [PATCH] " Eric Sandeen
@ 2009-06-02  8:55             ` Wu Fengguang
  2009-06-02 10:27               ` Jeff Layton
  2009-06-02 11:37               ` Jan Kara
  0 siblings, 2 replies; 28+ messages in thread
From: Wu Fengguang @ 2009-06-02  8:55 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA, linux-fsdevel,
	viro, Nick Piggin

On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> Wu Fengguang wrote:
> > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > add_dquot_ref().
> > 
> > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > incomplete without the testing of I_CLEAR.
> > 
> > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > Jan Kara reminds fixing the other two cases. Thanks!
> 
> Is there a reason it's not done for __sync_single_inode as well?

It missed the glance because it don't have an obvious '|' in the line ;)

> Jeff Layton asked the question and I'm following it up :)
> 
> __sync_single_inode currently only tests I_FREEING, but I think we are
> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> I_SYNC to be cleared before it changes I_STATE.

But I_SYNC is removed just before the I_FREEING test, so we still have
a small race window?

> On the other hand, testing I_CLEAR here probably would be safe anyway,
> and it'd be bonus points for consistency?

So let's add the I_CLEAR test?

> Same basic question for generic_sync_sb_inodes, which has a
> BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> as well?

Yes, we can add I_CLEAR here to catch more error condition.

Thanks,
Fengguang

---
skip I_CLEAR state inodes in writeback routines

The I_FREEING test in __sync_single_inode() is racy because
clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
and the test of I_FREEING.

Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.

Reported-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
 	spin_lock(&inode_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
+	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
 		if (!(inode->i_state & I_DIRTY) &&
 		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
@@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
 
-		BUG_ON(inode->i_state & I_FREEING);
+		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02  8:55             ` Wu Fengguang
@ 2009-06-02 10:27               ` Jeff Layton
  2009-06-02 11:37               ` Jan Kara
  1 sibling, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2009-06-02 10:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Eric Sandeen, Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin

On Tue, 2 Jun 2009 16:55:23 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > Wu Fengguang wrote:
> > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > add_dquot_ref().
> > > 
> > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > incomplete without the testing of I_CLEAR.
> > > 
> > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > Jan Kara reminds fixing the other two cases. Thanks!
> > 
> > Is there a reason it's not done for __sync_single_inode as well?
> 
> It missed the glance because it don't have an obvious '|' in the line ;)
> 
> > Jeff Layton asked the question and I'm following it up :)
> > 
> > __sync_single_inode currently only tests I_FREEING, but I think we are
> > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > I_SYNC to be cleared before it changes I_STATE.
> 
> But I_SYNC is removed just before the I_FREEING test, so we still have
> a small race window?
> 

Yes, I think so. __sync_single_inode clears I_SYNC but doesn't wake up
the wait queue until the end of the function. So I think it's possible
(though unlikely) that another thread can race in and change the state
to I_CLEAR before the I_FREEING check.

> > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > and it'd be bonus points for consistency?
> 
> So let's add the I_CLEAR test?
> 
> > Same basic question for generic_sync_sb_inodes, which has a
> > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > as well?
> 
> Yes, we can add I_CLEAR here to catch more error condition.
> 
> Thanks,
> Fengguang
> 
> ---
> skip I_CLEAR state inodes in writeback routines
> 
> The I_FREEING test in __sync_single_inode() is racy because
> clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> and the test of I_FREEING.
> 
> Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> 
> Reported-by: Jeff Layton <jlayton@redhat.com>
> Reported-by: Eric Sandeen <sandeen@sandeen.net>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>  	spin_lock(&inode_lock);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state &= ~I_SYNC;
> -	if (!(inode->i_state & I_FREEING)) {
> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>  		if (!(inode->i_state & I_DIRTY) &&
>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
>  		if (current_is_pdflush() && !writeback_acquire(bdi))
>  			break;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
>  		__iget(inode);
>  		pages_skipped = wbc->pages_skipped;
>  		__writeback_single_inode(inode, wbc);

Acked-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02  8:55             ` Wu Fengguang
  2009-06-02 10:27               ` Jeff Layton
@ 2009-06-02 11:37               ` Jan Kara
  2009-06-02 21:48                 ` Eric Sandeen
                                   ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Jan Kara @ 2009-06-02 11:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Eric Sandeen, Andrew Morton, LKML, Jan Kara, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin

On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > Wu Fengguang wrote:
> > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > add_dquot_ref().
> > > 
> > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > incomplete without the testing of I_CLEAR.
> > > 
> > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > Jan Kara reminds fixing the other two cases. Thanks!
> > 
> > Is there a reason it's not done for __sync_single_inode as well?
> 
> It missed the glance because it don't have an obvious '|' in the line ;)
> 
> > Jeff Layton asked the question and I'm following it up :)
> > 
> > __sync_single_inode currently only tests I_FREEING, but I think we are
> > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > I_SYNC to be cleared before it changes I_STATE.
> 
> But I_SYNC is removed just before the I_FREEING test, so we still have
> a small race window?
> 
> > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > and it'd be bonus points for consistency?
> 
> So let's add the I_CLEAR test?
> 
> > Same basic question for generic_sync_sb_inodes, which has a
> > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > as well?
> 
> Yes, we can add I_CLEAR here to catch more error condition.
> 
> Thanks,
> Fengguang
> 
> ---
> skip I_CLEAR state inodes in writeback routines
> 
> The I_FREEING test in __sync_single_inode() is racy because
> clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> and the test of I_FREEING.
> 
> Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> 
> Reported-by: Jeff Layton <jlayton@redhat.com>
> Reported-by: Eric Sandeen <sandeen@sandeen.net>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>  	spin_lock(&inode_lock);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state &= ~I_SYNC;
> -	if (!(inode->i_state & I_FREEING)) {
> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>  		if (!(inode->i_state & I_DIRTY) &&
>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
  Is the whole if needed? I had an impression that everyone calling
__sync_single_inode() should better take care it does not race with inode
freeing... So WARN_ON would be more appropriate IMHO.

>  			/*
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
>  		if (current_is_pdflush() && !writeback_acquire(bdi))
>  			break;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
>  		__iget(inode);
>  		pages_skipped = wbc->pages_skipped;
>  		__writeback_single_inode(inode, wbc);
  Looking at this code, it looks a bit suspicious. What prevents this s_io
list scan to race with inode freeing? In particular generic_forget_inode()
can drop inode_lock to write the inode and in the mean time
generic_sync_sb_inodes() can come, get a reference to the inode and start
it's writeback... Subsequent iput() would then call generic_forget_inode()
on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
inodes in this scan like we do for later in the function for another scan?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 11:37               ` Jan Kara
@ 2009-06-02 21:48                 ` Eric Sandeen
  2009-06-03 10:45                   ` Jeff Layton
  2009-06-03 13:32                 ` Wu Fengguang
  2009-06-03 14:10                 ` Wu Fengguang
  2 siblings, 1 reply; 28+ messages in thread
From: Eric Sandeen @ 2009-06-02 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin

Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
>> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
>>> Wu Fengguang wrote:
>>>> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
>>>> add_dquot_ref().
>>>>
>>>> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
>>>> and do so _outside_ of inode_lock. So any I_FREEING testing is
>>>> incomplete without the testing of I_CLEAR.
>>>>
>>>> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
>>>> Jan Kara reminds fixing the other two cases. Thanks!
>>> Is there a reason it's not done for __sync_single_inode as well?
>> It missed the glance because it don't have an obvious '|' in the line ;)
>>
>>> Jeff Layton asked the question and I'm following it up :)
>>>
>>> __sync_single_inode currently only tests I_FREEING, but I think we are
>>> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
>>> I_SYNC to be cleared before it changes I_STATE.
>> But I_SYNC is removed just before the I_FREEING test, so we still have
>> a small race window?

yep that's right.

        inode->i_state &= ~I_SYNC;
	>>> clear_inode->inode_sync_wait here and find it clear <<<
        if (!(inode->i_state & I_FREEING)) {

...

>> --- linux.orig/fs/fs-writeback.c
>> +++ linux/fs/fs-writeback.c
>> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>>  	spin_lock(&inode_lock);
>>  	WARN_ON(inode->i_state & I_NEW);
>>  	inode->i_state &= ~I_SYNC;
>> -	if (!(inode->i_state & I_FREEING)) {
>> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>>  		if (!(inode->i_state & I_DIRTY) &&
>>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {

>   Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.

Maybe both then (both a WARN on and then the test (defensive here, I
guess)) because if we continue we may wander into a poisoned list
pointer and explode, right?

-Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 21:48                 ` Eric Sandeen
@ 2009-06-03 10:45                   ` Jeff Layton
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2009-06-03 10:45 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Wu Fengguang, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin

On Tue, 02 Jun 2009 16:48:57 -0500
Eric Sandeen <sandeen@sandeen.net> wrote:

> Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> >> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> >>> Wu Fengguang wrote:
> >>>> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> >>>> add_dquot_ref().
> >>>>
> >>>> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> >>>> and do so _outside_ of inode_lock. So any I_FREEING testing is
> >>>> incomplete without the testing of I_CLEAR.
> >>>>
> >>>> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> >>>> Jan Kara reminds fixing the other two cases. Thanks!
> >>> Is there a reason it's not done for __sync_single_inode as well?
> >> It missed the glance because it don't have an obvious '|' in the line ;)
> >>
> >>> Jeff Layton asked the question and I'm following it up :)
> >>>
> >>> __sync_single_inode currently only tests I_FREEING, but I think we are
> >>> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> >>> I_SYNC to be cleared before it changes I_STATE.
> >> But I_SYNC is removed just before the I_FREEING test, so we still have
> >> a small race window?
> 
> yep that's right.
> 
>         inode->i_state &= ~I_SYNC;
> 	>>> clear_inode->inode_sync_wait here and find it clear <<<
>         if (!(inode->i_state & I_FREEING)) {
> 
> ...
> 
> >> --- linux.orig/fs/fs-writeback.c
> >> +++ linux/fs/fs-writeback.c
> >> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >>  	spin_lock(&inode_lock);
> >>  	WARN_ON(inode->i_state & I_NEW);
> >>  	inode->i_state &= ~I_SYNC;
> >> -	if (!(inode->i_state & I_FREEING)) {
> >> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >>  		if (!(inode->i_state & I_DIRTY) &&
> >>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> 
> >   Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> 
> Maybe both then (both a WARN on and then the test (defensive here, I
> guess)) because if we continue we may wander into a poisoned list
> pointer and explode, right?
> 

Right.

I think removing this check would be roughly equivalent to adding a
BUG_ON. It just wouldn't reliably pop since it would depend on the
timing of the race.

Keeping the check and adding a WARN seems like a reasonable thing to
do. Maybe after a few releases if no one has seen the WARN fire we can
look at removing the check (or maybe turn it into a BUG)?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 11:37               ` Jan Kara
  2009-06-02 21:48                 ` Eric Sandeen
@ 2009-06-03 13:32                 ` Wu Fengguang
  2009-06-03 14:00                   ` Jan Kara
  2009-06-03 14:10                 ` Wu Fengguang
  2 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2009-06-03 13:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin

[reply the easy part first]
On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > Wu Fengguang wrote:
> > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > add_dquot_ref().
> > > > 
> > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > incomplete without the testing of I_CLEAR.
> > > > 
> > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > 
> > > Is there a reason it's not done for __sync_single_inode as well?
> > 
> > It missed the glance because it don't have an obvious '|' in the line ;)
> > 
> > > Jeff Layton asked the question and I'm following it up :)
> > > 
> > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > I_SYNC to be cleared before it changes I_STATE.
> > 
> > But I_SYNC is removed just before the I_FREEING test, so we still have
> > a small race window?
> > 
> > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > and it'd be bonus points for consistency?
> > 
> > So let's add the I_CLEAR test?
> > 
> > > Same basic question for generic_sync_sb_inodes, which has a
> > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > as well?
> > 
> > Yes, we can add I_CLEAR here to catch more error condition.
> > 
> > Thanks,
> > Fengguang
> > 
> > ---
> > skip I_CLEAR state inodes in writeback routines
> > 
> > The I_FREEING test in __sync_single_inode() is racy because
> > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > and the test of I_FREEING.
> > 
> > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > 
> > Reported-by: Jeff Layton <jlayton@redhat.com>
> > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >  	spin_lock(&inode_lock);
> >  	WARN_ON(inode->i_state & I_NEW);
> >  	inode->i_state &= ~I_SYNC;
> > -	if (!(inode->i_state & I_FREEING)) {
> > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >  		if (!(inode->i_state & I_DIRTY) &&
> >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>   Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.

The caller doesn't matter here, because we temporarily dropped inode_lock
in __sync_single_inode() ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-03 13:32                 ` Wu Fengguang
@ 2009-06-03 14:00                   ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-06-03 14:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin

On Wed 03-06-09 21:32:02, Wu Fengguang wrote:
> [reply the easy part first]
> On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > Wu Fengguang wrote:
> > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > add_dquot_ref().
> > > > > 
> > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > incomplete without the testing of I_CLEAR.
> > > > > 
> > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > 
> > > > Is there a reason it's not done for __sync_single_inode as well?
> > > 
> > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > 
> > > > Jeff Layton asked the question and I'm following it up :)
> > > > 
> > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > I_SYNC to be cleared before it changes I_STATE.
> > > 
> > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > a small race window?
> > > 
> > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > and it'd be bonus points for consistency?
> > > 
> > > So let's add the I_CLEAR test?
> > > 
> > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > as well?
> > > 
> > > Yes, we can add I_CLEAR here to catch more error condition.
> > > 
> > > Thanks,
> > > Fengguang
> > > 
> > > ---
> > > skip I_CLEAR state inodes in writeback routines
> > > 
> > > The I_FREEING test in __sync_single_inode() is racy because
> > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > and the test of I_FREEING.
> > > 
> > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > 
> > > Reported-by: Jeff Layton <jlayton@redhat.com>
> > > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux.orig/fs/fs-writeback.c
> > > +++ linux/fs/fs-writeback.c
> > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > >  	spin_lock(&inode_lock);
> > >  	WARN_ON(inode->i_state & I_NEW);
> > >  	inode->i_state &= ~I_SYNC;
> > > -	if (!(inode->i_state & I_FREEING)) {
> > > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > >  		if (!(inode->i_state & I_DIRTY) &&
> > >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >   Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> 
> The caller doesn't matter here, because we temporarily dropped inode_lock
> in __sync_single_inode() ?
  But the problem is basically what I described in the part you cut from
the email - if the caller of __sync_single_inode() does not have a
reference to the inode or some other means of protecting inode from being
released, then we have a problem because e.g. __writeback_single_inode()
can sleep and before I_SYNC is set, all the clear_inode() can finish and
we end up calling do_writepages() on invalidated mapping.
  So in my opinion everybody calling __writeback_single_inode() and thus
__sync_single_inode() should do an equivalent of igrab() or be sure inode
cannot be freed e.g. because we have a file open.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-02 11:37               ` Jan Kara
  2009-06-02 21:48                 ` Eric Sandeen
  2009-06-03 13:32                 ` Wu Fengguang
@ 2009-06-03 14:10                 ` Wu Fengguang
  2009-06-03 14:16                   ` Jan Kara
  2 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2009-06-03 14:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > Wu Fengguang wrote:
> > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > add_dquot_ref().
> > > > 
> > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > incomplete without the testing of I_CLEAR.
> > > > 
> > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > 
> > > Is there a reason it's not done for __sync_single_inode as well?
> > 
> > It missed the glance because it don't have an obvious '|' in the line ;)
> > 
> > > Jeff Layton asked the question and I'm following it up :)
> > > 
> > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > I_SYNC to be cleared before it changes I_STATE.
> > 
> > But I_SYNC is removed just before the I_FREEING test, so we still have
> > a small race window?
> > 
> > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > and it'd be bonus points for consistency?
> > 
> > So let's add the I_CLEAR test?
> > 
> > > Same basic question for generic_sync_sb_inodes, which has a
> > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > as well?
> > 
> > Yes, we can add I_CLEAR here to catch more error condition.
> > 
> > Thanks,
> > Fengguang
> > 
> > ---
> > skip I_CLEAR state inodes in writeback routines
> > 
> > The I_FREEING test in __sync_single_inode() is racy because
> > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > and the test of I_FREEING.
> > 
> > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > 
> > Reported-by: Jeff Layton <jlayton@redhat.com>
> > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >  	spin_lock(&inode_lock);
> >  	WARN_ON(inode->i_state & I_NEW);
> >  	inode->i_state &= ~I_SYNC;
> > -	if (!(inode->i_state & I_FREEING)) {
> > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >  		if (!(inode->i_state & I_DIRTY) &&
> >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>   Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.
> 
> >  			/*
> > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> >  		if (current_is_pdflush() && !writeback_acquire(bdi))
> >  			break;
> >  
> > -		BUG_ON(inode->i_state & I_FREEING);
> > +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> >  		__iget(inode);
> >  		pages_skipped = wbc->pages_skipped;
> >  		__writeback_single_inode(inode, wbc);
>   Looking at this code, it looks a bit suspicious. What prevents this s_io
> list scan to race with inode freeing? In particular generic_forget_inode()

Good catch.

> can drop inode_lock to write the inode and in the mean time
> generic_sync_sb_inodes() can come, get a reference to the inode and start
> it's writeback... Subsequent iput() would then call generic_forget_inode()

Another possibility:

generic_forget_inode
  inode->i_state |= I_WILL_FREE;
  spin_unlock(&inode_lock);
                                                generic_sync_sb_inodes()
                                                  spin_lock(&inode_lock);
                                                  __iget(inode);
                                                  __writeback_single_inode
                                                    // see non zero i_count
                                                    WARN_ON(inode->i_state & I_WILL_FREE);

I'm wondering why didn't we saw reports on the last WARN_ON()?
Did we missed something?

> on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> inodes in this scan like we do for later in the function for another scan?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-03 14:10                 ` Wu Fengguang
@ 2009-06-03 14:16                   ` Jan Kara
  2009-06-03 14:47                     ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2009-06-03 14:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

On Wed 03-06-09 22:10:21, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > Wu Fengguang wrote:
> > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > add_dquot_ref().
> > > > > 
> > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > incomplete without the testing of I_CLEAR.
> > > > > 
> > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > 
> > > > Is there a reason it's not done for __sync_single_inode as well?
> > > 
> > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > 
> > > > Jeff Layton asked the question and I'm following it up :)
> > > > 
> > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > I_SYNC to be cleared before it changes I_STATE.
> > > 
> > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > a small race window?
> > > 
> > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > and it'd be bonus points for consistency?
> > > 
> > > So let's add the I_CLEAR test?
> > > 
> > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > as well?
> > > 
> > > Yes, we can add I_CLEAR here to catch more error condition.
> > > 
> > > Thanks,
> > > Fengguang
> > > 
> > > ---
> > > skip I_CLEAR state inodes in writeback routines
> > > 
> > > The I_FREEING test in __sync_single_inode() is racy because
> > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > and the test of I_FREEING.
> > > 
> > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > 
> > > Reported-by: Jeff Layton <jlayton@redhat.com>
> > > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux.orig/fs/fs-writeback.c
> > > +++ linux/fs/fs-writeback.c
> > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > >  	spin_lock(&inode_lock);
> > >  	WARN_ON(inode->i_state & I_NEW);
> > >  	inode->i_state &= ~I_SYNC;
> > > -	if (!(inode->i_state & I_FREEING)) {
> > > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > >  		if (!(inode->i_state & I_DIRTY) &&
> > >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >   Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> > 
> > >  			/*
> > > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > >  		if (current_is_pdflush() && !writeback_acquire(bdi))
> > >  			break;
> > >  
> > > -		BUG_ON(inode->i_state & I_FREEING);
> > > +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > >  		__iget(inode);
> > >  		pages_skipped = wbc->pages_skipped;
> > >  		__writeback_single_inode(inode, wbc);
> >   Looking at this code, it looks a bit suspicious. What prevents this s_io
> > list scan to race with inode freeing? In particular generic_forget_inode()
> 
> Good catch.
> 
> > can drop inode_lock to write the inode and in the mean time
> > generic_sync_sb_inodes() can come, get a reference to the inode and start
> > it's writeback... Subsequent iput() would then call generic_forget_inode()
> 
> Another possibility:
> 
> generic_forget_inode
>   inode->i_state |= I_WILL_FREE;
>   spin_unlock(&inode_lock);
>                                                 generic_sync_sb_inodes()
>                                                   spin_lock(&inode_lock);
>                                                   __iget(inode);
>                                                   __writeback_single_inode
>                                                     // see non zero i_count
>                                                     WARN_ON(inode->i_state & I_WILL_FREE);
> 
> I'm wondering why didn't we saw reports on the last WARN_ON()?
> Did we missed something?
  I meant the above race in my description ;-). Anyway, the race can happen
only if we are unmounting the filesystem (normally, we bail out on
sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
a while to understand why we weren't seeing tons of warnings...).

> > on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> > inodes in this scan like we do for later in the function for another scan?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] skip I_CLEAR state inodes
  2009-06-03 14:16                   ` Jan Kara
@ 2009-06-03 14:47                     ` Wu Fengguang
  2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2009-06-03 14:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

On Wed, Jun 03, 2009 at 10:16:36PM +0800, Jan Kara wrote:
> On Wed 03-06-09 22:10:21, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > > Wu Fengguang wrote:
> > > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > > add_dquot_ref().
> > > > > > 
> > > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > > incomplete without the testing of I_CLEAR.
> > > > > > 
> > > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > > 
> > > > > Is there a reason it's not done for __sync_single_inode as well?
> > > > 
> > > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > > 
> > > > > Jeff Layton asked the question and I'm following it up :)
> > > > > 
> > > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > > I_SYNC to be cleared before it changes I_STATE.
> > > > 
> > > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > > a small race window?
> > > > 
> > > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > > and it'd be bonus points for consistency?
> > > > 
> > > > So let's add the I_CLEAR test?
> > > > 
> > > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > > as well?
> > > > 
> > > > Yes, we can add I_CLEAR here to catch more error condition.
> > > > 
> > > > Thanks,
> > > > Fengguang
> > > > 
> > > > ---
> > > > skip I_CLEAR state inodes in writeback routines
> > > > 
> > > > The I_FREEING test in __sync_single_inode() is racy because
> > > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > > and the test of I_FREEING.
> > > > 
> > > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > > 
> > > > Reported-by: Jeff Layton <jlayton@redhat.com>
> > > > Reported-by: Eric Sandeen <sandeen@sandeen.net>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  fs/fs-writeback.c |    4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > --- linux.orig/fs/fs-writeback.c
> > > > +++ linux/fs/fs-writeback.c
> > > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > > >  	spin_lock(&inode_lock);
> > > >  	WARN_ON(inode->i_state & I_NEW);
> > > >  	inode->i_state &= ~I_SYNC;
> > > > -	if (!(inode->i_state & I_FREEING)) {
> > > > +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > > >  		if (!(inode->i_state & I_DIRTY) &&
> > > >  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > >   Is the whole if needed? I had an impression that everyone calling
> > > __sync_single_inode() should better take care it does not race with inode
> > > freeing... So WARN_ON would be more appropriate IMHO.
> > > 
> > > >  			/*
> > > > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > > >  		if (current_is_pdflush() && !writeback_acquire(bdi))
> > > >  			break;
> > > >  
> > > > -		BUG_ON(inode->i_state & I_FREEING);
> > > > +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > > >  		__iget(inode);
> > > >  		pages_skipped = wbc->pages_skipped;
> > > >  		__writeback_single_inode(inode, wbc);
> > >   Looking at this code, it looks a bit suspicious. What prevents this s_io
> > > list scan to race with inode freeing? In particular generic_forget_inode()
> > 
> > Good catch.
> > 
> > > can drop inode_lock to write the inode and in the mean time
> > > generic_sync_sb_inodes() can come, get a reference to the inode and start
> > > it's writeback... Subsequent iput() would then call generic_forget_inode()
> > 
> > Another possibility:
> > 
> > generic_forget_inode
> >   inode->i_state |= I_WILL_FREE;
> >   spin_unlock(&inode_lock);
> >                                                 generic_sync_sb_inodes()
> >                                                   spin_lock(&inode_lock);
> >                                                   __iget(inode);
> >                                                   __writeback_single_inode
> >                                                     // see non zero i_count
> >                                                     WARN_ON(inode->i_state & I_WILL_FREE);
> > 
> > I'm wondering why didn't we saw reports on the last WARN_ON()?
> > Did we missed something?
>   I meant the above race in my description ;-). Anyway, the race can happen
> only if we are unmounting the filesystem (normally, we bail out on
> sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
> a while to understand why we weren't seeing tons of warnings...).

Ah OK. Just checked that all three callers of generic_sync_sb_inodes():
- writeback_inodes(): umount prevented
- pohmelfs_kill_super(): just before umount
- ubifs calls: too complex to be obvious..
At least the first two cases are safe, so we didn't see the error report ;)

> > > on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> > > inodes in this scan like we do for later in the function for another scan?

Yes we should do this at least for safety.  I_WILL_FREE means
generic_forget_inode() is going to writeback the inode on its own, so
generic_sync_sb_inodes() would better not to wade in.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-03 14:47                     ` Wu Fengguang
@ 2009-06-06  3:07                       ` Wu Fengguang
  2009-06-08  7:03                         ` Artem Bityutskiy
  2009-06-08 17:07                         ` Jan Kara
  0 siblings, 2 replies; 28+ messages in thread
From: Wu Fengguang @ 2009-06-06  3:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

1) I_FREEING tests should be coupled with I_CLEAR

The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
   races with generic_forget_inode()

generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:

  generic_forget_inode
    inode->i_state |= I_WILL_FREE;
    spin_unlock(&inode_lock);
                                       generic_sync_sb_inodes()
                                         spin_lock(&inode_lock);
                                         __iget(inode);
                                         __writeback_single_inode
                                           // see non zero i_count
 may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                         spin_unlock(&inode_lock);
 may call generic_forget_inode again ==> iput(inode);

The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early. But we are not sure the UBIFS calls and future callers will guarantee
that. So skip I_WILL_FREE inodes for the sake of safety.

CC: Jan Kara <jack@suse.cz>
CC: Eric Sandeen <sandeen@sandeen.net>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
 	spin_lock(&inode_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
+	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
 		if (!(inode->i_state & I_DIRTY) &&
 		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
@@ -487,7 +487,7 @@ void generic_sync_sb_inodes(struct super
 			break;
 		}
 
-		if (inode->i_state & I_NEW) {
+		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
 			requeue_io(inode);
 			continue;
 		}
@@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
 
-		BUG_ON(inode->i_state & I_FREEING);
+		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
@ 2009-06-08  7:03                         ` Artem Bityutskiy
  2009-06-08  9:29                           ` Wu Fengguang
  2009-06-08 17:07                         ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: Artem Bityutskiy @ 2009-06-08  7:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

Wu Fengguang wrote:
> The above race and warning didn't turn up because writeback_inodes() holds
> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> early. But we are not sure the UBIFS calls and future callers will guarantee
> that. So skip I_WILL_FREE inodes for the sake of safety.

The inode states are a bit vague for me, but vs. UBIFS - feel
free to ask questions.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08  7:03                         ` Artem Bityutskiy
@ 2009-06-08  9:29                           ` Wu Fengguang
  2009-06-08 10:45                             ` Christoph Hellwig
  2009-06-09  7:03                             ` Artem Bityutskiy
  0 siblings, 2 replies; 28+ messages in thread
From: Wu Fengguang @ 2009-06-08  9:29 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

Hi Artem,

On Mon, Jun 08, 2009 at 03:03:10PM +0800, Artem Bityutskiy wrote:
> Wu Fengguang wrote:
> > The above race and warning didn't turn up because writeback_inodes() holds
> > the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> > early. But we are not sure the UBIFS calls and future callers will guarantee
> > that. So skip I_WILL_FREE inodes for the sake of safety.
>
> The inode states are a bit vague for me, but vs. UBIFS - feel
> free to ask questions.

Thank you. Basically I'm not sure if UBIFS guarantees it won't be
unmounted (hence the MS_ACTIVE bit is on) when calling
generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

Thanks,
Fengguang

PS: our previous discussions

        > > Another possibility:
        > >
        > > generic_forget_inode
        > >   inode->i_state |= I_WILL_FREE;
        > >   spin_unlock(&inode_lock);
        > >                                                 generic_sync_sb_inodes()
        > >                                                   spin_lock(&inode_lock);
        > >                                                   __iget(inode);
        > >                                                   __writeback_single_inode
        > >                                                     // see non zero i_count
        > >                                                     WARN_ON(inode->i_state & I_WILL_FREE);
        > >
        > > I'm wondering why didn't we saw reports on the last WARN_ON()?
        > > Did we missed something?
        >   I meant the above race in my description ;-). Anyway, the race can happen
        > only if we are unmounting the filesystem (normally, we bail out on
        > sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
        > a while to understand why we weren't seeing tons of warnings...).

        Ah OK. Just checked that all three callers of generic_sync_sb_inodes():
        - writeback_inodes(): umount prevented
        - pohmelfs_kill_super(): just before umount
        - ubifs calls: too complex to be obvious..
        At least the first two cases are safe, so we didn't see the error report ;)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08  9:29                           ` Wu Fengguang
@ 2009-06-08 10:45                             ` Christoph Hellwig
  2009-06-09  7:24                               ` Artem Bityutskiy
  2009-06-09  7:03                             ` Artem Bityutskiy
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2009-06-08 10:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Artem Bityutskiy, Jan Kara, Eric Sandeen, Andrew Morton, LKML,
	Masayoshi MIZUMA, linux-fsdevel, viro, Nick Piggin, Jeff Layton

On Mon, Jun 08, 2009 at 05:29:30PM +0800, Wu Fengguang wrote:
> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
> unmounted (hence the MS_ACTIVE bit is on) when calling
> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

Btw, the call in ubifs_sync_fs should be superflous in the
vfs-2.6#for-next tree.  We now do make sure that all inodes are flushed
before calling ->sync_fs with the wait parameter.

shrink_liability is a more interesting case, I don't understand enough
of ubifs to comment on it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
  2009-06-08  7:03                         ` Artem Bityutskiy
@ 2009-06-08 17:07                         ` Jan Kara
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-06-08 17:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

On Sat 06-06-09 11:07:25, Wu Fengguang wrote:
> 1) I_FREEING tests should be coupled with I_CLEAR
> 
> The two I_FREEING tests are racy because clear_inode() can set i_state to
> I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
> 
> 2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
>    races with generic_forget_inode()
> 
> generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
> generic_sync_sb_inodes() shall not try to step in and create possible races:
> 
>   generic_forget_inode
>     inode->i_state |= I_WILL_FREE;
>     spin_unlock(&inode_lock);
>                                        generic_sync_sb_inodes()
>                                          spin_lock(&inode_lock);
>                                          __iget(inode);
>                                          __writeback_single_inode
>                                            // see non zero i_count
>  may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
>                                          spin_unlock(&inode_lock);
>  may call generic_forget_inode again ==> iput(inode);
> 
> The above race and warning didn't turn up because writeback_inodes() holds
> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> early. But we are not sure the UBIFS calls and future callers will guarantee
> that. So skip I_WILL_FREE inodes for the sake of safety.
  OK, the patch looks fine. Acked-by: Jan Kara <jack@suse.cz>

									Honza 
> CC: Jan Kara <jack@suse.cz>
> CC: Eric Sandeen <sandeen@sandeen.net>
> Acked-by: Jeff Layton <jlayton@redhat.com>
> CC: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>  	spin_lock(&inode_lock);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state &= ~I_SYNC;
> -	if (!(inode->i_state & I_FREEING)) {
> +	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>  		if (!(inode->i_state & I_DIRTY) &&
>  		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> @@ -487,7 +487,7 @@ void generic_sync_sb_inodes(struct super
>  			break;
>  		}
>  
> -		if (inode->i_state & I_NEW) {
> +		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
>  			requeue_io(inode);
>  			continue;
>  		}
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
>  		if (current_is_pdflush() && !writeback_acquire(bdi))
>  			break;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
>  		__iget(inode);
>  		pages_skipped = wbc->pages_skipped;
>  		__writeback_single_inode(inode, wbc);
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08  9:29                           ` Wu Fengguang
  2009-06-08 10:45                             ` Christoph Hellwig
@ 2009-06-09  7:03                             ` Artem Bityutskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Artem Bityutskiy @ 2009-06-09  7:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Eric Sandeen, Andrew Morton, LKML, Masayoshi MIZUMA,
	linux-fsdevel, viro, Nick Piggin, Jeff Layton

Wu Fengguang wrote:
> Hi Artem,
> 
> On Mon, Jun 08, 2009 at 03:03:10PM +0800, Artem Bityutskiy wrote:
>> Wu Fengguang wrote:
>>> The above race and warning didn't turn up because writeback_inodes() holds
>>> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
>>> early. But we are not sure the UBIFS calls and future callers will guarantee
>>> that. So skip I_WILL_FREE inodes for the sake of safety.
>> The inode states are a bit vague for me, but vs. UBIFS - feel
>> free to ask questions.
> 
> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
> unmounted (hence the MS_ACTIVE bit is on) when calling
> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

To be frank my VFS knowledge is not good enough to give you a
good answer. MS_ACTIVE seems to be set by file-systems when
they are mounted, and cleaned by VFS when unmounting.
I guess MS_ACTIVE is used by FS-es to check whether unmounting
is in progress or not. Anyway, UBIFS does not use it.

The generic_sync_sb_inodes() is called only from within
VFS operations, e.g., from ->create, ->rename, ->mknod,
->write_begin, ->setattr, etc. I mean, it is called from
UBIFS implementations of the above calls. UBIFS never calls
generic_sync_sb_inodes() function by itself, e.g., from
the UBIFS background thread.

Also, all calls to generic_sync_sb_inodes() are from within
VFS operations which need writing, e.g., VFS called
'mnt_want_write()' for all of them.

I think VFS must protect the FS from being unmunted while
it is in the middle of an operation, right? I'm just not
sure how this mechanism works. This would mean that yes,
we cannot be unmounted while we are in
generic_sync_sb_inodes() called by UBIFS.

Did this help :-) ?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] writeback: skip new or to-be-freed inodes
  2009-06-08 10:45                             ` Christoph Hellwig
@ 2009-06-09  7:24                               ` Artem Bityutskiy
  0 siblings, 0 replies; 28+ messages in thread
From: Artem Bityutskiy @ 2009-06-09  7:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Jan Kara, Eric Sandeen, Andrew Morton, LKML,
	Masayoshi MIZUMA, linux-fsdevel, viro, Nick Piggin, Jeff Layton,
	Hunter Adrian (Nokia-D/Helsinki)

[CCed Adiran Hunter]

Christoph Hellwig wrote:
> On Mon, Jun 08, 2009 at 05:29:30PM +0800, Wu Fengguang wrote:
>> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
>> unmounted (hence the MS_ACTIVE bit is on) when calling
>> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().
> 
> Btw, the call in ubifs_sync_fs should be superflous in the
> vfs-2.6#for-next tree.  We now do make sure that all inodes are flushed
> before calling ->sync_fs with the wait parameter.

OK, thanks for letting know. Once this is merged upstream,
I'll amend UBIFS.

> shrink_liability is a more interesting case, I don't understand enough
> of ubifs to comment on it.

Well... I'm not sure if I can tell why we need this in few
words. But I'll try.

UBIFS supports on-the-flight compression. This, and other factors
lead to a situation when UBIFS does not know how much the dirty
data in the page/inode caches will take on the flash. Indeed,
how do we know how well the data will be compressed?

UBIFS has so called "budgeting" subsystem. This subsystem is
responsible for accounting flash space. If user writes a new
data page, which goes to the page cache and sits there, the
budgeting sub-system decrements the free space counters by
4KiB. And so on.

At some point the free space counters in the budgeting subsystem
reache zero, which means we do not have more space. However,
in 99% of the cases this is not true, because the budgeting
subsystem's calculations are _very_ pessimistic, they always
assume the worst case scenario like the data are uncompressible.

So consider a situation when user writes a new data page. First
of all, the ->write_begin function will call a budgeting
sub-system function in order to reserve flash space for this
new data page.

The budgeting subsystem will see that space counters are zero.
And what it will do it will call the 'shrink_liability()'
function, which, among other things, may call the
'generic_sync_sb_inodes()' function, which will force write-back,
and this will give us some space. Indeed, when we actually
write the data back, we'll see how much flash space they
really take. And in 99% of cases they will take less than
we budgeted for, usually much less.

This is the rough idea. In practice things are more complex,
and there are factors like inability to know how much of dirty
space may be reclaimed, what will be the index size after
commit, etc. This all makes the budgeting subsystem complex
and difficult to understand. Moreover, we still consider it
as a work in progress, because we use too rough calculations,
and there are too many heuristics.

Here you may read some more information about UBIFS flash
accounting issues:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_spaceacc

You may also find a lot of info here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation

Especially in this document:
http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf
but it is not easy reading. You may search for "budget"
in the doc.

HTH.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-06-09  7:25 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-18  8:13 [PATCH][BUG] Lack of mutex_lock in drop_pagecache_sb() Masasyoshi MIZUMA
2009-03-23 10:38 ` Wu Fengguang
2009-03-24  7:06   ` Masayoshi MIZUMA
2009-03-24  7:44     ` Wu Fengguang
2009-03-24 12:05       ` Jan Kara
2009-03-24 12:11         ` Wu Fengguang
2009-03-24 12:40         ` [PATCH] skip I_CLEAR state inodes Wu Fengguang
2009-03-30  7:18           ` [PATCH][RESEND for 2.6.29-rc8-mm1] " Wu Fengguang
2009-03-31 23:43             ` Andrew Morton
2009-04-01  0:53               ` Wu Fengguang
2009-06-01 21:38           ` [PATCH] " Eric Sandeen
2009-06-02  8:55             ` Wu Fengguang
2009-06-02 10:27               ` Jeff Layton
2009-06-02 11:37               ` Jan Kara
2009-06-02 21:48                 ` Eric Sandeen
2009-06-03 10:45                   ` Jeff Layton
2009-06-03 13:32                 ` Wu Fengguang
2009-06-03 14:00                   ` Jan Kara
2009-06-03 14:10                 ` Wu Fengguang
2009-06-03 14:16                   ` Jan Kara
2009-06-03 14:47                     ` Wu Fengguang
2009-06-06  3:07                       ` [PATCH] writeback: skip new or to-be-freed inodes Wu Fengguang
2009-06-08  7:03                         ` Artem Bityutskiy
2009-06-08  9:29                           ` Wu Fengguang
2009-06-08 10:45                             ` Christoph Hellwig
2009-06-09  7:24                               ` Artem Bityutskiy
2009-06-09  7:03                             ` Artem Bityutskiy
2009-06-08 17:07                         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).