All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
@ 2011-02-07 11:53 Masayoshi MIZUMA
  2011-02-15 16:06 ` Jan Kara
  2011-12-09  1:56 ` Masayoshi MIZUMA
  0 siblings, 2 replies; 121+ messages in thread
From: Masayoshi MIZUMA @ 2011-02-07 11:53 UTC (permalink / raw)
  To: Andreas Dilger, Theodore Ts'o, linux-ext4; +Cc: linux-fsdevel

Hi,

When I checked the freeze feature for ext4 filesystem using fsfreeze command
at 2.6.38-rc3, I got the following messeges:

---------------------------------------------------------------------
Feb  7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
Feb  7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  7 15:05:09 RX300S6 kernel: fsfreeze        D ffff880076d5f040     0  2104   2018 0x00000000
Feb  7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
Feb  7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
Feb  7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
Feb  7 15:05:09 RX300S6 kernel: Call Trace:
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
...
Feb  7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
Feb  7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  7 15:07:09 RX300S6 kernel: flush-8:0       D ffff880037777a30     0  1409      2 0x00000000
Feb  7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
Feb  7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
Feb  7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
Feb  7 15:07:09 RX300S6 kernel: Call Trace:
Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
---------------------------------------------------------------------

I think the following deadlock problem happened:

              [flush-8:0:1409]              |          [fsfreeze:2104]
--------------------------------------------+--------------------------------
writeback_inodes_wb                         |
 pin_sb_for_writeback                       |
   down_read_trylock(&sb->s_umount)         |
 writeback_sb_inodes                        |thaw_super
   writeback_single_inode                   | down_write(&sb->s_umount)
     do_writepages                          |  # stop until flush-8:0 releases
      ext4_da_writepages                    |  # read lock of sb->s_umount...
       ext4_journal_start_sb                |
        vfs_check_frozen                    |
          wait_event((sb)->s_wait_unfrozen, |
           ((sb)->s_frozen < (level)))      |
            # stop until being waked up by  |
            # fsfreeze...                   |
--------------------------------------------+--------------------------------

Could anyone check this problem?

Thanks,
Masayoshi Mizuma



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-07 11:53 [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Masayoshi MIZUMA
@ 2011-02-15 16:06 ` Jan Kara
  2011-02-15 17:03   ` Ted Ts'o
  2011-12-09  1:56 ` Masayoshi MIZUMA
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-02-15 16:06 UTC (permalink / raw)
  To: Masayoshi MIZUMA
  Cc: Andreas Dilger, Theodore Ts'o, linux-ext4, linux-fsdevel

On Mon 07-02-11 20:53:25, Masayoshi MIZUMA wrote:
> Hi,
> 
> When I checked the freeze feature for ext4 filesystem using fsfreeze command
> at 2.6.38-rc3, I got the following messeges:
> 
> ---------------------------------------------------------------------
> Feb  7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> Feb  7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  7 15:05:09 RX300S6 kernel: fsfreeze        D ffff880076d5f040     0  2104   2018 0x00000000
> Feb  7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> Feb  7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> Feb  7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> Feb  7 15:05:09 RX300S6 kernel: Call Trace:
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> ...
> Feb  7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> Feb  7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  7 15:07:09 RX300S6 kernel: flush-8:0       D ffff880037777a30     0  1409      2 0x00000000
> Feb  7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> Feb  7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> Feb  7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> Feb  7 15:07:09 RX300S6 kernel: Call Trace:
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> ---------------------------------------------------------------------
> 
> I think the following deadlock problem happened:
> 
>               [flush-8:0:1409]              |          [fsfreeze:2104]
> --------------------------------------------+--------------------------------
> writeback_inodes_wb                         |
>  pin_sb_for_writeback                       |
>    down_read_trylock(&sb->s_umount)         |
>  writeback_sb_inodes                        |thaw_super
>    writeback_single_inode                   | down_write(&sb->s_umount)
>      do_writepages                          |  # stop until flush-8:0 releases
>       ext4_da_writepages                    |  # read lock of sb->s_umount...
>        ext4_journal_start_sb                |
>         vfs_check_frozen                    |
>           wait_event((sb)->s_wait_unfrozen, |
>            ((sb)->s_frozen < (level)))      |
>             # stop until being waked up by  |
>             # fsfreeze...                   |
> --------------------------------------------+--------------------------------
> 
> Could anyone check this problem?
Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
under s_umount semaphore, we are prone to deadlock like the one you
describe above.

Looking at the code, s_frozen acts as lock but it's lock ranking is
unclear. Logically, the only sane ranking seems to be to rank above all
other VFS locks because we return with s_frozen held to userspace. But then
the need to wait for s_frozen from inside the filesystem violates this
ranking and causes the above deadlock.

Gosh, this is so broken. The whole thing is made even worse because
filesystems can take different locks in their unfreeze_fs callbacks and we
can possibly deadlock on these locks the same way as we do on s_umount
semaphore.

I have to think how this can be possibly fixed...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-15 16:06 ` Jan Kara
@ 2011-02-15 17:03   ` Ted Ts'o
  2011-02-15 17:29     ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Ted Ts'o @ 2011-02-15 17:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> under s_umount semaphore, we are prone to deadlock like the one you
> describe above.

One of the fundamental problems here is that the freeze and thaw
routines are using down_write(&sb->s_umount) for two purposes.  The
first is to prevent the resume/thaw from racing with a umount (which
it could do just as well by taking a read lock), but the second is to
prevent the resume/thaw code from racing with itself.  That's the core
fundamental problem here.

So I think we can solve this by introduce a new mutex, s_freeze, and
having the the resume/thaw first take the s_freeze mutex and then
second take a read lock on the s_umount.

						- Ted

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-15 17:03   ` Ted Ts'o
@ 2011-02-15 17:29     ` Jan Kara
  2011-02-15 18:04       ` Ted Ts'o
  2011-02-15 23:17       ` Toshiyuki Okajima
  0 siblings, 2 replies; 121+ messages in thread
From: Jan Kara @ 2011-02-15 17:29 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Jan Kara, Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > under s_umount semaphore, we are prone to deadlock like the one you
> > describe above.
> 
> One of the fundamental problems here is that the freeze and thaw
> routines are using down_write(&sb->s_umount) for two purposes.  The
> first is to prevent the resume/thaw from racing with a umount (which
> it could do just as well by taking a read lock), but the second is to
> prevent the resume/thaw code from racing with itself.  That's the core
> fundamental problem here.
> 
> So I think we can solve this by introduce a new mutex, s_freeze, and
> having the the resume/thaw first take the s_freeze mutex and then
> second take a read lock on the s_umount.
  Sadly this does not quite work because even down_read(&sb->s_umount)
in thaw_super() can block if there is another process that tries to acquire
s_umount for writing - a situation like:
  TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
down_read(&sb->s_umount)
  block on s_frozen
				down_write(&sb->s_umount)
				  -blocked
								down_read(&sb->s_umount)
								  -blocked
behind the write access...

The only working solution I see is to check for frozen filesystem before
taking s_umount semaphore which seems rather ugly (but might be bearable if
we did so in some well described wrapper).

And in particular ext4 has another deadlock of this kind because it does
IO from ext4_remount() e.g. when doing online resize (I know it's a bit
artifical but still ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-15 17:29     ` Jan Kara
@ 2011-02-15 18:04       ` Ted Ts'o
  2011-02-15 19:11         ` Jan Kara
  2011-02-15 23:17       ` Toshiyuki Okajima
  1 sibling, 1 reply; 121+ messages in thread
From: Ted Ts'o @ 2011-02-15 18:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Tue, Feb 15, 2011 at 06:29:54PM +0100, Jan Kara wrote:
>   Sadly this does not quite work because even down_read(&sb->s_umount)
> in thaw_super() can block if there is another process that tries to acquire
> s_umount for writing - a situation like:
>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> down_read(&sb->s_umount)
>   block on s_frozen
> 				down_write(&sb->s_umount)
> 				  -blocked
> 								down_read(&sb->s_umount)
> 								  -blocked
> behind the write access...

OK, sorry for being dense, but why does this cause a deadlock?  What
are you imaging TASK 3 doing that would impede the flusher from
eventually resuming?  Or how would TASK 3 prevent userspace from
completing whatever it needs to do (say, a device mapper ioctl)?

freeze_fs has always been inherently dangerous if the userspace does
not know what it's doing.  If it freezes the root file system, and
then while the file system is frozen, userspace attempts to modify
/etc/mtab, it's going to lose.  I've in the past argued for some kind
of safety timeout that prevents the system from wedging, but the
argument I've gotten back is (a) it's too complex, and (b) userspace
programmers aren't that stupid, and (c) it could cause the filesystem
to unfreeze when userspace wasn't expecting it.  Oh, and (d) if the
system wedges up due to userspace being stupid, it's acceptable.

Obviously, if the kernel does something to itself that causes a
deadlock, we need to fix it, but userspace doing something stupid has
been explicitly ruled out of scope, at least in previous
discussions...

> And in particular ext4 has another deadlock of this kind because it does
> IO from ext4_remount() e.g. when doing online resize (I know it's a bit
> artifical but still ;).

OK, I'm being dense again.  How does remount and online resize relate
with each other?  and it's not I/O in general which is a problem, it's
writeback activity which causes a problem because it takes a read lock
on s_umount, right?

						- Ted

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-15 18:04       ` Ted Ts'o
@ 2011-02-15 19:11         ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-02-15 19:11 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Jan Kara, Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Tue 15-02-11 13:04:35, Ted Ts'o wrote:
> On Tue, Feb 15, 2011 at 06:29:54PM +0100, Jan Kara wrote:
> >   Sadly this does not quite work because even down_read(&sb->s_umount)
> > in thaw_super() can block if there is another process that tries to acquire
> > s_umount for writing - a situation like:
> >   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > down_read(&sb->s_umount)
> >   block on s_frozen
> > 				down_write(&sb->s_umount)
> > 				  -blocked
> > 								down_read(&sb->s_umount)
> > 								  -blocked
> > behind the write access...
> 
> OK, sorry for being dense, but why does this cause a deadlock?  What
> are you imaging TASK 3 doing that would impede the flusher from
> eventually resuming?  Or how would TASK 3 prevent userspace from
> completing whatever it needs to do (say, a device mapper ioctl)?
  I was arguing that using down_read(sb->s_umount) in thaw_super() instead
of down_write() does not solve anything. The deadlock as originally
reported can still happen, you just need another task (TASK 2 in the above
scheme) to block in down_write() before thaw_super() happens.

> freeze_fs has always been inherently dangerous if the userspace does
> not know what it's doing.  If it freezes the root file system, and
> then while the file system is frozen, userspace attempts to modify
> /etc/mtab, it's going to lose.  I've in the past argued for some kind
> of safety timeout that prevents the system from wedging, but the
> argument I've gotten back is (a) it's too complex, and (b) userspace
> programmers aren't that stupid, and (c) it could cause the filesystem
> to unfreeze when userspace wasn't expecting it.  Oh, and (d) if the
> system wedges up due to userspace being stupid, it's acceptable.
> 
> Obviously, if the kernel does something to itself that causes a
> deadlock, we need to fix it, but userspace doing something stupid has
> been explicitly ruled out of scope, at least in previous
> discussions...
> 
> > And in particular ext4 has another deadlock of this kind because it does
> > IO from ext4_remount() e.g. when doing online resize (I know it's a bit
> > artifical but still ;).
> 
> OK, I'm being dense again.  How does remount and online resize relate
> with each other?  and it's not I/O in general which is a problem, it's
> writeback activity which causes a problem because it takes a read lock
> on s_umount, right?
  The problem is to start a transaction while holding s_umount semaphore,
or actually any lock that thaw_super() (including per-filesystem
->unfreeze_fs() callback) needs. For ext4 this seems to be sb->s_lock.
I was actually wrong with the ext4 online resizing using resize option
causing possible deadlocks because do_remount_sb() refuses to do anything
with the superblock while it is frozen... But still if we ever happen to
start a transaction in ext4 while sb->s_lock is held, the deadlock with
freezing code can happen and that's just subtle and ugly IMHO.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-15 17:29     ` Jan Kara
  2011-02-15 18:04       ` Ted Ts'o
@ 2011-02-15 23:17       ` Toshiyuki Okajima
  2011-02-16 14:56         ` Jan Kara
  1 sibling, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-02-15 23:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

Hi.

On Tue, 15 Feb 2011 18:29:54 +0100
Jan Kara <jack@suse.cz> wrote:
> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > under s_umount semaphore, we are prone to deadlock like the one you
> > > describe above.
> > 
> > One of the fundamental problems here is that the freeze and thaw
> > routines are using down_write(&sb->s_umount) for two purposes.  The
> > first is to prevent the resume/thaw from racing with a umount (which
> > it could do just as well by taking a read lock), but the second is to
> > prevent the resume/thaw code from racing with itself.  That's the core
> > fundamental problem here.
> > 
> > So I think we can solve this by introduce a new mutex, s_freeze, and
> > having the the resume/thaw first take the s_freeze mutex and then
> > second take a read lock on the s_umount.
>   Sadly this does not quite work because even down_read(&sb->s_umount)
> in thaw_super() can block if there is another process that tries to acquire
> s_umount for writing - a situation like:
>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> down_read(&sb->s_umount)
>   block on s_frozen
> 				down_write(&sb->s_umount)
> 				  -blocked
> 								down_read(&sb->s_umount)
> 								  -blocked
> behind the write access...
> 
> The only working solution I see is to check for frozen filesystem before
> taking s_umount semaphore which seems rather ugly (but might be bearable if
> we did so in some well described wrapper).
I created the patch that you imagine yesterday.
 
I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
without a fixed patch. After an hour, I confirmed that this deadlock happened.

However, on the kernel with a fixed patch, this deadlock doesn't still happen 
after 12 hours passed.

The patch for linux-2.6.38-rc4 is as follows:
---
 fs/fs-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 59c6e49..1c9a05e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
        spin_unlock(&sb_lock);

        if (down_read_trylock(&sb->s_umount)) {
-               if (sb->s_root)
+               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
                        return true;
                up_read(&sb->s_umount);
        }
-- 

Best Regards,
Toshiyuki Okajima

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-15 23:17       ` Toshiyuki Okajima
@ 2011-02-16 14:56         ` Jan Kara
  2011-02-17  3:50           ` Toshiyuki Okajima
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-02-16 14:56 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel

On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> On Tue, 15 Feb 2011 18:29:54 +0100
> Jan Kara <jack@suse.cz> wrote:
> > On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > under s_umount semaphore, we are prone to deadlock like the one you
> > > > describe above.
> > > 
> > > One of the fundamental problems here is that the freeze and thaw
> > > routines are using down_write(&sb->s_umount) for two purposes.  The
> > > first is to prevent the resume/thaw from racing with a umount (which
> > > it could do just as well by taking a read lock), but the second is to
> > > prevent the resume/thaw code from racing with itself.  That's the core
> > > fundamental problem here.
> > > 
> > > So I think we can solve this by introduce a new mutex, s_freeze, and
> > > having the the resume/thaw first take the s_freeze mutex and then
> > > second take a read lock on the s_umount.
> >   Sadly this does not quite work because even down_read(&sb->s_umount)
> > in thaw_super() can block if there is another process that tries to acquire
> > s_umount for writing - a situation like:
> >   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > down_read(&sb->s_umount)
> >   block on s_frozen
> > 				down_write(&sb->s_umount)
> > 				  -blocked
> > 								down_read(&sb->s_umount)
> > 								  -blocked
> > behind the write access...
> > 
> > The only working solution I see is to check for frozen filesystem before
> > taking s_umount semaphore which seems rather ugly (but might be bearable if
> > we did so in some well described wrapper).
> I created the patch that you imagine yesterday.
>  
> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> without a fixed patch. After an hour, I confirmed that this deadlock happened.
> 
> However, on the kernel with a fixed patch, this deadlock doesn't still happen 
> after 12 hours passed.
> 
> The patch for linux-2.6.38-rc4 is as follows:
> ---
>  fs/fs-writeback.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 59c6e49..1c9a05e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>         spin_unlock(&sb_lock);
> 
>         if (down_read_trylock(&sb->s_umount)) {
> -               if (sb->s_root)
> +               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
>                         return true;
>                 up_read(&sb->s_umount);
  So this is something along the lines I thought but it actually won't work
for example if sync(1) is run while the filesystem is frozen (that takes
s_umount semaphore in a different place). And generally, I'm not convinced
there are not other places that try to do IO while holding s_umount
semaphore...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-16 14:56         ` Jan Kara
@ 2011-02-17  3:50           ` Toshiyuki Okajima
  2011-02-17  5:13             ` Andreas Dilger
  2011-02-17 10:45             ` Jan Kara
  0 siblings, 2 replies; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-02-17  3:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

(2011/02/16 23:56), Jan Kara wrote:
> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>> On Tue, 15 Feb 2011 18:29:54 +0100
>> Jan Kara<jack@suse.cz>  wrote:
>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>> describe above.
>>>>
>>>> One of the fundamental problems here is that the freeze and thaw
>>>> routines are using down_write(&sb->s_umount) for two purposes.  The
>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>> it could do just as well by taking a read lock), but the second is to
>>>> prevent the resume/thaw code from racing with itself.  That's the core
>>>> fundamental problem here.
>>>>
>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>> second take a read lock on the s_umount.
>>>    Sadly this does not quite work because even down_read(&sb->s_umount)
>>> in thaw_super() can block if there is another process that tries to acquire
>>> s_umount for writing - a situation like:
>>>    TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
>>> down_read(&sb->s_umount)
>>>    block on s_frozen
>>> 				down_write(&sb->s_umount)
>>> 				  -blocked
>>> 								down_read(&sb->s_umount)
>>> 								  -blocked
>>> behind the write access...
>>>
>>> The only working solution I see is to check for frozen filesystem before
>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>> we did so in some well described wrapper).
>> I created the patch that you imagine yesterday.
>>
>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>
>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>> after 12 hours passed.
>>
>> The patch for linux-2.6.38-rc4 is as follows:
>> ---
>>   fs/fs-writeback.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 59c6e49..1c9a05e 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>          spin_unlock(&sb_lock);
>>
>>          if (down_read_trylock(&sb->s_umount)) {
>> -               if (sb->s_root)
>> +               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
>>                          return true;
>>                  up_read(&sb->s_umount);

>    So this is something along the lines I thought but it actually won't work
> for example if sync(1) is run while the filesystem is frozen (that takes
> s_umount semaphore in a different place). And generally, I'm not convinced
> there are not other places that try to do IO while holding s_umount
> semaphore...
OK. I understand.

This code only fixes the case for the following path:
writeback_inodes_wb
-> ext4_da_writepages
    -> ext4_journal_start_sb
       -> vfs_check_frozen
But, the code doesn't fix the other cases.

We must modify the local filesystem part in order to fix all cases...?

Regards,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-17  3:50           ` Toshiyuki Okajima
@ 2011-02-17  5:13             ` Andreas Dilger
  2011-02-17 10:41               ` Jan Kara
  2011-02-17 10:45             ` Jan Kara
  1 sibling, 1 reply; 121+ messages in thread
From: Andreas Dilger @ 2011-02-17  5:13 UTC (permalink / raw)
  To: toshi.okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, linux-ext4, linux-fsdevel

On 2011-02-16, at 20:50, Toshiyuki Okajima wrote:
> (2011/02/16 23:56), Jan Kara wrote:
>> 
>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>> 
>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>> after 12 hours passed.
>>> 
>>> The patch for linux-2.6.38-rc4 is as follows:
>>> ---
>>>  fs/fs-writeback.c |    2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>> 
>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>> index 59c6e49..1c9a05e 100644
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>         spin_unlock(&sb_lock);
>>> 
>>>         if (down_read_trylock(&sb->s_umount)) {
>>> -               if (sb->s_root)
>>> +               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
>>>                         return true;
>>>                 up_read(&sb->s_umount);

This seems like a very low-risk fix.

>>   So this is something along the lines I thought but it actually won't work
>> for example if sync(1) is run while the filesystem is frozen (that takes
>> s_umount semaphore in a different place). And generally, I'm not convinced
>> there are not other places that try to do IO while holding s_umount
>> semaphore...
> 
> OK. I understand.
> 
> This code only fixes the case for the following path:
> writeback_inodes_wb
> -> ext4_da_writepages
>   -> ext4_journal_start_sb
>      -> vfs_check_frozen
> But, the code doesn't fix the other cases.
> 
> We must modify the local filesystem part in order to fix all cases...?

It seems worthwhile to implement the low-risk fix that covers the common case, and if/when someone hits the rare 3-process case and/or submits a patch for it then that one will be fixed also.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-17  5:13             ` Andreas Dilger
@ 2011-02-17 10:41               ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-02-17 10:41 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: toshi.okajima, Jan Kara, Ted Ts'o, Masayoshi MIZUMA,
	linux-ext4, linux-fsdevel

On Wed 16-02-11 22:13:53, Andreas Dilger wrote:
> On 2011-02-16, at 20:50, Toshiyuki Okajima wrote:
> > (2011/02/16 23:56), Jan Kara wrote:
> >> 
> >>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel without a fixed patch. After an hour, I confirmed that this deadlock happened.
> >>> 
> >>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
> >>> after 12 hours passed.
> >>> 
> >>> The patch for linux-2.6.38-rc4 is as follows:
> >>> ---
> >>>  fs/fs-writeback.c |    2 +-
> >>>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>> 
> >>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>> index 59c6e49..1c9a05e 100644
> >>> --- a/fs/fs-writeback.c
> >>> +++ b/fs/fs-writeback.c
> >>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> >>>         spin_unlock(&sb_lock);
> >>> 
> >>>         if (down_read_trylock(&sb->s_umount)) {
> >>> -               if (sb->s_root)
> >>> +               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> >>>                         return true;
> >>>                 up_read(&sb->s_umount);
> 
> This seems like a very low-risk fix.
> 
> >>   So this is something along the lines I thought but it actually won't work
> >> for example if sync(1) is run while the filesystem is frozen (that takes
> >> s_umount semaphore in a different place). And generally, I'm not convinced
> >> there are not other places that try to do IO while holding s_umount
> >> semaphore...
> > 
> > OK. I understand.
> > 
> > This code only fixes the case for the following path:
> > writeback_inodes_wb
> > -> ext4_da_writepages
> >   -> ext4_journal_start_sb
> >      -> vfs_check_frozen
> > But, the code doesn't fix the other cases.
> > 
> > We must modify the local filesystem part in order to fix all cases...?
> 
> It seems worthwhile to implement the low-risk fix that covers the common
> case, and if/when someone hits the rare 3-process case and/or submits a
> patch for it then that one will be fixed also.
  Yes, the fix is simple enough that I won't oppose it getting in as a
band aid and if we add this band aid to fs/sync.c:sync_one_sb(), it would
even be a reasonably reliable band aid. But that doesn't change the fact
that the locking is simply broken ;).

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-17  3:50           ` Toshiyuki Okajima
  2011-02-17  5:13             ` Andreas Dilger
@ 2011-02-17 10:45             ` Jan Kara
  2011-03-28  8:06               ` [RFC][PATCH] " Toshiyuki Okajima
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-02-17 10:45 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel

On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> (2011/02/16 23:56), Jan Kara wrote:
> >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>On Tue, 15 Feb 2011 18:29:54 +0100
> >>Jan Kara<jack@suse.cz>  wrote:
> >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> >>>>>describe above.
> >>>>
> >>>>One of the fundamental problems here is that the freeze and thaw
> >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> >>>>first is to prevent the resume/thaw from racing with a umount (which
> >>>>it could do just as well by taking a read lock), but the second is to
> >>>>prevent the resume/thaw code from racing with itself.  That's the core
> >>>>fundamental problem here.
> >>>>
> >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> >>>>having the the resume/thaw first take the s_freeze mutex and then
> >>>>second take a read lock on the s_umount.
> >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> >>>in thaw_super() can block if there is another process that tries to acquire
> >>>s_umount for writing - a situation like:
> >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> >>>down_read(&sb->s_umount)
> >>>   block on s_frozen
> >>>				down_write(&sb->s_umount)
> >>>				  -blocked
> >>>								down_read(&sb->s_umount)
> >>>								  -blocked
> >>>behind the write access...
> >>>
> >>>The only working solution I see is to check for frozen filesystem before
> >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> >>>we did so in some well described wrapper).
> >>I created the patch that you imagine yesterday.
> >>
> >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> >>
> >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> >>after 12 hours passed.
> >>
> >>The patch for linux-2.6.38-rc4 is as follows:
> >>---
> >>  fs/fs-writeback.c |    2 +-
> >>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>index 59c6e49..1c9a05e 100644
> >>--- a/fs/fs-writeback.c
> >>+++ b/fs/fs-writeback.c
> >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> >>         spin_unlock(&sb_lock);
> >>
> >>         if (down_read_trylock(&sb->s_umount)) {
> >>-               if (sb->s_root)
> >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> >>                         return true;
> >>                 up_read(&sb->s_umount);
> 
> >   So this is something along the lines I thought but it actually won't work
> >for example if sync(1) is run while the filesystem is frozen (that takes
> >s_umount semaphore in a different place). And generally, I'm not convinced
> >there are not other places that try to do IO while holding s_umount
> >semaphore...
> OK. I understand.
> 
> This code only fixes the case for the following path:
> writeback_inodes_wb
> -> ext4_da_writepages
>    -> ext4_journal_start_sb
>       -> vfs_check_frozen
> But, the code doesn't fix the other cases.
> 
> We must modify the local filesystem part in order to fix all cases...?
  Yes, possibly. But most importantly we should first find clear locking
rules for frozen filesystem that avoid deadlocks like the one above. And
the freezing / unfreezing code might become subtle for that reason, that's
fine, but it would be really good to avoid any complicated things for the
code in the rest of the VFS / filesystems.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-17 10:45             ` Jan Kara
@ 2011-03-28  8:06               ` Toshiyuki Okajima
  2011-03-30 14:12                 ` Jan Kara
  2011-03-31 23:40                 ` Dave Chinner
  0 siblings, 2 replies; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-03-28  8:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

Hi.

On Thu, 17 Feb 2011 11:45:52 +0100
Jan Kara <jack@suse.cz> wrote:
> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > (2011/02/16 23:56), Jan Kara wrote:
> > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > >>Jan Kara<jack@suse.cz>  wrote:
> > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > >>>>>describe above.
> > >>>>
> > >>>>One of the fundamental problems here is that the freeze and thaw
> > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > >>>>it could do just as well by taking a read lock), but the second is to
> > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > >>>>fundamental problem here.
> > >>>>
> > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > >>>>second take a read lock on the s_umount.
> > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > >>>in thaw_super() can block if there is another process that tries to acquire
> > >>>s_umount for writing - a situation like:
> > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > >>>down_read(&sb->s_umount)
> > >>>   block on s_frozen
> > >>>				down_write(&sb->s_umount)
> > >>>				  -blocked
> > >>>								down_read(&sb->s_umount)
> > >>>								  -blocked
> > >>>behind the write access...
> > >>>
> > >>>The only working solution I see is to check for frozen filesystem before
> > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > >>>we did so in some well described wrapper).
> > >>I created the patch that you imagine yesterday.
> > >>
> > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > >>
> > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > >>after 12 hours passed.
> > >>
> > >>The patch for linux-2.6.38-rc4 is as follows:
> > >>---
> > >>  fs/fs-writeback.c |    2 +-
> > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > >>
> > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > >>index 59c6e49..1c9a05e 100644
> > >>--- a/fs/fs-writeback.c
> > >>+++ b/fs/fs-writeback.c
> > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > >>         spin_unlock(&sb_lock);
> > >>
> > >>         if (down_read_trylock(&sb->s_umount)) {
> > >>-               if (sb->s_root)
> > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > >>                         return true;
> > >>                 up_read(&sb->s_umount);
> > 
> > >   So this is something along the lines I thought but it actually won't work
> > >for example if sync(1) is run while the filesystem is frozen (that takes
> > >s_umount semaphore in a different place). And generally, I'm not convinced
> > >there are not other places that try to do IO while holding s_umount
> > >semaphore...
> > OK. I understand.
> > 
> > This code only fixes the case for the following path:
> > writeback_inodes_wb
> > -> ext4_da_writepages
> >    -> ext4_journal_start_sb
> >       -> vfs_check_frozen
> > But, the code doesn't fix the other cases.
> > 
> > We must modify the local filesystem part in order to fix all cases...?
>   Yes, possibly. But most importantly we should first find clear locking
> rules for frozen filesystem that avoid deadlocks like the one above. And
> the freezing / unfreezing code might become subtle for that reason, that's
> fine, but it would be really good to avoid any complicated things for the
> code in the rest of the VFS / filesystems.
I have deeply continued to examined the root cause of this problem, then 
I found it.

It is that we can write a memory which is mmaped to a file. Then the memory 
becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
"writeback" the memory. 

Therefore, the root cause of this hangup is not only ext4 component (with
delayed allocation feature) but also writeback mechanism for mmap. If you 
use the other filesystem, you can write something to the filesystem though 
you have freezed the filesystem.

A sample problem is attached on this mail.  Try to execute it then you can 
confirm that we can write some data to your filesystem while freezing the 
filesystem.
(If you change FS variable in go.sh from ext3 to ext4 and you execute
"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)

I think the best approach to fix this problem is to let users not to write
memory which is mapped to a certain file while the filesystem is freezing. 
However, it is very difficult to control users not to write memory which has 
been already mapped to the file.

Therefore, I think there is only actual method that we stop writeback thread 
to resolve the mmap problem. Also, by this fix, the original problem 
(ext4 delayed write vs unfreeze) can be solved.

I created a patch for this problem. Please confirm it.

------------------------------------------------------------------------------
----------
reproducer
----------
[run script] go.sh
#!/bin/sh

FS=ext3
gcc -o ./write ./write.c
dd if=/dev/zero of=/tmp/loop.$$ bs=1k seek=64k count=1 > /dev/null 2>&1
/sbin/mkfs.$FS -Fq /tmp/loop.$$
/sbin/losetup /dev/loop7 /tmp/loop.$$
mkdir -p mnt
mount -t $FS /dev/loop7 mnt
dd if=/dev/zero of=mnt/file bs=4k count=100 > /dev/null 2>&1
./write mnt/file &
pid=$!
# write 0 then 1
/bin/kill -SIGUSR1 $pid 
/bin/kill -SIGUSR1 $pid 
/sbin/fsfreeze -f mnt
cp /tmp/loop.$$ /tmp/loop.$$.pre
/bin/kill -SIGUSR1 $pid
sync
sleep 30
cp /tmp/loop.$$ /tmp/loop.$$.post
/sbin/fsfreeze -u mnt
/bin/kill -SIGTERM $pid
umount mnt
/sbin/losetup -d /dev/loop7
/usr/bin/cmp /tmp/loop.$$.pre /tmp/loop.$$.post > /dev/null 2>&1
if [ $? -ne 0 ]; then
        echo "freeze doesn't work correctly!"
else
        echo "freeze works correctly!"
fi
rm -f /tmp/loop.$$* 
exit 0

[program] write.c
#define LARGEFILE64_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <signal.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>

int counter = 0;
char *mmap_addr;
int fd;

#define LOOP 100
#define UNIT 4096
#define MMAPSZ  (UNIT*LOOP)
#define FILENAME "./mnt/file"

void
write_inc(int sig)
{
        int i;

        for (i = 0; i < LOOP; i++) 
                *((int*)(mmap_addr + UNIT*i)) = counter;
        counter ++;
}

void
main_exit(int sig)
{
        munmap(mmap_addr, MMAPSZ);
        close(fd);
        exit(0);
}

int main(int argc, char *argv[])
{
        char *file = FILENAME;

        if ((fd = open(file, O_RDWR)) < 0) {
                perror("open error");
                exit(1);
        }
        if ((mmap_addr = mmap(0, MMAPSZ, PROT_WRITE, MAP_SHARED, fd, 0)) ==
MAP_FAILED) {
                perror("mmap error");
                close(fd);
                exit(2);
        }
        sigset(SIGTERM, (void *)main_exit);
        sigset(SIGUSR1, (void *)write_inc);
        while (1) 
                pause();
}

[step to rerproduce]
# sh ./go.sh 
------------------------------------------------------------------------------

[patch]
Now, we can write the memory which is mapped to a file while 
the filesystem to which it belongs is being freezed.
Therefore, the filesystem can modify even if it is being freezed.
This fix prevents the flusher thread from updating the filesystem.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/fs-writeback.c   |    2 +-
 fs/super.c          |    7 ++++++-
 mm/page-writeback.c |    2 ++
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b5ed541..2a60148 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -477,7 +477,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 	spin_unlock(&sb_lock);
 
 	if (down_read_trylock(&sb->s_umount)) {
-		if (sb->s_root)
+		if (sb->s_frozen == 0 && sb->s_root)
 			return true;
 		up_read(&sb->s_umount);
 	}
diff --git a/fs/super.c b/fs/super.c
index 8a06881..bac28c4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -432,8 +432,13 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 			continue;
 		sb->s_count++;
 		spin_unlock(&sb_lock);
-
+retry:
 		down_read(&sb->s_umount);
+		if (sb->s_frozen > 0) {
+			up_read(&sb->s_umount);
+			cond_resched();
+			goto retry;
+		}
 		if (sb->s_root)
 			f(sb, arg);
 		up_read(&sb->s_umount);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31f6988..eb19642 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1058,7 +1058,9 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
+	struct super_block *sb = mapping->host->i_sb;
 
+	vfs_check_frozen(sb, SB_FREEZE_TRANS);
 	if (wbc->nr_to_write <= 0)
 		return 0;
 	if (mapping->a_ops->writepages)
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-28  8:06               ` [RFC][PATCH] " Toshiyuki Okajima
@ 2011-03-30 14:12                 ` Jan Kara
  2011-03-31  8:37                   ` Yongqiang Yang
  2011-03-31 12:03                   ` Toshiyuki Okajima
  2011-03-31 23:40                 ` Dave Chinner
  1 sibling, 2 replies; 121+ messages in thread
From: Jan Kara @ 2011-03-30 14:12 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel

  Hello,

On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> On Thu, 17 Feb 2011 11:45:52 +0100
> Jan Kara <jack@suse.cz> wrote:
> > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > (2011/02/16 23:56), Jan Kara wrote:
> > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > >>Jan Kara<jack@suse.cz>  wrote:
> > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > >>>>>describe above.
> > > >>>>
> > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > >>>>it could do just as well by taking a read lock), but the second is to
> > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > >>>>fundamental problem here.
> > > >>>>
> > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > >>>>second take a read lock on the s_umount.
> > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > >>>s_umount for writing - a situation like:
> > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > >>>down_read(&sb->s_umount)
> > > >>>   block on s_frozen
> > > >>>				down_write(&sb->s_umount)
> > > >>>				  -blocked
> > > >>>								down_read(&sb->s_umount)
> > > >>>								  -blocked
> > > >>>behind the write access...
> > > >>>
> > > >>>The only working solution I see is to check for frozen filesystem before
> > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > >>>we did so in some well described wrapper).
> > > >>I created the patch that you imagine yesterday.
> > > >>
> > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > >>
> > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > >>after 12 hours passed.
> > > >>
> > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > >>---
> > > >>  fs/fs-writeback.c |    2 +-
> > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > >>
> > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > >>index 59c6e49..1c9a05e 100644
> > > >>--- a/fs/fs-writeback.c
> > > >>+++ b/fs/fs-writeback.c
> > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > >>         spin_unlock(&sb_lock);
> > > >>
> > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > >>-               if (sb->s_root)
> > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > >>                         return true;
> > > >>                 up_read(&sb->s_umount);
> > > 
> > > >   So this is something along the lines I thought but it actually won't work
> > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > >there are not other places that try to do IO while holding s_umount
> > > >semaphore...
> > > OK. I understand.
> > > 
> > > This code only fixes the case for the following path:
> > > writeback_inodes_wb
> > > -> ext4_da_writepages
> > >    -> ext4_journal_start_sb
> > >       -> vfs_check_frozen
> > > But, the code doesn't fix the other cases.
> > > 
> > > We must modify the local filesystem part in order to fix all cases...?
> >   Yes, possibly. But most importantly we should first find clear locking
> > rules for frozen filesystem that avoid deadlocks like the one above. And
> > the freezing / unfreezing code might become subtle for that reason, that's
> > fine, but it would be really good to avoid any complicated things for the
> > code in the rest of the VFS / filesystems.
> I have deeply continued to examined the root cause of this problem, then 
> I found it.
> 
> It is that we can write a memory which is mmaped to a file. Then the memory 
> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> "writeback" the memory. 
> 
> Therefore, the root cause of this hangup is not only ext4 component (with
> delayed allocation feature) but also writeback mechanism for mmap. If you 
> use the other filesystem, you can write something to the filesystem though 
> you have freezed the filesystem.
  Well, you can write something only in the caches, not to the on disk
image. So it's not a problem as such.

> A sample problem is attached on this mail.  Try to execute it then you can 
> confirm that we can write some data to your filesystem while freezing the 
> filesystem.
> (If you change FS variable in go.sh from ext3 to ext4 and you execute
> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> 
> I think the best approach to fix this problem is to let users not to write
> memory which is mapped to a certain file while the filesystem is freezing. 
> However, it is very difficult to control users not to write memory which has 
> been already mapped to the file.
  It is actually possible. In case of ext4, you could add a check (+ wait)
in ext4_page_mkwrite() whether the filesystem is frozen or in the process
of being frozen and if so, wait for it to get unfrozen. The only tough
problem here might be the locking as ext4_page_mkwrite() is called with
mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
But you'd have to fix all filesystems (and all paths possibly creating
dirty data) in this way.
 
> Therefore, I think there is only actual method that we stop writeback thread 
> to resolve the mmap problem. Also, by this fix, the original problem 
> (ext4 delayed write vs unfreeze) can be solved.
  Hmm, I had a look at the code again and think we could fix the issue
cleanly (i.e. all possible users of s_umount) as follows: The lock
ordering will be
  s_umount -> "fs frozen"
and there will be a new mutex s_freeze_mutex protecting changes of
s_frozen.

freeze_bdev() already observes this lock ordering, it will only take
s_freeze_mutex for the changes of s_frozen values. The only other code
that is relevant for the lock ordering is thaw_super() (the freezing
process is not expected to reenter kernel for the frozen filesystem).
In thaw_super() we could take s_freeze_mutex, do all the thawing work,
set s_frozen, release s_freeze_mutex and put superblock reference.

So something like the patch below - it seems to work for me, can you test
it please?

>From 0939f4c2fd5d69d7d1bf7ece9a641bb561e9d0dd Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 30 Mar 2011 15:21:44 +0200
Subject: [PATCH] vfs: Fix deadlocks on frozen filesystem

When a filesystem is frozen and the flusher thread decides to do writeback
for the frozen filesystem (e.g. because pages were marked dirty by mmaped
write) we deadlock because we take s_umount semaphore and then try to write
dirty pages which blocks. In this situation there is no way to unfreeze
the filesystem because thawing code requires s_umount semaphore.

Fix the problem removing the need to take s_umount from thawing code. Instead
we introduce new s_freeze_mutex to provide necessary exclusion.

Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
 include/linux/fs.h |    1 +
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index e848649..4f74718 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,6 +77,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		mutex_init(&s->s_freeze_mutex);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
@@ -971,6 +972,24 @@ out:
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs will return
  * -EBUSY.
+ *
+ * Locking of freeze / thaw is tricky (if not messy). Freezing is protected by
+ * exclusively taking s_umount to avoid races with mount / remount / umount and
+ * also provide exclusion of concurrent freeze calls. Then we have
+ * s_freeze_mutex which protects changes to s_frozen and the call ->freeze_fs()
+ * against races with thawing code.
+ *
+ * Thawing code must not take s_umount before the filesystem is unfrozen
+ * because that would cause deadlocks (e.g. background flushing takes s_umount
+ * and then does writeback which blocks on a frozen filesystem). So we take
+ * only s_freeze_mutex, which provides us exclusion against concurrent
+ * freezing, and hold it until the thawing is finished. We are protected
+ * against superblock going away by holding an active sb reference and against
+ * remounting by the fact that the sb is frozen.
+ *
+ * Notes: s_freeze_mutex cannot be merged with bd_fsfreeze_mutex because we
+ * can freeze block devices without filesystems and also freeze filesystems
+ * not backed by block devices.
  */
 int freeze_super(struct super_block *sb)
 {
@@ -978,7 +997,9 @@ int freeze_super(struct super_block *sb)
 
 	atomic_inc(&sb->s_active);
 	down_write(&sb->s_umount);
-	if (sb->s_frozen) {
+	mutex_lock(&sb->s_freeze_mutex);
+	if (sb->s_frozen != SB_UNFROZEN) {
+		mutex_unlock(&sb->s_freeze_mutex);
 		deactivate_locked_super(sb);
 		return -EBUSY;
 	}
@@ -986,15 +1007,18 @@ int freeze_super(struct super_block *sb)
 	if (sb->s_flags & MS_RDONLY) {
 		sb->s_frozen = SB_FREEZE_TRANS;
 		smp_wmb();
+		mutex_unlock(&sb->s_freeze_mutex);
 		up_write(&sb->s_umount);
 		return 0;
 	}
 
 	sb->s_frozen = SB_FREEZE_WRITE;
+	mutex_unlock(&sb->s_freeze_mutex);
 	smp_wmb();
 
 	sync_filesystem(sb);
 
+	mutex_lock(&sb->s_freeze_mutex);
 	sb->s_frozen = SB_FREEZE_TRANS;
 	smp_wmb();
 
@@ -1005,10 +1029,12 @@ int freeze_super(struct super_block *sb)
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
 			sb->s_frozen = SB_UNFROZEN;
+			mutex_unlock(&sb->s_freeze_mutex);
 			deactivate_locked_super(sb);
 			return ret;
 		}
 	}
+	mutex_unlock(&sb->s_freeze_mutex);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1019,14 +1045,15 @@ EXPORT_SYMBOL(freeze_super);
  * @sb: the super to thaw
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super().
+ * See freeze_super() for locking comments.
  */
 int thaw_super(struct super_block *sb)
 {
 	int error;
 
-	down_write(&sb->s_umount);
-	if (sb->s_frozen == SB_UNFROZEN) {
-		up_write(&sb->s_umount);
+	mutex_lock(&sb->s_freeze_mutex);
+	if (sb->s_frozen != SB_FREEZE_TRANS) {
+		mutex_unlock(&sb->s_freeze_mutex);
 		return -EINVAL;
 	}
 
@@ -1039,7 +1066,7 @@ int thaw_super(struct super_block *sb)
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
 			sb->s_frozen = SB_FREEZE_TRANS;
-			up_write(&sb->s_umount);
+			mutex_unlock(&sb->s_freeze_mutex);
 			return error;
 		}
 	}
@@ -1048,7 +1075,8 @@ out:
 	sb->s_frozen = SB_UNFROZEN;
 	smp_wmb();
 	wake_up(&sb->s_wait_unfrozen);
-	deactivate_locked_super(sb);
+	mutex_unlock(&sb->s_freeze_mutex);
+	deactivate_super(sb);
 
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7061a85..230892d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1382,6 +1382,7 @@ struct super_block {
 	struct dentry		*s_root;
 	struct rw_semaphore	s_umount;
 	struct mutex		s_lock;
+	struct mutex		s_freeze_mutex;
 	int			s_count;
 	atomic_t		s_active;
 #ifdef CONFIG_SECURITY
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-30 14:12                 ` Jan Kara
@ 2011-03-31  8:37                   ` Yongqiang Yang
  2011-03-31  8:48                     ` Yongqiang Yang
  2011-03-31 14:04                     ` Eric Sandeen
  2011-03-31 12:03                   ` Toshiyuki Okajima
  1 sibling, 2 replies; 121+ messages in thread
From: Yongqiang Yang @ 2011-03-31  8:37 UTC (permalink / raw)
  To: Jan Kara, Amir Goldstein
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

Hi everyone,

Amir met a deadlock when he tested ext4 with snapshot.  The deadlock
was reported on
https://github.com/amir73il/ext4-snapshots/commit/56396185d922a73524a091b545e665543abf741a.
 It is difficult to reproduce the deadlock.  There is a deadlock
reported on http://www.spinics.net/lists/linux-ext4/msg23018.html.
Actually, these two deadlocks come from a same source.

Below are my analysis on the 1st one.  Mail is not a good place to
describe parallel processes.  I have submitted the analysis to
https://github.com/YANGYongqiang/ext4-snapshots/blob/9e0ae9ae9907125e6bf45aa91db296d4cc041b17/fs/ext4/BUGS#L143.
 It is much more readable.

-- deadlock in ext4 with snapshot
       ext4 with snapshot calls freeze_super() to bring
       a fs be in a clean state when a user takes a snapshot.

     freeze                  truncate              kjournald

                    |  ext4_ext_truncate     |
    freeze_super()  |   starts a handle      |
    sets s_frozen   |                        |
                    |  ext4_ext_truncate     |
                    |  holds i_data_sem      |
  ext4_freeze()     |                        |   commit_transaction()
   wait for updates |                        |   waits for i_data_sem
                    |  ext4_free_blocks      |
                    |  calls dquot_free_block|
                    |                        |
                    |  dquot_free_block call |
                    |  ext4_dirty_inode      |
                    |                        |
                    |  ext4_dirty_inode      |
                    |  trys to start a handle|
                    |                        |
                    |  block due to s_frozen |

in ext3, ext3_freeze() prevents journal from being updated by
lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
unlock_journal_updates().

in ext4, however, before ext4_freeze() returns, it unlock journal, and
ext4 prevents journal from being updated by s_frozen. s_frozen is in
an upper layer, so it is out control of ext4 and deadlock is easy to
happen.

Could someone explain why ext4 does like above but not follow ext3?

Yongqiang.
-- 
Best Wishes
Yongqiang Yang

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31  8:37                   ` Yongqiang Yang
@ 2011-03-31  8:48                     ` Yongqiang Yang
  2011-03-31 14:04                     ` Eric Sandeen
  1 sibling, 0 replies; 121+ messages in thread
From: Yongqiang Yang @ 2011-03-31  8:48 UTC (permalink / raw)
  To: Jan Kara, Amir Goldstein
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Thu, Mar 31, 2011 at 4:37 PM, Yongqiang Yang <xiaoqiangnk@gmail.com> wrote:
> Hi everyone,
>
> Amir met a deadlock when he tested ext4 with snapshot.  The deadlock
> was reported on
> https://github.com/amir73il/ext4-snapshots/commit/56396185d922a73524a091b545e665543abf741a.
>  It is difficult to reproduce the deadlock.  There is a deadlock
> reported on http://www.spinics.net/lists/linux-ext4/msg23018.html.
> Actually, these two deadlocks come from a same source.
>
> Below are my analysis on the 1st one.  Mail is not a good place to
> describe parallel processes.  I have submitted the analysis to
> https://github.com/YANGYongqiang/ext4-snapshots/blob/9e0ae9ae9907125e6bf45aa91db296d4cc041b17/fs/ext4/BUGS#L143.
>  It is much more readable.
>
> -- deadlock in ext4 with snapshot
>       ext4 with snapshot calls freeze_super() to bring
>       a fs be in a clean state when a user takes a snapshot.
>
>     freeze                  truncate              kjournald
>
>                    |  ext4_ext_truncate     |
>    freeze_super()  |   starts a handle      |
>    sets s_frozen   |                        |
>                    |  ext4_ext_truncate     |
>                    |  holds i_data_sem      |
>  ext4_freeze()     |                        |   commit_transaction()
>   wait for updates |                        |   waits for i_data_sem
>                    |  ext4_free_blocks      |
>                    |  calls dquot_free_block|
>                    |                        |
>                    |  dquot_free_block call |
>                    |  ext4_dirty_inode      |
>                    |                        |
>                    |  ext4_dirty_inode      |
>                    |  trys to start a handle|
>                    |                        |
>                    |  block due to s_frozen |
>
> in ext3, ext3_freeze() prevents journal from being updated by
> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> unlock_journal_updates().
>
> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> an upper layer, so it is out control of ext4 and deadlock is easy to
> happen.

Virtually,  it is not right to block ext4_journal_start_sb() before we
confirm that current thread has no a active handle. But ext4 does like
that. Deadlock is thus easy to happen.   Right?

>
> Could someone explain why ext4 does like above but not follow ext3?
>
> Yongqiang.
> --
> Best Wishes
> Yongqiang Yang
>



-- 
Best Wishes
Yongqiang Yang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-30 14:12                 ` Jan Kara
  2011-03-31  8:37                   ` Yongqiang Yang
@ 2011-03-31 12:03                   ` Toshiyuki Okajima
  2011-04-05 10:25                     ` Toshiyuki Okajima
  1 sibling, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-03-31 12:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

Hi, thanks for your reviewing.

(2011/03/30 23:12), Jan Kara wrote:
>    Hello,
>
> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara<jack@suse.cz>  wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
<SNIP>
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>>
>> Therefore, the root cause of this hangup is not only ext4 component (with
>> delayed allocation feature) but also writeback mechanism for mmap. If you
>> use the other filesystem, you can write something to the filesystem though
>> you have freezed the filesystem.

>    Well, you can write something only in the caches, not to the on disk
> image. So it's not a problem as such.
My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
we can write in not only the caches but also the loopback device. However,
I don't still confirm that we can write to the real device(/dev/sdaX).

>
>> A sample problem is attached on this mail.  Try to execute it then you can
>> confirm that we can write some data to your filesystem while freezing the
>> filesystem.
>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
>    It is actually possible. In case of ext4, you could add a check (+ wait)
> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> of being frozen and if so, wait for it to get unfrozen. The only tough
> problem here might be the locking as ext4_page_mkwrite() is called with
> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> But you'd have to fix all filesystems (and all paths possibly creating
> dirty data) in this way.
>

>> Therefore, I think there is only actual method that we stop writeback thread
>> to resolve the mmap problem. Also, by this fix, the original problem
>> (ext4 delayed write vs unfreeze) can be solved.
>    Hmm, I had a look at the code again and think we could fix the issue
> cleanly (i.e. all possible users of s_umount) as follows: The lock
> ordering will be
>    s_umount ->  "fs frozen"
> and there will be a new mutex s_freeze_mutex protecting changes of
> s_frozen.
>
> freeze_bdev() already observes this lock ordering, it will only take
> s_freeze_mutex for the changes of s_frozen values. The only other code
> that is relevant for the lock ordering is thaw_super() (the freezing
> process is not expected to reenter kernel for the frozen filesystem).
> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> set s_frozen, release s_freeze_mutex and put superblock reference.
>

> So something like the patch below - it seems to work for me, can you test
> it please?
I think your patch looks good, so, the original problem seems to be solved.
OK, I will test your patch.
This weekend I cannot test it. So, I will reply next week.

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31  8:37                   ` Yongqiang Yang
  2011-03-31  8:48                     ` Yongqiang Yang
@ 2011-03-31 14:04                     ` Eric Sandeen
  2011-03-31 14:36                       ` Yongqiang Yang
  1 sibling, 1 reply; 121+ messages in thread
From: Eric Sandeen @ 2011-03-31 14:04 UTC (permalink / raw)
  To: Yongqiang Yang
  Cc: Jan Kara, Amir Goldstein, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On 3/31/11 3:37 AM, Yongqiang Yang wrote:

> in ext3, ext3_freeze() prevents journal from being updated by
> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> unlock_journal_updates().
> 
> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> an upper layer, so it is out control of ext4 and deadlock is easy to
> happen.
> 
> Could someone explain why ext4 does like above but not follow ext3?
> 
> Yongqiang.

That was me, I think ...

commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Sun May 16 02:00:00 2010 -0400

    ext4: don't return to userspace after freezing the fs with a mutex held

    ext4_freeze() used jbd2_journal_lock_updates() which takes
    the j_barrier mutex, and then returns to userspace.  The
    kernel does not like this:

    ================================================
    [ BUG: lock held when returning to user space! ]
    ------------------------------------------------
    lvcreate/1075 is leaving the kernel with locks still held!
    1 lock held by lvcreate/1075:
     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
    jbd2_journal_lock_updates+0xe1/0xf0

    Use vfs_check_frozen() added to ext4_journal_start_sb() and
    ext4_force_commit() instead.

    Addresses-Red-Hat-Bugzilla: #568503


    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 14:04                     ` Eric Sandeen
@ 2011-03-31 14:36                       ` Yongqiang Yang
  2011-03-31 15:25                         ` Eric Sandeen
  2011-03-31 16:28                         ` Jan Kara
  0 siblings, 2 replies; 121+ messages in thread
From: Yongqiang Yang @ 2011-03-31 14:36 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Amir Goldstein, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> On 3/31/11 3:37 AM, Yongqiang Yang wrote:
>
>> in ext3, ext3_freeze() prevents journal from being updated by
>> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
>> unlock_journal_updates().
>>
>> in ext4, however, before ext4_freeze() returns, it unlock journal, and
>> ext4 prevents journal from being updated by s_frozen. s_frozen is in
>> an upper layer, so it is out control of ext4 and deadlock is easy to
>> happen.
>>
>> Could someone explain why ext4 does like above but not follow ext3?
>>
>> Yongqiang.
>
> That was me, I think ...

Thank you, Eric.

I think ext4_journal_start() should check if current thread has an
active handle before  vfs_check_frozen(), if so, current handle will
be returned. Thus, we can avoid deadlocks.

Do you agree with me?  If I am right, I will send a patch.
>
> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
> Author: Eric Sandeen <sandeen@redhat.com>
> Date:   Sun May 16 02:00:00 2010 -0400
>
>    ext4: don't return to userspace after freezing the fs with a mutex held
>
>    ext4_freeze() used jbd2_journal_lock_updates() which takes
>    the j_barrier mutex, and then returns to userspace.  The
>    kernel does not like this:
>
>    ================================================
>    [ BUG: lock held when returning to user space! ]
>    ------------------------------------------------
>    lvcreate/1075 is leaving the kernel with locks still held!
>    1 lock held by lvcreate/1075:
>     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>    jbd2_journal_lock_updates+0xe1/0xf0
>
>    Use vfs_check_frozen() added to ext4_journal_start_sb() and
>    ext4_force_commit() instead.
>
>    Addresses-Red-Hat-Bugzilla: #568503
>
>
>    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>



-- 
Best Wishes
Yongqiang Yang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 14:36                       ` Yongqiang Yang
@ 2011-03-31 15:25                         ` Eric Sandeen
  2011-03-31 16:28                         ` Jan Kara
  1 sibling, 0 replies; 121+ messages in thread
From: Eric Sandeen @ 2011-03-31 15:25 UTC (permalink / raw)
  To: Yongqiang Yang
  Cc: Jan Kara, Amir Goldstein, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On 3/31/11 9:36 AM, Yongqiang Yang wrote:
> On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <sandeen@redhat.com> wrote:
>> On 3/31/11 3:37 AM, Yongqiang Yang wrote:
>>
>>> in ext3, ext3_freeze() prevents journal from being updated by
>>> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
>>> unlock_journal_updates().
>>>
>>> in ext4, however, before ext4_freeze() returns, it unlock journal, and
>>> ext4 prevents journal from being updated by s_frozen. s_frozen is in
>>> an upper layer, so it is out control of ext4 and deadlock is easy to
>>> happen.
>>>
>>> Could someone explain why ext4 does like above but not follow ext3?
>>>
>>> Yongqiang.
>>
>> That was me, I think ...
> 
> Thank you, Eric.
> 
> I think ext4_journal_start() should check if current thread has an
> active handle before  vfs_check_frozen(), if so, current handle will
> be returned. Thus, we can avoid deadlocks.
> 
> Do you agree with me?  If I am right, I will send a patch.

If you have a testcase to test it with, sure.  plus a patch would help me know for sure what you propose :)

Sorry for breaking it (if I did!)  But holding a mutex and returning to userspace was pretty bad, too :(

Thanks,
-Eric

>>
>> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
>> Author: Eric Sandeen <sandeen@redhat.com>
>> Date:   Sun May 16 02:00:00 2010 -0400
>>
>>    ext4: don't return to userspace after freezing the fs with a mutex held
>>
>>    ext4_freeze() used jbd2_journal_lock_updates() which takes
>>    the j_barrier mutex, and then returns to userspace.  The
>>    kernel does not like this:
>>
>>    ================================================
>>    [ BUG: lock held when returning to user space! ]
>>    ------------------------------------------------
>>    lvcreate/1075 is leaving the kernel with locks still held!
>>    1 lock held by lvcreate/1075:
>>     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>    jbd2_journal_lock_updates+0xe1/0xf0
>>
>>    Use vfs_check_frozen() added to ext4_journal_start_sb() and
>>    ext4_force_commit() instead.
>>
>>    Addresses-Red-Hat-Bugzilla: #568503
>>
>>
>>    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 14:36                       ` Yongqiang Yang
  2011-03-31 15:25                         ` Eric Sandeen
@ 2011-03-31 16:28                         ` Jan Kara
  1 sibling, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-03-31 16:28 UTC (permalink / raw)
  To: Yongqiang Yang
  Cc: Eric Sandeen, Jan Kara, Amir Goldstein, Toshiyuki Okajima,
	Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

On Thu 31-03-11 22:36:46, Yongqiang Yang wrote:
> On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> > On 3/31/11 3:37 AM, Yongqiang Yang wrote:
> >
> >> in ext3, ext3_freeze() prevents journal from being updated by
> >> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> >> unlock_journal_updates().
> >>
> >> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> >> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> >> an upper layer, so it is out control of ext4 and deadlock is easy to
> >> happen.
> >>
> >> Could someone explain why ext4 does like above but not follow ext3?
> >>
> >> Yongqiang.
> >
> > That was me, I think ...
> 
> Thank you, Eric.
> 
> I think ext4_journal_start() should check if current thread has an
> active handle before  vfs_check_frozen(), if so, current handle will
> be returned. Thus, we can avoid deadlocks.
> 
> Do you agree with me?  If I am right, I will send a patch.
  Yes, definitely. This was exactly what I wanted to propose as well.
Thanks for looking into this.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-28  8:06               ` [RFC][PATCH] " Toshiyuki Okajima
  2011-03-30 14:12                 ` Jan Kara
@ 2011-03-31 23:40                 ` Dave Chinner
  2011-03-31 23:53                   ` Eric Sandeen
                                     ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Dave Chinner @ 2011-03-31 23:40 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel

On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> Hi.
> 
> On Thu, 17 Feb 2011 11:45:52 +0100
> Jan Kara <jack@suse.cz> wrote:
> > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > (2011/02/16 23:56), Jan Kara wrote:
> > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > >>Jan Kara<jack@suse.cz>  wrote:
> > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > >>>>>describe above.
> > > >>>>
> > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > >>>>it could do just as well by taking a read lock), but the second is to
> > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > >>>>fundamental problem here.
> > > >>>>
> > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > >>>>second take a read lock on the s_umount.
> > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > >>>s_umount for writing - a situation like:
> > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > >>>down_read(&sb->s_umount)
> > > >>>   block on s_frozen
> > > >>>				down_write(&sb->s_umount)
> > > >>>				  -blocked
> > > >>>								down_read(&sb->s_umount)
> > > >>>								  -blocked
> > > >>>behind the write access...
> > > >>>
> > > >>>The only working solution I see is to check for frozen filesystem before
> > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > >>>we did so in some well described wrapper).
> > > >>I created the patch that you imagine yesterday.
> > > >>
> > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > >>
> > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > >>after 12 hours passed.
> > > >>
> > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > >>---
> > > >>  fs/fs-writeback.c |    2 +-
> > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > >>
> > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > >>index 59c6e49..1c9a05e 100644
> > > >>--- a/fs/fs-writeback.c
> > > >>+++ b/fs/fs-writeback.c
> > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > >>         spin_unlock(&sb_lock);
> > > >>
> > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > >>-               if (sb->s_root)
> > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > >>                         return true;
> > > >>                 up_read(&sb->s_umount);
> > > 
> > > >   So this is something along the lines I thought but it actually won't work
> > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > >there are not other places that try to do IO while holding s_umount
> > > >semaphore...
> > > OK. I understand.
> > > 
> > > This code only fixes the case for the following path:
> > > writeback_inodes_wb
> > > -> ext4_da_writepages
> > >    -> ext4_journal_start_sb
> > >       -> vfs_check_frozen
> > > But, the code doesn't fix the other cases.
> > > 
> > > We must modify the local filesystem part in order to fix all cases...?
> >   Yes, possibly. But most importantly we should first find clear locking
> > rules for frozen filesystem that avoid deadlocks like the one above. And
> > the freezing / unfreezing code might become subtle for that reason, that's
> > fine, but it would be really good to avoid any complicated things for the
> > code in the rest of the VFS / filesystems.
> I have deeply continued to examined the root cause of this problem, then 
> I found it.
> 
> It is that we can write a memory which is mmaped to a file. Then the memory 
> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> "writeback" the memory. 

Then surely the issue is that .page_mkwrite is not checking that the
filesystem is frozen before allowing the page fault to continue and
dirty the page?

> I think the best approach to fix this problem is to let users not to write
> memory which is mapped to a certain file while the filesystem is freezing. 
> However, it is very difficult to control users not to write memory which has 
> been already mapped to the file.

If you don't allow the page to be dirtied in the fist place, then
nothing needs to be done to the writeback path because there is
nothing dirty for it to write back.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 23:40                 ` Dave Chinner
@ 2011-03-31 23:53                   ` Eric Sandeen
  2011-04-01 14:08                   ` Jan Kara
  2011-04-05 10:44                   ` Toshiyuki Okajima
  2 siblings, 0 replies; 121+ messages in thread
From: Eric Sandeen @ 2011-03-31 23:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Toshiyuki Okajima, Jan Kara, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On 3/31/11 6:40 PM, Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
>> Hi.
>>
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara <jack@suse.cz> wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<jack@suse.cz>  wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>>>>>> describe above.
>>>>>>>>
>>>>>>>> One of the fundamental problems here is that the freeze and thaw
>>>>>>>> routines are using down_write(&sb->s_umount) for two purposes.  The
>>>>>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>>>>>> it could do just as well by taking a read lock), but the second is to
>>>>>>>> prevent the resume/thaw code from racing with itself.  That's the core
>>>>>>>> fundamental problem here.
>>>>>>>>
>>>>>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>>>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>>>>>> second take a read lock on the s_umount.
>>>>>>>   Sadly this does not quite work because even down_read(&sb->s_umount)
>>>>>>> in thaw_super() can block if there is another process that tries to acquire
>>>>>>> s_umount for writing - a situation like:
>>>>>>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
>>>>>>> down_read(&sb->s_umount)
>>>>>>>   block on s_frozen
>>>>>>> 				down_write(&sb->s_umount)
>>>>>>> 				  -blocked
>>>>>>> 								down_read(&sb->s_umount)
>>>>>>> 								  -blocked
>>>>>>> behind the write access...
>>>>>>>
>>>>>>> The only working solution I see is to check for frozen filesystem before
>>>>>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>>>>>> we did so in some well described wrapper).
>>>>>> I created the patch that you imagine yesterday.
>>>>>>
>>>>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>>>>>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>>>>
>>>>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>>>>> after 12 hours passed.
>>>>>>
>>>>>> The patch for linux-2.6.38-rc4 is as follows:
>>>>>> ---
>>>>>>  fs/fs-writeback.c |    2 +-
>>>>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>> index 59c6e49..1c9a05e 100644
>>>>>> --- a/fs/fs-writeback.c
>>>>>> +++ b/fs/fs-writeback.c
>>>>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>>>>         spin_unlock(&sb_lock);
>>>>>>
>>>>>>         if (down_read_trylock(&sb->s_umount)) {
>>>>>> -               if (sb->s_root)
>>>>>> +               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
>>>>>>                         return true;
>>>>>>                 up_read(&sb->s_umount);
>>>>
>>>>>   So this is something along the lines I thought but it actually won't work
>>>>> for example if sync(1) is run while the filesystem is frozen (that takes
>>>>> s_umount semaphore in a different place). And generally, I'm not convinced
>>>>> there are not other places that try to do IO while holding s_umount
>>>>> semaphore...
>>>> OK. I understand.
>>>>
>>>> This code only fixes the case for the following path:
>>>> writeback_inodes_wb
>>>> -> ext4_da_writepages
>>>>    -> ext4_journal_start_sb
>>>>       -> vfs_check_frozen
>>>> But, the code doesn't fix the other cases.
>>>>
>>>> We must modify the local filesystem part in order to fix all cases...?
>>>   Yes, possibly. But most importantly we should first find clear locking
>>> rules for frozen filesystem that avoid deadlocks like the one above. And
>>> the freezing / unfreezing code might become subtle for that reason, that's
>>> fine, but it would be really good to avoid any complicated things for the
>>> code in the rest of the VFS / filesystems.
>> I have deeply continued to examined the root cause of this problem, then 
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory 
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory. 
> 
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
> 
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing. 
>> However, it is very difficult to control users not to write memory which has 
>> been already mapped to the file.
> 
> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.

I floated 

[PATCH, RFC] check for frozen filesystems in the mmap path

a long time ago, but it went nowhere; maybe time to revive that approach.

-Eric

> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 23:40                 ` Dave Chinner
  2011-03-31 23:53                   ` Eric Sandeen
@ 2011-04-01 14:08                   ` Jan Kara
  2011-04-06  5:40                     ` Dave Chinner
  2011-04-05 10:44                   ` Toshiyuki Okajima
  2 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-01 14:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Toshiyuki Okajima, Jan Kara, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > On Thu, 17 Feb 2011 11:45:52 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > >>Jan Kara<jack@suse.cz>  wrote:
> > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > >>>>>describe above.
> > > > >>>>
> > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > > >>>>fundamental problem here.
> > > > >>>>
> > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > >>>>second take a read lock on the s_umount.
> > > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > >>>s_umount for writing - a situation like:
> > > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > > >>>down_read(&sb->s_umount)
> > > > >>>   block on s_frozen
> > > > >>>				down_write(&sb->s_umount)
> > > > >>>				  -blocked
> > > > >>>								down_read(&sb->s_umount)
> > > > >>>								  -blocked
> > > > >>>behind the write access...
> > > > >>>
> > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > >>>we did so in some well described wrapper).
> > > > >>I created the patch that you imagine yesterday.
> > > > >>
> > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > >>
> > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > >>after 12 hours passed.
> > > > >>
> > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > >>---
> > > > >>  fs/fs-writeback.c |    2 +-
> > > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > >>
> > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > >>index 59c6e49..1c9a05e 100644
> > > > >>--- a/fs/fs-writeback.c
> > > > >>+++ b/fs/fs-writeback.c
> > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > >>         spin_unlock(&sb_lock);
> > > > >>
> > > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > > >>-               if (sb->s_root)
> > > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > > >>                         return true;
> > > > >>                 up_read(&sb->s_umount);
> > > > 
> > > > >   So this is something along the lines I thought but it actually won't work
> > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > >there are not other places that try to do IO while holding s_umount
> > > > >semaphore...
> > > > OK. I understand.
> > > > 
> > > > This code only fixes the case for the following path:
> > > > writeback_inodes_wb
> > > > -> ext4_da_writepages
> > > >    -> ext4_journal_start_sb
> > > >       -> vfs_check_frozen
> > > > But, the code doesn't fix the other cases.
> > > > 
> > > > We must modify the local filesystem part in order to fix all cases...?
> > >   Yes, possibly. But most importantly we should first find clear locking
> > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > the freezing / unfreezing code might become subtle for that reason, that's
> > > fine, but it would be really good to avoid any complicated things for the
> > > code in the rest of the VFS / filesystems.
> > I have deeply continued to examined the root cause of this problem, then 
> > I found it.
> > 
> > It is that we can write a memory which is mmaped to a file. Then the memory 
> > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > "writeback" the memory. 
> 
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
  And is this a bug? That isn't clear to me...

> > I think the best approach to fix this problem is to let users not to write
> > memory which is mapped to a certain file while the filesystem is freezing. 
> > However, it is very difficult to control users not to write memory which has 
> > been already mapped to the file.
> 
> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.
  Sure but that's only the problem he was able to hit. But generally,
there's a problem with needing s_umount for unfreezing because it isn't
clear there aren't other code paths which can block with s_umount held
waiting for fs to get unfrozen. And these code paths would cause the same
deadlock. That's why I chose to get rid of s_umount during thawing.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 12:03                   ` Toshiyuki Okajima
@ 2011-04-05 10:25                     ` Toshiyuki Okajima
  2011-04-05 22:54                       ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-05 10:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel, sandeen

Hi.

(2011/03/31 21:03), Toshiyuki Okajima wrote:
> Hi, thanks for your reviewing.
>
> (2011/03/30 23:12), Jan Kara wrote:
>> Hello,
>>
>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>> Jan Kara<jack@suse.cz> wrote:
>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>> Jan Kara<jack@suse.cz> wrote:
>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> <SNIP>
>>> I have deeply continued to examined the root cause of this problem, then
>>> I found it.
>>>
>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>> "writeback" the memory.
>>>
>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>> use the other filesystem, you can write something to the filesystem though
>>> you have freezed the filesystem.
>
>> Well, you can write something only in the caches, not to the on disk
>> image. So it's not a problem as such.
> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> we can write in not only the caches but also the loopback device. However,
> I don't still confirm that we can write to the real device(/dev/sdaX).
>
>>
>>> A sample problem is attached on this mail. Try to execute it then you can
>>> confirm that we can write some data to your filesystem while freezing the
>>> filesystem.
>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>
>>> I think the best approach to fix this problem is to let users not to write
>>> memory which is mapped to a certain file while the filesystem is freezing.
>>> However, it is very difficult to control users not to write memory which has
>>> been already mapped to the file.
>> It is actually possible. In case of ext4, you could add a check (+ wait)
>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>> of being frozen and if so, wait for it to get unfrozen. The only tough
>> problem here might be the locking as ext4_page_mkwrite() is called with
>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>> But you'd have to fix all filesystems (and all paths possibly creating
>> dirty data) in this way.
>>
>
>>> Therefore, I think there is only actual method that we stop writeback thread
>>> to resolve the mmap problem. Also, by this fix, the original problem
>>> (ext4 delayed write vs unfreeze) can be solved.
>> Hmm, I had a look at the code again and think we could fix the issue
>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>> ordering will be
>> s_umount -> "fs frozen"
>> and there will be a new mutex s_freeze_mutex protecting changes of
>> s_frozen.
>>
>> freeze_bdev() already observes this lock ordering, it will only take
>> s_freeze_mutex for the changes of s_frozen values. The only other code
>> that is relevant for the lock ordering is thaw_super() (the freezing
>> process is not expected to reenter kernel for the frozen filesystem).
>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>
>
>> So something like the patch below - it seems to work for me, can you test
>> it please?
> I think your patch looks good, so, the original problem seems to be solved.
> OK, I will test your patch.
> This weekend I cannot test it. So, I will reply next week.
I have tested whether Mizuma-san's reproducer can cause to deadlock with your
patch. And then any problems didn't hit while the reproducer was running.

I think your patch solves the original deadlock problem which is reported by
Mizuma-san.

> Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>  include/linux/fs.h |    1 +
>  2 files changed, 35 insertions(+), 6 deletions(-)

However, I think a write which causes the deadlock is from mmapped dirty
pages. So, I guess we also need to fix in the mmap path while fsfreezing.

> I floated
>
> [PATCH, RFC] check for frozen filesystems in the mmap path
>
> a long time ago, but it went nowhere; maybe time to revive that approach.

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-03-31 23:40                 ` Dave Chinner
  2011-03-31 23:53                   ` Eric Sandeen
  2011-04-01 14:08                   ` Jan Kara
@ 2011-04-05 10:44                   ` Toshiyuki Okajima
  2 siblings, 0 replies; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-05 10:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

Hi.

(2011/04/01 8:40), Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
>> Hi.
>>
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara<jack@suse.cz>  wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>>>>>> describe above.
>>>>>>>>
>>>>>>>> One of the fundamental problems here is that the freeze and thaw
>>>>>>>> routines are using down_write(&sb->s_umount) for two purposes.  The
>>>>>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>>>>>> it could do just as well by taking a read lock), but the second is to
>>>>>>>> prevent the resume/thaw code from racing with itself.  That's the core
>>>>>>>> fundamental problem here.
>>>>>>>>
>>>>>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>>>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>>>>>> second take a read lock on the s_umount.
>>>>>>>    Sadly this does not quite work because even down_read(&sb->s_umount)
>>>>>>> in thaw_super() can block if there is another process that tries to acquire
>>>>>>> s_umount for writing - a situation like:
>>>>>>>    TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
>>>>>>> down_read(&sb->s_umount)
>>>>>>>    block on s_frozen
>>>>>>> 				down_write(&sb->s_umount)
>>>>>>> 				  -blocked
>>>>>>> 								down_read(&sb->s_umount)
>>>>>>> 								  -blocked
>>>>>>> behind the write access...
>>>>>>>
>>>>>>> The only working solution I see is to check for frozen filesystem before
>>>>>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>>>>>> we did so in some well described wrapper).
>>>>>> I created the patch that you imagine yesterday.
>>>>>>
>>>>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>>>>>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>>>>
>>>>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>>>>> after 12 hours passed.
>>>>>>
>>>>>> The patch for linux-2.6.38-rc4 is as follows:
>>>>>> ---
>>>>>>   fs/fs-writeback.c |    2 +-
>>>>>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>> index 59c6e49..1c9a05e 100644
>>>>>> --- a/fs/fs-writeback.c
>>>>>> +++ b/fs/fs-writeback.c
>>>>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>>>>          spin_unlock(&sb_lock);
>>>>>>
>>>>>>          if (down_read_trylock(&sb->s_umount)) {
>>>>>> -               if (sb->s_root)
>>>>>> +               if (sb->s_frozen == SB_UNFROZEN&&   sb->s_root)
>>>>>>                          return true;
>>>>>>                  up_read(&sb->s_umount);
>>>>
>>>>>    So this is something along the lines I thought but it actually won't work
>>>>> for example if sync(1) is run while the filesystem is frozen (that takes
>>>>> s_umount semaphore in a different place). And generally, I'm not convinced
>>>>> there are not other places that try to do IO while holding s_umount
>>>>> semaphore...
>>>> OK. I understand.
>>>>
>>>> This code only fixes the case for the following path:
>>>> writeback_inodes_wb
>>>> ->  ext4_da_writepages
>>>>     ->  ext4_journal_start_sb
>>>>        ->  vfs_check_frozen
>>>> But, the code doesn't fix the other cases.
>>>>
>>>> We must modify the local filesystem part in order to fix all cases...?
>>>    Yes, possibly. But most importantly we should first find clear locking
>>> rules for frozen filesystem that avoid deadlocks like the one above. And
>>> the freezing / unfreezing code might become subtle for that reason, that's
>>> fine, but it would be really good to avoid any complicated things for the
>>> code in the rest of the VFS / filesystems.
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
>

> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.
OK. We can block the write operation by not allowing the page to be
dirtied in the first place.

But we can not easily stop writing the page which is *already mapped*
in the next place. Therefore I think writing back such pages can
be blocked only in the flusher thread.

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-05 10:25                     ` Toshiyuki Okajima
@ 2011-04-05 22:54                       ` Jan Kara
  2011-04-06  5:09                         ` Toshiyuki Okajima
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-05 22:54 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> (2011/03/31 21:03), Toshiyuki Okajima wrote:
> >Hi, thanks for your reviewing.
> >
> >(2011/03/30 23:12), Jan Kara wrote:
> >>Hello,
> >>
> >>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>Jan Kara<jack@suse.cz> wrote:
> >>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>Jan Kara<jack@suse.cz> wrote:
> >>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> ><SNIP>
> >>>I have deeply continued to examined the root cause of this problem, then
> >>>I found it.
> >>>
> >>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>"writeback" the memory.
> >>>
> >>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>use the other filesystem, you can write something to the filesystem though
> >>>you have freezed the filesystem.
> >
> >>Well, you can write something only in the caches, not to the on disk
> >>image. So it's not a problem as such.
> >My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >we can write in not only the caches but also the loopback device. However,
> >I don't still confirm that we can write to the real device(/dev/sdaX).
> >
> >>
> >>>A sample problem is attached on this mail. Try to execute it then you can
> >>>confirm that we can write some data to your filesystem while freezing the
> >>>filesystem.
> >>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>
> >>>I think the best approach to fix this problem is to let users not to write
> >>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>However, it is very difficult to control users not to write memory which has
> >>>been already mapped to the file.
> >>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>problem here might be the locking as ext4_page_mkwrite() is called with
> >>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>But you'd have to fix all filesystems (and all paths possibly creating
> >>dirty data) in this way.
> >>
> >
> >>>Therefore, I think there is only actual method that we stop writeback thread
> >>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>(ext4 delayed write vs unfreeze) can be solved.
> >>Hmm, I had a look at the code again and think we could fix the issue
> >>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>ordering will be
> >>s_umount -> "fs frozen"
> >>and there will be a new mutex s_freeze_mutex protecting changes of
> >>s_frozen.
> >>
> >>freeze_bdev() already observes this lock ordering, it will only take
> >>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>that is relevant for the lock ordering is thaw_super() (the freezing
> >>process is not expected to reenter kernel for the frozen filesystem).
> >>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>
> >
> >>So something like the patch below - it seems to work for me, can you test
> >>it please?
> >I think your patch looks good, so, the original problem seems to be solved.
> >OK, I will test your patch.
> >This weekend I cannot test it. So, I will reply next week.
> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> patch. And then any problems didn't hit while the reproducer was running.
>
> I think your patch solves the original deadlock problem which is reported by
> Mizuma-san.
  Good. Thanks.

> >Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> >Signed-off-by: Jan Kara <jack@suse.cz>
> >---
> > fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
> > include/linux/fs.h |    1 +
> > 2 files changed, 35 insertions(+), 6 deletions(-)
> 
> However, I think a write which causes the deadlock is from mmapped dirty
> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
  Why? If you dirty a page, writeback thread can come and try to write it -
which blocks - but now that does not matter...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-05 22:54                       ` Jan Kara
@ 2011-04-06  5:09                         ` Toshiyuki Okajima
  2011-04-06  5:57                           ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-06  5:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel, sandeen

Hi.

(2011/04/06 7:54), Jan Kara wrote:
> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>> Hi, thanks for your reviewing.
>>>
>>> (2011/03/30 23:12), Jan Kara wrote:
>>>> Hello,
>>>>
>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>> Jan Kara<jack@suse.cz>  wrote:
>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>> Jan Kara<jack@suse.cz>  wrote:
>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>> <SNIP>
>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>> I found it.
>>>>>
>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>> "writeback" the memory.
>>>>>
>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>> use the other filesystem, you can write something to the filesystem though
>>>>> you have freezed the filesystem.
>>>
>>>> Well, you can write something only in the caches, not to the on disk
>>>> image. So it's not a problem as such.
>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>> we can write in not only the caches but also the loopback device. However,
>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>
>>>>
>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>> filesystem.
>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>
>>>>> I think the best approach to fix this problem is to let users not to write
>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>> However, it is very difficult to control users not to write memory which has
>>>>> been already mapped to the file.
>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>> dirty data) in this way.
>>>>
>>>
>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>> ordering will be
>>>> s_umount ->  "fs frozen"
>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>> s_frozen.
>>>>
>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>
>>>
>>>> So something like the patch below - it seems to work for me, can you test
>>>> it please?
>>> I think your patch looks good, so, the original problem seems to be solved.
>>> OK, I will test your patch.
>>> This weekend I cannot test it. So, I will reply next week.
>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>> patch. And then any problems didn't hit while the reproducer was running.
>>
>> I think your patch solves the original deadlock problem which is reported by
>> Mizuma-san.
>    Good. Thanks.
>
>>> Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
>>> Signed-off-by: Jan Kara<jack@suse.cz>
>>> ---
>>> fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>>> include/linux/fs.h |    1 +
>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>

>> However, I think a write which causes the deadlock is from mmapped dirty
>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>    Why? If you dirty a page, writeback thread can come and try to write it -
> which blocks - but now that does not matter...
I have not understood the code around writeback thread very much...
Please explain me the concrete function name which blocks some writes?

Mizuma-san's reproducer also writes the data which maps to the file (mmap).
The original problem happens after the fsfreeze operation is done.
I understand the normal write operation (not mmap) can be blocked while
fsfreezing. So, I guess we don't always block all the write operation
while fsfreezing.

Thanks
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-01 14:08                   ` Jan Kara
@ 2011-04-06  5:40                     ` Dave Chinner
  2011-04-06  6:18                       ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Chinner @ 2011-04-06  5:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > > On Thu, 17 Feb 2011 11:45:52 +0100
> > > Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > > >>Jan Kara<jack@suse.cz>  wrote:
> > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > > >>>>>describe above.
> > > > > >>>>
> > > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > > > >>>>fundamental problem here.
> > > > > >>>>
> > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > > >>>>second take a read lock on the s_umount.
> > > > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > > >>>s_umount for writing - a situation like:
> > > > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > > > >>>down_read(&sb->s_umount)
> > > > > >>>   block on s_frozen
> > > > > >>>				down_write(&sb->s_umount)
> > > > > >>>				  -blocked
> > > > > >>>								down_read(&sb->s_umount)
> > > > > >>>								  -blocked
> > > > > >>>behind the write access...
> > > > > >>>
> > > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > > >>>we did so in some well described wrapper).
> > > > > >>I created the patch that you imagine yesterday.
> > > > > >>
> > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > > >>
> > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > > >>after 12 hours passed.
> > > > > >>
> > > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > > >>---
> > > > > >>  fs/fs-writeback.c |    2 +-
> > > > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > >>
> > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > >>index 59c6e49..1c9a05e 100644
> > > > > >>--- a/fs/fs-writeback.c
> > > > > >>+++ b/fs/fs-writeback.c
> > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > > >>         spin_unlock(&sb_lock);
> > > > > >>
> > > > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > > > >>-               if (sb->s_root)
> > > > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > > > >>                         return true;
> > > > > >>                 up_read(&sb->s_umount);
> > > > > 
> > > > > >   So this is something along the lines I thought but it actually won't work
> > > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > > >there are not other places that try to do IO while holding s_umount
> > > > > >semaphore...
> > > > > OK. I understand.
> > > > > 
> > > > > This code only fixes the case for the following path:
> > > > > writeback_inodes_wb
> > > > > -> ext4_da_writepages
> > > > >    -> ext4_journal_start_sb
> > > > >       -> vfs_check_frozen
> > > > > But, the code doesn't fix the other cases.
> > > > > 
> > > > > We must modify the local filesystem part in order to fix all cases...?
> > > >   Yes, possibly. But most importantly we should first find clear locking
> > > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > > the freezing / unfreezing code might become subtle for that reason, that's
> > > > fine, but it would be really good to avoid any complicated things for the
> > > > code in the rest of the VFS / filesystems.
> > > I have deeply continued to examined the root cause of this problem, then 
> > > I found it.
> > > 
> > > It is that we can write a memory which is mmaped to a file. Then the memory 
> > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > > "writeback" the memory. 
> > 
> > Then surely the issue is that .page_mkwrite is not checking that the
> > filesystem is frozen before allowing the page fault to continue and
> > dirty the page?
>   And is this a bug? That isn't clear to me...

Given the semantics of a frozen filesystem, letting any object be
dirtied while frozen (be it an inode, a page, a metadata block, etc)
is definitely a bug.

The way the freeze code is architected is that incoming dirtying
events are prevented so that the writeback side does not need to
care about the frozen state of the filesystem at all. The freeze
operation is supposed to block new dirtiers, then flush all dirty
objects resulting in everything being clean in the filesystem.

Hence if no objects are being dirtied, then there should never be
any need to block writeback threads due to the filesytem being
frozen because, by definition, there should be no work for them to
do. Hence if objects are being dirtied while the filesystem is
frozen, then that is a bug.

> > > I think the best approach to fix this problem is to let users not to write
> > > memory which is mapped to a certain file while the filesystem is freezing. 
> > > However, it is very difficult to control users not to write memory which has 
> > > been already mapped to the file.
> > 
> > If you don't allow the page to be dirtied in the fist place, then
> > nothing needs to be done to the writeback path because there is
> > nothing dirty for it to write back.
>   Sure but that's only the problem he was able to hit. But generally,
> there's a problem with needing s_umount for unfreezing because it isn't
> clear there aren't other code paths which can block with s_umount held
> waiting for fs to get unfrozen. And these code paths would cause the same
> deadlock. That's why I chose to get rid of s_umount during thawing.

Holding the s_umount lock while checking if frozen and sleeping
is essentially an ABBA lock inversion bug that can bite in many more
places that just thawing the filesystem. . Any where this is done
should be fixed, so I don't think just removing the s_umount lock
from the thaw path is sufficient to avoid problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06  5:09                         ` Toshiyuki Okajima
@ 2011-04-06  5:57                           ` Jan Kara
  2011-04-06  7:40                             ` Toshiyuki Okajima
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-06  5:57 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
> (2011/04/06 7:54), Jan Kara wrote:
> >On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> >>(2011/03/31 21:03), Toshiyuki Okajima wrote:
> >>>Hi, thanks for your reviewing.
> >>>
> >>>(2011/03/30 23:12), Jan Kara wrote:
> >>>>Hello,
> >>>>
> >>>>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>>>Jan Kara<jack@suse.cz>  wrote:
> >>>>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>>>Jan Kara<jack@suse.cz>  wrote:
> >>>>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>><SNIP>
> >>>>>I have deeply continued to examined the root cause of this problem, then
> >>>>>I found it.
> >>>>>
> >>>>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>>>"writeback" the memory.
> >>>>>
> >>>>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>>>use the other filesystem, you can write something to the filesystem though
> >>>>>you have freezed the filesystem.
> >>>
> >>>>Well, you can write something only in the caches, not to the on disk
> >>>>image. So it's not a problem as such.
> >>>My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >>>we can write in not only the caches but also the loopback device. However,
> >>>I don't still confirm that we can write to the real device(/dev/sdaX).
> >>>
> >>>>
> >>>>>A sample problem is attached on this mail. Try to execute it then you can
> >>>>>confirm that we can write some data to your filesystem while freezing the
> >>>>>filesystem.
> >>>>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>>>
> >>>>>I think the best approach to fix this problem is to let users not to write
> >>>>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>>>However, it is very difficult to control users not to write memory which has
> >>>>>been already mapped to the file.
> >>>>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>>>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>>>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>>>problem here might be the locking as ext4_page_mkwrite() is called with
> >>>>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>>>But you'd have to fix all filesystems (and all paths possibly creating
> >>>>dirty data) in this way.
> >>>>
> >>>
> >>>>>Therefore, I think there is only actual method that we stop writeback thread
> >>>>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>>>(ext4 delayed write vs unfreeze) can be solved.
> >>>>Hmm, I had a look at the code again and think we could fix the issue
> >>>>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>>>ordering will be
> >>>>s_umount ->  "fs frozen"
> >>>>and there will be a new mutex s_freeze_mutex protecting changes of
> >>>>s_frozen.
> >>>>
> >>>>freeze_bdev() already observes this lock ordering, it will only take
> >>>>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>>>that is relevant for the lock ordering is thaw_super() (the freezing
> >>>>process is not expected to reenter kernel for the frozen filesystem).
> >>>>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>>>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>>>
> >>>
> >>>>So something like the patch below - it seems to work for me, can you test
> >>>>it please?
> >>>I think your patch looks good, so, the original problem seems to be solved.
> >>>OK, I will test your patch.
> >>>This weekend I cannot test it. So, I will reply next week.
> >>I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> >>patch. And then any problems didn't hit while the reproducer was running.
> >>
> >>I think your patch solves the original deadlock problem which is reported by
> >>Mizuma-san.
> >   Good. Thanks.
> >
> >>>Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
> >>>Signed-off-by: Jan Kara<jack@suse.cz>
> >>>---
> >>>fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
> >>>include/linux/fs.h |    1 +
> >>>2 files changed, 35 insertions(+), 6 deletions(-)
> >>
> 
> >>However, I think a write which causes the deadlock is from mmapped dirty
> >>pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> >   Why? If you dirty a page, writeback thread can come and try to write it -
> >which blocks - but now that does not matter...
> I have not understood the code around writeback thread very much...
> Please explain me the concrete function name which blocks some writes?
  It would block in ext4_da_writepages() function.

> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> The original problem happens after the fsfreeze operation is done.
> I understand the normal write operation (not mmap) can be blocked while
> fsfreezing. So, I guess we don't always block all the write operation
> while fsfreezing.
  Technically speaking, we block all the transaction starts which means we
end up blocking all the writes from going to disk. But that does not mean
we block all the writes from going to in-memory cache - as you properly
note the mmap case is one of such exceptions.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06  5:40                     ` Dave Chinner
@ 2011-04-06  6:18                       ` Jan Kara
  2011-04-06 11:21                         ` Dave Chinner
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-06  6:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > > > On Thu, 17 Feb 2011 11:45:52 +0100
> > > > Jan Kara <jack@suse.cz> wrote:
> > > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > > > >>Jan Kara<jack@suse.cz>  wrote:
> > > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > > > >>>>>describe above.
> > > > > > >>>>
> > > > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > > > > >>>>fundamental problem here.
> > > > > > >>>>
> > > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > > > >>>>second take a read lock on the s_umount.
> > > > > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > > > >>>s_umount for writing - a situation like:
> > > > > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > > > > >>>down_read(&sb->s_umount)
> > > > > > >>>   block on s_frozen
> > > > > > >>>				down_write(&sb->s_umount)
> > > > > > >>>				  -blocked
> > > > > > >>>								down_read(&sb->s_umount)
> > > > > > >>>								  -blocked
> > > > > > >>>behind the write access...
> > > > > > >>>
> > > > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > > > >>>we did so in some well described wrapper).
> > > > > > >>I created the patch that you imagine yesterday.
> > > > > > >>
> > > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > > > >>
> > > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > > > >>after 12 hours passed.
> > > > > > >>
> > > > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > > > >>---
> > > > > > >>  fs/fs-writeback.c |    2 +-
> > > > > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > > >>
> > > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > > >>index 59c6e49..1c9a05e 100644
> > > > > > >>--- a/fs/fs-writeback.c
> > > > > > >>+++ b/fs/fs-writeback.c
> > > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > > > >>         spin_unlock(&sb_lock);
> > > > > > >>
> > > > > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > > > > >>-               if (sb->s_root)
> > > > > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > > > > >>                         return true;
> > > > > > >>                 up_read(&sb->s_umount);
> > > > > > 
> > > > > > >   So this is something along the lines I thought but it actually won't work
> > > > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > > > >there are not other places that try to do IO while holding s_umount
> > > > > > >semaphore...
> > > > > > OK. I understand.
> > > > > > 
> > > > > > This code only fixes the case for the following path:
> > > > > > writeback_inodes_wb
> > > > > > -> ext4_da_writepages
> > > > > >    -> ext4_journal_start_sb
> > > > > >       -> vfs_check_frozen
> > > > > > But, the code doesn't fix the other cases.
> > > > > > 
> > > > > > We must modify the local filesystem part in order to fix all cases...?
> > > > >   Yes, possibly. But most importantly we should first find clear locking
> > > > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > > > the freezing / unfreezing code might become subtle for that reason, that's
> > > > > fine, but it would be really good to avoid any complicated things for the
> > > > > code in the rest of the VFS / filesystems.
> > > > I have deeply continued to examined the root cause of this problem, then 
> > > > I found it.
> > > > 
> > > > It is that we can write a memory which is mmaped to a file. Then the memory 
> > > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > > > "writeback" the memory. 
> > > 
> > > Then surely the issue is that .page_mkwrite is not checking that the
> > > filesystem is frozen before allowing the page fault to continue and
> > > dirty the page?
> >   And is this a bug? That isn't clear to me...
> 
> Given the semantics of a frozen filesystem, letting any object be
> dirtied while frozen (be it an inode, a page, a metadata block, etc)
> is definitely a bug.
>
> The way the freeze code is architected is that incoming dirtying
> events are prevented so that the writeback side does not need to
> care about the frozen state of the filesystem at all. The freeze
> operation is supposed to block new dirtiers, then flush all dirty
> objects resulting in everything being clean in the filesystem.
> 
> Hence if no objects are being dirtied, then there should never be
> any need to block writeback threads due to the filesytem being
> frozen because, by definition, there should be no work for them to
> do. Hence if objects are being dirtied while the filesystem is
> frozen, then that is a bug.
  OK, after some thought I start to agree with you that it would be nice
if we didn't allow the pages to be dirtied at the first place. Otherwise
things get a bit fragile as writing a data block does *not* need a
transaction start as such (we just happen to do it in all code paths)...

> > > > I think the best approach to fix this problem is to let users not to write
> > > > memory which is mapped to a certain file while the filesystem is freezing. 
> > > > However, it is very difficult to control users not to write memory which has 
> > > > been already mapped to the file.
> > > 
> > > If you don't allow the page to be dirtied in the fist place, then
> > > nothing needs to be done to the writeback path because there is
> > > nothing dirty for it to write back.
> >   Sure but that's only the problem he was able to hit. But generally,
> > there's a problem with needing s_umount for unfreezing because it isn't
> > clear there aren't other code paths which can block with s_umount held
> > waiting for fs to get unfrozen. And these code paths would cause the same
> > deadlock. That's why I chose to get rid of s_umount during thawing.
> 
> Holding the s_umount lock while checking if frozen and sleeping
> is essentially an ABBA lock inversion bug that can bite in many more
> places that just thawing the filesystem.  Any where this is done should
> be fixed, so I don't think just removing the s_umount lock from the thaw
> path is sufficient to avoid problems.
  That's easily said but hard to do - any transaction start in ext3/4 may
block on filesystem being frozen (this seems to be similar for XFS as I'm
looking into the code) and transaction start traditionally nests inside
s_umount (and basically there's no way around that since sync() calls your
fs code with s_umount held). So I'm afraid we are not going to get rid of
this ABBA dependency unless we declare that s_umount ranks above filesystem
being frozen - but surely I'm open to suggestions.

Another possibility is just to hide the problem e.g. by checking for frozen
filesystem whenever we try to get s_umount. But that looks a bit ugly to
me.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06  5:57                           ` Jan Kara
@ 2011-04-06  7:40                             ` Toshiyuki Okajima
  2011-04-06 17:46                               ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-06  7:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel, sandeen

Hi.

(2011/04/06 14:57), Jan Kara wrote:
> On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
>> (2011/04/06 7:54), Jan Kara wrote:
>>> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>>>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>>>> Hi, thanks for your reviewing.
>>>>>
>>>>> (2011/03/30 23:12), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>> <SNIP>
>>>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>>>> I found it.
>>>>>>>
>>>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>>>> "writeback" the memory.
>>>>>>>
>>>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>>>> use the other filesystem, you can write something to the filesystem though
>>>>>>> you have freezed the filesystem.
>>>>>
>>>>>> Well, you can write something only in the caches, not to the on disk
>>>>>> image. So it's not a problem as such.
>>>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>>>> we can write in not only the caches but also the loopback device. However,
>>>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>>>
>>>>>>
>>>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>>>> filesystem.
>>>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>>>
>>>>>>> I think the best approach to fix this problem is to let users not to write
>>>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>>>> However, it is very difficult to control users not to write memory which has
>>>>>>> been already mapped to the file.
>>>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>>>> dirty data) in this way.
>>>>>>
>>>>>
>>>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>>>> ordering will be
>>>>>> s_umount ->   "fs frozen"
>>>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>>>> s_frozen.
>>>>>>
>>>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>>>
>>>>>
>>>>>> So something like the patch below - it seems to work for me, can you test
>>>>>> it please?
>>>>> I think your patch looks good, so, the original problem seems to be solved.
>>>>> OK, I will test your patch.
>>>>> This weekend I cannot test it. So, I will reply next week.
>>>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>>>> patch. And then any problems didn't hit while the reproducer was running.
>>>>
>>>> I think your patch solves the original deadlock problem which is reported by
>>>> Mizuma-san.
>>>    Good. Thanks.
>>>
>>>>> Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
>>>>> Signed-off-by: Jan Kara<jack@suse.cz>
>>>>> ---
>>>>> fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>>>>> include/linux/fs.h |    1 +
>>>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>>>
>>
>>>> However, I think a write which causes the deadlock is from mmapped dirty
>>>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>>>    Why? If you dirty a page, writeback thread can come and try to write it -
>>> which blocks - but now that does not matter...

>> I have not understood the code around writeback thread very much...
>> Please explain me the concrete function name which blocks some writes?
>    It would block in ext4_da_writepages() function.
In ext4 with delayed allocation case, I understand it blocks.
(Original deadlock problem is just this case.)
But in ext4 without delayed allocation or other filesystems case, which function
can block writing?

>
>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>> The original problem happens after the fsfreeze operation is done.
>> I understand the normal write operation (not mmap) can be blocked while
>> fsfreezing. So, I guess we don't always block all the write operation
>> while fsfreezing.
>    Technically speaking, we block all the transaction starts which means we
> end up blocking all the writes from going to disk. But that does not mean
> we block all the writes from going to in-memory cache - as you properly
> note the mmap case is one of such exceptions.
Hm, I also think we can allow the writes to in-memory cache but we can't allow
the writes to disk while fsfreezing. I am considering that mmap path can
write to disk while fsfreezing because this deadlock problem happens after
fsfreeze operation is done...

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06  6:18                       ` Jan Kara
@ 2011-04-06 11:21                         ` Dave Chinner
  2011-04-06 13:44                           ` Christoph Hellwig
                                             ` (2 more replies)
  0 siblings, 3 replies; 121+ messages in thread
From: Dave Chinner @ 2011-04-06 11:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > If you don't allow the page to be dirtied in the fist place, then
> > > > nothing needs to be done to the writeback path because there is
> > > > nothing dirty for it to write back.
> > >   Sure but that's only the problem he was able to hit. But generally,
> > > there's a problem with needing s_umount for unfreezing because it isn't
> > > clear there aren't other code paths which can block with s_umount held
> > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > 
> > Holding the s_umount lock while checking if frozen and sleeping
> > is essentially an ABBA lock inversion bug that can bite in many more
> > places that just thawing the filesystem.  Any where this is done should
> > be fixed, so I don't think just removing the s_umount lock from the thaw
> > path is sufficient to avoid problems.
>   That's easily said but hard to do - any transaction start in ext3/4 may
> block on filesystem being frozen (this seems to be similar for XFS as I'm
> looking into the code) and transaction start traditionally nests inside
> s_umount (and basically there's no way around that since sync() calls your
> fs code with s_umount held).

Sure, but the question must be asked - why is ext3/4 even starting a
transaction on a clean filesystem during sync? A frozen filesystem,
by definition, is a clean filesytem, and therefore sync calls of any
kind should not be trying to write to the FS or start transactions.
XFS does this just fine, so I'd consider such behaviour on a frozen
filesystem a bug in ext3/4...

> So I'm afraid we are not going to get rid of
> this ABBA dependency unless we declare that s_umount ranks above filesystem
> being frozen - but surely I'm open to suggestions.

Not sure I understand what you are saying there - this is already
the case, isn't it? i.e. it has to be held exclusive to freeze a
filesystem...

> Another possibility is just to hide the problem e.g. by checking for frozen
> filesystem whenever we try to get s_umount. But that looks a bit ugly to
> me.

And not necessary, AFAICT.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 11:21                         ` Dave Chinner
@ 2011-04-06 13:44                           ` Christoph Hellwig
  2011-04-06 22:59                             ` Dave Chinner
  2011-04-06 17:40                           ` Jan Kara
  2011-05-02  9:07                           ` Surbhi Palande
  2 siblings, 1 reply; 121+ messages in thread
From: Christoph Hellwig @ 2011-04-06 13:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Wed, Apr 06, 2011 at 09:21:35PM +1000, Dave Chinner wrote:
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...

XFS does have one special case for this.  When writing the dummy log
record at the end of the freeze process we use _xfs_alloc_trans to
bypass the frozen filesystem check as we have to write out this record
when the filesystem already is frozen.  But that's after the main
sync with its normal transactions.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 11:21                         ` Dave Chinner
  2011-04-06 13:44                           ` Christoph Hellwig
@ 2011-04-06 17:40                           ` Jan Kara
  2011-04-06 22:54                             ` Dave Chinner
  2011-05-02  9:07                           ` Surbhi Palande
  2 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-06 17:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > nothing needs to be done to the writeback path because there is
> > > > > nothing dirty for it to write back.
> > > >   Sure but that's only the problem he was able to hit. But generally,
> > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > clear there aren't other code paths which can block with s_umount held
> > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > 
> > > Holding the s_umount lock while checking if frozen and sleeping
> > > is essentially an ABBA lock inversion bug that can bite in many more
> > > places that just thawing the filesystem.  Any where this is done should
> > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > path is sufficient to avoid problems.
> >   That's easily said but hard to do - any transaction start in ext3/4 may
> > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > looking into the code) and transaction start traditionally nests inside
> > s_umount (and basically there's no way around that since sync() calls your
> > fs code with s_umount held).
> 
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...
  But by this you are essentially agreeing that the lock inversion is there
in principle. We just hide it by relying on the fact that no code path
trying to change anything with s_umount held (which is the right lock
ordering) gets called while the fs is frozen.  And that is fragile.
Actually, I've looked for a while and if you call quotactl(), it will get
s_umount and then tell filesystem to update quota information which blocks
inside the fs waiting for filesystem being unfrozen => deadlock. We can
change this code path to wait for frozen filesystem before taking s_umount
that essentially it just reinstates my point - it't fragile and IMHO we
need some more consistent way to handle this...

> > So I'm afraid we are not going to get rid of
> > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > being frozen - but surely I'm open to suggestions.
> 
> Not sure I understand what you are saying there - this is already
> the case, isn't it? i.e. it has to be held exclusive to freeze a
> filesystem...
  Not really. We freeze the fs under s_umount but freezing essentially
implements trylock semantics while setting s_frozen so that does not really
establish any lock dependency. What establishes lock dependency is the
thawing path which blocks on s_umount while the filesystem is still frozen.
And this dependency is the other way around - i.e., freezing above
s_umount. This is why I was messing with thawing code to fix this...
 
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06  7:40                             ` Toshiyuki Okajima
@ 2011-04-06 17:46                               ` Jan Kara
  2011-04-15 13:39                                 ` Toshiyuki Okajima
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-06 17:46 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

  Hello,

On Wed 06-04-11 16:40:15, Toshiyuki Okajima wrote:
> (2011/04/06 14:57), Jan Kara wrote:
> >On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
> >>(2011/04/06 7:54), Jan Kara wrote:
> >>>On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> >>>>(2011/03/31 21:03), Toshiyuki Okajima wrote:
> >>>>>Hi, thanks for your reviewing.
> >>>>>
> >>>>>(2011/03/30 23:12), Jan Kara wrote:
> >>>>>>Hello,
> >>>>>>
> >>>>>>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>>>>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>>>>>Jan Kara<jack@suse.cz>   wrote:
> >>>>>>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>>>>>Jan Kara<jack@suse.cz>   wrote:
> >>>>>>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>>>><SNIP>
> >>>>>>>I have deeply continued to examined the root cause of this problem, then
> >>>>>>>I found it.
> >>>>>>>
> >>>>>>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>>>>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>>>>>"writeback" the memory.
> >>>>>>>
> >>>>>>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>>>>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>>>>>use the other filesystem, you can write something to the filesystem though
> >>>>>>>you have freezed the filesystem.
> >>>>>
> >>>>>>Well, you can write something only in the caches, not to the on disk
> >>>>>>image. So it's not a problem as such.
> >>>>>My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >>>>>we can write in not only the caches but also the loopback device. However,
> >>>>>I don't still confirm that we can write to the real device(/dev/sdaX).
> >>>>>
> >>>>>>
> >>>>>>>A sample problem is attached on this mail. Try to execute it then you can
> >>>>>>>confirm that we can write some data to your filesystem while freezing the
> >>>>>>>filesystem.
> >>>>>>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>>>>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>>>>>
> >>>>>>>I think the best approach to fix this problem is to let users not to write
> >>>>>>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>>>>>However, it is very difficult to control users not to write memory which has
> >>>>>>>been already mapped to the file.
> >>>>>>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>>>>>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>>>>>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>>>>>problem here might be the locking as ext4_page_mkwrite() is called with
> >>>>>>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>>>>>But you'd have to fix all filesystems (and all paths possibly creating
> >>>>>>dirty data) in this way.
> >>>>>>
> >>>>>
> >>>>>>>Therefore, I think there is only actual method that we stop writeback thread
> >>>>>>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>>>>>(ext4 delayed write vs unfreeze) can be solved.
> >>>>>>Hmm, I had a look at the code again and think we could fix the issue
> >>>>>>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>>>>>ordering will be
> >>>>>>s_umount ->   "fs frozen"
> >>>>>>and there will be a new mutex s_freeze_mutex protecting changes of
> >>>>>>s_frozen.
> >>>>>>
> >>>>>>freeze_bdev() already observes this lock ordering, it will only take
> >>>>>>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>>>>>that is relevant for the lock ordering is thaw_super() (the freezing
> >>>>>>process is not expected to reenter kernel for the frozen filesystem).
> >>>>>>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>>>>>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>>>>>
> >>>>>
> >>>>>>So something like the patch below - it seems to work for me, can you test
> >>>>>>it please?
> >>>>>I think your patch looks good, so, the original problem seems to be solved.
> >>>>>OK, I will test your patch.
> >>>>>This weekend I cannot test it. So, I will reply next week.
> >>>>I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> >>>>patch. And then any problems didn't hit while the reproducer was running.
> >>>>
> >>>>I think your patch solves the original deadlock problem which is reported by
> >>>>Mizuma-san.
> >>>   Good. Thanks.
> >>>
> >>>>>Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
> >>>>>Signed-off-by: Jan Kara<jack@suse.cz>
> >>>>>---
> >>>>>fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
> >>>>>include/linux/fs.h |    1 +
> >>>>>2 files changed, 35 insertions(+), 6 deletions(-)
> >>>>
> >>
> >>>>However, I think a write which causes the deadlock is from mmapped dirty
> >>>>pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> >>>   Why? If you dirty a page, writeback thread can come and try to write it -
> >>>which blocks - but now that does not matter...
> 
> >>I have not understood the code around writeback thread very much...
> >>Please explain me the concrete function name which blocks some writes?
> >   It would block in ext4_da_writepages() function.
> In ext4 with delayed allocation case, I understand it blocks.
> (Original deadlock problem is just this case.)
> But in ext4 without delayed allocation or other filesystems case, which function
> can block writing?
  For ext3 or ext4 without delayed allocation we block inside writepage()
function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
get modified to block while minor-faulting the page on frozen fs because
when blocks are already allocated we may skip starting a transaction and so
we could possibly modify the filesystem.

> >>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>The original problem happens after the fsfreeze operation is done.
> >>I understand the normal write operation (not mmap) can be blocked while
> >>fsfreezing. So, I guess we don't always block all the write operation
> >>while fsfreezing.
> >   Technically speaking, we block all the transaction starts which means we
> >end up blocking all the writes from going to disk. But that does not mean
> >we block all the writes from going to in-memory cache - as you properly
> >note the mmap case is one of such exceptions.
> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> the writes to disk while fsfreezing. I am considering that mmap path can
> write to disk while fsfreezing because this deadlock problem happens after
> fsfreeze operation is done...
  I'm sorry I don't understand now - are you speaking about the case above
when writepage() does not wait for filesystem being frozen or something
else?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 17:40                           ` Jan Kara
@ 2011-04-06 22:54                             ` Dave Chinner
  2011-04-08 21:33                               ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Chinner @ 2011-04-06 22:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Wed, Apr 06, 2011 at 07:40:01PM +0200, Jan Kara wrote:
> On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> > On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > > nothing needs to be done to the writeback path because there is
> > > > > > nothing dirty for it to write back.
> > > > >   Sure but that's only the problem he was able to hit. But generally,
> > > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > > clear there aren't other code paths which can block with s_umount held
> > > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > > 
> > > > Holding the s_umount lock while checking if frozen and sleeping
> > > > is essentially an ABBA lock inversion bug that can bite in many more
> > > > places that just thawing the filesystem.  Any where this is done should
> > > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > > path is sufficient to avoid problems.
> > >   That's easily said but hard to do - any transaction start in ext3/4 may
> > > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > > looking into the code) and transaction start traditionally nests inside
> > > s_umount (and basically there's no way around that since sync() calls your
> > > fs code with s_umount held).
> > 
> > Sure, but the question must be asked - why is ext3/4 even starting a
> > transaction on a clean filesystem during sync? A frozen filesystem,
> > by definition, is a clean filesytem, and therefore sync calls of any
> > kind should not be trying to write to the FS or start transactions.
> > XFS does this just fine, so I'd consider such behaviour on a frozen
> > filesystem a bug in ext3/4...
>   But by this you are essentially agreeing that the lock inversion is there
> in principle. We just hide it by relying on the fact that no code path
> trying to change anything with s_umount held (which is the right lock
> ordering) gets called while the fs is frozen.  And that is fragile.

It's just another lock ordering rule. i.e. don't sleep on a frozen
filesystem with s_umount held.  It's no more fragile than the many
other lock ordering rules we have.

> Actually, I've looked for a while and if you call quotactl(), it will get
> s_umount and then tell filesystem to update quota information which blocks
> inside the fs waiting for filesystem being unfrozen => deadlock.

Which is a bug according to the above locking rule.

> We can
> change this code path to wait for frozen filesystem before taking s_umount
> that essentially it just reinstates my point - it't fragile and IMHO we
> need some more consistent way to handle this...
> 
> > > So I'm afraid we are not going to get rid of
> > > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > > being frozen - but surely I'm open to suggestions.
> > 
> > Not sure I understand what you are saying there - this is already
> > the case, isn't it? i.e. it has to be held exclusive to freeze a
> > filesystem...
>   Not really. We freeze the fs under s_umount but freezing essentially
> implements trylock semantics while setting s_frozen so that does not really
> establish any lock dependency. What establishes lock dependency is the
> thawing path which blocks on s_umount while the filesystem is still frozen.
> And this dependency is the other way around - i.e., freezing above
> s_umount. This is why I was messing with thawing code to fix this...

It's just the tip of the iceberg. If we allow s_umount to be held
while waiting on a frozen filesystem, we open ouselves up to all
manner of problems. Such as umount hanging on a frozen fs,
(which means a shutdown with a frozen filesystem will hang), it can
hang sync, it can hang memory reclaim, it can hang in any path that
takes s_umount and hence do all sorts of bad things.

Yes, unthawing the filesystem will get things moving again with your
patch, but my point is that it simply does not address the problems
caused by the bad behaviour that has already occurred while the FS
is frozen. Fixing the thaw code in this way is like shooting the
messenger - it doesn't fix the problems being reported.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 13:44                           ` Christoph Hellwig
@ 2011-04-06 22:59                             ` Dave Chinner
  0 siblings, 0 replies; 121+ messages in thread
From: Dave Chinner @ 2011-04-06 22:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Wed, Apr 06, 2011 at 09:44:28AM -0400, Christoph Hellwig wrote:
> On Wed, Apr 06, 2011 at 09:21:35PM +1000, Dave Chinner wrote:
> > Sure, but the question must be asked - why is ext3/4 even starting a
> > transaction on a clean filesystem during sync? A frozen filesystem,
> > by definition, is a clean filesytem, and therefore sync calls of any
> > kind should not be trying to write to the FS or start transactions.
> > XFS does this just fine, so I'd consider such behaviour on a frozen
> > filesystem a bug in ext3/4...
> 
> XFS does have one special case for this.  When writing the dummy log
> record at the end of the freeze process we use _xfs_alloc_trans to
> bypass the frozen filesystem check as we have to write out this record
> when the filesystem already is frozen.  But that's after the main
> sync with its normal transactions.

Right, that is a special case in the _freeze process_ (i.e. before
we've declared the FS frozen), not a normal operation on a frozen
filesystem.

If you want to list exceptions (i.e. where we explicitly avoid
writes to frozen fs), look for xfs_fs_writeable(), which stops
various write operations from proceeding when the fs is either
frozen, read-only or shut down.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 22:54                             ` Dave Chinner
@ 2011-04-08 21:33                               ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-04-08 21:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Thu 07-04-11 08:54:01, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 07:40:01PM +0200, Jan Kara wrote:
> > On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> > > On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > > > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > > > nothing needs to be done to the writeback path because there is
> > > > > > > nothing dirty for it to write back.
> > > > > >   Sure but that's only the problem he was able to hit. But generally,
> > > > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > > > clear there aren't other code paths which can block with s_umount held
> > > > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > > > 
> > > > > Holding the s_umount lock while checking if frozen and sleeping
> > > > > is essentially an ABBA lock inversion bug that can bite in many more
> > > > > places that just thawing the filesystem.  Any where this is done should
> > > > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > > > path is sufficient to avoid problems.
> > > >   That's easily said but hard to do - any transaction start in ext3/4 may
> > > > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > > > looking into the code) and transaction start traditionally nests inside
> > > > s_umount (and basically there's no way around that since sync() calls your
> > > > fs code with s_umount held).
> > > 
> > > Sure, but the question must be asked - why is ext3/4 even starting a
> > > transaction on a clean filesystem during sync? A frozen filesystem,
> > > by definition, is a clean filesytem, and therefore sync calls of any
> > > kind should not be trying to write to the FS or start transactions.
> > > XFS does this just fine, so I'd consider such behaviour on a frozen
> > > filesystem a bug in ext3/4...
> >   But by this you are essentially agreeing that the lock inversion is there
> > in principle. We just hide it by relying on the fact that no code path
> > trying to change anything with s_umount held (which is the right lock
> > ordering) gets called while the fs is frozen.  And that is fragile.
> 
> It's just another lock ordering rule. i.e. don't sleep on a frozen
> filesystem with s_umount held.  It's no more fragile than the many
> other lock ordering rules we have.
  Except that for all the filesystems transaction start => sleep on a
frozen filesystem and in some code paths we have s_umount held while doing
a transaction start. So I don't buy the argument that it's just another
normal locking rule because normally we require that all the code paths
follow correct lock ordering. Now we have some paths (like sync) which do
not follow the correct lock ordering and we just make sure they are not
called if they could cause deadlocks by other means...

> > Actually, I've looked for a while and if you call quotactl(), it will get
> > s_umount and then tell filesystem to update quota information which blocks
> > inside the fs waiting for filesystem being unfrozen => deadlock.
> 
> Which is a bug according to the above locking rule.
  Yes, I was just trying to demonstrate that the locking rule changes
"block until the fs is unfrozen" into "kernel is deadlocked" in an
unexpected places... fsync_bdev() is another case which deadlocks
currently.

> > We can
> > change this code path to wait for frozen filesystem before taking s_umount
> > that essentially it just reinstates my point - it't fragile and IMHO we
> > need some more consistent way to handle this...
> > 
> > > > So I'm afraid we are not going to get rid of
> > > > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > > > being frozen - but surely I'm open to suggestions.
> > > 
> > > Not sure I understand what you are saying there - this is already
> > > the case, isn't it? i.e. it has to be held exclusive to freeze a
> > > filesystem...
> >   Not really. We freeze the fs under s_umount but freezing essentially
> > implements trylock semantics while setting s_frozen so that does not really
> > establish any lock dependency. What establishes lock dependency is the
> > thawing path which blocks on s_umount while the filesystem is still frozen.
> > And this dependency is the other way around - i.e., freezing above
> > s_umount. This is why I was messing with thawing code to fix this...
> 
> It's just the tip of the iceberg. If we allow s_umount to be held
> while waiting on a frozen filesystem, we open ouselves up to all
> manner of problems. Such as umount hanging on a frozen fs,
> (which means a shutdown with a frozen filesystem will hang), it can
> hang sync, it can hang memory reclaim, it can hang in any path that
> takes s_umount and hence do all sorts of bad things.
  I see. The umount hang (especially in the shutdown case) is not nice.
Direct reclaim won't be blocked AFAICS if we stop dirtying pages while the
fs is frozen (which, as I already wrote, I agree is not a good thing to do
after some thought). Since you can block while accessing the frozen
filesystem anyway (because of atime updates or just because of writing
process waiting with i_mutex held for fs to be unfrozen) I'm not sure how
much worse it would be if s_umount lock would be another lock with which
we can wait for fs to get unfrozen...

> Yes, unthawing the filesystem will get things moving again with your
> patch, but my point is that it simply does not address the problems
> caused by the bad behaviour that has already occurred while the FS
> is frozen. Fixing the thaw code in this way is like shooting the
> messenger - it doesn't fix the problems being reported.
  I don't there has been any too bad behavior - you tried to access frozen
filesystem and you got blocked. But OK, I'll invest some more thought into
how to not block with s_umount held without sprinkling frozen checks over
the tree...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 17:46                               ` Jan Kara
@ 2011-04-15 13:39                                 ` Toshiyuki Okajima
  2011-04-15 17:13                                   ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-15 13:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

Hi, sorry for my late response.

(2011/04/07 2:46), Jan Kara wrote:
>    Hello,
>
> On Wed 06-04-11 16:40:15, Toshiyuki Okajima wrote:
>> (2011/04/06 14:57), Jan Kara wrote:
>>> On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
>>>> (2011/04/06 7:54), Jan Kara wrote:
>>>>> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>>>>>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>>>>>> Hi, thanks for your reviewing.
>>>>>>>
>>>>>>> (2011/03/30 23:12), Jan Kara wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>>>>>> Jan Kara<jack@suse.cz>    wrote:
>>>>>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>>>>>> Jan Kara<jack@suse.cz>    wrote:
>>>>>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>> <SNIP>
>>>>>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>>>>>> I found it.
>>>>>>>>>
>>>>>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>>>>>> "writeback" the memory.
>>>>>>>>>
>>>>>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>>>>>> use the other filesystem, you can write something to the filesystem though
>>>>>>>>> you have freezed the filesystem.
>>>>>>>
>>>>>>>> Well, you can write something only in the caches, not to the on disk
>>>>>>>> image. So it's not a problem as such.
>>>>>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>>>>>> we can write in not only the caches but also the loopback device. However,
>>>>>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>>>>>
>>>>>>>>
>>>>>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>>>>>> filesystem.
>>>>>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>>>>>
>>>>>>>>> I think the best approach to fix this problem is to let users not to write
>>>>>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>>>>>> However, it is very difficult to control users not to write memory which has
>>>>>>>>> been already mapped to the file.
>>>>>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>>>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>>>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>>>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>>>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>>>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>>>>>> dirty data) in this way.
>>>>>>>>
>>>>>>>
>>>>>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>>>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>>>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>>>>>> ordering will be
>>>>>>>> s_umount ->    "fs frozen"
>>>>>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>>>>>> s_frozen.
>>>>>>>>
>>>>>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>>>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>>>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>>>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>>>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>>>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>>>>>
>>>>>>>
>>>>>>>> So something like the patch below - it seems to work for me, can you test
>>>>>>>> it please?
>>>>>>> I think your patch looks good, so, the original problem seems to be solved.
>>>>>>> OK, I will test your patch.
>>>>>>> This weekend I cannot test it. So, I will reply next week.
>>>>>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>>>>>> patch. And then any problems didn't hit while the reproducer was running.
>>>>>>
>>>>>> I think your patch solves the original deadlock problem which is reported by
>>>>>> Mizuma-san.
>>>>>    Good. Thanks.
>>>>>
>>>>>>> Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
>>>>>>> Signed-off-by: Jan Kara<jack@suse.cz>
>>>>>>> ---
>>>>>>> fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>>>>>>> include/linux/fs.h |    1 +
>>>>>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>>>>>
>>>>
>>>>>> However, I think a write which causes the deadlock is from mmapped dirty
>>>>>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>>>>>    Why? If you dirty a page, writeback thread can come and try to write it -
>>>>> which blocks - but now that does not matter...
>>
>>>> I have not understood the code around writeback thread very much...
>>>> Please explain me the concrete function name which blocks some writes?
>>>    It would block in ext4_da_writepages() function.
>> In ext4 with delayed allocation case, I understand it blocks.
>> (Original deadlock problem is just this case.)
>> But in ext4 without delayed allocation or other filesystems case, which function
>> can block writing?

>    For ext3 or ext4 without delayed allocation we block inside writepage()
> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> get modified to block while minor-faulting the page on frozen fs because
> when blocks are already allocated we may skip starting a transaction and so
> we could possibly modify the filesystem.
OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.

(minor-pagefault)
-> do_wp_page()
    -> page_mkwrite(= ext4_mkwrite())
       => BLOCK!

(major-pagefault)
-> do_liner_fault()
    -> page_mkwrite(= ext4_mkwrite())
       => BLOCK!

>
>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>> The original problem happens after the fsfreeze operation is done.
>>>> I understand the normal write operation (not mmap) can be blocked while
>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>> while fsfreezing.
>>>    Technically speaking, we block all the transaction starts which means we
>>> end up blocking all the writes from going to disk. But that does not mean
>>> we block all the writes from going to in-memory cache - as you properly
>>> note the mmap case is one of such exceptions.
>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>> the writes to disk while fsfreezing. I am considering that mmap path can
>> write to disk while fsfreezing because this deadlock problem happens after
>> fsfreeze operation is done...
>    I'm sorry I don't understand now - are you speaking about the case above
> when writepage() does not wait for filesystem being frozen or something
> else?
Sorry, I didn't understand around the page fault path.
So, I had read the kernel source code around it, then I maybe understand...

I worry whether we can update the file data in mmap case while fsfreezing.
Of course, I understand that we can write to in-memory cache, and it is not a
problem. However, if we can write to disk while fsfreezing, it is a problem.
So, I summarize the cases whether we can write to disk or not.

--------------------------------------------------------------------------
Cases (Whether we can write the data mmapped to the file on the disk
while fsfreezing)

[1] One of the page which has been mmapped is not bound. And
  the page is not allocated yet. (major fault?)

    (1) user dirtys a page
    (2) a page fault occurs (do_page_fault)
    (3) __do_falut is called.
    (4) ext4_page_mkwrite is called
    (5) ext4_write_begin is called
    (6) ext4_journal_start_sb       => We can STOP!

[2] One of the page which has been mmapped is not bound. But
  the page is already allocated, and the buffer_heads of the page
  are not mapped (BH_Mapped).  (minor fault?)

    (1) user dirtys a page
    (2) a page fault occurs (do_page_fault)
    (3) do_wp_page is called.
    (4) ext4_page_mkwrite is called
    (5) ext4_write_begin is called
    (6) ext4_journal_start_sb       => We can STOP!

[3] One of the page which has been mmapped is not bound. But
  the page is already allocated, and the buffer_heads of the page
  are mapped (BH_Mapped).  (minor fault?)

    (1) user dirtys a page
    (2) a page fault occurs (do_page_fault)
    (3) do_wp_page is called.
    (4) ext4_page_mkwrite is called
    * Cannot block the dirty page to be written because all bh is mapped.
    (5) user munmaps the page (munmap)
    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
    (7) writeback thread writes the page (struct page) to disk
                                            => We cannot STOP!

[4] One of the page which has been mmapped is bound. And
  the page is already allocated.

    (1) user dirtys a page
    ( ) no page fault occurs
    (2) user munmaps the page (munmap)
    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
    (4) writeback thread writes the page (struct page) to disk
                                            => We cannot STOP!
--------------------------------------------------------------------------

So, we can block the cases [1], [2].
But I think we cannot block the cases [3], [4] now.
If fixing the page_mkwrite, we can also block the case [3].
But the case [4] is not blocked because no page fault occurs
when we dirty the mmapped page.

Therefore, to repair this problem, we need to fix the cases [3], [4].
I think we must modify the writeback thread to fix the case [4].

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-15 13:39                                 ` Toshiyuki Okajima
@ 2011-04-15 17:13                                   ` Jan Kara
  2011-04-15 17:17                                     ` Eric Sandeen
  2011-04-18  9:05                                     ` Toshiyuki Okajima
  0 siblings, 2 replies; 121+ messages in thread
From: Jan Kara @ 2011-04-15 17:13 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

  Hello,

On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >   For ext3 or ext4 without delayed allocation we block inside writepage()
> >function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >get modified to block while minor-faulting the page on frozen fs because
> >when blocks are already allocated we may skip starting a transaction and so
> >we could possibly modify the filesystem.
> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> 
> (minor-pagefault)
> -> do_wp_page()
>    -> page_mkwrite(= ext4_mkwrite())
>       => BLOCK!
> 
> (major-pagefault)
> -> do_liner_fault()
>    -> page_mkwrite(= ext4_mkwrite())
>       => BLOCK!
> 
> >
> >>>>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>The original problem happens after the fsfreeze operation is done.
> >>>>I understand the normal write operation (not mmap) can be blocked while
> >>>>fsfreezing. So, I guess we don't always block all the write operation
> >>>>while fsfreezing.
> >>>   Technically speaking, we block all the transaction starts which means we
> >>>end up blocking all the writes from going to disk. But that does not mean
> >>>we block all the writes from going to in-memory cache - as you properly
> >>>note the mmap case is one of such exceptions.
> >>Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>the writes to disk while fsfreezing. I am considering that mmap path can
> >>write to disk while fsfreezing because this deadlock problem happens after
> >>fsfreeze operation is done...
> >   I'm sorry I don't understand now - are you speaking about the case above
> >when writepage() does not wait for filesystem being frozen or something
> >else?
> Sorry, I didn't understand around the page fault path.
> So, I had read the kernel source code around it, then I maybe understand...
> 
> I worry whether we can update the file data in mmap case while fsfreezing.
> Of course, I understand that we can write to in-memory cache, and it is not a
> problem. However, if we can write to disk while fsfreezing, it is a problem.
> So, I summarize the cases whether we can write to disk or not.
> 
> --------------------------------------------------------------------------
> Cases (Whether we can write the data mmapped to the file on the disk
> while fsfreezing)
> 
> [1] One of the page which has been mmapped is not bound. And
>  the page is not allocated yet. (major fault?)
> 
>    (1) user dirtys a page
>    (2) a page fault occurs (do_page_fault)
>    (3) __do_falut is called.
>    (4) ext4_page_mkwrite is called
>    (5) ext4_write_begin is called
>    (6) ext4_journal_start_sb       => We can STOP!
> 
> [2] One of the page which has been mmapped is not bound. But
>  the page is already allocated, and the buffer_heads of the page
>  are not mapped (BH_Mapped).  (minor fault?)
> 
>    (1) user dirtys a page
>    (2) a page fault occurs (do_page_fault)
>    (3) do_wp_page is called.
>    (4) ext4_page_mkwrite is called
>    (5) ext4_write_begin is called
>    (6) ext4_journal_start_sb       => We can STOP!
> 
> [3] One of the page which has been mmapped is not bound. But
>  the page is already allocated, and the buffer_heads of the page
>  are mapped (BH_Mapped).  (minor fault?)
> 
>    (1) user dirtys a page
>    (2) a page fault occurs (do_page_fault)
>    (3) do_wp_page is called.
>    (4) ext4_page_mkwrite is called
>    * Cannot block the dirty page to be written because all bh is mapped.
>    (5) user munmaps the page (munmap)
>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>    (7) writeback thread writes the page (struct page) to disk
>                                            => We cannot STOP!
> 
> [4] One of the page which has been mmapped is bound. And
>  the page is already allocated.
> 
>    (1) user dirtys a page
>    ( ) no page fault occurs
>    (2) user munmaps the page (munmap)
>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>    (4) writeback thread writes the page (struct page) to disk
>                                            => We cannot STOP!
> --------------------------------------------------------------------------
> 
> So, we can block the cases [1], [2].
> But I think we cannot block the cases [3], [4] now.
> If fixing the page_mkwrite, we can also block the case [3].
> But the case [4] is not blocked because no page fault occurs
> when we dirty the mmapped page.
> 
> Therefore, to repair this problem, we need to fix the cases [3], [4].
> I think we must modify the writeback thread to fix the case [4].
  The trick here is that when we write a page to disk, we write-protect
the page (you seem to call this that "the page is bound", I'm not sure why).
So we are guaranteed to receive a minor fault (case [3]) if user tries to
modify a page after we finish writeback while freezing the filesystem.
So principially all we need to do is just wait in ext4_page_mkwrite().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-15 17:13                                   ` Jan Kara
@ 2011-04-15 17:17                                     ` Eric Sandeen
  2011-04-15 17:37                                       ` Jan Kara
  2011-04-18  9:05                                     ` Toshiyuki Okajima
  1 sibling, 1 reply; 121+ messages in thread
From: Eric Sandeen @ 2011-04-15 17:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On 4/15/11 12:13 PM, Jan Kara wrote:
>   Hello,
> 
> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>   For ext3 or ext4 without delayed allocation we block inside writepage()
>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>> get modified to block while minor-faulting the page on frozen fs because
>>> when blocks are already allocated we may skip starting a transaction and so
>>> we could possibly modify the filesystem.
>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>
>> (minor-pagefault)
>> -> do_wp_page()
>>    -> page_mkwrite(= ext4_mkwrite())
>>       => BLOCK!
>>
>> (major-pagefault)
>> -> do_liner_fault()
>>    -> page_mkwrite(= ext4_mkwrite())
>>       => BLOCK!
>>
>>>
>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>> while fsfreezing.
>>>>>   Technically speaking, we block all the transaction starts which means we
>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>> note the mmap case is one of such exceptions.
>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>> fsfreeze operation is done...
>>>   I'm sorry I don't understand now - are you speaking about the case above
>>> when writepage() does not wait for filesystem being frozen or something
>>> else?
>> Sorry, I didn't understand around the page fault path.
>> So, I had read the kernel source code around it, then I maybe understand...
>>
>> I worry whether we can update the file data in mmap case while fsfreezing.
>> Of course, I understand that we can write to in-memory cache, and it is not a
>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>> So, I summarize the cases whether we can write to disk or not.
>>
>> --------------------------------------------------------------------------
>> Cases (Whether we can write the data mmapped to the file on the disk
>> while fsfreezing)
>>
>> [1] One of the page which has been mmapped is not bound. And
>>  the page is not allocated yet. (major fault?)
>>
>>    (1) user dirtys a page
>>    (2) a page fault occurs (do_page_fault)
>>    (3) __do_falut is called.
>>    (4) ext4_page_mkwrite is called
>>    (5) ext4_write_begin is called
>>    (6) ext4_journal_start_sb       => We can STOP!
>>
>> [2] One of the page which has been mmapped is not bound. But
>>  the page is already allocated, and the buffer_heads of the page
>>  are not mapped (BH_Mapped).  (minor fault?)
>>
>>    (1) user dirtys a page
>>    (2) a page fault occurs (do_page_fault)
>>    (3) do_wp_page is called.
>>    (4) ext4_page_mkwrite is called
>>    (5) ext4_write_begin is called
>>    (6) ext4_journal_start_sb       => We can STOP!
>>
>> [3] One of the page which has been mmapped is not bound. But
>>  the page is already allocated, and the buffer_heads of the page
>>  are mapped (BH_Mapped).  (minor fault?)
>>
>>    (1) user dirtys a page
>>    (2) a page fault occurs (do_page_fault)
>>    (3) do_wp_page is called.
>>    (4) ext4_page_mkwrite is called
>>    * Cannot block the dirty page to be written because all bh is mapped.
>>    (5) user munmaps the page (munmap)
>>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>    (7) writeback thread writes the page (struct page) to disk
>>                                            => We cannot STOP!
>>
>> [4] One of the page which has been mmapped is bound. And
>>  the page is already allocated.
>>
>>    (1) user dirtys a page
>>    ( ) no page fault occurs
>>    (2) user munmaps the page (munmap)
>>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>    (4) writeback thread writes the page (struct page) to disk
>>                                            => We cannot STOP!
>> --------------------------------------------------------------------------
>>
>> So, we can block the cases [1], [2].
>> But I think we cannot block the cases [3], [4] now.
>> If fixing the page_mkwrite, we can also block the case [3].
>> But the case [4] is not blocked because no page fault occurs
>> when we dirty the mmapped page.
>>
>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>> I think we must modify the writeback thread to fix the case [4].
>   The trick here is that when we write a page to disk, we write-protect
> the page (you seem to call this that "the page is bound", I'm not sure why).
> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> modify a page after we finish writeback while freezing the filesystem.
> So principially all we need to do is just wait in ext4_page_mkwrite().

I've been kind of absent from this thread, sorry, but why would we wait in ext4_page_mkwrite(), rather than in mm/memory.c prior to any page_mkwrite call on any fs?

no frozen fs should be able to dirty & write pages via mmap, right?

-Eric
 
> 								Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-15 17:17                                     ` Eric Sandeen
@ 2011-04-15 17:37                                       ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-04-15 17:37 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Fri 15-04-11 12:17:06, Eric Sandeen wrote:
> On 4/15/11 12:13 PM, Jan Kara wrote:
> >   Hello,
> > 
> > On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>   For ext3 or ext4 without delayed allocation we block inside writepage()
> >>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>> get modified to block while minor-faulting the page on frozen fs because
> >>> when blocks are already allocated we may skip starting a transaction and so
> >>> we could possibly modify the filesystem.
> >> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>
> >> (minor-pagefault)
> >> -> do_wp_page()
> >>    -> page_mkwrite(= ext4_mkwrite())
> >>       => BLOCK!
> >>
> >> (major-pagefault)
> >> -> do_liner_fault()
> >>    -> page_mkwrite(= ext4_mkwrite())
> >>       => BLOCK!
> >>
> >>>
> >>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>> The original problem happens after the fsfreeze operation is done.
> >>>>>> I understand the normal write operation (not mmap) can be blocked while
> >>>>>> fsfreezing. So, I guess we don't always block all the write operation
> >>>>>> while fsfreezing.
> >>>>>   Technically speaking, we block all the transaction starts which means we
> >>>>> end up blocking all the writes from going to disk. But that does not mean
> >>>>> we block all the writes from going to in-memory cache - as you properly
> >>>>> note the mmap case is one of such exceptions.
> >>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>> the writes to disk while fsfreezing. I am considering that mmap path can
> >>>> write to disk while fsfreezing because this deadlock problem happens after
> >>>> fsfreeze operation is done...
> >>>   I'm sorry I don't understand now - are you speaking about the case above
> >>> when writepage() does not wait for filesystem being frozen or something
> >>> else?
> >> Sorry, I didn't understand around the page fault path.
> >> So, I had read the kernel source code around it, then I maybe understand...
> >>
> >> I worry whether we can update the file data in mmap case while fsfreezing.
> >> Of course, I understand that we can write to in-memory cache, and it is not a
> >> problem. However, if we can write to disk while fsfreezing, it is a problem.
> >> So, I summarize the cases whether we can write to disk or not.
> >>
> >> --------------------------------------------------------------------------
> >> Cases (Whether we can write the data mmapped to the file on the disk
> >> while fsfreezing)
> >>
> >> [1] One of the page which has been mmapped is not bound. And
> >>  the page is not allocated yet. (major fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) __do_falut is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       => We can STOP!
> >>
> >> [2] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are not mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       => We can STOP!
> >>
> >> [3] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    * Cannot block the dirty page to be written because all bh is mapped.
> >>    (5) user munmaps the page (munmap)
> >>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (7) writeback thread writes the page (struct page) to disk
> >>                                            => We cannot STOP!
> >>
> >> [4] One of the page which has been mmapped is bound. And
> >>  the page is already allocated.
> >>
> >>    (1) user dirtys a page
> >>    ( ) no page fault occurs
> >>    (2) user munmaps the page (munmap)
> >>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (4) writeback thread writes the page (struct page) to disk
> >>                                            => We cannot STOP!
> >> --------------------------------------------------------------------------
> >>
> >> So, we can block the cases [1], [2].
> >> But I think we cannot block the cases [3], [4] now.
> >> If fixing the page_mkwrite, we can also block the case [3].
> >> But the case [4] is not blocked because no page fault occurs
> >> when we dirty the mmapped page.
> >>
> >> Therefore, to repair this problem, we need to fix the cases [3], [4].
> >> I think we must modify the writeback thread to fix the case [4].
> >   The trick here is that when we write a page to disk, we write-protect
> > the page (you seem to call this that "the page is bound", I'm not sure why).
> > So we are guaranteed to receive a minor fault (case [3]) if user tries to
> > modify a page after we finish writeback while freezing the filesystem.
> > So principially all we need to do is just wait in ext4_page_mkwrite().
> 
> I've been kind of absent from this thread, sorry, but why would we wait in ext4_page_mkwrite(), rather than in mm/memory.c prior to any page_mkwrite call on any fs?
> 
> no frozen fs should be able to dirty & write pages via mmap, right?
  I have not put that much thought to it but locking might be kind of
tricky in the generic code. We have to be sure that freezing waits for
the page which is just being faulted. That means we have to take page lock
(now writepage() called during fs_freeze will wait for us), check whether
fs is frozen. If yes, unlock page, do vfs_check_frozen(), and retry. This
call sequence is much better suited for block_page_mkwrite() than for
code in memory.c I think.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-15 17:13                                   ` Jan Kara
  2011-04-15 17:17                                     ` Eric Sandeen
@ 2011-04-18  9:05                                     ` Toshiyuki Okajima
  2011-04-18 10:51                                       ` Jan Kara
  2011-05-03 11:01                                       ` Surbhi Palande
  1 sibling, 2 replies; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-18  9:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

Hi,

(2011/04/16 2:13), Jan Kara wrote:
>    Hello,
>
> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>> get modified to block while minor-faulting the page on frozen fs because
>>> when blocks are already allocated we may skip starting a transaction and so
>>> we could possibly modify the filesystem.
>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>
>> (minor-pagefault)
>> ->  do_wp_page()
>>     ->  page_mkwrite(= ext4_mkwrite())
>>        =>  BLOCK!
>>
>> (major-pagefault)
>> ->  do_liner_fault()
>>     ->  page_mkwrite(= ext4_mkwrite())
>>        =>  BLOCK!
>>
>>>
>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>> while fsfreezing.
>>>>>    Technically speaking, we block all the transaction starts which means we
>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>> note the mmap case is one of such exceptions.
>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>> fsfreeze operation is done...
>>>    I'm sorry I don't understand now - are you speaking about the case above
>>> when writepage() does not wait for filesystem being frozen or something
>>> else?
>> Sorry, I didn't understand around the page fault path.
>> So, I had read the kernel source code around it, then I maybe understand...
>>
>> I worry whether we can update the file data in mmap case while fsfreezing.
>> Of course, I understand that we can write to in-memory cache, and it is not a
>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>> So, I summarize the cases whether we can write to disk or not.
>>
>> --------------------------------------------------------------------------
>> Cases (Whether we can write the data mmapped to the file on the disk
>> while fsfreezing)
>>
>> [1] One of the page which has been mmapped is not bound. And
>>   the page is not allocated yet. (major fault?)
>>
>>     (1) user dirtys a page
>>     (2) a page fault occurs (do_page_fault)
>>     (3) __do_falut is called.
>>     (4) ext4_page_mkwrite is called
>>     (5) ext4_write_begin is called
>>     (6) ext4_journal_start_sb       =>  We can STOP!
>>
>> [2] One of the page which has been mmapped is not bound. But
>>   the page is already allocated, and the buffer_heads of the page
>>   are not mapped (BH_Mapped).  (minor fault?)
>>
>>     (1) user dirtys a page
>>     (2) a page fault occurs (do_page_fault)
>>     (3) do_wp_page is called.
>>     (4) ext4_page_mkwrite is called
>>     (5) ext4_write_begin is called
>>     (6) ext4_journal_start_sb       =>  We can STOP!
>>
>> [3] One of the page which has been mmapped is not bound. But
>>   the page is already allocated, and the buffer_heads of the page
>>   are mapped (BH_Mapped).  (minor fault?)
>>
>>     (1) user dirtys a page
>>     (2) a page fault occurs (do_page_fault)
>>     (3) do_wp_page is called.
>>     (4) ext4_page_mkwrite is called
>>     * Cannot block the dirty page to be written because all bh is mapped.
>>     (5) user munmaps the page (munmap)
>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>     (7) writeback thread writes the page (struct page) to disk
>>                                             =>  We cannot STOP!
>>
>> [4] One of the page which has been mmapped is bound. And
>>   the page is already allocated.
>>
>>     (1) user dirtys a page
>>     ( ) no page fault occurs
>>     (2) user munmaps the page (munmap)
>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>     (4) writeback thread writes the page (struct page) to disk
>>                                             =>  We cannot STOP!
>> --------------------------------------------------------------------------
>>
>> So, we can block the cases [1], [2].
>> But I think we cannot block the cases [3], [4] now.
>> If fixing the page_mkwrite, we can also block the case [3].
>> But the case [4] is not blocked because no page fault occurs
>> when we dirty the mmapped page.
>>
>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>> I think we must modify the writeback thread to fix the case [4].
>    The trick here is that when we write a page to disk, we write-protect
> the page (you seem to call this that "the page is bound", I'm not sure why).
Hm, I want to understand how to write-protect the page under fsfreezing.
But, anyway, I understand we don't need to consider the case [4].

> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> modify a page after we finish writeback while freezing the filesystem.
> So principially all we need to do is just wait in ext4_page_mkwrite().
OK. I understand.
Are there any concrete ideas to fix this?
For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-18  9:05                                     ` Toshiyuki Okajima
@ 2011-04-18 10:51                                       ` Jan Kara
  2011-04-19  9:43                                         ` Toshiyuki Okajima
  2011-05-03 11:01                                       ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-18 10:51 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
> >On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>   For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>>get modified to block while minor-faulting the page on frozen fs because
> >>>when blocks are already allocated we may skip starting a transaction and so
> >>>we could possibly modify the filesystem.
> >>OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>
> >>(minor-pagefault)
> >>->  do_wp_page()
> >>    ->  page_mkwrite(= ext4_mkwrite())
> >>       =>  BLOCK!
> >>
> >>(major-pagefault)
> >>->  do_liner_fault()
> >>    ->  page_mkwrite(= ext4_mkwrite())
> >>       =>  BLOCK!
> >>
> >>>
> >>>>>>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>I understand the normal write operation (not mmap) can be blocked while
> >>>>>>fsfreezing. So, I guess we don't always block all the write operation
> >>>>>>while fsfreezing.
> >>>>>   Technically speaking, we block all the transaction starts which means we
> >>>>>end up blocking all the writes from going to disk. But that does not mean
> >>>>>we block all the writes from going to in-memory cache - as you properly
> >>>>>note the mmap case is one of such exceptions.
> >>>>Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>>the writes to disk while fsfreezing. I am considering that mmap path can
> >>>>write to disk while fsfreezing because this deadlock problem happens after
> >>>>fsfreeze operation is done...
> >>>   I'm sorry I don't understand now - are you speaking about the case above
> >>>when writepage() does not wait for filesystem being frozen or something
> >>>else?
> >>Sorry, I didn't understand around the page fault path.
> >>So, I had read the kernel source code around it, then I maybe understand...
> >>
> >>I worry whether we can update the file data in mmap case while fsfreezing.
> >>Of course, I understand that we can write to in-memory cache, and it is not a
> >>problem. However, if we can write to disk while fsfreezing, it is a problem.
> >>So, I summarize the cases whether we can write to disk or not.
> >>
> >>--------------------------------------------------------------------------
> >>Cases (Whether we can write the data mmapped to the file on the disk
> >>while fsfreezing)
> >>
> >>[1] One of the page which has been mmapped is not bound. And
> >>  the page is not allocated yet. (major fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) __do_falut is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       =>  We can STOP!
> >>
> >>[2] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are not mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       =>  We can STOP!
> >>
> >>[3] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    * Cannot block the dirty page to be written because all bh is mapped.
> >>    (5) user munmaps the page (munmap)
> >>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (7) writeback thread writes the page (struct page) to disk
> >>                                            =>  We cannot STOP!
> >>
> >>[4] One of the page which has been mmapped is bound. And
> >>  the page is already allocated.
> >>
> >>    (1) user dirtys a page
> >>    ( ) no page fault occurs
> >>    (2) user munmaps the page (munmap)
> >>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (4) writeback thread writes the page (struct page) to disk
> >>                                            =>  We cannot STOP!
> >>--------------------------------------------------------------------------
> >>
> >>So, we can block the cases [1], [2].
> >>But I think we cannot block the cases [3], [4] now.
> >>If fixing the page_mkwrite, we can also block the case [3].
> >>But the case [4] is not blocked because no page fault occurs
> >>when we dirty the mmapped page.
> >>
> >>Therefore, to repair this problem, we need to fix the cases [3], [4].
> >>I think we must modify the writeback thread to fix the case [4].
> >   The trick here is that when we write a page to disk, we write-protect
> >the page (you seem to call this that "the page is bound", I'm not sure why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
  Look at what page_mkclean() called from clear_page_dirty_for_io() does...

> But, anyway, I understand we don't need to consider the case [4].
  Yes.

> >So we are guaranteed to receive a minor fault (case [3]) if user tries to
> >modify a page after we finish writeback while freezing the filesystem.
> >So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
  Yes.

> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
  Sadly I don't see a simple way to fix this issue for all filesystems at
once. Implementing proper wait in block_page_mkwrite() should fix the issue
for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
have patches for this already for some time but I have to get to properly
testing them in more exotic conditions like 64k pages...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-18 10:51                                       ` Jan Kara
@ 2011-04-19  9:43                                         ` Toshiyuki Okajima
  2011-04-22  6:58                                           ` Toshiyuki Okajima
  0 siblings, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-19  9:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

Hi,

(2011/04/18 19:51), Jan Kara wrote:
> On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>>>> get modified to block while minor-faulting the page on frozen fs because
>>>>> when blocks are already allocated we may skip starting a transaction and so
>>>>> we could possibly modify the filesystem.
>>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>>>
>>>> (minor-pagefault)
>>>> ->   do_wp_page()
>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>        =>   BLOCK!
>>>>
>>>> (major-pagefault)
>>>> ->   do_liner_fault()
>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>        =>   BLOCK!
>>>>
>>>>>
>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>>>> while fsfreezing.
>>>>>>>    Technically speaking, we block all the transaction starts which means we
>>>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>>>> note the mmap case is one of such exceptions.
>>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>>>> fsfreeze operation is done...
>>>>>    I'm sorry I don't understand now - are you speaking about the case above
>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>> else?
>>>> Sorry, I didn't understand around the page fault path.
>>>> So, I had read the kernel source code around it, then I maybe understand...
>>>>
>>>> I worry whether we can update the file data in mmap case while fsfreezing.
>>>> Of course, I understand that we can write to in-memory cache, and it is not a
>>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>>>> So, I summarize the cases whether we can write to disk or not.
>>>>
>>>> --------------------------------------------------------------------------
>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>> while fsfreezing)
>>>>
>>>> [1] One of the page which has been mmapped is not bound. And
>>>>   the page is not allocated yet. (major fault?)
>>>>
>>>>     (1) user dirtys a page
>>>>     (2) a page fault occurs (do_page_fault)
>>>>     (3) __do_falut is called.
>>>>     (4) ext4_page_mkwrite is called
>>>>     (5) ext4_write_begin is called
>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>
>>>> [2] One of the page which has been mmapped is not bound. But
>>>>   the page is already allocated, and the buffer_heads of the page
>>>>   are not mapped (BH_Mapped).  (minor fault?)
>>>>
>>>>     (1) user dirtys a page
>>>>     (2) a page fault occurs (do_page_fault)
>>>>     (3) do_wp_page is called.
>>>>     (4) ext4_page_mkwrite is called
>>>>     (5) ext4_write_begin is called
>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>
>>>> [3] One of the page which has been mmapped is not bound. But
>>>>   the page is already allocated, and the buffer_heads of the page
>>>>   are mapped (BH_Mapped).  (minor fault?)
>>>>
>>>>     (1) user dirtys a page
>>>>     (2) a page fault occurs (do_page_fault)
>>>>     (3) do_wp_page is called.
>>>>     (4) ext4_page_mkwrite is called
>>>>     * Cannot block the dirty page to be written because all bh is mapped.
>>>>     (5) user munmaps the page (munmap)
>>>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>     (7) writeback thread writes the page (struct page) to disk
>>>>                                             =>   We cannot STOP!
>>>>
>>>> [4] One of the page which has been mmapped is bound. And
>>>>   the page is already allocated.
>>>>
>>>>     (1) user dirtys a page
>>>>     ( ) no page fault occurs
>>>>     (2) user munmaps the page (munmap)
>>>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>     (4) writeback thread writes the page (struct page) to disk
>>>>                                             =>   We cannot STOP!
>>>> --------------------------------------------------------------------------
>>>>
>>>> So, we can block the cases [1], [2].
>>>> But I think we cannot block the cases [3], [4] now.
>>>> If fixing the page_mkwrite, we can also block the case [3].
>>>> But the case [4] is not blocked because no page fault occurs
>>>> when we dirty the mmapped page.
>>>>
>>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>>> I think we must modify the writeback thread to fix the case [4].
>>>    The trick here is that when we write a page to disk, we write-protect
>>> the page (you seem to call this that "the page is bound", I'm not sure why).
>> Hm, I want to understand how to write-protect the page under fsfreezing.
>    Look at what page_mkclean() called from clear_page_dirty_for_io() does...
Thanks. I'll read that.

>
>> But, anyway, I understand we don't need to consider the case [4].
>    Yes.
>
>>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>>> modify a page after we finish writeback while freezing the filesystem.
>>> So principially all we need to do is just wait in ext4_page_mkwrite().
>> OK. I understand.
>> Are there any concrete ideas to fix this?
>> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
>    Yes.
>
>> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
>    Sadly I don't see a simple way to fix this issue for all filesystems at
> once. Implementing proper wait in block_page_mkwrite() should fix the issue
> for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
> separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
> have patches for this already for some time but I have to get to properly
> testing them in more exotic conditions like 64k pages...
OK. I understand the current status of your works to fix the problem which
can be written with some data at mmap path while fsfreezing.

Thanks,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-19  9:43                                         ` Toshiyuki Okajima
@ 2011-04-22  6:58                                           ` Toshiyuki Okajima
  2011-04-22 21:26                                             ` Peter M. Petrakis
  2011-04-22 22:10                                             ` Jan Kara
  0 siblings, 2 replies; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-22  6:58 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen, toshi.okajima

Hi,

On Tue, 19 Apr 2011 18:43:16 +0900
Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:
> Hi,
> 
> (2011/04/18 19:51), Jan Kara wrote:
> > On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
> >>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>>>> get modified to block while minor-faulting the page on frozen fs because
> >>>>> when blocks are already allocated we may skip starting a transaction and so
> >>>>> we could possibly modify the filesystem.
> >>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>>>
> >>>> (minor-pagefault)
> >>>> ->   do_wp_page()
> >>>>     ->   page_mkwrite(= ext4_mkwrite())
> >>>>        =>   BLOCK!
> >>>>
> >>>> (major-pagefault)
> >>>> ->   do_liner_fault()
> >>>>     ->   page_mkwrite(= ext4_mkwrite())
> >>>>        =>   BLOCK!
> >>>>
> >>>>>
> >>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>>>> The original problem happens after the fsfreeze operation is done.
> >>>>>>>> I understand the normal write operation (not mmap) can be blocked while
> >>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
> >>>>>>>> while fsfreezing.
> >>>>>>>    Technically speaking, we block all the transaction starts which means we
> >>>>>>> end up blocking all the writes from going to disk. But that does not mean
> >>>>>>> we block all the writes from going to in-memory cache - as you properly
> >>>>>>> note the mmap case is one of such exceptions.
> >>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
> >>>>>> write to disk while fsfreezing because this deadlock problem happens after
> >>>>>> fsfreeze operation is done...
> >>>>>    I'm sorry I don't understand now - are you speaking about the case above
> >>>>> when writepage() does not wait for filesystem being frozen or something
> >>>>> else?
> >>>> Sorry, I didn't understand around the page fault path.
> >>>> So, I had read the kernel source code around it, then I maybe understand...
> >>>>
> >>>> I worry whether we can update the file data in mmap case while fsfreezing.
> >>>> Of course, I understand that we can write to in-memory cache, and it is not a
> >>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
> >>>> So, I summarize the cases whether we can write to disk or not.
> >>>>
> >>>> --------------------------------------------------------------------------
> >>>> Cases (Whether we can write the data mmapped to the file on the disk
> >>>> while fsfreezing)
> >>>>
> >>>> [1] One of the page which has been mmapped is not bound. And
> >>>>   the page is not allocated yet. (major fault?)
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     (2) a page fault occurs (do_page_fault)
> >>>>     (3) __do_falut is called.
> >>>>     (4) ext4_page_mkwrite is called
> >>>>     (5) ext4_write_begin is called
> >>>>     (6) ext4_journal_start_sb       =>   We can STOP!
> >>>>
> >>>> [2] One of the page which has been mmapped is not bound. But
> >>>>   the page is already allocated, and the buffer_heads of the page
> >>>>   are not mapped (BH_Mapped).  (minor fault?)
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     (2) a page fault occurs (do_page_fault)
> >>>>     (3) do_wp_page is called.
> >>>>     (4) ext4_page_mkwrite is called
> >>>>     (5) ext4_write_begin is called
> >>>>     (6) ext4_journal_start_sb       =>   We can STOP!
> >>>>
> >>>> [3] One of the page which has been mmapped is not bound. But
> >>>>   the page is already allocated, and the buffer_heads of the page
> >>>>   are mapped (BH_Mapped).  (minor fault?)
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     (2) a page fault occurs (do_page_fault)
> >>>>     (3) do_wp_page is called.
> >>>>     (4) ext4_page_mkwrite is called
> >>>>     * Cannot block the dirty page to be written because all bh is mapped.
> >>>>     (5) user munmaps the page (munmap)
> >>>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>>>     (7) writeback thread writes the page (struct page) to disk
> >>>>                                             =>   We cannot STOP!
> >>>>
> >>>> [4] One of the page which has been mmapped is bound. And
> >>>>   the page is already allocated.
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     ( ) no page fault occurs
> >>>>     (2) user munmaps the page (munmap)
> >>>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>>>     (4) writeback thread writes the page (struct page) to disk
> >>>>                                             =>   We cannot STOP!
> >>>> --------------------------------------------------------------------------
> >>>>
> >>>> So, we can block the cases [1], [2].
> >>>> But I think we cannot block the cases [3], [4] now.
> >>>> If fixing the page_mkwrite, we can also block the case [3].
> >>>> But the case [4] is not blocked because no page fault occurs
> >>>> when we dirty the mmapped page.
> >>>>
> >>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
> >>>> I think we must modify the writeback thread to fix the case [4].
> >>>    The trick here is that when we write a page to disk, we write-protect
> >>> the page (you seem to call this that "the page is bound", I'm not sure why).
> >> Hm, I want to understand how to write-protect the page under fsfreezing.
> >    Look at what page_mkclean() called from clear_page_dirty_for_io() does...
> Thanks. I'll read that.
> 
> >
> >> But, anyway, I understand we don't need to consider the case [4].
> >    Yes.
> >
> >>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> >>> modify a page after we finish writeback while freezing the filesystem.
> >>> So principially all we need to do is just wait in ext4_page_mkwrite().
> >> OK. I understand.
> >> Are there any concrete ideas to fix this?
> >> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> >    Yes.
> >
> >> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
> >    Sadly I don't see a simple way to fix this issue for all filesystems at
> > once. Implementing proper wait in block_page_mkwrite() should fix the issue
> > for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
> > separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
> > have patches for this already for some time but I have to get to properly
> > testing them in more exotic conditions like 64k pages...
> OK. I understand the current status of your works to fix the problem which
> can be written with some data at mmap path while fsfreezing.
I have confirmed that the following patch works fine while my or
Mizuma-san's reproducer is running. Therefore,
 we can block to write the data, which is mmapped to a file, into a disk
by a page-fault while fsfreezing. 

I think this patch fixes the following two problems:
- A deadlock occurs between ext4_da_writepages() (called from
writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
- We can also write the data, which is mmapped to a file,
  into a disk while fsfreezing (ext3/ext4).
                                       (reported by me)

Please examine this patch.

Thanks,
Toshiyuki Okajima
---
 fs/ext3/file.c          |   19 ++++++++++++-
 fs/ext3/inode.c         |   71 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/inode.c         |    4 ++-
 include/linux/ext3_fs.h |    1 +
 4 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index f55df0e..6d376ef 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
 	return 0;
 }
 
+static const struct vm_operations_struct ext3_file_vm_ops = {
+	.fault          = filemap_fault,
+	.page_mkwrite   = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct address_space *mapping = file->f_mapping;
+
+	if (!mapping->a_ops->readpage)
+		return -ENOEXEC;
+	file_accessed(file);
+	vma->vm_ops = &ext3_file_vm_ops;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
+	return 0;
+}
+
 const struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext3_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext3_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext3_release_file,
 	.fsync		= ext3_sync_file,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 68b2e43..66c31dd 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
 
 	return err;
 }
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	loff_t size;
+	unsigned long len;
+	int ret = -EINVAL;
+	void *fsdata;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+
+	/*
+	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
+	 * get i_mutex because we are already holding mmap_sem.
+	 */
+	down_read(&inode->i_alloc_sem);
+	size = i_size_read(inode);
+	if (page->mapping != mapping || size <= page_offset(page)
+	   || !PageUptodate(page)) {
+		/* page got truncated from under us? */
+		goto out_unlock;
+	}
+	ret = 0;
+	if (PageMappedToDisk(page))
+		goto out_frozen;
+
+	if (page->index == size >> PAGE_CACHE_SHIFT)
+		len = size & ~PAGE_CACHE_MASK;
+	else
+		len = PAGE_CACHE_SIZE;
+
+	lock_page(page);
+	/*
+	 * return if we have all the buffers mapped. This avoid
+	 * the need to call write_begin/write_end which does a
+	 * journal_start/journal_stop which can block and take
+	 * long time
+	 */
+	if (page_has_buffers(page)) {
+		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
+					buffer_unmapped)) {
+			unlock_page(page);
+out_frozen:
+			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+			goto out_unlock;
+		}
+	}
+	unlock_page(page);
+	/*
+	 * OK, we need to fill the hole... Do write_begin write_end
+	 * to do block allocation/reservation.We are not holding
+	 * inode.i__mutex here. That allow * parallel write_begin,
+	 * write_end call. lock_page prevent this from happening
+	 * on the same page though
+	 */
+	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
+			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (ret < 0)
+		goto out_unlock;
+	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
+			len, len, page, fsdata);
+	if (ret < 0)
+		goto out_unlock;
+	ret = 0;
+out_unlock:
+	if (ret)
+		ret = VM_FAULT_SIGBUS;
+	up_read(&inode->i_alloc_sem);
+	return ret;
+}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..44979ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	ret = 0;
 	if (PageMappedToDisk(page))
-		goto out_unlock;
+		goto out_frozen;
 
 	if (page->index == size >> PAGE_CACHE_SHIFT)
 		len = size & ~PAGE_CACHE_MASK;
@@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
 			unlock_page(page);
+out_frozen:
+			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 			goto out_unlock;
 		}
 	}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 85c1d30..a0e39ca 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* ioctl.c */
 extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-22  6:58                                           ` Toshiyuki Okajima
@ 2011-04-22 21:26                                             ` Peter M. Petrakis
  2011-04-22 21:40                                               ` Jan Kara
  2011-04-22 22:10                                             ` Jan Kara
  1 sibling, 1 reply; 121+ messages in thread
From: Peter M. Petrakis @ 2011-04-22 21:26 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen, Craig Magina

Hi All,

On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
> Hi,
> 
> On Tue, 19 Apr 2011 18:43:16 +0900
> Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:
>> Hi,
>>
>> (2011/04/18 19:51), Jan Kara wrote:
>>> On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>>>>>> get modified to block while minor-faulting the page on frozen fs because
>>>>>>> when blocks are already allocated we may skip starting a transaction and so
>>>>>>> we could possibly modify the filesystem.
>>>>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>>>>>
>>>>>> (minor-pagefault)
>>>>>> ->   do_wp_page()
>>>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>>>        =>   BLOCK!
>>>>>>
>>>>>> (major-pagefault)
>>>>>> ->   do_liner_fault()
>>>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>>>        =>   BLOCK!
>>>>>>
>>>>>>>
>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>>>>>> while fsfreezing.
>>>>>>>>>    Technically speaking, we block all the transaction starts which means we
>>>>>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>>>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>>>>>> fsfreeze operation is done...
>>>>>>>    I'm sorry I don't understand now - are you speaking about the case above
>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>> else?
>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>> So, I had read the kernel source code around it, then I maybe understand...
>>>>>>
>>>>>> I worry whether we can update the file data in mmap case while fsfreezing.
>>>>>> Of course, I understand that we can write to in-memory cache, and it is not a
>>>>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>> while fsfreezing)
>>>>>>
>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>>   the page is not allocated yet. (major fault?)
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     (2) a page fault occurs (do_page_fault)
>>>>>>     (3) __do_falut is called.
>>>>>>     (4) ext4_page_mkwrite is called
>>>>>>     (5) ext4_write_begin is called
>>>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>>>
>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>>   the page is already allocated, and the buffer_heads of the page
>>>>>>   are not mapped (BH_Mapped).  (minor fault?)
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     (2) a page fault occurs (do_page_fault)
>>>>>>     (3) do_wp_page is called.
>>>>>>     (4) ext4_page_mkwrite is called
>>>>>>     (5) ext4_write_begin is called
>>>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>>>
>>>>>> [3] One of the page which has been mmapped is not bound. But
>>>>>>   the page is already allocated, and the buffer_heads of the page
>>>>>>   are mapped (BH_Mapped).  (minor fault?)
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     (2) a page fault occurs (do_page_fault)
>>>>>>     (3) do_wp_page is called.
>>>>>>     (4) ext4_page_mkwrite is called
>>>>>>     * Cannot block the dirty page to be written because all bh is mapped.
>>>>>>     (5) user munmaps the page (munmap)
>>>>>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>>>     (7) writeback thread writes the page (struct page) to disk
>>>>>>                                             =>   We cannot STOP!
>>>>>>
>>>>>> [4] One of the page which has been mmapped is bound. And
>>>>>>   the page is already allocated.
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     ( ) no page fault occurs
>>>>>>     (2) user munmaps the page (munmap)
>>>>>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>>>     (4) writeback thread writes the page (struct page) to disk
>>>>>>                                             =>   We cannot STOP!
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> So, we can block the cases [1], [2].
>>>>>> But I think we cannot block the cases [3], [4] now.
>>>>>> If fixing the page_mkwrite, we can also block the case [3].
>>>>>> But the case [4] is not blocked because no page fault occurs
>>>>>> when we dirty the mmapped page.
>>>>>>
>>>>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>>>>> I think we must modify the writeback thread to fix the case [4].
>>>>>    The trick here is that when we write a page to disk, we write-protect
>>>>> the page (you seem to call this that "the page is bound", I'm not sure why).
>>>> Hm, I want to understand how to write-protect the page under fsfreezing.
>>>    Look at what page_mkclean() called from clear_page_dirty_for_io() does...
>> Thanks. I'll read that.
>>
>>>
>>>> But, anyway, I understand we don't need to consider the case [4].
>>>    Yes.
>>>
>>>>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>>>>> modify a page after we finish writeback while freezing the filesystem.
>>>>> So principially all we need to do is just wait in ext4_page_mkwrite().
>>>> OK. I understand.
>>>> Are there any concrete ideas to fix this?
>>>> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
>>>    Yes.
>>>
>>>> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
>>>    Sadly I don't see a simple way to fix this issue for all filesystems at
>>> once. Implementing proper wait in block_page_mkwrite() should fix the issue
>>> for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
>>> separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
>>> have patches for this already for some time but I have to get to properly
>>> testing them in more exotic conditions like 64k pages...
>> OK. I understand the current status of your works to fix the problem which
>> can be written with some data at mmap path while fsfreezing.
> I have confirmed that the following patch works fine while my or
> Mizuma-san's reproducer is running. Therefore,
>  we can block to write the data, which is mmapped to a file, into a disk
> by a page-fault while fsfreezing. 
> 
> I think this patch fixes the following two problems:
> - A deadlock occurs between ext4_da_writepages() (called from
> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> - We can also write the data, which is mmapped to a file,
>   into a disk while fsfreezing (ext3/ext4).
>                                        (reported by me)
> 
> Please examine this patch.

We've recently identified the same root cause in 2.6.32 though the hit rate
is much much higher. The configuration is a SAN ALUA Active/Standby using
multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
when a path comes back into service, as a result of a kpartx invocation on
behalf of this udev rule.

/lib/udev/rules.d/95-kpartx.rules

# Create dm tables for partitions
ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
        RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"


Below are the logs of the current incarntion of the fault with your current patch against 2.6.38.
Still working to obtain a viable crashdump.

[ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)

[ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)

[ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)

[ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)

[ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)

[ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)

[ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)

[ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)

[ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA

[ 1898.578635] device-mapper: multipath: Failing path 8:32.

[ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.

[ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.361891] kjournald       D ffff88063acb9a90     0   595      2 0x00000000

[ 2041.369891]  ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000

[ 2041.378416]  0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8

[ 2041.386954]  ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0

[ 2041.395561] Call Trace:

[ 2041.398358]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50

[ 2041.404342]  [<ffffffff815d3120>] io_schedule+0x70/0xc0

[ 2041.410227]  [<ffffffff811923c5>] sync_buffer+0x45/0x50

[ 2041.416179]  [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90

[ 2041.422258]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50

[ 2041.428275]  [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90

[ 2041.435324]  [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40

[ 2041.441958]  [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30

[ 2041.448333]  [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0

[ 2041.455873]  [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10

[ 2041.463020]  [<ffffffff8107443c>] ? lock_timer_base+0x3c/0x70

[ 2041.469514]  [<ffffffff81074e33>] ? try_to_del_timer_sync+0x83/0xe0

[ 2041.476563]  [<ffffffff8123df7d>] kjournald+0xed/0x250

[ 2041.482349]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2041.489624]  [<ffffffff8123de90>] ? kjournald+0x0/0x250

[ 2041.495504]  [<ffffffff810865e6>] kthread+0x96/0xa0

[ 2041.501003]  [<ffffffff8100ce64>] kernel_thread_helper+0x4/0x10

[ 2041.507667]  [<ffffffff81086550>] ? kthread+0x0/0xa0

[ 2041.513301]  [<ffffffff8100ce60>] ? kernel_thread_helper+0x0/0x10

[ 2041.520247] INFO: task rsyslogd:1854 blocked for more than 120 seconds.

[ 2041.527677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.536499] rsyslogd        D ffff88063c513170     0  1854      1 0x00000000

[ 2041.544533]  ffff88063d0e3cd8 0000000000000082 ffff88063c479180 0000000000000000

[ 2041.553108]  0000000000013cc0 ffff88063c512dc0 ffff88063c513170 ffff88063d0e3fd8

[ 2041.561691]  ffff88063c513178 0000000000013cc0 ffff88063d0e2010 0000000000013cc0

[ 2041.570323] Call Trace:

[ 2041.573108]  [<ffffffff8110c78d>] __generic_file_aio_write+0xbd/0x470

[ 2041.580447]  [<ffffffff8108a82d>] ? hrtimer_try_to_cancel+0x3d/0xd0

[ 2041.587496]  [<ffffffff81097e3d>] ? futex_wait_queue_me+0xcd/0x110

[ 2041.594489]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2041.601833]  [<ffffffff8110cba2>] generic_file_aio_write+0x62/0xd0

[ 2041.608831]  [<ffffffff81163a9a>] do_sync_write+0xda/0x120

[ 2041.615165]  [<ffffffff812de756>] ? rb_erase+0xd6/0x160

[ 2041.621050]  [<ffffffff812ac918>] ? apparmor_file_permission+0x18/0x20

[ 2041.628395]  [<ffffffff81279b23>] ? security_file_permission+0x23/0x90

[ 2041.635827]  [<ffffffff81164018>] vfs_write+0xc8/0x190

[ 2041.641649]  [<ffffffff811641d1>] sys_write+0x51/0x90

[ 2041.647337]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.654091] INFO: task multipathd:1337 blocked for more than 120 seconds.

[ 2041.661750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.670669] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000

[ 2041.678746]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000

[ 2041.687219]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8

[ 2041.695818]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0

[ 2041.704369] Call Trace:

[ 2041.707128]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300

[ 2041.713679]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2041.719846]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410

[ 2041.726301]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180

[ 2041.732685]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20

[ 2041.739138]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20

[ 2041.746079]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20

[ 2041.752716]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0

[ 2041.759853]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240

[ 2041.766503]  [<ffffffff8107e060>] __request_module+0x190/0x210

[ 2041.773054]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40

[ 2041.778636]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240

[ 2041.785121]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0

[ 2041.791312]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140

[ 2041.797671]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250

[ 2041.804413]  [<ffffffff8149de3a>] table_load+0xca/0x2f0

[ 2041.810317]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0

[ 2041.816316]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2041.822184]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2041.828188]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2041.834250]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170

[ 2041.840219]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2041.845898]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.852639] INFO: task iozone:1871 blocked for more than 120 seconds.

[ 2041.859921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.868760] iozone          D ffff880c3bc21a90     0  1871   1869 0x00000000

[ 2041.876728]  ffff880c3e743e20 0000000000000086 0000000000000001 ffff880c00000000

[ 2041.885177]  0000000000013cc0 ffff880c3bc216e0 ffff880c3bc21a90 ffff880c3e743fd8

[ 2041.893647]  ffff880c3bc21a98 0000000000013cc0 ffff880c3e742010 0000000000013cc0

[ 2041.902112] Call Trace:

[ 2041.906302]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2041.912494]  [<ffffffff815d4ddd>] rwsem_down_failed_common+0xcd/0x170

[ 2041.919718]  [<ffffffff8118f480>] ? sync_one_sb+0x0/0x30

[ 2041.925719]  [<ffffffff815d4eb5>] rwsem_down_read_failed+0x15/0x17

[ 2041.932690]  [<ffffffff812e41a4>] call_rwsem_down_read_failed+0x14/0x30

[ 2041.940116]  [<ffffffff815d4207>] ? down_read+0x17/0x20

[ 2041.945990]  [<ffffffff811665e1>] iterate_supers+0x71/0xf0

[ 2041.952149]  [<ffffffff8118f4df>] sys_sync+0x2f/0x70

[ 2041.957763]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.

[ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.980626] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000

[ 2041.988607]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000

[ 2041.997056]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8

[ 2042.005496]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0

[ 2042.013939] Call Trace:

[ 2042.016702]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150

[ 2042.023089]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2042.030321]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2042.036584]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70

[ 2042.042552]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330

[ 2042.049133]  [<ffffffff81115391>] ? do_writepages+0x21/0x40

[ 2042.055423]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60

[ 2042.062944]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90

[ 2042.069430]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70

[ 2042.075690]  [<ffffffff81166a85>] freeze_super+0x55/0x100

[ 2042.081754]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0

[ 2042.087625]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0

[ 2042.093495]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0

[ 2042.099948]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.105916]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0

[ 2042.111784]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.117753]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0

[ 2042.123621]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.129591]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2042.135493]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2042.141770]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2042.147739]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2042.153801]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2042.159478]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2161.971321] INFO: task rsyslogd:1854 blocked for more than 120 seconds.

[ 2161.978798] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2161.987656] rsyslogd        D ffff88063c513170     0  1854      1 0x00000000

[ 2161.995718]  ffff88063d0e3cd8 0000000000000082 ffff88063c479180 0000000000000000

[ 2162.004340]  0000000000013cc0 ffff88063c512dc0 ffff88063c513170 ffff88063d0e3fd8

[ 2162.012932]  ffff88063c513178 0000000000013cc0 ffff88063d0e2010 0000000000013cc0

[ 2162.021481] Call Trace:

[ 2162.024290]  [<ffffffff8110c78d>] __generic_file_aio_write+0xbd/0x470

[ 2162.031627]  [<ffffffff8108a82d>] ? hrtimer_try_to_cancel+0x3d/0xd0

[ 2162.038711]  [<ffffffff81097e3d>] ? futex_wait_queue_me+0xcd/0x110

[ 2162.045662]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2162.053007]  [<ffffffff8110cba2>] generic_file_aio_write+0x62/0xd0

[ 2162.059962]  [<ffffffff81163a9a>] do_sync_write+0xda/0x120

[ 2162.066165]  [<ffffffff812de756>] ? rb_erase+0xd6/0x160

[ 2162.072048]  [<ffffffff812ac918>] ? apparmor_file_permission+0x18/0x20

[ 2162.079387]  [<ffffffff81279b23>] ? security_file_permission+0x23/0x90

[ 2162.086761]  [<ffffffff81164018>] vfs_write+0xc8/0x190

[ 2162.092552]  [<ffffffff811641d1>] sys_write+0x51/0x90

[ 2162.098247]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.105042] INFO: task multipathd:1337 blocked for more than 120 seconds.

[ 2162.112667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.121487] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000

[ 2162.129517]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000

[ 2162.138112]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8

[ 2162.146688]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0

[ 2162.155253] Call Trace:

[ 2162.158073]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300

[ 2162.164639]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2162.170886]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410

[ 2162.177389]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180

[ 2162.183852]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20

[ 2162.190317]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20

[ 2162.197304]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20

[ 2162.203968]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0

[ 2162.211111]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240

[ 2162.217807]  [<ffffffff8107e060>] __request_module+0x190/0x210

[ 2162.224461]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40

[ 2162.230054]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240

[ 2162.236503]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0

[ 2162.242673]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140

[ 2162.249079]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250

[ 2162.255840]  [<ffffffff8149de3a>] table_load+0xca/0x2f0

[ 2162.261719]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0

[ 2162.267701]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2162.273621]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2162.279592]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2162.285710]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170

[ 2162.291694]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2162.297383]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.304169] INFO: task iozone:1871 blocked for more than 120 seconds.

[ 2162.311407] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.320229] iozone          D ffff880c3bc21a90     0  1871   1869 0x00000000

[ 2162.328317]  ffff880c3e743e20 0000000000000086 0000000000000001 ffff880c00000000

[ 2162.336901]  0000000000013cc0 ffff880c3bc216e0 ffff880c3bc21a90 ffff880c3e743fd8

[ 2162.345415]  ffff880c3bc21a98 0000000000013cc0 ffff880c3e742010 0000000000013cc0

[ 2162.353887] Call Trace:

[ 2162.356650]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2162.362815]  [<ffffffff815d4ddd>] rwsem_down_failed_common+0xcd/0x170

[ 2162.370042]  [<ffffffff8118f480>] ? sync_one_sb+0x0/0x30

[ 2162.376121]  [<ffffffff815d4eb5>] rwsem_down_read_failed+0x15/0x17

[ 2162.383075]  [<ffffffff812e41a4>] call_rwsem_down_read_failed+0x14/0x30

[ 2162.390575]  [<ffffffff815d4207>] ? down_read+0x17/0x20

[ 2162.396501]  [<ffffffff811665e1>] iterate_supers+0x71/0xf0

[ 2162.402768]  [<ffffffff8118f4df>] sys_sync+0x2f/0x70

[ 2162.408360]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.415159] INFO: task kpartx:1897 blocked for more than 120 seconds.

[ 2162.422493] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.431405] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000

[ 2162.439440]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000

[ 2162.448021]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8

[ 2162.456468]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0

[ 2162.464962] Call Trace:

[ 2162.467724]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150

[ 2162.474088]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2162.481319]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2162.487577]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70

[ 2162.493548]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330

[ 2162.500107]  [<ffffffff81115391>] ? do_writepages+0x21/0x40

[ 2162.506415]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60

[ 2162.513947]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90

[ 2162.520514]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70

[ 2162.526783]  [<ffffffff81166a85>] freeze_super+0x55/0x100

[ 2162.532896]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0

[ 2162.538819]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0

[ 2162.544705]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0

[ 2162.551174]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.557160]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0

[ 2162.563082]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.569102]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0

[ 2162.574987]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.581068]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2162.586954]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2162.593217]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2162.599190]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2162.605298]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2162.610990]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2191.336354] Uhhuh. NMI received for unknown reason 21 on CPU 0.

[ 2191.343064] Do you have a strange power saving mode enabled?

[ 2191.349476] Kernel panic - not syncing: NMI: Not continuing

[ 2191.355753] Pid: 0, comm: swapper Not tainted 2.6.38-8-server #43

[ 2191.362593] Call Trace:

[ 2191.365380]  <NMI>  [<ffffffff815d2083>] ? panic+0x91/0x19e

[ 2191.371779]  [<ffffffff815d21f8>] ? printk+0x68/0x70

[ 2191.377381]  [<ffffffff815d6333>] ? default_do_nmi+0x1f3/0x200

[ 2191.383929]  [<ffffffff815d63c0>] ? do_nmi+0x80/0x90

[ 2191.389526]  [<ffffffff815d5b50>] ? nmi+0x20/0x30

[ 2191.394816]  [<ffffffff81332d74>] ? intel_idle+0x94/0x120

[ 2191.400897]  <<EOE>>  [<ffffffff814b3472>] ? cpuidle_idle_call+0xb2/0x1b0

[ 2191.408606]  [<ffffffff8100b067>] ? cpu_idle+0xb7/0x110

[ 2191.414497]  [<ffffffff815b7682>] ? rest_init+0x72/0x80

[ 2191.420367]  [<ffffffff81ae2c95>] ? start_kernel+0x374/0x37b

[ 2191.426780]  [<ffffffff81ae2346>] ? x86_64_start_reservations+0x131/0x135

[ 2191.434457]  [<ffffffff81ae244d>] ? x86_64_start_kernel+0x103/0x112


Thanks.

Peter




> 
> Thanks,
> Toshiyuki Okajima
> ---
>  fs/ext3/file.c          |   19 ++++++++++++-
>  fs/ext3/inode.c         |   71 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext4/inode.c         |    4 ++-
>  include/linux/ext3_fs.h |    1 +
>  4 files changed, 93 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index f55df0e..6d376ef 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
>  	return 0;
>  }
>  
> +static const struct vm_operations_struct ext3_file_vm_ops = {
> +	.fault          = filemap_fault,
> +	.page_mkwrite   = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct address_space *mapping = file->f_mapping;
> +
> +	if (!mapping->a_ops->readpage)
> +		return -ENOEXEC;
> +	file_accessed(file);
> +	vma->vm_ops = &ext3_file_vm_ops;
> +	vma->vm_flags |= VM_CAN_NONLINEAR;
> +	return 0;
> +}
> +
>  const struct file_operations ext3_file_operations = {
>  	.llseek		= generic_file_llseek,
>  	.read		= do_sync_read,
> @@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
>  #ifdef CONFIG_COMPAT
>  	.compat_ioctl	= ext3_compat_ioctl,
>  #endif
> -	.mmap		= generic_file_mmap,
> +	.mmap		= ext3_file_mmap,
>  	.open		= dquot_file_open,
>  	.release	= ext3_release_file,
>  	.fsync		= ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 68b2e43..66c31dd 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>  
>  	return err;
>  }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct page *page = vmf->page;
> +	loff_t size;
> +	unsigned long len;
> +	int ret = -EINVAL;
> +	void *fsdata;
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	/*
> +	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> +	 * get i_mutex because we are already holding mmap_sem.
> +	 */
> +	down_read(&inode->i_alloc_sem);
> +	size = i_size_read(inode);
> +	if (page->mapping != mapping || size <= page_offset(page)
> +	   || !PageUptodate(page)) {
> +		/* page got truncated from under us? */
> +		goto out_unlock;
> +	}
> +	ret = 0;
> +	if (PageMappedToDisk(page))
> +		goto out_frozen;
> +
> +	if (page->index == size >> PAGE_CACHE_SHIFT)
> +		len = size & ~PAGE_CACHE_MASK;
> +	else
> +		len = PAGE_CACHE_SIZE;
> +
> +	lock_page(page);
> +	/*
> +	 * return if we have all the buffers mapped. This avoid
> +	 * the need to call write_begin/write_end which does a
> +	 * journal_start/journal_stop which can block and take
> +	 * long time
> +	 */
> +	if (page_has_buffers(page)) {
> +		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> +					buffer_unmapped)) {
> +			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +			goto out_unlock;
> +		}
> +	}
> +	unlock_page(page);
> +	/*
> +	 * OK, we need to fill the hole... Do write_begin write_end
> +	 * to do block allocation/reservation.We are not holding
> +	 * inode.i__mutex here. That allow * parallel write_begin,
> +	 * write_end call. lock_page prevent this from happening
> +	 * on the same page though
> +	 */
> +	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
> +			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
> +			len, len, page, fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = 0;
> +out_unlock:
> +	if (ret)
> +		ret = VM_FAULT_SIGBUS;
> +	up_read(&inode->i_alloc_sem);
> +	return ret;
> +}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..44979ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	ret = 0;
>  	if (PageMappedToDisk(page))
> -		goto out_unlock;
> +		goto out_frozen;
>  
>  	if (page->index == size >> PAGE_CACHE_SHIFT)
>  		len = size & ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
>  			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
>  			goto out_unlock;
>  		}
>  	}
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 85c1d30..a0e39ca 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
>  extern void ext3_set_aops(struct inode *inode);
>  extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
> +extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  /* ioctl.c */
>  extern long ext3_ioctl(struct file *, unsigned int, unsigned long);

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-22 21:26                                             ` Peter M. Petrakis
@ 2011-04-22 21:40                                               ` Jan Kara
  2011-04-22 22:57                                                 ` Peter M. Petrakis
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-22 21:40 UTC (permalink / raw)
  To: Peter M. Petrakis
  Cc: Toshiyuki Okajima, Jan Kara, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen, Craig Magina

  Hello,

On Fri 22-04-11 17:26:07, Peter M. Petrakis wrote:
> On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
> > On Tue, 19 Apr 2011 18:43:16 +0900
> > I have confirmed that the following patch works fine while my or
> > Mizuma-san's reproducer is running. Therefore,
> >  we can block to write the data, which is mmapped to a file, into a disk
> > by a page-fault while fsfreezing. 
> > 
> > I think this patch fixes the following two problems:
> > - A deadlock occurs between ext4_da_writepages() (called from
> > writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> > - We can also write the data, which is mmapped to a file,
> >   into a disk while fsfreezing (ext3/ext4).
> >                                        (reported by me)
> > 
> > Please examine this patch.
> 
> We've recently identified the same root cause in 2.6.32 though the hit rate
> is much much higher. The configuration is a SAN ALUA Active/Standby using
> multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
> when a path comes back into service, as a result of a kpartx invocation on
> behalf of this udev rule.
> 
> /lib/udev/rules.d/95-kpartx.rules
> 
> # Create dm tables for partitions
> ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
>         RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"
  Hmm, I don't think this is the same problem... See:

> [ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)
> [ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)
> [ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)
> [ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)
> [ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)
> [ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)
> [ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)
> [ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)
> [ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA
> [ 1898.578635] device-mapper: multipath: Failing path 8:32.
> [ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.
> [ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2041.361891] kjournald       D ffff88063acb9a90     0   595      2 0x00000000
> [ 2041.369891]  ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000
> [ 2041.378416]  0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8
> [ 2041.386954]  ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0
> 
> [ 2041.395561] Call Trace:
> [ 2041.398358]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
> [ 2041.404342]  [<ffffffff815d3120>] io_schedule+0x70/0xc0
> [ 2041.410227]  [<ffffffff811923c5>] sync_buffer+0x45/0x50
> [ 2041.416179]  [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90
> [ 2041.422258]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
> [ 2041.428275]  [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90
> [ 2041.435324]  [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40
> [ 2041.441958]  [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30
> [ 2041.448333]  [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0
  So kjournald is committing a transaction and waiting for IO to complete.
Which maybe never happens because of multipath being in transition? That
would be a bug...

> [ 2041.670669] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000
> [ 2041.678746]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000
> [ 2041.687219]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8
> [ 2041.695818]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0
> [ 2041.704369] Call Trace:
> [ 2041.707128]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300
> [ 2041.713679]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90
> [ 2041.719846]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410
> [ 2041.726301]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180
> [ 2041.732685]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20
> [ 2041.739138]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
> [ 2041.746079]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20
> [ 2041.752716]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0
> [ 2041.759853]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240
> [ 2041.766503]  [<ffffffff8107e060>] __request_module+0x190/0x210
> [ 2041.773054]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40
> [ 2041.778636]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240
> [ 2041.785121]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0
> [ 2041.791312]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140
> [ 2041.797671]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250
> [ 2041.804413]  [<ffffffff8149de3a>] table_load+0xca/0x2f0
> [ 2041.810317]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0
> [ 2041.816316]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
> [ 2041.822184]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
> [ 2041.828188]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
> [ 2041.834250]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170
> [ 2041.840219]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
> [ 2041.845898]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
  multipathd is hung waiting for module to be loaded? How come?

> [ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.
> [ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2041.980626] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000
> [ 2041.988607]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000
> [ 2041.997056]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8
> [ 2042.005496]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0
> [ 2042.013939] Call Trace:
> [ 2042.016702]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150
> [ 2042.023089]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40
> [ 2042.030321]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
> [ 2042.036584]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70
> [ 2042.042552]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330
> [ 2042.049133]  [<ffffffff81115391>] ? do_writepages+0x21/0x40
> [ 2042.055423]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60
> [ 2042.062944]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90
> [ 2042.069430]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70
> [ 2042.075690]  [<ffffffff81166a85>] freeze_super+0x55/0x100
> [ 2042.081754]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0
> [ 2042.087625]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0
> [ 2042.093495]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0
> [ 2042.099948]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.105916]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0
> [ 2042.111784]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.117753]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0
> [ 2042.123621]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.129591]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
> [ 2042.135493]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
> [ 2042.141770]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
> [ 2042.147739]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
> [ 2042.153801]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
> [ 2042.159478]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
  kpartx is waiting for kjournald to finish transaction commit and it is
holding s_umount but that doesn't really seem to be a problem...

So as I say, find a reason why kjournald is not able to finish committing a
transaction and you should solve this riddle ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-22  6:58                                           ` Toshiyuki Okajima
  2011-04-22 21:26                                             ` Peter M. Petrakis
@ 2011-04-22 22:10                                             ` Jan Kara
  2011-04-25  6:28                                               ` Toshiyuki Okajima
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-04-22 22:10 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
> I have confirmed that the following patch works fine while my or
> Mizuma-san's reproducer is running. Therefore,
>  we can block to write the data, which is mmapped to a file, into a disk
> by a page-fault while fsfreezing. 
> 
> I think this patch fixes the following two problems:
> - A deadlock occurs between ext4_da_writepages() (called from
> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> - We can also write the data, which is mmapped to a file,
>   into a disk while fsfreezing (ext3/ext4).
>                                        (reported by me)
> 
> Please examine this patch.
  Thanks for the patch. The ext3 part is not as easy as this. You cannot
really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
held by page fault code and i_alloc_sem should be acquired before it (yes I
know, ext4 already has this bug which should be fixed when I get to it).
Also you'll find that performance of random writers via mmap (which is
relatively common) is going to be rather bad with this patch (because the
file will be heavily fragmented). We have to be more clever which is
exactly why it's taking me so long with my patch :) But tests are already
running so if everything goes fine, I should have patches to submit next
week.

The ext4 part looks correct. I'd just also like to have some comments about
how freeze handling is done because it's kind of subtle.

								Honza

> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index f55df0e..6d376ef 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
>  	return 0;
>  }
>  
> +static const struct vm_operations_struct ext3_file_vm_ops = {
> +	.fault          = filemap_fault,
> +	.page_mkwrite   = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct address_space *mapping = file->f_mapping;
> +
> +	if (!mapping->a_ops->readpage)
> +		return -ENOEXEC;
> +	file_accessed(file);
> +	vma->vm_ops = &ext3_file_vm_ops;
> +	vma->vm_flags |= VM_CAN_NONLINEAR;
> +	return 0;
> +}
> +
>  const struct file_operations ext3_file_operations = {
>  	.llseek		= generic_file_llseek,
>  	.read		= do_sync_read,
> @@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
>  #ifdef CONFIG_COMPAT
>  	.compat_ioctl	= ext3_compat_ioctl,
>  #endif
> -	.mmap		= generic_file_mmap,
> +	.mmap		= ext3_file_mmap,
>  	.open		= dquot_file_open,
>  	.release	= ext3_release_file,
>  	.fsync		= ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 68b2e43..66c31dd 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>  
>  	return err;
>  }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct page *page = vmf->page;
> +	loff_t size;
> +	unsigned long len;
> +	int ret = -EINVAL;
> +	void *fsdata;
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	/*
> +	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> +	 * get i_mutex because we are already holding mmap_sem.
> +	 */
> +	down_read(&inode->i_alloc_sem);
> +	size = i_size_read(inode);
> +	if (page->mapping != mapping || size <= page_offset(page)
> +	   || !PageUptodate(page)) {
> +		/* page got truncated from under us? */
> +		goto out_unlock;
> +	}
> +	ret = 0;
> +	if (PageMappedToDisk(page))
> +		goto out_frozen;
> +
> +	if (page->index == size >> PAGE_CACHE_SHIFT)
> +		len = size & ~PAGE_CACHE_MASK;
> +	else
> +		len = PAGE_CACHE_SIZE;
> +
> +	lock_page(page);
> +	/*
> +	 * return if we have all the buffers mapped. This avoid
> +	 * the need to call write_begin/write_end which does a
> +	 * journal_start/journal_stop which can block and take
> +	 * long time
> +	 */
> +	if (page_has_buffers(page)) {
> +		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> +					buffer_unmapped)) {
> +			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +			goto out_unlock;
> +		}
> +	}
> +	unlock_page(page);
> +	/*
> +	 * OK, we need to fill the hole... Do write_begin write_end
> +	 * to do block allocation/reservation.We are not holding
> +	 * inode.i__mutex here. That allow * parallel write_begin,
> +	 * write_end call. lock_page prevent this from happening
> +	 * on the same page though
> +	 */
> +	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
> +			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
> +			len, len, page, fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = 0;
> +out_unlock:
> +	if (ret)
> +		ret = VM_FAULT_SIGBUS;
> +	up_read(&inode->i_alloc_sem);
> +	return ret;
> +}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..44979ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	ret = 0;
>  	if (PageMappedToDisk(page))
> -		goto out_unlock;
> +		goto out_frozen;
>  
>  	if (page->index == size >> PAGE_CACHE_SHIFT)
>  		len = size & ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
>  			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
>  			goto out_unlock;
>  		}
>  	}
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 85c1d30..a0e39ca 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
>  extern void ext3_set_aops(struct inode *inode);
>  extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
> +extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  /* ioctl.c */
>  extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
> -- 
> 1.5.5.6
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-22 21:40                                               ` Jan Kara
@ 2011-04-22 22:57                                                 ` Peter M. Petrakis
  0 siblings, 0 replies; 121+ messages in thread
From: Peter M. Petrakis @ 2011-04-22 22:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen, Craig Magina



On 04/22/2011 05:40 PM, Jan Kara wrote:
>   Hello,
> 
> On Fri 22-04-11 17:26:07, Peter M. Petrakis wrote:
>> On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
>>> On Tue, 19 Apr 2011 18:43:16 +0900
>>> I have confirmed that the following patch works fine while my or
>>> Mizuma-san's reproducer is running. Therefore,
>>>  we can block to write the data, which is mmapped to a file, into a disk
>>> by a page-fault while fsfreezing. 
>>>
>>> I think this patch fixes the following two problems:
>>> - A deadlock occurs between ext4_da_writepages() (called from
>>> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
>>> - We can also write the data, which is mmapped to a file,
>>>   into a disk while fsfreezing (ext3/ext4).
>>>                                        (reported by me)
>>>
>>> Please examine this patch.
>>
>> We've recently identified the same root cause in 2.6.32 though the hit rate
>> is much much higher. The configuration is a SAN ALUA Active/Standby using
>> multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
>> when a path comes back into service, as a result of a kpartx invocation on
>> behalf of this udev rule.
>>
>> /lib/udev/rules.d/95-kpartx.rules
>>
>> # Create dm tables for partitions
>> ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
>>         RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"
>   Hmm, I don't think this is the same problem... See:

Figures :)

 
>> [ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)
>> [ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)
>> [ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)
>> [ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)
>> [ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)
>> [ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)
>> [ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)
>> [ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)
>> [ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA
>> [ 1898.578635] device-mapper: multipath: Failing path 8:32.
>> [ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.
>> [ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2041.361891] kjournald       D ffff88063acb9a90     0   595      2 0x00000000
>> [ 2041.369891]  ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000
>> [ 2041.378416]  0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8
>> [ 2041.386954]  ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0
>>
>> [ 2041.395561] Call Trace:
>> [ 2041.398358]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
>> [ 2041.404342]  [<ffffffff815d3120>] io_schedule+0x70/0xc0
>> [ 2041.410227]  [<ffffffff811923c5>] sync_buffer+0x45/0x50
>> [ 2041.416179]  [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90
>> [ 2041.422258]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
>> [ 2041.428275]  [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90
>> [ 2041.435324]  [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40
>> [ 2041.441958]  [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30
>> [ 2041.448333]  [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0
>   So kjournald is committing a transaction and waiting for IO to complete.
> Which maybe never happens because of multipath being in transition? That
> would be a bug...
>

and it would be a new one for us. It's entirely possible the original deadlock
is resolved, and this is new. With only the tracebacks to consult, and general
unfamiliarity with this area, it looked like the same fault to me.
In 2.6.32 it's a dead ringer per the thread parent:

http://permalink.gmane.org/gmane.comp.file-systems.ext4/23171

[Ubuntu 10.04 - 2.6.32 crashdump]

crash-5.0> ps | grep UN
    992      2   7  ffff8802678a8000  UN   0.0       0      0  [flush-251:5]
  17295   2537   2  ffff880267be0000  UN   0.2   47060  17368  iozone
  17314   2477   5  ffff88026a952010  UN   0.2   47060  17364  iozone
  17447   2573   0  ffff880268bd2010  UN   0.2   47060  17340  iozone
  17460      1  13  ffff88026b3c4020  UN   0.0  191564   1992  rsyslogd
  17606  17597  15  ffff880268420000  UN   0.0   10436    808  kpartx
  17738   2268  13  ffff88016908a010  UN   0.0   17756   1616  dhclient-script
  17747   2223  15  ffff88026a950000  UN   0.0  151460   1596  multipathd
  17748   2284   1  ffff88016908c020  UN   0.0   49260    688  sshd
  17749   2284   1  ffff880169088000  UN   0.0   49260    692  sshd
  17750   2284   1  ffff88016a628000  UN   0.0   49260    688  sshd
  17751   2284   0  ffff88026a3cc020  UN   0.0   49260    688  sshd
  17752   2284   0  ffff88026a3ca010  UN   0.0   49260    688  sshd
  17753   2284   0  ffff88026a3c8000  UN   0.0   49260    688  sshd
  17754   2284   0  ffff880268f60000  UN   0.0   49260    692  sshd
  17755   2284   0  ffff880268f62010  UN   0.0   49260    688  sshd
crash-5.0> bt 17606
PID: 17606  TASK: ffff880268420000  CPU: 15  COMMAND: "kpartx"
 #0 [ffff88026aac3b18] schedule at ffffffff8158bcbd
 #1 [ffff88026aac3bd0] rwsem_down_failed_common at ffffffff8158df2d
 #2 [ffff88026aac3c30] rwsem_down_write_failed at ffffffff8158e0b3
 #3 [ffff88026aac3c70] call_rwsem_down_write_failed at ffffffff812d9903
 #4 [ffff88026aac3ce0] thaw_bdev at ffffffff81186d5a
 #5 [ffff88026aac3d40] unlock_fs at ffffffff8145e46d
 #6 [ffff88026aac3d60] dm_resume at ffffffff8145fb38
 #7 [ffff88026aac3db0] do_resume at ffffffff81465c98
 #8 [ffff88026aac3de0] dev_suspend at ffffffff81465d65
 #9 [ffff88026aac3e20] ctl_ioctl at ffffffff814665f5
#10 [ffff88026aac3e90] dm_ctl_ioctl at ffffffff814666a3
#11 [ffff88026aac3ea0] vfs_ioctl at ffffffff81165e92
#12 [ffff88026aac3ee0] do_vfs_ioctl at ffffffff81166140
#13 [ffff88026aac3f30] sys_ioctl at ffffffff811664b1
#14 [ffff88026aac3f80] system_call_fastpath at ffffffff810131b2
    RIP: 00007fa798b04197  RSP: 00007fff4cf1c6e8  RFLAGS: 00010202
    RAX: 0000000000000010  RBX: ffffffff810131b2  RCX: 0000000000000000
    RDX: 0000000000bcf310  RSI: 00000000c138fd06  RDI: 0000000000000004
    RBP: 0000000000bcf340   R8: 00007fa798dc2528   R9: 00007fff4cf1c640
    R10: 00007fa798dc1dc0  R11: 0000000000000246  R12: 00007fa798dc1dc0
    R13: 0000000000004000  R14: 0000000000bce0f0  R15: 00007fa798dc1dc0
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash-5.0> bt 992

PID: 992    TASK: ffff8802678a8000  CPU: 7   COMMAND: "flush-251:5"
 #0 [ffff880267bddb00] schedule at ffffffff8158bcbd
 #1 [ffff880267bddbb8] ext4_force_commit at ffffffff8120b16d
 #2 [ffff880267bddc18] ext4_write_inode at ffffffff811f29e5
 #3 [ffff880267bddc68] writeback_single_inode at ffffffff81178964
 #4 [ffff880267bddcb8] writeback_sb_inodes at ffffffff81178f09
 #5 [ffff880267bddd18] wb_writeback at ffffffff8117995c
 #6 [ffff880267bdddc8] wb_do_writeback at ffffffff81179b6b
 #7 [ffff880267bdde58] bdi_writeback_task at ffffffff81179cc3
 #8 [ffff880267bdde98] bdi_start_fn at ffffffff8111e816
 #9 [ffff880267bddec8] kthread at ffffffff81088a06
#10 [ffff880267bddf48] kernel_thread at ffffffff810142ea

crash-5.0> super_block.s_frozen ffff880268a4e000
  s_frozen = 0x2,

int ext4_force_commit(struct super_block *sb)
{
        journal_t *journal;
        int ret = 0;

        if (sb->s_flags & MS_RDONLY)
                return 0;

        journal = EXT4_SB(sb)->s_journal;
        if (journal) {
                vfs_check_frozen(sb, SB_FREEZE_TRANS); <=== this is where sleep
                ret = ext4_journal_force_commit(journal);
        }

        return ret;
}
 

I have tried the previous versions of the patch, backporting
to 2.6.32 without any success. I thought I would just go for it
this time with the latest.

 
>> [ 2041.670669] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000
>> [ 2041.678746]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000
>> [ 2041.687219]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8
>> [ 2041.695818]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0
>> [ 2041.704369] Call Trace:
>> [ 2041.707128]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300
>> [ 2041.713679]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90
>> [ 2041.719846]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410
>> [ 2041.726301]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180
>> [ 2041.732685]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20
>> [ 2041.739138]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
>> [ 2041.746079]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20
>> [ 2041.752716]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0
>> [ 2041.759853]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240
>> [ 2041.766503]  [<ffffffff8107e060>] __request_module+0x190/0x210
>> [ 2041.773054]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40
>> [ 2041.778636]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240
>> [ 2041.785121]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0
>> [ 2041.791312]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140
>> [ 2041.797671]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250
>> [ 2041.804413]  [<ffffffff8149de3a>] table_load+0xca/0x2f0
>> [ 2041.810317]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0
>> [ 2041.816316]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
>> [ 2041.822184]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
>> [ 2041.828188]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
>> [ 2041.834250]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170
>> [ 2041.840219]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
>> [ 2041.845898]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
>   multipathd is hung waiting for module to be loaded? How come?

It shouldn't, dh_alua is already loaded. 


>> [ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.
>> [ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2041.980626] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000
>> [ 2041.988607]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000
>> [ 2041.997056]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8
>> [ 2042.005496]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0
>> [ 2042.013939] Call Trace:
>> [ 2042.016702]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150
>> [ 2042.023089]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40
>> [ 2042.030321]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
>> [ 2042.036584]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70
>> [ 2042.042552]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330
>> [ 2042.049133]  [<ffffffff81115391>] ? do_writepages+0x21/0x40
>> [ 2042.055423]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60
>> [ 2042.062944]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90
>> [ 2042.069430]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70
>> [ 2042.075690]  [<ffffffff81166a85>] freeze_super+0x55/0x100
>> [ 2042.081754]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0
>> [ 2042.087625]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0
>> [ 2042.093495]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0
>> [ 2042.099948]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.105916]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0
>> [ 2042.111784]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.117753]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0
>> [ 2042.123621]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.129591]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
>> [ 2042.135493]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
>> [ 2042.141770]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
>> [ 2042.147739]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
>> [ 2042.153801]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
>> [ 2042.159478]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
>   kpartx is waiting for kjournald to finish transaction commit and it is
> holding s_umount but that doesn't really seem to be a problem...
> 
> So as I say, find a reason why kjournald is not able to finish committing a
> transaction and you should solve this riddle ;).

Cool, thanks!

Peter

> 
> 								Honza

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-22 22:10                                             ` Jan Kara
@ 2011-04-25  6:28                                               ` Toshiyuki Okajima
  2011-05-03  8:06                                                 ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Toshiyuki Okajima @ 2011-04-25  6:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel, sandeen, toshi.okajima

Hi.

On Sat, 23 Apr 2011 00:10:25 +0200
Jan Kara <jack@suse.cz> wrote:
> On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
> > I have confirmed that the following patch works fine while my or
> > Mizuma-san's reproducer is running. Therefore,
> >  we can block to write the data, which is mmapped to a file, into a disk
> > by a page-fault while fsfreezing. 
> > 
> > I think this patch fixes the following two problems:
> > - A deadlock occurs between ext4_da_writepages() (called from
> > writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> > - We can also write the data, which is mmapped to a file,
> >   into a disk while fsfreezing (ext3/ext4).
> >                                        (reported by me)
> > 
> > Please examine this patch.
>   Thanks for the patch. The ext3 part is not as easy as this. You cannot
> really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
> held by page fault code and i_alloc_sem should be acquired before it (yes I
> know, ext4 already has this bug which should be fixed when I get to it).
> Also you'll find that performance of random writers via mmap (which is
> relatively common) is going to be rather bad with this patch (because the
> file will be heavily fragmented). We have to be more clever which is
> exactly why it's taking me so long with my patch :) But tests are already
> running so if everything goes fine, I should have patches to submit next
> week.
OK, I'll wait your patch. :)

> 
> The ext4 part looks correct. I'd just also like to have some comments about
> how freeze handling is done because it's kind of subtle.

How about this?

Thanks,
Toshiyuki Okajima

----------------------------------------------------------------------------------------------------
Subject: [PATCH] ext4: prevent the mmapped page flushing to disk while fsfreezing

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/ext4/inode.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..411b177 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	ret = 0;
 	if (PageMappedToDisk(page))
-		goto out_unlock;
+		goto out_frozen;
 
 	if (page->index == size >> PAGE_CACHE_SHIFT)
 		len = size & ~PAGE_CACHE_MASK;
@@ -5830,6 +5830,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
 			unlock_page(page);
+out_frozen:
+			/*
+			 * We must wait here while the filesystem is being 
+			 * frozen otherwise a flushing thread can write this 
+			 * page to the disk (we can update the filesystem even 
+			 * if it is frozen).
+			 */
+			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 			goto out_unlock;
 		}
 	}
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-06 11:21                         ` Dave Chinner
  2011-04-06 13:44                           ` Christoph Hellwig
  2011-04-06 17:40                           ` Jan Kara
@ 2011-05-02  9:07                           ` Surbhi Palande
  2011-05-02 10:56                             ` Jan Kara
  2 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-02  9:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

Hi,

On 04/06/2011 02:21 PM, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>> nothing needs to be done to the writeback path because there is
>>>>> nothing dirty for it to write back.
>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>> clear there aren't other code paths which can block with s_umount held
>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>
>>> Holding the s_umount lock while checking if frozen and sleeping
>>> is essentially an ABBA lock inversion bug that can bite in many more
>>> places that just thawing the filesystem.  Any where this is done should
>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>> path is sufficient to avoid problems.
>>    That's easily said but hard to do - any transaction start in ext3/4 may
>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>> looking into the code) and transaction start traditionally nests inside
>> s_umount (and basically there's no way around that since sync() calls your
>> fs code with s_umount held).
>
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...

I had a look at the xfs code for seeing how this is done.
xfs_file_aio_write()
   xfs_wait_for_freeze()
     vfs_check_frozen()
So xfs_file_aio_write() writes to buffers when the FS is not frozen.

Now, I want to know what stops the following scenario from happening:
--------------------
xfs_file_aio_write()
   xfs_wait_for_freeze()
     vfs_check_frozen()
At this point F.S was not frozen, so the next instruction in the 
xfs_file_aio_write() will be executed next.
However at this point (i.e after checking if F.S is frozen) the write 
process gets pre-empted and say the _freeze_ process gets control.

Now the F.S freezes and the write process gets the control back. And so 
we end up writing to the page cache when the F.S is frozen.
--------------------

Can anyone please enlighten me on how & why this premption is _not_ 
possible?

If this pre-emption is _possible_, then can we use sb->s_umount to 
prevent a freeze from happening while a write to the page cache buffers 
is going on. Eg:

* Before writing to the buffers in the page cache:

   down_write(sb->s_umount)
     if(sb->s_frozen == SB_FREEZE_WRITE) {
       // do not sleep with the sb->s_umount semaphore.
       up_write(s_umount);
       vfs_check_frozen();
       // if you are here then fs is not thawed.
       down_write(sb->s_umount);
     }


Thanks!


Warm Regards,
Surbhi.




>
>> So I'm afraid we are not going to get rid of
>> this ABBA dependency unless we declare that s_umount ranks above filesystem
>> being frozen - but surely I'm open to suggestions.
>
> Not sure I understand what you are saying there - this is already
> the case, isn't it? i.e. it has to be held exclusive to freeze a
> filesystem...
>
>> Another possibility is just to hide the problem e.g. by checking for frozen
>> filesystem whenever we try to get s_umount. But that looks a bit ugly to
>> me.
>
> And not necessary, AFAICT.
>
> Cheers,
>
> Dave.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02  9:07                           ` Surbhi Palande
@ 2011-05-02 10:56                             ` Jan Kara
  2011-05-02 11:27                               ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-02 10:56 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Dave Chinner, Jan Kara, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>nothing needs to be done to the writeback path because there is
> >>>>>nothing dirty for it to write back.
> >>>>   Sure but that's only the problem he was able to hit. But generally,
> >>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>clear there aren't other code paths which can block with s_umount held
> >>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>
> >>>Holding the s_umount lock while checking if frozen and sleeping
> >>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>places that just thawing the filesystem.  Any where this is done should
> >>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>path is sufficient to avoid problems.
> >>   That's easily said but hard to do - any transaction start in ext3/4 may
> >>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>looking into the code) and transaction start traditionally nests inside
> >>s_umount (and basically there's no way around that since sync() calls your
> >>fs code with s_umount held).
> >
> >Sure, but the question must be asked - why is ext3/4 even starting a
> >transaction on a clean filesystem during sync? A frozen filesystem,
> >by definition, is a clean filesytem, and therefore sync calls of any
> >kind should not be trying to write to the FS or start transactions.
> >XFS does this just fine, so I'd consider such behaviour on a frozen
> >filesystem a bug in ext3/4...
> 
> I had a look at the xfs code for seeing how this is done.
> xfs_file_aio_write()
>   xfs_wait_for_freeze()
>     vfs_check_frozen()
> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> 
> Now, I want to know what stops the following scenario from happening:
> --------------------
> xfs_file_aio_write()
>   xfs_wait_for_freeze()
>     vfs_check_frozen()
> At this point F.S was not frozen, so the next instruction in the
> xfs_file_aio_write() will be executed next.
> However at this point (i.e after checking if F.S is frozen) the
> write process gets pre-empted and say the _freeze_ process gets
> control.
> 
> Now the F.S freezes and the write process gets the control back. And
> so we end up writing to the page cache when the F.S is frozen.
> --------------------
> 
> Can anyone please enlighten me on how & why this premption is _not_
> possible?
  XFS works similarly as ext4 in this regard I believe. They have the log
frozen in xfs_freeze() so if the race you describe above happens, either
the writing process gets caught waiting for log to unfreeze or it manages
to start a transaction and then freezing process waits for transaction to
finish before it can proceed with freezing. I'm not sure why is there the
check in xfs_file_aio_write()...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 10:56                             ` Jan Kara
@ 2011-05-02 11:27                               ` Surbhi Palande
  2011-05-02 12:06                                 ` Surbhi Palande
  2011-05-02 12:20                                 ` Jan Kara
  0 siblings, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-02 11:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On 05/02/2011 01:56 PM, Jan Kara wrote:
> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>> nothing dirty for it to write back.
>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>> path is sufficient to avoid problems.
>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>> looking into the code) and transaction start traditionally nests inside
>>>> s_umount (and basically there's no way around that since sync() calls your
>>>> fs code with s_umount held).
>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>> by definition, is a clean filesytem, and therefore sync calls of any
>>> kind should not be trying to write to the FS or start transactions.
>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>> filesystem a bug in ext3/4...
>> I had a look at the xfs code for seeing how this is done.
>> xfs_file_aio_write()
>>    xfs_wait_for_freeze()
>>      vfs_check_frozen()
>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>
>> Now, I want to know what stops the following scenario from happening:
>> --------------------
>> xfs_file_aio_write()
>>    xfs_wait_for_freeze()
>>      vfs_check_frozen()
>> At this point F.S was not frozen, so the next instruction in the
>> xfs_file_aio_write() will be executed next.
>> However at this point (i.e after checking if F.S is frozen) the
>> write process gets pre-empted and say the _freeze_ process gets
>> control.
>>
>> Now the F.S freezes and the write process gets the control back. And
>> so we end up writing to the page cache when the F.S is frozen.
>> --------------------
>>
>> Can anyone please enlighten me on how&  why this premption is _not_
>> possible?
Thanks for your reply.
>    XFS works similarly as ext4 in this regard I believe. They have the log
> frozen in xfs_freeze() so if the race you describe above happens, either
> the writing process gets caught waiting for log to unfreeze
Agreed.
>   or it manages
> to start a transaction and then freezing process waits for transaction to
> finish before it can proceed with freezing. I'm not sure why is there the
> check in xfs_file_aio_write()...
>
> 			
I am sorry, but I don't understand how this will happen - i.e I can't 
understand what stops freeze_super() (or ext4_freeze) from freezing a 
superblock (as the write process stopped just before writing anything 
for this transaction and has not taken any locks?)

Thanks!

Warm Regards,
Surbhi.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 11:27                               ` Surbhi Palande
@ 2011-05-02 12:06                                 ` Surbhi Palande
  2011-05-02 12:20                                 ` Jan Kara
  1 sibling, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-02 12:06 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4

On 05/02/2011 02:27 PM, Surbhi Palande wrote:
> On 05/02/2011 01:56 PM, Jan Kara wrote:
>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>> nothing dirty for it to write back.
>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>> there's a problem with needing s_umount for unfreezing because it
>>>>>>> isn't
>>>>>>> clear there aren't other code paths which can block with s_umount
>>>>>>> held
>>>>>>> waiting for fs to get unfrozen. And these code paths would cause
>>>>>>> the same
>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>> places that just thawing the filesystem. Any where this is done
>>>>>> should
>>>>>> be fixed, so I don't think just removing the s_umount lock from
>>>>>> the thaw
>>>>>> path is sufficient to avoid problems.
>>>>> That's easily said but hard to do - any transaction start in ext3/4
>>>>> may
>>>>> block on filesystem being frozen (this seems to be similar for XFS
>>>>> as I'm
>>>>> looking into the code) and transaction start traditionally nests
>>>>> inside
>>>>> s_umount (and basically there's no way around that since sync()
>>>>> calls your
>>>>> fs code with s_umount held).
>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>> kind should not be trying to write to the FS or start transactions.
>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>> filesystem a bug in ext3/4...
>>> I had a look at the xfs code for seeing how this is done.
>>> xfs_file_aio_write()
>>> xfs_wait_for_freeze()
>>> vfs_check_frozen()
>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>
>>> Now, I want to know what stops the following scenario from happening:
>>> --------------------
>>> xfs_file_aio_write()
>>> xfs_wait_for_freeze()
>>> vfs_check_frozen()
>>> At this point F.S was not frozen, so the next instruction in the
>>> xfs_file_aio_write() will be executed next.
>>> However at this point (i.e after checking if F.S is frozen) the
>>> write process gets pre-empted and say the _freeze_ process gets
>>> control.
>>>
>>> Now the F.S freezes and the write process gets the control back. And
>>> so we end up writing to the page cache when the F.S is frozen.
>>> --------------------
>>>
>>> Can anyone please enlighten me on how& why this premption is _not_
>>> possible?
> Thanks for your reply.
>> XFS works similarly as ext4 in this regard I believe. They have the log
>> frozen in xfs_freeze() so if the race you describe above happens, either
>> the writing process gets caught waiting for log to unfreeze
> Agreed.
>> or it manages
>> to start a transaction and then freezing process waits for transaction to
>> finish before it can proceed with freezing. I'm not sure why is there the
>> check in xfs_file_aio_write()...
>>
>>
> I am sorry, but I don't understand how this will happen - i.e I can't
> understand what stops freeze_super() (or ext4_freeze) from freezing a
> superblock (as the write process stopped just before writing anything
> for this transaction and has not taken any locks?)

To make myself a little more coherent:

freeze_super()
   ext4_freeze()
     1) jbd2_journal_updates()
     2) jbd2_journal_flush(journal)
     3) jbd2_journal_unlock_updates(journal).
     4) return

Say now the fs write process stopped just after checking that fs is not 
frozen (i.e its thawed). So its ready to write to the page cache. Just 
when it has finished this vfs_check_frozen() and before it starts any 
write (or transactions), say the write process gets pre-empted and then 
the freeze process freezes the superblock. Wont ext4_freeze() simply 
lock the current transactions, flush them to the log and then unlock the 
transactions (so that new handles/transactions can be accepted later?)

So then after the fsfreeze finishes freezing the F.S, say if the write 
process gets the control back. The write process assumes that after its 
out of vfs_check_frozen(), the fs is thawed (or unfrozen) where as in 
this case it is not.

So I don't understand, _what_ stops the writing process from starting a 
transaction in this case when the F.S is frozen already
and what stops the fsfreeze from waiting for the write process (when it 
has not yet started the write)?

Warm Regards,
Surbhi.




>
> Thanks!
>
> Warm Regards,
> Surbhi.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 11:27                               ` Surbhi Palande
  2011-05-02 12:06                                 ` Surbhi Palande
@ 2011-05-02 12:20                                 ` Jan Kara
  2011-05-02 12:30                                   ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-02 12:20 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
> On 05/02/2011 01:56 PM, Jan Kara wrote:
> >On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> >>On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >>>On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>>>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>>>nothing needs to be done to the writeback path because there is
> >>>>>>>nothing dirty for it to write back.
> >>>>>>   Sure but that's only the problem he was able to hit. But generally,
> >>>>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>>>clear there aren't other code paths which can block with s_umount held
> >>>>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>>>Holding the s_umount lock while checking if frozen and sleeping
> >>>>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>>>places that just thawing the filesystem.  Any where this is done should
> >>>>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>>>path is sufficient to avoid problems.
> >>>>   That's easily said but hard to do - any transaction start in ext3/4 may
> >>>>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>>>looking into the code) and transaction start traditionally nests inside
> >>>>s_umount (and basically there's no way around that since sync() calls your
> >>>>fs code with s_umount held).
> >>>Sure, but the question must be asked - why is ext3/4 even starting a
> >>>transaction on a clean filesystem during sync? A frozen filesystem,
> >>>by definition, is a clean filesytem, and therefore sync calls of any
> >>>kind should not be trying to write to the FS or start transactions.
> >>>XFS does this just fine, so I'd consider such behaviour on a frozen
> >>>filesystem a bug in ext3/4...
> >>I had a look at the xfs code for seeing how this is done.
> >>xfs_file_aio_write()
> >>   xfs_wait_for_freeze()
> >>     vfs_check_frozen()
> >>So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> >>
> >>Now, I want to know what stops the following scenario from happening:
> >>--------------------
> >>xfs_file_aio_write()
> >>   xfs_wait_for_freeze()
> >>     vfs_check_frozen()
> >>At this point F.S was not frozen, so the next instruction in the
> >>xfs_file_aio_write() will be executed next.
> >>However at this point (i.e after checking if F.S is frozen) the
> >>write process gets pre-empted and say the _freeze_ process gets
> >>control.
> >>
> >>Now the F.S freezes and the write process gets the control back. And
> >>so we end up writing to the page cache when the F.S is frozen.
> >>--------------------
> >>
> >>Can anyone please enlighten me on how&  why this premption is _not_
> >>possible?
> Thanks for your reply.
> >   XFS works similarly as ext4 in this regard I believe. They have the log
> >frozen in xfs_freeze() so if the race you describe above happens, either
> >the writing process gets caught waiting for log to unfreeze
> Agreed.
> >  or it manages
> >to start a transaction and then freezing process waits for transaction to
> >finish before it can proceed with freezing. I'm not sure why is there the
> >check in xfs_file_aio_write()...
> >
> >			
> I am sorry, but I don't understand how this will happen - i.e I
> can't understand what stops freeze_super() (or ext4_freeze) from
> freezing a superblock (as the write process stopped just before
> writing anything for this transaction and has not taken any locks?)
  So ext4_freeze() does
jbd2_journal_lock_updates(journal)
  which waits for all running transactions to finish and updates
j_barrier_count which stops any news ones from proceeding (check
function start_this_handle()).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 12:20                                 ` Jan Kara
@ 2011-05-02 12:30                                   ` Surbhi Palande
  2011-05-02 13:16                                     ` Jan Kara
  2011-05-02 14:01                                     ` Eric Sandeen
  0 siblings, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-02 12:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On 05/02/2011 03:20 PM, Jan Kara wrote:
> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>> path is sufficient to avoid problems.
>>>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>> fs code with s_umount held).
>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>> kind should not be trying to write to the FS or start transactions.
>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>> filesystem a bug in ext3/4...
>>>> I had a look at the xfs code for seeing how this is done.
>>>> xfs_file_aio_write()
>>>>    xfs_wait_for_freeze()
>>>>      vfs_check_frozen()
>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>
>>>> Now, I want to know what stops the following scenario from happening:
>>>> --------------------
>>>> xfs_file_aio_write()
>>>>    xfs_wait_for_freeze()
>>>>      vfs_check_frozen()
>>>> At this point F.S was not frozen, so the next instruction in the
>>>> xfs_file_aio_write() will be executed next.
>>>> However at this point (i.e after checking if F.S is frozen) the
>>>> write process gets pre-empted and say the _freeze_ process gets
>>>> control.
>>>>
>>>> Now the F.S freezes and the write process gets the control back. And
>>>> so we end up writing to the page cache when the F.S is frozen.
>>>> --------------------
>>>>
>>>> Can anyone please enlighten me on how&   why this premption is _not_
>>>> possible?
>> Thanks for your reply.
>>>    XFS works similarly as ext4 in this regard I believe. They have the log
>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>> the writing process gets caught waiting for log to unfreeze
>> Agreed.
>>>   or it manages
>>> to start a transaction and then freezing process waits for transaction to
>>> finish before it can proceed with freezing. I'm not sure why is there the
>>> check in xfs_file_aio_write()...
>>>
>>> 			
>> I am sorry, but I don't understand how this will happen - i.e I
>> can't understand what stops freeze_super() (or ext4_freeze) from
>> freezing a superblock (as the write process stopped just before
>> writing anything for this transaction and has not taken any locks?)
>    So ext4_freeze() does
> jbd2_journal_lock_updates(journal)
>    which waits for all running transactions to finish and updates
> j_barrier_count which stops any news ones from proceeding (check
> function start_this_handle()).
>
Yes, but ext4_freeze() also calls jbd2_journal_unlock_updates(journal) 
which decrements the  j_barrier_count (which was previously 
updated/incremented in jbd2_journal_lock_updates) ? before it returns. 
So after this call a new transaction/handle can be accepted/started.

A comment in ext4_freeze() says:
/* we rely on s_frozen to stop further updates */
(before calling jbd2_journal_unlock_updates())

Warm Regards,
Surbhi.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 12:30                                   ` Surbhi Palande
@ 2011-05-02 13:16                                     ` Jan Kara
  2011-05-02 13:22                                       ` Christoph Hellwig
  2011-05-02 13:22                                       ` Surbhi Palande
  2011-05-02 14:01                                     ` Eric Sandeen
  1 sibling, 2 replies; 121+ messages in thread
From: Jan Kara @ 2011-05-02 13:16 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
> On 05/02/2011 03:20 PM, Jan Kara wrote:
> >On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
> >>On 05/02/2011 01:56 PM, Jan Kara wrote:
> >>>On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> >>>>On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >>>>>On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>>>>>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>>>>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>>>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>>>>>nothing needs to be done to the writeback path because there is
> >>>>>>>>>nothing dirty for it to write back.
> >>>>>>>>   Sure but that's only the problem he was able to hit. But generally,
> >>>>>>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>>>>>clear there aren't other code paths which can block with s_umount held
> >>>>>>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>>>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>>>>>Holding the s_umount lock while checking if frozen and sleeping
> >>>>>>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>>>>>places that just thawing the filesystem.  Any where this is done should
> >>>>>>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>>>>>path is sufficient to avoid problems.
> >>>>>>   That's easily said but hard to do - any transaction start in ext3/4 may
> >>>>>>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>>>>>looking into the code) and transaction start traditionally nests inside
> >>>>>>s_umount (and basically there's no way around that since sync() calls your
> >>>>>>fs code with s_umount held).
> >>>>>Sure, but the question must be asked - why is ext3/4 even starting a
> >>>>>transaction on a clean filesystem during sync? A frozen filesystem,
> >>>>>by definition, is a clean filesytem, and therefore sync calls of any
> >>>>>kind should not be trying to write to the FS or start transactions.
> >>>>>XFS does this just fine, so I'd consider such behaviour on a frozen
> >>>>>filesystem a bug in ext3/4...
> >>>>I had a look at the xfs code for seeing how this is done.
> >>>>xfs_file_aio_write()
> >>>>   xfs_wait_for_freeze()
> >>>>     vfs_check_frozen()
> >>>>So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> >>>>
> >>>>Now, I want to know what stops the following scenario from happening:
> >>>>--------------------
> >>>>xfs_file_aio_write()
> >>>>   xfs_wait_for_freeze()
> >>>>     vfs_check_frozen()
> >>>>At this point F.S was not frozen, so the next instruction in the
> >>>>xfs_file_aio_write() will be executed next.
> >>>>However at this point (i.e after checking if F.S is frozen) the
> >>>>write process gets pre-empted and say the _freeze_ process gets
> >>>>control.
> >>>>
> >>>>Now the F.S freezes and the write process gets the control back. And
> >>>>so we end up writing to the page cache when the F.S is frozen.
> >>>>--------------------
> >>>>
> >>>>Can anyone please enlighten me on how&   why this premption is _not_
> >>>>possible?
> >>Thanks for your reply.
> >>>   XFS works similarly as ext4 in this regard I believe. They have the log
> >>>frozen in xfs_freeze() so if the race you describe above happens, either
> >>>the writing process gets caught waiting for log to unfreeze
> >>Agreed.
> >>>  or it manages
> >>>to start a transaction and then freezing process waits for transaction to
> >>>finish before it can proceed with freezing. I'm not sure why is there the
> >>>check in xfs_file_aio_write()...
> >>>
> >>>			
> >>I am sorry, but I don't understand how this will happen - i.e I
> >>can't understand what stops freeze_super() (or ext4_freeze) from
> >>freezing a superblock (as the write process stopped just before
> >>writing anything for this transaction and has not taken any locks?)
> >   So ext4_freeze() does
> >jbd2_journal_lock_updates(journal)
> >   which waits for all running transactions to finish and updates
> >j_barrier_count which stops any news ones from proceeding (check
> >function start_this_handle()).
> >
> Yes, but ext4_freeze() also calls
> jbd2_journal_unlock_updates(journal) which decrements the
> j_barrier_count (which was previously updated/incremented in
> jbd2_journal_lock_updates) ? before it returns. So after this call a
> new transaction/handle can be accepted/started.
> 
> A comment in ext4_freeze() says:
> /* we rely on s_frozen to stop further updates */
> (before calling jbd2_journal_unlock_updates())
  Ah, drat, you're right. I've missed this other part. It's the problem
that if you expect to see something, you'll see it regardless of the real
code ;).

The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
it's still racy (although the race window is relatively small) because the
filesystem can become frozen the instant after we check vfs_check_frozen().
Commit 6b0310fb broke it for ext4.

I guess the code was mostly copied from XFS which seems to have the same
problem in xfs_trans_alloc() since the git history beginning. I see two
ways to fix this - either fix ext4/xfs to check s_frozen after starting
a transaction and if the filesystem is being frozen, we stop the
transaction, wait for fs to get unfrozen, and restart. Another option is
to create an analogous logic using a atomic counter of write ops in vfs
that could be used by all filesystems. We'd just have to replace
vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
appropriate places...

Dave, Christoph, any opinions on this?
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:16                                     ` Jan Kara
@ 2011-05-02 13:22                                       ` Christoph Hellwig
  2011-05-02 14:20                                         ` Jan Kara
  2011-05-02 13:22                                       ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Christoph Hellwig @ 2011-05-02 13:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Surbhi Palande, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On Mon, May 02, 2011 at 03:16:19PM +0200, Jan Kara wrote:
> Dave, Christoph, any opinions on this?

The busyloop in xfs_quiesce_attr which waits for all active transactions
to finish is supposed to fix this issue.

Note that XFS traditionally expects a two stage freeze process where
we first freeze new VFS-level writes, then flush the caches and then
stop transactions, wait for them to finish and do the remainder of
the freeze process, but I really messed that process up when moving
the sequence to generic code.  Funnily enough it seems to work
neverless.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:16                                     ` Jan Kara
  2011-05-02 13:22                                       ` Christoph Hellwig
@ 2011-05-02 13:22                                       ` Surbhi Palande
  2011-05-02 13:24                                         ` Christoph Hellwig
  2011-05-02 14:04                                         ` Eric Sandeen
  1 sibling, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-02 13:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, Christoph Hellwig

On 05/02/2011 04:16 PM, Jan Kara wrote:
> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>> fs code with s_umount held).
>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>> filesystem a bug in ext3/4...
>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>> xfs_file_aio_write()
>>>>>>    xfs_wait_for_freeze()
>>>>>>      vfs_check_frozen()
>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>
>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>> --------------------
>>>>>> xfs_file_aio_write()
>>>>>>    xfs_wait_for_freeze()
>>>>>>      vfs_check_frozen()
>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>> xfs_file_aio_write() will be executed next.
>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>> control.
>>>>>>
>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>> --------------------
>>>>>>
>>>>>> Can anyone please enlighten me on how&    why this premption is _not_
>>>>>> possible?
>>>> Thanks for your reply.
>>>>>    XFS works similarly as ext4 in this regard I believe. They have the log
>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>> the writing process gets caught waiting for log to unfreeze
>>>> Agreed.
>>>>>   or it manages
>>>>> to start a transaction and then freezing process waits for transaction to
>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>> check in xfs_file_aio_write()...
>>>>>
>>>>> 			
>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>> freezing a superblock (as the write process stopped just before
>>>> writing anything for this transaction and has not taken any locks?)
>>>    So ext4_freeze() does
>>> jbd2_journal_lock_updates(journal)
>>>    which waits for all running transactions to finish and updates
>>> j_barrier_count which stops any news ones from proceeding (check
>>> function start_this_handle()).
>>>
>> Yes, but ext4_freeze() also calls
>> jbd2_journal_unlock_updates(journal) which decrements the
>> j_barrier_count (which was previously updated/incremented in
>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>> new transaction/handle can be accepted/started.
>>
>> A comment in ext4_freeze() says:
>> /* we rely on s_frozen to stop further updates */
>> (before calling jbd2_journal_unlock_updates())
>    Ah, drat, you're right. I've missed this other part. It's the problem
> that if you expect to see something, you'll see it regardless of the real
> code ;).
>
> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
> it's still racy (although the race window is relatively small) because the
> filesystem can become frozen the instant after we check vfs_check_frozen().
> Commit 6b0310fb broke it for ext4.
>
> I guess the code was mostly copied from XFS which seems to have the same
> problem in xfs_trans_alloc() since the git history beginning. I see two
> ways to fix this - either fix ext4/xfs to check s_frozen after starting
> a transaction and if the filesystem is being frozen, we stop the
> transaction, wait for fs to get unfrozen, and restart. Another option is
> to create an analogous logic using a atomic counter of write ops in vfs
> that could be used by all filesystems. We'd just have to replace
> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
> appropriate places...
How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
from ext4_unfreeze()?

So that indeed no transactions can be started before unfreeze is called.

This has another advantage, that it rightfully does not let you update 
the access time when the F.S is frozen (touch_atime called from a read 
path when the F.S is frozen) Otherwise we also need to fix this path.

Warm Regards,
Surbhi.

> Dave, Christoph, any opinions on this?
> 								Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:22                                       ` Surbhi Palande
@ 2011-05-02 13:24                                         ` Christoph Hellwig
  2011-05-02 13:27                                           ` Surbhi Palande
  2011-05-02 14:04                                         ` Eric Sandeen
  1 sibling, 1 reply; 121+ messages in thread
From: Christoph Hellwig @ 2011-05-02 13:24 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
> This has another advantage, that it rightfully does not let you
> update the access time when the F.S is frozen (touch_atime called
> from a read path when the F.S is frozen) Otherwise we also need to
> fix this path.

In most filesystens atime updates aren't transactional.  They just
get written into inode->i_atime, and at some later point when the
VFS tries to clean the inode it gets writtent back, either through
a transaction or not.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:24                                         ` Christoph Hellwig
@ 2011-05-02 13:27                                           ` Surbhi Palande
  2011-05-02 14:26                                             ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-02 13:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On 05/02/2011 04:24 PM, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
>> This has another advantage, that it rightfully does not let you
>> update the access time when the F.S is frozen (touch_atime called
>> from a read path when the F.S is frozen) Otherwise we also need to
>> fix this path.
> In most filesystens atime updates aren't transactional.  They just
> get written into inode->i_atime, and at some later point when the
> VFS tries to clean the inode it gets writtent back, either through
> a transaction or not.
>
Yes, agreed. But then when a F.S is frozen the inode should not be 
dirtied? Right? So this has to be fixed?
Also, in ext4, I think that updating atime starts a transaction.

Warm Regards,
Surbhi


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 12:30                                   ` Surbhi Palande
  2011-05-02 13:16                                     ` Jan Kara
@ 2011-05-02 14:01                                     ` Eric Sandeen
  1 sibling, 0 replies; 121+ messages in thread
From: Eric Sandeen @ 2011-05-02 14:01 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel

On 5/2/11 7:30 AM, Surbhi Palande wrote:

...

> Yes, but ext4_freeze() also calls jbd2_journal_unlock_updates(journal) which decrements the  j_barrier_count (which was previously updated/incremented in jbd2_journal_lock_updates) ? before it returns. So after this call a new transaction/handle can be accepted/started.
> 
> A comment in ext4_freeze() says:
> /* we rely on s_frozen to stop further updates */
> (before calling jbd2_journal_unlock_updates())

that was me; 

commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Sun May 16 02:00:00 2010 -0400

    ext4: don't return to userspace after freezing the fs with a mutex held


otherwise we return to userspace holding a mutex :(

-Eric

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:22                                       ` Surbhi Palande
  2011-05-02 13:24                                         ` Christoph Hellwig
@ 2011-05-02 14:04                                         ` Eric Sandeen
  2011-05-03  7:27                                           ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Eric Sandeen @ 2011-05-02 14:04 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On 5/2/11 8:22 AM, Surbhi Palande wrote:
> On 05/02/2011 04:16 PM, Jan Kara wrote:
>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>> fs code with s_umount held).
>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>> filesystem a bug in ext3/4...
>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>> xfs_file_aio_write()
>>>>>>>    xfs_wait_for_freeze()
>>>>>>>      vfs_check_frozen()
>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>
>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>> --------------------
>>>>>>> xfs_file_aio_write()
>>>>>>>    xfs_wait_for_freeze()
>>>>>>>      vfs_check_frozen()
>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>> control.
>>>>>>>
>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>> --------------------
>>>>>>>
>>>>>>> Can anyone please enlighten me on how&    why this premption is _not_
>>>>>>> possible?
>>>>> Thanks for your reply.
>>>>>>    XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>> Agreed.
>>>>>>   or it manages
>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>> check in xfs_file_aio_write()...
>>>>>>
>>>>>>            
>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>> freezing a superblock (as the write process stopped just before
>>>>> writing anything for this transaction and has not taken any locks?)
>>>>    So ext4_freeze() does
>>>> jbd2_journal_lock_updates(journal)
>>>>    which waits for all running transactions to finish and updates
>>>> j_barrier_count which stops any news ones from proceeding (check
>>>> function start_this_handle()).
>>>>
>>> Yes, but ext4_freeze() also calls
>>> jbd2_journal_unlock_updates(journal) which decrements the
>>> j_barrier_count (which was previously updated/incremented in
>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>> new transaction/handle can be accepted/started.
>>>
>>> A comment in ext4_freeze() says:
>>> /* we rely on s_frozen to stop further updates */
>>> (before calling jbd2_journal_unlock_updates())
>>    Ah, drat, you're right. I've missed this other part. It's the problem
>> that if you expect to see something, you'll see it regardless of the real
>> code ;).
>>
>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>> it's still racy (although the race window is relatively small) because the
>> filesystem can become frozen the instant after we check vfs_check_frozen().
>> Commit 6b0310fb broke it for ext4.
>>
>> I guess the code was mostly copied from XFS which seems to have the same
>> problem in xfs_trans_alloc() since the git history beginning. I see two
>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>> a transaction and if the filesystem is being frozen, we stop the
>> transaction, wait for fs to get unfrozen, and restart. Another option is
>> to create an analogous logic using a atomic counter of write ops in vfs
>> that could be used by all filesystems. We'd just have to replace
>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>> appropriate places...
> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> from ext4_unfreeze()?

we used to have that, but holding it locked until then means we exit the kernel
with a mutex held, which is pretty icky.

    ================================================
    [ BUG: lock held when returning to user space! ]
    ------------------------------------------------
    lvcreate/1075 is leaving the kernel with locks still held!
    1 lock held by lvcreate/1075:
     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
    jbd2_journal_lock_updates+0xe1/0xf0


-Eric

> So that indeed no transactions can be started before unfreeze is called.
> 
> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
> 
> Warm Regards,
> Surbhi.
> 
>> Dave, Christoph, any opinions on this?
>>                                 Honza
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:22                                       ` Christoph Hellwig
@ 2011-05-02 14:20                                         ` Jan Kara
  2011-05-02 14:41                                           ` Christoph Hellwig
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-02 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Surbhi Palande, Dave Chinner, Toshiyuki Okajima,
	Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

On Mon 02-05-11 09:22:04, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 03:16:19PM +0200, Jan Kara wrote:
> > Dave, Christoph, any opinions on this?
> 
> The busyloop in xfs_quiesce_attr which waits for all active transactions
> to finish is supposed to fix this issue.
  Hmm, but what prevents the following race?

  Thread 1					Thread 2
..
xfs_trans_alloc()
  xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
						freeze_super()
						  ...
						  xfs_fs_freeze()
						    ...
						    xfs_quiesce_attr()
						    ...
  _xfs_trans_alloc()
    atomic_inc(&mp->m_active_trans);
    ... goes on modifying the filesystem

  It seems to be a similar problem as in ext4 - the atomic_inc() and
vfs_check_frozen() are in the wrong order...

> Note that XFS traditionally expects a two stage freeze process where
> we first freeze new VFS-level writes, then flush the caches and then
> stop transactions, wait for them to finish and do the remainder of
> the freeze process, but I really messed that process up when moving
> the sequence to generic code.  Funnily enough it seems to work
> neverless.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 13:27                                           ` Surbhi Palande
@ 2011-05-02 14:26                                             ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-05-02 14:26 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Toshiyuki Okajima,
	Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

On Mon 02-05-11 16:27:39, Surbhi Palande wrote:
> On 05/02/2011 04:24 PM, Christoph Hellwig wrote:
> >On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
> >>This has another advantage, that it rightfully does not let you
> >>update the access time when the F.S is frozen (touch_atime called
> >>from a read path when the F.S is frozen) Otherwise we also need to
> >>fix this path.
> >In most filesystens atime updates aren't transactional.  They just
> >get written into inode->i_atime, and at some later point when the
> >VFS tries to clean the inode it gets writtent back, either through
> >a transaction or not.
> >
> Yes, agreed. But then when a F.S is frozen the inode should not be
> dirtied? Right? So this has to be fixed?
> Also, in ext4, I think that updating atime starts a transaction.
  Yes, it does. Any mark_inode_dirty() call causes a transaction update.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 14:20                                         ` Jan Kara
@ 2011-05-02 14:41                                           ` Christoph Hellwig
  2011-05-02 16:23                                             ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Christoph Hellwig @ 2011-05-02 14:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Surbhi Palande, Dave Chinner,
	Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Mon, May 02, 2011 at 04:20:55PM +0200, Jan Kara wrote:
>   Hmm, but what prevents the following race?
> 
>   Thread 1					Thread 2
> ..
> xfs_trans_alloc()
>   xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> 						freeze_super()

                                                  sb->s_frozen = SB_FREEZE_TRANS;

> 						  ...
> 						  xfs_fs_freeze()
> 						    ...
> 						    xfs_quiesce_attr()

						      waits for all active
						      transactions

> 						    ...

   xfs_trans_alloc
     -> blocks in xfs_wait_for_freeze
     (thus doesn't get to _xfs_trans_alloc)

>   _xfs_trans_alloc()
>     atomic_inc(&mp->m_active_trans);
>     ... goes on modifying the filesystem
> 
>   It seems to be a similar problem as in ext4 - the atomic_inc() and
> vfs_check_frozen() are in the wrong order...

I can't see the problem in this scheme.  Note that we want
_xfs_trans_alloc to be able to create a transaction for
xfs_fs_log_dummy, so that we can write the dummy log record after
freezing out all other transactions, so that one is special cased
and doesn't do the xfs_wait_for_freeze.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 14:41                                           ` Christoph Hellwig
@ 2011-05-02 16:23                                             ` Jan Kara
  2011-05-02 16:38                                               ` Christoph Hellwig
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-02 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Surbhi Palande, Dave Chinner, Toshiyuki Okajima,
	Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger, linux-ext4,
	linux-fsdevel

On Mon 02-05-11 10:41:55, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 04:20:55PM +0200, Jan Kara wrote:
> >   Hmm, but what prevents the following race?
> > 
> >   Thread 1					Thread 2
> > ..
> > xfs_trans_alloc()
> >   xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > 						freeze_super()
> 
>                                                   sb->s_frozen = SB_FREEZE_TRANS;
> > 						  ...
> > 						  xfs_fs_freeze()
> > 						    ...
> > 						    xfs_quiesce_attr()
> 
> 						      waits for all active
> 						      transactions
> 
> > 						    ...
> 
>    xfs_trans_alloc
>      -> blocks in xfs_wait_for_freeze
  But why should it block when xfs_wait_for_freeze() gets called before
freeze_super() gets called? The other thread calls freeze_super() just
after xfs_wait_for_freeze() in thread 1 and before _xfs_trans_alloc() gets
called.  Or am I missing some serialization somewhere?

>      (thus doesn't get to _xfs_trans_alloc)
> 
> >   _xfs_trans_alloc()
> >     atomic_inc(&mp->m_active_trans);
> >     ... goes on modifying the filesystem
> > 
> >   It seems to be a similar problem as in ext4 - the atomic_inc() and
> > vfs_check_frozen() are in the wrong order...
> 
> I can't see the problem in this scheme.  Note that we want
> _xfs_trans_alloc to be able to create a transaction for
> xfs_fs_log_dummy, so that we can write the dummy log record after
> freezing out all other transactions, so that one is special cased
> and doesn't do the xfs_wait_for_freeze.
  OK.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 16:23                                             ` Jan Kara
@ 2011-05-02 16:38                                               ` Christoph Hellwig
  0 siblings, 0 replies; 121+ messages in thread
From: Christoph Hellwig @ 2011-05-02 16:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Surbhi Palande, Dave Chinner,
	Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel

On Mon, May 02, 2011 at 06:23:34PM +0200, Jan Kara wrote:
>   But why should it block when xfs_wait_for_freeze() gets called before
> freeze_super() gets called? The other thread calls freeze_super() just
> after xfs_wait_for_freeze() in thread 1 and before _xfs_trans_alloc() gets
> called.  Or am I missing some serialization somewhere?

Oh, now I get the race window you mean.  It's the single instruction
window between doing the frozen check and incrementing m_active_trans.

Yes, that one looks real, although very unlikely to hit.  Could be fixed
relatively easily by moving the check after the increment.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-02 14:04                                         ` Eric Sandeen
@ 2011-05-03  7:27                                           ` Surbhi Palande
  2011-05-03 20:14                                             ` Eric Sandeen
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03  7:27 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On 05/02/2011 05:04 PM, Eric Sandeen wrote:
> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>     Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>     That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>> fs code with s_umount held).
>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>> xfs_file_aio_write()
>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>       vfs_check_frozen()
>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>
>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>> --------------------
>>>>>>>> xfs_file_aio_write()
>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>       vfs_check_frozen()
>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>> control.
>>>>>>>>
>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>> --------------------
>>>>>>>>
>>>>>>>> Can anyone please enlighten me on how&     why this premption is _not_
>>>>>>>> possible?
>>>>>> Thanks for your reply.
>>>>>>>     XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>> Agreed.
>>>>>>>    or it manages
>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>> check in xfs_file_aio_write()...
>>>>>>>
>>>>>>>
>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>> freezing a superblock (as the write process stopped just before
>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>     So ext4_freeze() does
>>>>> jbd2_journal_lock_updates(journal)
>>>>>     which waits for all running transactions to finish and updates
>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>> function start_this_handle()).
>>>>>
>>>> Yes, but ext4_freeze() also calls
>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>> j_barrier_count (which was previously updated/incremented in
>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>> new transaction/handle can be accepted/started.
>>>>
>>>> A comment in ext4_freeze() says:
>>>> /* we rely on s_frozen to stop further updates */
>>>> (before calling jbd2_journal_unlock_updates())
>>>     Ah, drat, you're right. I've missed this other part. It's the problem
>>> that if you expect to see something, you'll see it regardless of the real
>>> code ;).
>>>
>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>> it's still racy (although the race window is relatively small) because the
>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>> Commit 6b0310fb broke it for ext4.
>>>
>>> I guess the code was mostly copied from XFS which seems to have the same
>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>> a transaction and if the filesystem is being frozen, we stop the
>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>> to create an analogous logic using a atomic counter of write ops in vfs
>>> that could be used by all filesystems. We'd just have to replace
>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>> appropriate places...
>> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> from ext4_unfreeze()?
> we used to have that, but holding it locked until then means we exit the kernel
> with a mutex held, which is pretty icky.
>
>      ================================================
>      [ BUG: lock held when returning to user space! ]
>      ------------------------------------------------
>      lvcreate/1075 is leaving the kernel with locks still held!
>      1 lock held by lvcreate/1075:
>       #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>      jbd2_journal_lock_updates+0xe1/0xf0
>
>
> -Eric
Should this not be reverted? I think that its a lot easier to stop a 
transaction between a freeze and a thaw that way! If you agree, can I 
send a patch for the same?

Thanks!

Warm Regards,
Surbhi.


>> So that indeed no transactions can be started before unfreeze is called.
>>
>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>
>> Warm Regards,
>> Surbhi.
>>
>>> Dave, Christoph, any opinions on this?
>>>                                  Honza
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-25  6:28                                               ` Toshiyuki Okajima
@ 2011-05-03  8:06                                                 ` Surbhi Palande
  0 siblings, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03  8:06 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On 04/25/2011 09:28 AM, Toshiyuki Okajima wrote:
> Hi.
>
> On Sat, 23 Apr 2011 00:10:25 +0200
> Jan Kara<jack@suse.cz>  wrote:
>> On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
>>> I have confirmed that the following patch works fine while my or
>>> Mizuma-san's reproducer is running. Therefore,
>>>   we can block to write the data, which is mmapped to a file, into a disk
>>> by a page-fault while fsfreezing.
>>>
>>> I think this patch fixes the following two problems:
>>> - A deadlock occurs between ext4_da_writepages() (called from
>>> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
>>> - We can also write the data, which is mmapped to a file,
>>>    into a disk while fsfreezing (ext3/ext4).
>>>                                         (reported by me)
>>>
>>> Please examine this patch.
>>    Thanks for the patch. The ext3 part is not as easy as this. You cannot
>> really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
>> held by page fault code and i_alloc_sem should be acquired before it (yes I
>> know, ext4 already has this bug which should be fixed when I get to it).
>> Also you'll find that performance of random writers via mmap (which is
>> relatively common) is going to be rather bad with this patch (because the
>> file will be heavily fragmented). We have to be more clever which is
>> exactly why it's taking me so long with my patch :) But tests are already
>> running so if everything goes fine, I should have patches to submit next
>> week.
> OK, I'll wait your patch. :)
>
>>
>> The ext4 part looks correct. I'd just also like to have some comments about
>> how freeze handling is done because it's kind of subtle.
>
> How about this?


We can have a race here too - since we are only checking if the F.S is 
in a frozen state or not at _that_ point. We are _not_ preventing a F.S 
freeze from happening _after_ this point. So here is what can happen:

Key:
(tx: time at xth unit)

Scenario:

Task 1: mmapped write - (case: page mapped to disk and is in page cache)
t1) ext4_page_mkwrite()
t2)   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
t3) ---- Preempted ----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has completed 
execution.

Task1: mmapped write - (case: page mapped to disk and is in page cache)
tn+1)ext4_page_mkwrite() returns 0. The write to the mmapped page 
continues with writing to a page in the page cache when the F.S is 
frozen! So after the vfs_check_frozen() we _are_ susceptible to 
"dirtying the page cache when F.S is frozen"

In this case we are not protected by a transaction. Are we?

Warm Regards,
Surbhi.




>
> Thanks,
> Toshiyuki Okajima
>
> ----------------------------------------------------------------------------------------------------
> Subject: [PATCH] ext4: prevent the mmapped page flushing to disk while fsfreezing
>
> Signed-off-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
> ---
>   fs/ext4/inode.c |   10 +++++++++-
>   1 files changed, 9 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..411b177 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>   	}
>   	ret = 0;
>   	if (PageMappedToDisk(page))
> -		goto out_unlock;
> +		goto out_frozen;
>
>   	if (page->index == size>>  PAGE_CACHE_SHIFT)
>   		len = size&  ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>   		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>   					ext4_bh_unmapped)) {
>   			unlock_page(page);
> +out_frozen:
> +			/*
> +			 * We must wait here while the filesystem is being
> +			 * frozen otherwise a flushing thread can write this
> +			 * page to the disk (we can update the filesystem even
> +			 * if it is frozen).
> +			 */
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
>   			goto out_unlock;
>   		}
>   	}


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-04-18  9:05                                     ` Toshiyuki Okajima
  2011-04-18 10:51                                       ` Jan Kara
@ 2011-05-03 11:01                                       ` Surbhi Palande
  2011-05-03 13:08                                         ` (unknown), Surbhi Palande
                                                           ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03 11:01 UTC (permalink / raw)
  To: Toshiyuki Okajima
  Cc: Jan Kara, Ted Ts'o, Masayoshi MIZUMA, Andreas Dilger,
	linux-ext4, linux-fsdevel, sandeen

On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> Hi,
>
> (2011/04/16 2:13), Jan Kara wrote:
>> Hello,
>>
>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>> probably
>>>> get modified to block while minor-faulting the page on frozen fs
>>>> because
>>>> when blocks are already allocated we may skip starting a transaction
>>>> and so
>>>> we could possibly modify the filesystem.
>>> OK. I think ->page_mkwrite() should also block writing the
>>> minor-faulting pages.
>>>
>>> (minor-pagefault)
>>> -> do_wp_page()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>> (major-pagefault)
>>> -> do_liner_fault()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>>>
>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>> file (mmap).
>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>> while
>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>> operation
>>>>>>> while fsfreezing.
>>>>>> Technically speaking, we block all the transaction starts which
>>>>>> means we
>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>> not mean
>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>> properly
>>>>>> note the mmap case is one of such exceptions.
>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>> can't allow
>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>> path can
>>>>> write to disk while fsfreezing because this deadlock problem
>>>>> happens after
>>>>> fsfreeze operation is done...
>>>> I'm sorry I don't understand now - are you speaking about the case
>>>> above
>>>> when writepage() does not wait for filesystem being frozen or something
>>>> else?
>>> Sorry, I didn't understand around the page fault path.
>>> So, I had read the kernel source code around it, then I maybe
>>> understand...
>>>
>>> I worry whether we can update the file data in mmap case while
>>> fsfreezing.
>>> Of course, I understand that we can write to in-memory cache, and it
>>> is not a
>>> problem. However, if we can write to disk while fsfreezing, it is a
>>> problem.
>>> So, I summarize the cases whether we can write to disk or not.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Cases (Whether we can write the data mmapped to the file on the disk
>>> while fsfreezing)
>>>
>>> [1] One of the page which has been mmapped is not bound. And
>>> the page is not allocated yet. (major fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) __do_falut is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!
>>>
>>> [2] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are not mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!

What happens in the case as follows:

Task 1: Mmapped writes
t1)ext4_page_mkwrite()
   t2) ext4_write_begin() (FS is thawed so we proceed)
   t3) ext4_write_end() (journal is stopped now)
-----Pre-empted-----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has completed 
execution.

Task 1: Mmapped writes
tn+1) ext4_page_mkwrite() returns 0.
tn+2) __do_fault() gets control, code gets executed.
tn+3) _do_fault() marks the page dirty if the intent is to write to a 
file based page which faulted.

So you end up dirtying the page cache when the F.S is frozen? No?


Warm Regards,
Surbhi.







>>>
>>> [3] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> * Cannot block the dirty page to be written because all bh is mapped.
>>> (5) user munmaps the page (munmap)
>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (7) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>>
>>> [4] One of the page which has been mmapped is bound. And
>>> the page is already allocated.
>>>
>>> (1) user dirtys a page
>>> ( ) no page fault occurs
>>> (2) user munmaps the page (munmap)
>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (4) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>> --------------------------------------------------------------------------
>>>
>>>
>>> So, we can block the cases [1], [2].
>>> But I think we cannot block the cases [3], [4] now.
>>> If fixing the page_mkwrite, we can also block the case [3].
>>> But the case [4] is not blocked because no page fault occurs
>>> when we dirty the mmapped page.
>>>
>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>> I think we must modify the writeback thread to fix the case [4].
>> The trick here is that when we write a page to disk, we write-protect
>> the page (you seem to call this that "the page is bound", I'm not sure
>> why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
> But, anyway, I understand we don't need to consider the case [4].
>
>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>> modify a page after we finish writeback while freezing the filesystem.
>> So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent
> it?
>
> Thanks,
> Toshiyuki Okajima
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* (unknown), 
  2011-05-03 11:01                                       ` Surbhi Palande
@ 2011-05-03 13:08                                         ` Surbhi Palande
  2011-05-03 13:46                                           ` your mail Jan Kara
  2011-05-03 13:08                                         ` [PATCH] Prevent dirtying a page when ext4 F.S is frozen Surbhi Palande
  2011-05-03 15:19                                         ` [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Jan Kara
  2 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03 13:08 UTC (permalink / raw)
  To: jack
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen


On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
Toshiyuki pointed out.

zap_pte_range()
  mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)  

So, I think that it is here that we should do the checking for a ext4 F.S
frozen state and also prevent a parallel ext4 F.S freeze from happening.

Attaching a patch for initial review. Please do let me know your thoughts! 

Thanks a lot!

Warm Regards,
Surbhi.



^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH] Prevent dirtying a page when ext4 F.S is frozen
  2011-05-03 11:01                                       ` Surbhi Palande
  2011-05-03 13:08                                         ` (unknown), Surbhi Palande
@ 2011-05-03 13:08                                         ` Surbhi Palande
  2011-05-03 15:19                                         ` [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Jan Kara
  2 siblings, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03 13:08 UTC (permalink / raw)
  To: jack
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

Prevent dirtying a page when ext4 F.S is frozen. Also take the write semaphore
sb->s_umount to prevent a F.S freeze from racing with the page dirtying
process. Without this we can end up dirtying a page while a F.S freeze
happened because of preemption.

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
---
 fs/ext4/inode.c |   35 ++++++++++++++++++++++++++++++++++-
 1 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..db3f99d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3827,8 +3827,41 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
  */
 static int ext4_journalled_set_page_dirty(struct page *page)
 {
+	int ret=0;
+	struct inode * inode = NULL;
+	 struct super_block * sb = NULL;
+
+	if(likely((page->mapping) && (page->mapping->host))){
+		inode = page->mapping->host;
+		if(likely(inode->i_sb)){
+			sb = inode->i_sb;
+			/* we do not want a freeze to start now if F.S is not
+			 * already frozen*/
+			down_write(&sb->s_umount);
+			if(sb->s_frozen != SB_UNFROZEN) {
+				/* F.S is frozen.
+				 * we dont want to sleep with s_umount held.
+				 * Or else we might race with thaw_super */
+				up_write(&sb->s_umount);
+				vfs_check_frozen(sb, SB_FREEZE_WRITE);
+				/* F.S is no more frozen. We do not want the
+				 * FS freeze to begin after this point
+				 */
+				down_write(&sb->s_umount);
+			}
+		}
+	}
 	SetPageChecked(page);
-	return __set_page_dirty_nobuffers(page);
+	ret = __set_page_dirty_nobuffers(page);
+	if(likely((page->mapping) && (page->mapping->host))){
+		if(likely(inode->i_sb)){
+			up_write(&sb->s_umount);
+			/* If we freeze after this point, the dirtied page can
+			 * be flushed out!
+			 */
+		}
+	}
+	return ret;
 }
 
 static const struct address_space_operations ext4_ordered_aops = {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: your mail
  2011-05-03 13:08                                         ` (unknown), Surbhi Palande
@ 2011-05-03 13:46                                           ` Jan Kara
  2011-05-03 13:56                                             ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-03 13:46 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> Toshiyuki pointed out.
> 
> zap_pte_range()
>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)  
> 
> So, I think that it is here that we should do the checking for a ext4 F.S
> frozen state and also prevent a parallel ext4 F.S freeze from happening.
> 
> Attaching a patch for initial review. Please do let me know your thoughts! 
  This is definitely the wrong place. ->set_page_dirty() callbacks are
called with various locks held and the page need not be locked (thus
dereferencing page->mapping is oopsable). Moreover this particular callback
is called only in data=journal mode.

Believe me, the right place is page_mkwrite() - you have to catch the
read-only => read-write page transition. Once the page is mapped
read-write, you've already lost the race.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: your mail
  2011-05-03 13:46                                           ` your mail Jan Kara
@ 2011-05-03 13:56                                             ` Surbhi Palande
  2011-05-03 15:26                                               ` Surbhi Palande
  2011-05-03 15:36                                               ` Jan Kara
  0 siblings, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03 13:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On 05/03/2011 04:46 PM, Jan Kara wrote:
> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:

Sorry for missing the subject line :(
>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>> Toshiyuki pointed out.
>>
>> zap_pte_range()
>>    mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>
>> So, I think that it is here that we should do the checking for a ext4 F.S
>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>
>> Attaching a patch for initial review. Please do let me know your thoughts!
>    This is definitely the wrong place. ->set_page_dirty() callbacks are
> called with various locks held and the page need not be locked (thus
> dereferencing page->mapping is oopsable). Moreover this particular callback
> is called only in data=journal mode.
Ok! Thanks for that!

>
> Believe me, the right place is page_mkwrite() - you have to catch the
> read-only =>  read-write page transition. Once the page is mapped
> read-write, you've already lost the race.

My only point is:
1) something should prevent the freeze from happening. We cant merely 
check the vfs_check_frozen()?

And this should be done where the page is marked dirty.Also, I thought 
that the page is marked read-write only in the page table in the 
__do_page_fault()? i.e the zap_pte_range() marks them dirty in the page 
cache? Is this understanding right?

IMHO, whatever code dirties the page in the page cache should call a F.S 
specific function and let it _prevent_ a fsfreeze while the page is 
getting dirtied, so that a freeze called after this point flushes this page!

Warm Regards,
Surbhi.










>
> 								Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-03 11:01                                       ` Surbhi Palande
  2011-05-03 13:08                                         ` (unknown), Surbhi Palande
  2011-05-03 13:08                                         ` [PATCH] Prevent dirtying a page when ext4 F.S is frozen Surbhi Palande
@ 2011-05-03 15:19                                         ` Jan Kara
  2011-05-04 12:09                                           ` Surbhi Palande
  2 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-03 15:19 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Toshiyuki Okajima, Jan Kara, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

[-- Attachment #1: Type: text/plain, Size: 5056 bytes --]

On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> >(2011/04/16 2:13), Jan Kara wrote:
> >>Hello,
> >>
> >>On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should
> >>>>probably
> >>>>get modified to block while minor-faulting the page on frozen fs
> >>>>because
> >>>>when blocks are already allocated we may skip starting a transaction
> >>>>and so
> >>>>we could possibly modify the filesystem.
> >>>OK. I think ->page_mkwrite() should also block writing the
> >>>minor-faulting pages.
> >>>
> >>>(minor-pagefault)
> >>>-> do_wp_page()
> >>>-> page_mkwrite(= ext4_mkwrite())
> >>>=> BLOCK!
> >>>
> >>>(major-pagefault)
> >>>-> do_liner_fault()
> >>>-> page_mkwrite(= ext4_mkwrite())
> >>>=> BLOCK!
> >>>
> >>>>
> >>>>>>>Mizuma-san's reproducer also writes the data which maps to the
> >>>>>>>file (mmap).
> >>>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>>I understand the normal write operation (not mmap) can be blocked
> >>>>>>>while
> >>>>>>>fsfreezing. So, I guess we don't always block all the write
> >>>>>>>operation
> >>>>>>>while fsfreezing.
> >>>>>>Technically speaking, we block all the transaction starts which
> >>>>>>means we
> >>>>>>end up blocking all the writes from going to disk. But that does
> >>>>>>not mean
> >>>>>>we block all the writes from going to in-memory cache - as you
> >>>>>>properly
> >>>>>>note the mmap case is one of such exceptions.
> >>>>>Hm, I also think we can allow the writes to in-memory cache but we
> >>>>>can't allow
> >>>>>the writes to disk while fsfreezing. I am considering that mmap
> >>>>>path can
> >>>>>write to disk while fsfreezing because this deadlock problem
> >>>>>happens after
> >>>>>fsfreeze operation is done...
> >>>>I'm sorry I don't understand now - are you speaking about the case
> >>>>above
> >>>>when writepage() does not wait for filesystem being frozen or something
> >>>>else?
> >>>Sorry, I didn't understand around the page fault path.
> >>>So, I had read the kernel source code around it, then I maybe
> >>>understand...
> >>>
> >>>I worry whether we can update the file data in mmap case while
> >>>fsfreezing.
> >>>Of course, I understand that we can write to in-memory cache, and it
> >>>is not a
> >>>problem. However, if we can write to disk while fsfreezing, it is a
> >>>problem.
> >>>So, I summarize the cases whether we can write to disk or not.
> >>>
> >>>--------------------------------------------------------------------------
> >>>
> >>>Cases (Whether we can write the data mmapped to the file on the disk
> >>>while fsfreezing)
> >>>
> >>>[1] One of the page which has been mmapped is not bound. And
> >>>the page is not allocated yet. (major fault?)
> >>>
> >>>(1) user dirtys a page
> >>>(2) a page fault occurs (do_page_fault)
> >>>(3) __do_falut is called.
> >>>(4) ext4_page_mkwrite is called
> >>>(5) ext4_write_begin is called
> >>>(6) ext4_journal_start_sb => We can STOP!
> >>>
> >>>[2] One of the page which has been mmapped is not bound. But
> >>>the page is already allocated, and the buffer_heads of the page
> >>>are not mapped (BH_Mapped). (minor fault?)
> >>>
> >>>(1) user dirtys a page
> >>>(2) a page fault occurs (do_page_fault)
> >>>(3) do_wp_page is called.
> >>>(4) ext4_page_mkwrite is called
> >>>(5) ext4_write_begin is called
> >>>(6) ext4_journal_start_sb => We can STOP!
> 
> What happens in the case as follows:
> 
> Task 1: Mmapped writes
> t1)ext4_page_mkwrite()
>   t2) ext4_write_begin() (FS is thawed so we proceed)
>   t3) ext4_write_end() (journal is stopped now)
> -----Pre-empted-----
> 
> 
> Task 2: Freeze Task
> t4) freezes the super block...
> ...(continues)....
> tn) the page cache is clean and the F.S is frozen. Freeze has
> completed execution.
> 
> Task 1: Mmapped writes
> tn+1) ext4_page_mkwrite() returns 0.
> tn+2) __do_fault() gets control, code gets executed.
> tn+3) _do_fault() marks the page dirty if the intent is to write to
> a file based page which faulted.
> 
> So you end up dirtying the page cache when the F.S is frozen? No?
  You are right ext4_page_mkrite() as currently implemented has problems.
You have to return the page locked (and check for frozen fs with page lock
held) to avoid races.

If you check for frozen fs with page lock held, you are guaranteed that
freezing code must wait for the page to get unlocked before proceeding. And
before the page is unlocked, it is marked dirty by the pagefault code which
makes freezing code write the page and writeprotect it again. So everything
will be safe.

Doing this cleanly requires some cleanups to ext4_page_mkwrite() (but
stable pages during writeback need that as well so it's a reasonable thing
to do). So something like attached patches should do what's needed - it's
lightly tested with fsx in delalloc, nodelalloc, and data=journal configs.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-fs-Create-__block_page_mkwrite-helper-passing-error-.patch --]
[-- Type: text/x-patch, Size: 2511 bytes --]

>From dc11ee148c44dda41c0c2315bdaf26b86f34b201 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 26 Apr 2011 21:12:53 +0200
Subject: [PATCH 1/3] fs: Create __block_page_mkwrite() helper passing error values back

Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
does except that it passes back errors from __block_write_begin /
block_commit_write calls.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c                 |   26 ++++++++++++++++++--------
 include/linux/buffer_head.h |    2 ++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..469c832 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2333,7 +2333,7 @@ EXPORT_SYMBOL(block_commit_write);
  * unlock the page.
  */
 int
-block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+__block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		   get_block_t get_block)
 {
 	struct page *page = vmf->page;
@@ -2361,18 +2361,28 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (!ret)
 		ret = block_commit_write(page, 0, end);
 
-	if (unlikely(ret)) {
+	if (unlikely(ret < 0))
 		unlock_page(page);
-		if (ret == -ENOMEM)
-			ret = VM_FAULT_OOM;
-		else /* -ENOSPC, -EIO, etc */
-			ret = VM_FAULT_SIGBUS;
-	} else
+	else
 		ret = VM_FAULT_LOCKED;
-
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__block_page_mkwrite);
+
+int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		   get_block_t get_block)
+{
+	int ret = __block_page_mkwrite(vma, vmf, get_block);
+
+	if (unlikely(ret < 0)) {
+		if (ret == -ENOMEM)
+			return VM_FAULT_OOM;
+		/* -ENOSPC, -EIO, etc */
+		return VM_FAULT_SIGBUS;
+	}
+	return ret;
+}
 EXPORT_SYMBOL(block_page_mkwrite);
 
 /*
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index f5df235..0b719b0 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -217,6 +217,8 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
 			get_block_t *, loff_t *);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
+int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+				get_block_t get_block);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
-- 
1.7.1


[-- Attachment #3: 0002-ext4-Rewrite-ext4_page_mkwrite-to-return-locked-page.patch --]
[-- Type: text/x-patch, Size: 4876 bytes --]

>From 6c333a1a5a577672f4ea0114e0fc430531097788 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 26 Apr 2011 20:48:13 +0200
Subject: [PATCH 2/3] ext4: Rewrite ext4_page_mkwrite() to return locked page

ext4_page_mkwrite() does not return page locked. This makes it hard
to avoid races with filesystem freezing code (so that we don't leave
writeable page on a frozen fs) or writeback code (so that we allow page
to be stable during writeback).

Also the current code uses i_alloc_sem to avoid races with truncate but that
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |  101 ++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 63 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..377fed0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5788,40 +5788,53 @@ static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
 	return !buffer_mapped(bh);
 }
 
+static int ext4_journalled_fault_fn(handle_t *handle, struct buffer_head *bh)
+{
+	if (!buffer_dirty(bh))
+		return 0;
+	return ext4_handle_dirty_metadata(handle, NULL, bh);
+}
+
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
 	loff_t size;
 	unsigned long len;
-	int ret = -EINVAL;
-	void *fsdata;
+	int ret;
 	struct file *file = vma->vm_file;
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
+	handle_t handle;
+	get_block_t get_block;
+	int retries = 0;
 
-	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
-	 */
-	down_read(&inode->i_alloc_sem);
+	/* Delalloc case is easy... */
+	if (test_opt(inode->i_sb, DELALLOC) &&
+	    !ext4_should_journal_data(inode) &&
+	    !ext4_nonda_switch(inode->i_sb)) {
+		do {
+			ret = __block_page_mkwrite(vma, vmf,
+						   ext4_da_get_block_prep);
+		} while (ret == -ENOSPC &&
+		       ext4_should_retry_alloc(inode->i_sb, &retries));
+		goto out_ret;
+	}
+
+	lock_page(page);
 	size = i_size_read(inode);
-	if (page->mapping != mapping || size <= page_offset(page)
-	    || !PageUptodate(page)) {
-		/* page got truncated from under us? */
-		goto out_unlock;
+	/* Page got truncated from under us? */
+	if (page->mapping != mapping || page_offset(page) > size) {
+		unlock_page(page);
+ 		ret = VM_FAULT_NOPAGE;
+		goto out;
 	}
-	ret = 0;
-	if (PageMappedToDisk(page))
-		goto out_unlock;
 
 	if (page->index == size >> PAGE_CACHE_SHIFT)
 		len = size & ~PAGE_CACHE_MASK;
 	else
 		len = PAGE_CACHE_SIZE;
-
-	lock_page(page);
 	/*
-	 * return if we have all the buffers mapped. This avoid
+	 * Return if we have all the buffers mapped. This avoid
 	 * the need to call write_begin/write_end which does a
 	 * journal_start/journal_stop which can block and take
 	 * long time
@@ -5829,30 +5842,42 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
-			unlock_page(page);
-			goto out_unlock;
+			ret = VM_FAULT_LOCKED;
+			goto out;
 		}
 	}
 	unlock_page(page);
-	/*
-	 * OK, we need to fill the hole... Do write_begin write_end
-	 * to do block allocation/reservation.We are not holding
-	 * inode.i__mutex here. That allow * parallel write_begin,
-	 * write_end call. lock_page prevent this from happening
-	 * on the same page though
-	 */
-	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
-			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
-			len, len, page, fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = 0;
-out_unlock:
-	if (ret)
+	/* OK, we need to fill the hole... */
+	if (ext4_should_dioread_nolock(inode))
+		get_block = ext4_get_block_write;
+	else
+		get_block = ext4_get_block;
+retry:
+	handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+	if (IS_ERR(handle)) {
 		ret = VM_FAULT_SIGBUS;
-	up_read(&inode->i_alloc_sem);
+		goto out;
+	}
+	ret = __block_page_mkwrite(vma, vmf, get_block);
+	if (ret == VM_FAULT_LOCKED && ext4_should_journal_data(inode)) {
+		if (walk_page_buffers(handle, page_buffers(page), 0,
+		 	  PAGE_CACHE_SIZE, NULL, ext4_journalled_fault_fn)) {
+			unlock_page(page);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+	}
+	ext4_journal_end(handle);
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+out_ret:
+	if (ret < 0) {
+		if (ret == -ENOMEM)
+			ret = VM_FAULT_OOM;
+		else
+			ret = VM_FAULT_SIGBUS;
+	}
+out:
 	return ret;
 }
-- 
1.7.1


[-- Attachment #4: 0003-ext4-Block-mmapped-writes-while-the-fs-is-frozen.patch --]
[-- Type: text/x-patch, Size: 4146 bytes --]

>From ee1f2f8cdea23cf19b34e51b4f78e040ce898976 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 3 May 2011 17:00:35 +0200
Subject: [PATCH 3/3] ext4: Block mmapped writes while the fs is frozen

We should not allow file modification via mmap while the filesystem is
frozen. So block in ext4_page_mkwrite() while the filesystem is frozen.

We have to check for frozen filesystem under page lock with which we then
return from ext4_page_mkwrite(). Only that way we cannot race with writeback
done by freezing code - either we lock the page after the writeback has
started, see freezing in progress and block, or writeback will wait for our
page lock which is released only when the fault is done and then writeback
will writeout and writeprotect the page again.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   41 ++++++++++++++++++++++++-----------------
 1 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 377fed0..6faadaf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5788,13 +5788,6 @@ static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
 	return !buffer_mapped(bh);
 }
 
-static int ext4_journalled_fault_fn(handle_t *handle, struct buffer_head *bh)
-{
-	if (!buffer_dirty(bh))
-		return 0;
-	return ext4_handle_dirty_metadata(handle, NULL, bh);
-}
-
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -5804,10 +5797,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file *file = vma->vm_file;
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
-	handle_t handle;
-	get_block_t get_block;
+	handle_t *handle;
+	get_block_t *get_block;
 	int retries = 0;
 
+restart:
+	/*
+	 * This check is racy but catches the common case. The check at the
+	 * end of this function is reliable.
+	 */
+	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
@@ -5834,10 +5833,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	else
 		len = PAGE_CACHE_SIZE;
 	/*
-	 * Return if we have all the buffers mapped. This avoid
-	 * the need to call write_begin/write_end which does a
-	 * journal_start/journal_stop which can block and take
-	 * long time
+	 * Return if we have all the buffers mapped. This avoids the need to do
+	 * journal_start/journal_stop which can block and take a long time
 	 */
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
@@ -5852,7 +5849,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		get_block = ext4_get_block_write;
 	else
 		get_block = ext4_get_block;
-retry:
+retry_alloc:
 	handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
 	if (IS_ERR(handle)) {
 		ret = VM_FAULT_SIGBUS;
@@ -5861,16 +5858,16 @@ retry:
 	ret = __block_page_mkwrite(vma, vmf, get_block);
 	if (ret == VM_FAULT_LOCKED && ext4_should_journal_data(inode)) {
 		if (walk_page_buffers(handle, page_buffers(page), 0,
-		 	  PAGE_CACHE_SIZE, NULL, ext4_journalled_fault_fn)) {
+		 	  PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) {
 			unlock_page(page);
 			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 	}
-	ext4_journal_end(handle);
+	ext4_journal_stop(handle);
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
+		goto retry_alloc;
 out_ret:
 	if (ret < 0) {
 		if (ret == -ENOMEM)
@@ -5879,5 +5876,15 @@ out_ret:
 			ret = VM_FAULT_SIGBUS;
 	}
 out:
+	/*
+	 * Freezing in progress? We check with page lock held so if the test
+	 * here fails, we are sure freezing code will wait until the page
+	 * fault is done - at that point page will be dirty and unlocked so
+	 * freezing code will writeprotect it again.
+	 */
+	if (ret == VM_FAULT_LOCKED && inode->i_sb->s_frozen != SB_UNFROZEN) {
+		unlock_page(page);
+		goto restart;
+	}
 	return ret;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: your mail
  2011-05-03 13:56                                             ` Surbhi Palande
@ 2011-05-03 15:26                                               ` Surbhi Palande
  2011-05-03 15:36                                               ` Jan Kara
  1 sibling, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03 15:26 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
	linux-ext4, linux-fsdevel, sandeen

On 05/03/2011 04:56 PM, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
>>> On munmap() zap_pte_range() is called which dirties the PTE dirty
>>> pages as
>>> Toshiyuki pointed out.
>>>
>>> zap_pte_range()
>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>
>>> So, I think that it is here that we should do the checking for a ext4
>>> F.S
>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>
>>> Attaching a patch for initial review. Please do let me know your
>>> thoughts!
>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>> called with various locks held and the page need not be locked (thus
>> dereferencing page->mapping is oopsable). Moreover this particular
>> callback
>> is called only in data=journal mode.
> Ok! Thanks for that!
>
>>
>> Believe me, the right place is page_mkwrite() - you have to catch the
>> read-only => read-write page transition. Once the page is mapped
>> read-write, you've already lost the race.
Also, we then need to prevent a munmap()/zap_pte_range() call from 
dirtying a mmapped file page when the F.S is frozen?

Warm Regards,
Surbhi.

>
> My only point is:
> 1) something should prevent the freeze from happening. We cant merely
> check the vfs_check_frozen()?
>
> And this should be done where the page is marked dirty.Also, I thought
> that the page is marked read-write only in the page table in the
> __do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
> cache? Is this understanding right?
>
> IMHO, whatever code dirties the page in the page cache should call a F.S
> specific function and let it _prevent_ a fsfreeze while the page is
> getting dirtied, so that a freeze called after this point flushes this
> page!
>
> Warm Regards,
> Surbhi.
>
>
>
>
>
>
>
>
>
>
>>
>> Honza
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: your mail
  2011-05-03 13:56                                             ` Surbhi Palande
  2011-05-03 15:26                                               ` Surbhi Palande
@ 2011-05-03 15:36                                               ` Jan Kara
  2011-05-03 15:43                                                 ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-03 15:36 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
	linux-ext4, linux-fsdevel, sandeen

On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> 
> Sorry for missing the subject line :(
> >>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>Toshiyuki pointed out.
> >>
> >>zap_pte_range()
> >>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>
> >>So, I think that it is here that we should do the checking for a ext4 F.S
> >>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>
> >>Attaching a patch for initial review. Please do let me know your thoughts!
> >   This is definitely the wrong place. ->set_page_dirty() callbacks are
> >called with various locks held and the page need not be locked (thus
> >dereferencing page->mapping is oopsable). Moreover this particular callback
> >is called only in data=journal mode.
> Ok! Thanks for that!
> 
> >
> >Believe me, the right place is page_mkwrite() - you have to catch the
> >read-only =>  read-write page transition. Once the page is mapped
> >read-write, you've already lost the race.
> 
> My only point is:
> 1) something should prevent the freeze from happening. We cant
> merely check the vfs_check_frozen()?
  Yes, I agree - see my other email with patches.

> And this should be done where the page is marked dirty.Also, I
> thought that the page is marked read-write only in the page table in
> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> the page cache? Is this understanding right?
  The page can become dirty either because it was written via standard
write - write_begin is responsible for reliable check here - or it was
written via mmap - here we rely on page_mkwrite to do a reliable check -
it is analogous to write_begin callback. There should be no other way
to dirty a page.

With dirty bits it is a bit complicated. We have two of them in fact. One
in page table entry maintained by mmu and one in page structure maintained
by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
from page table into struct page. This is a lazy process so page can in
principle have new data without a dirty bit set in struct page because we
have not yet copied the dirty bit from page table. Only at moments where it
is important (like when we want to unmap the page, or throw away the page,
or so), we make sure struct page and page table bits are in sync.

Another subtle thing you need not be aware of it that when we clear page
dirty bit, we also writeprotect the page. So we are guaranteed to get a
page fault when the page is written to again.

> IMHO, whatever code dirties the page in the page cache should call a
> F.S specific function and let it _prevent_ a fsfreeze while the page
> is getting dirtied, so that a freeze called after this point flushes
> this page!
  Agreed, that's what code in write_begin() and page_mkwrite() should
achieve.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: your mail
  2011-05-03 15:36                                               ` Jan Kara
@ 2011-05-03 15:43                                                 ` Surbhi Palande
  2011-05-04 19:24                                                   ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-03 15:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

On 05/03/2011 06:36 PM, Jan Kara wrote:
> On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
>> On 05/03/2011 04:46 PM, Jan Kara wrote:
>>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>>
>> Sorry for missing the subject line :(
>>>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>>>> Toshiyuki pointed out.
>>>>
>>>> zap_pte_range()
>>>>    mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>>
>>>> So, I think that it is here that we should do the checking for a ext4 F.S
>>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>>
>>>> Attaching a patch for initial review. Please do let me know your thoughts!
>>>    This is definitely the wrong place. ->set_page_dirty() callbacks are
>>> called with various locks held and the page need not be locked (thus
>>> dereferencing page->mapping is oopsable). Moreover this particular callback
>>> is called only in data=journal mode.
>> Ok! Thanks for that!
>>
>>>
>>> Believe me, the right place is page_mkwrite() - you have to catch the
>>> read-only =>   read-write page transition. Once the page is mapped
>>> read-write, you've already lost the race.
>>
>> My only point is:
>> 1) something should prevent the freeze from happening. We cant
>> merely check the vfs_check_frozen()?
>    Yes, I agree - see my other email with patches.
>
>> And this should be done where the page is marked dirty.Also, I
>> thought that the page is marked read-write only in the page table in
>> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
>> the page cache? Is this understanding right?
>    The page can become dirty either because it was written via standard
> write - write_begin is responsible for reliable check here - or it was
> written via mmap - here we rely on page_mkwrite to do a reliable check -
> it is analogous to write_begin callback. There should be no other way
> to dirty a page.
>
> With dirty bits it is a bit complicated. We have two of them in fact. One
> in page table entry maintained by mmu and one in page structure maintained
> by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> from page table into struct page. This is a lazy process so page can in
> principle have new data without a dirty bit set in struct page because we
> have not yet copied the dirty bit from page table. Only at moments where it
> is important (like when we want to unmap the page, or throw away the page,
> or so), we make sure struct page and page table bits are in sync.
>
> Another subtle thing you need not be aware of it that when we clear page
> dirty bit, we also writeprotect the page. So we are guaranteed to get a
> page fault when the page is written to again.
>
>> IMHO, whatever code dirties the page in the page cache should call a
>> F.S specific function and let it _prevent_ a fsfreeze while the page
>> is getting dirtied, so that a freeze called after this point flushes
>> this page!
>    Agreed, that's what code in write_begin() and page_mkwrite() should
> achieve.
> 								Honza
Thanks a lot for the wonderful explanation :)

How about the revert : i.e calling  jbd2_journal_unlock_updates() from 
ext4_unfreeze() instead of the ext4_freeze()? Do you agree to that?


Warm Regards,
Surbhi.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-03  7:27                                           ` Surbhi Palande
@ 2011-05-03 20:14                                             ` Eric Sandeen
  2011-05-04  8:26                                               ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Eric Sandeen @ 2011-05-03 20:14 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On 5/3/11 2:27 AM, Surbhi Palande wrote:
> On 05/02/2011 05:04 PM, Eric Sandeen wrote:
>> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>>     Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>>     That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>>> fs code with s_umount held).
>>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>>       vfs_check_frozen()
>>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>>
>>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>>> --------------------
>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>>       vfs_check_frozen()
>>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>>> control.
>>>>>>>>>
>>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>>> --------------------
>>>>>>>>>
>>>>>>>>> Can anyone please enlighten me on how&     why this premption is _not_
>>>>>>>>> possible?
>>>>>>> Thanks for your reply.
>>>>>>>>     XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>>> Agreed.
>>>>>>>>    or it manages
>>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>>> check in xfs_file_aio_write()...
>>>>>>>>
>>>>>>>>
>>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>>> freezing a superblock (as the write process stopped just before
>>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>>     So ext4_freeze() does
>>>>>> jbd2_journal_lock_updates(journal)
>>>>>>     which waits for all running transactions to finish and updates
>>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>>> function start_this_handle()).
>>>>>>
>>>>> Yes, but ext4_freeze() also calls
>>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>>> j_barrier_count (which was previously updated/incremented in
>>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>>> new transaction/handle can be accepted/started.
>>>>>
>>>>> A comment in ext4_freeze() says:
>>>>> /* we rely on s_frozen to stop further updates */
>>>>> (before calling jbd2_journal_unlock_updates())
>>>>     Ah, drat, you're right. I've missed this other part. It's the problem
>>>> that if you expect to see something, you'll see it regardless of the real
>>>> code ;).
>>>>
>>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>>> it's still racy (although the race window is relatively small) because the
>>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>>> Commit 6b0310fb broke it for ext4.
>>>>
>>>> I guess the code was mostly copied from XFS which seems to have the same
>>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>>> a transaction and if the filesystem is being frozen, we stop the
>>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>>> to create an analogous logic using a atomic counter of write ops in vfs
>>>> that could be used by all filesystems. We'd just have to replace
>>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>>> appropriate places...
>>> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>> from ext4_unfreeze()?
>> we used to have that, but holding it locked until then means we exit the kernel
>> with a mutex held, which is pretty icky.
>>
>>      ================================================
>>      [ BUG: lock held when returning to user space! ]
>>      ------------------------------------------------
>>      lvcreate/1075 is leaving the kernel with locks still held!
>>      1 lock held by lvcreate/1075:
>>       #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>      jbd2_journal_lock_updates+0xe1/0xf0
>>
>>
>> -Eric
> Should this not be reverted? I think that its a lot easier to stop a transaction between a freeze and a thaw that way! If you agree, can I send a patch for the same?

Only if you want the kernel to start spewing "BUG!" messages again...

-Eric

> Thanks!
> 
> Warm Regards,
> Surbhi.
> 
> 
>>> So that indeed no transactions can be started before unfreeze is called.
>>>
>>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>>
>>> Warm Regards,
>>> Surbhi.
>>>
>>>> Dave, Christoph, any opinions on this?
>>>>                                  Honza
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-03 20:14                                             ` Eric Sandeen
@ 2011-05-04  8:26                                               ` Surbhi Palande
  2011-05-04 14:30                                                 ` Eric Sandeen
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-04  8:26 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On 05/03/2011 11:14 PM, Eric Sandeen wrote:
> On 5/3/11 2:27 AM, Surbhi Palande wrote:
>> On 05/02/2011 05:04 PM, Eric Sandeen wrote:
>>> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>>>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>>>      Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>>>      That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>>>> fs code with s_umount held).
>>>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>>      xfs_wait_for_freeze()
>>>>>>>>>>        vfs_check_frozen()
>>>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>>>
>>>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>>>> --------------------
>>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>>      xfs_wait_for_freeze()
>>>>>>>>>>        vfs_check_frozen()
>>>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>>>> control.
>>>>>>>>>>
>>>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>>>> --------------------
>>>>>>>>>>
>>>>>>>>>> Can anyone please enlighten me on how&      why this premption is _not_
>>>>>>>>>> possible?
>>>>>>>> Thanks for your reply.
>>>>>>>>>      XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>>>> Agreed.
>>>>>>>>>     or it manages
>>>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>>>> check in xfs_file_aio_write()...
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>>>> freezing a superblock (as the write process stopped just before
>>>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>>>      So ext4_freeze() does
>>>>>>> jbd2_journal_lock_updates(journal)
>>>>>>>      which waits for all running transactions to finish and updates
>>>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>>>> function start_this_handle()).
>>>>>>>
>>>>>> Yes, but ext4_freeze() also calls
>>>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>>>> j_barrier_count (which was previously updated/incremented in
>>>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>>>> new transaction/handle can be accepted/started.
>>>>>>
>>>>>> A comment in ext4_freeze() says:
>>>>>> /* we rely on s_frozen to stop further updates */
>>>>>> (before calling jbd2_journal_unlock_updates())
>>>>>      Ah, drat, you're right. I've missed this other part. It's the problem
>>>>> that if you expect to see something, you'll see it regardless of the real
>>>>> code ;).
>>>>>
>>>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>>>> it's still racy (although the race window is relatively small) because the
>>>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>>>> Commit 6b0310fb broke it for ext4.
>>>>>
>>>>> I guess the code was mostly copied from XFS which seems to have the same
>>>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>>>> a transaction and if the filesystem is being frozen, we stop the
>>>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>>>> to create an analogous logic using a atomic counter of write ops in vfs
>>>>> that could be used by all filesystems. We'd just have to replace
>>>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>>>> appropriate places...
>>>> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>>> from ext4_unfreeze()?
>>> we used to have that, but holding it locked until then means we exit the kernel
>>> with a mutex held, which is pretty icky.
>>>
>>>       ================================================
>>>       [ BUG: lock held when returning to user space! ]
>>>       ------------------------------------------------
>>>       lvcreate/1075 is leaving the kernel with locks still held!
>>>       1 lock held by lvcreate/1075:
>>>        #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>>       jbd2_journal_lock_updates+0xe1/0xf0
>>>
>>>
>>> -Eric
>> Should this not be reverted? I think that its a lot easier to stop a transaction between a freeze and a thaw that way! If you agree, can I send a patch for the same?
> Only if you want the kernel to start spewing "BUG!" messages again...
>
> -Eric
But, then you need a much more complicated way to stop accepting the 
transactions and the writes between the freeze and the thaw? (in the 
write path and the read path)? Is this not much simpler?

Warm Regards,
Surbhi.








>> Thanks!
>>
>> Warm Regards,
>> Surbhi.
>>
>>
>>>> So that indeed no transactions can be started before unfreeze is called.
>>>>
>>>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>>>
>>>> Warm Regards,
>>>> Surbhi.
>>>>
>>>>> Dave, Christoph, any opinions on this?
>>>>>                                   Honza
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-03 15:19                                         ` [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Jan Kara
@ 2011-05-04 12:09                                           ` Surbhi Palande
  2011-05-04 19:19                                             ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-04 12:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On 05/03/2011 06:19 PM, Jan Kara wrote:
> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>> (2011/04/16 2:13), Jan Kara wrote:
>>>> Hello,
>>>>
>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>> probably
>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>> because
>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>> and so
>>>>>> we could possibly modify the filesystem.
>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>> minor-faulting pages.
>>>>>
>>>>> (minor-pagefault)
>>>>> ->  do_wp_page()
>>>>> ->  page_mkwrite(= ext4_mkwrite())
>>>>> =>  BLOCK!
>>>>>
>>>>> (major-pagefault)
>>>>> ->  do_liner_fault()
>>>>> ->  page_mkwrite(= ext4_mkwrite())
>>>>> =>  BLOCK!
>>>>>
>>>>>>
>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>> file (mmap).
>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>> while
>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>> operation
>>>>>>>>> while fsfreezing.
>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>> means we
>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>> not mean
>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>> properly
>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>> can't allow
>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>> path can
>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>> happens after
>>>>>>> fsfreeze operation is done...
>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>> above
>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>> else?
>>>>> Sorry, I didn't understand around the page fault path.
>>>>> So, I had read the kernel source code around it, then I maybe
>>>>> understand...
>>>>>
>>>>> I worry whether we can update the file data in mmap case while
>>>>> fsfreezing.
>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>> is not a
>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>> problem.
>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>> while fsfreezing)
>>>>>
>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>> the page is not allocated yet. (major fault?)
>>>>>
>>>>> (1) user dirtys a page
>>>>> (2) a page fault occurs (do_page_fault)
>>>>> (3) __do_falut is called.
>>>>> (4) ext4_page_mkwrite is called
>>>>> (5) ext4_write_begin is called
>>>>> (6) ext4_journal_start_sb =>  We can STOP!
>>>>>
>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>> the page is already allocated, and the buffer_heads of the page
>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>
>>>>> (1) user dirtys a page
>>>>> (2) a page fault occurs (do_page_fault)
>>>>> (3) do_wp_page is called.
>>>>> (4) ext4_page_mkwrite is called
>>>>> (5) ext4_write_begin is called
>>>>> (6) ext4_journal_start_sb =>  We can STOP!
>>
>> What happens in the case as follows:
>>
>> Task 1: Mmapped writes
>> t1)ext4_page_mkwrite()
>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>    t3) ext4_write_end() (journal is stopped now)
>> -----Pre-empted-----
>>
>>
>> Task 2: Freeze Task
>> t4) freezes the super block...
>> ...(continues)....
>> tn) the page cache is clean and the F.S is frozen. Freeze has
>> completed execution.
>>
>> Task 1: Mmapped writes
>> tn+1) ext4_page_mkwrite() returns 0.
>> tn+2) __do_fault() gets control, code gets executed.
>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>> a file based page which faulted.
>>
>> So you end up dirtying the page cache when the F.S is frozen? No?
>    You are right ext4_page_mkrite() as currently implemented has problems.
> You have to return the page locked (and check for frozen fs with page lock
> held) to avoid races.
>
> If you check for frozen fs with page lock held, you are guaranteed that
> freezing code must wait for the page to get unlocked before proceeding. And
> before the page is unlocked, it is marked dirty by the pagefault code which
> makes freezing code write the page and writeprotect it again. So everything
> will be safe.
For the locked page to be a part of the freeze initiated sync, should 
its owner inode not be dirtied? The page fault handler dirties the page, 
but who ensures that the inode is dirtied at this point?

Thanks!

Warm Regards,
Surbhi.



>
> Doing this cleanly requires some cleanups to ext4_page_mkwrite() (but
> stable pages during writeback need that as well so it's a reasonable thing
> to do). So something like attached patches should do what's needed - it's
> lightly tested with fsx in delalloc, nodelalloc, and data=journal configs.
>
> 								Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-04  8:26                                               ` Surbhi Palande
@ 2011-05-04 14:30                                                 ` Eric Sandeen
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Sandeen @ 2011-05-04 14:30 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Jan Kara, Dave Chinner, Toshiyuki Okajima, Ted Ts'o,
	Masayoshi MIZUMA, Andreas Dilger, linux-ext4, linux-fsdevel,
	Christoph Hellwig

On 5/4/11 3:26 AM, Surbhi Palande wrote:
> On 05/03/2011 11:14 PM, Eric Sandeen wrote:
>> On 5/3/11 2:27 AM, Surbhi Palande wrote:

...

>>> Should this not be reverted? I think that its a lot easier to
>>> stop a transaction between a freeze and a thaw that way! If you
>>> agree, can I send a patch for the same?
>> Only if you want the kernel to start spewing "BUG!" messages
>> again...
>> 
>> -Eric
> But, then you need a much more complicated way to stop accepting the
> transactions and the writes between the freeze and the thaw? (in the
> write path and the read path)? Is this not much simpler?

I just cannot see how a solution which leads to:

>>      ================================================
>>      [ BUG: lock held when returning to user space! ]
>>      ------------------------------------------------
>>      lvcreate/1075 is leaving the kernel with locks still held!
>>      1 lock held by lvcreate/1075:
>>       #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>      jbd2_journal_lock_updates+0xe1/0xf0


can be considered viable.

You are welcome to send the patch, and if other ext4 devs concur with it then I'll be outvoted. :)

-Eric


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-04 12:09                                           ` Surbhi Palande
@ 2011-05-04 19:19                                             ` Jan Kara
  2011-05-04 21:34                                               ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-04 19:19 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> On 05/03/2011 06:19 PM, Jan Kara wrote:
> >On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> >>>(2011/04/16 2:13), Jan Kara wrote:
> >>>>Hello,
> >>>>
> >>>>On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>>>For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should
> >>>>>>probably
> >>>>>>get modified to block while minor-faulting the page on frozen fs
> >>>>>>because
> >>>>>>when blocks are already allocated we may skip starting a transaction
> >>>>>>and so
> >>>>>>we could possibly modify the filesystem.
> >>>>>OK. I think ->page_mkwrite() should also block writing the
> >>>>>minor-faulting pages.
> >>>>>
> >>>>>(minor-pagefault)
> >>>>>->  do_wp_page()
> >>>>>->  page_mkwrite(= ext4_mkwrite())
> >>>>>=>  BLOCK!
> >>>>>
> >>>>>(major-pagefault)
> >>>>>->  do_liner_fault()
> >>>>>->  page_mkwrite(= ext4_mkwrite())
> >>>>>=>  BLOCK!
> >>>>>
> >>>>>>
> >>>>>>>>>Mizuma-san's reproducer also writes the data which maps to the
> >>>>>>>>>file (mmap).
> >>>>>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>>>>I understand the normal write operation (not mmap) can be blocked
> >>>>>>>>>while
> >>>>>>>>>fsfreezing. So, I guess we don't always block all the write
> >>>>>>>>>operation
> >>>>>>>>>while fsfreezing.
> >>>>>>>>Technically speaking, we block all the transaction starts which
> >>>>>>>>means we
> >>>>>>>>end up blocking all the writes from going to disk. But that does
> >>>>>>>>not mean
> >>>>>>>>we block all the writes from going to in-memory cache - as you
> >>>>>>>>properly
> >>>>>>>>note the mmap case is one of such exceptions.
> >>>>>>>Hm, I also think we can allow the writes to in-memory cache but we
> >>>>>>>can't allow
> >>>>>>>the writes to disk while fsfreezing. I am considering that mmap
> >>>>>>>path can
> >>>>>>>write to disk while fsfreezing because this deadlock problem
> >>>>>>>happens after
> >>>>>>>fsfreeze operation is done...
> >>>>>>I'm sorry I don't understand now - are you speaking about the case
> >>>>>>above
> >>>>>>when writepage() does not wait for filesystem being frozen or something
> >>>>>>else?
> >>>>>Sorry, I didn't understand around the page fault path.
> >>>>>So, I had read the kernel source code around it, then I maybe
> >>>>>understand...
> >>>>>
> >>>>>I worry whether we can update the file data in mmap case while
> >>>>>fsfreezing.
> >>>>>Of course, I understand that we can write to in-memory cache, and it
> >>>>>is not a
> >>>>>problem. However, if we can write to disk while fsfreezing, it is a
> >>>>>problem.
> >>>>>So, I summarize the cases whether we can write to disk or not.
> >>>>>
> >>>>>--------------------------------------------------------------------------
> >>>>>
> >>>>>Cases (Whether we can write the data mmapped to the file on the disk
> >>>>>while fsfreezing)
> >>>>>
> >>>>>[1] One of the page which has been mmapped is not bound. And
> >>>>>the page is not allocated yet. (major fault?)
> >>>>>
> >>>>>(1) user dirtys a page
> >>>>>(2) a page fault occurs (do_page_fault)
> >>>>>(3) __do_falut is called.
> >>>>>(4) ext4_page_mkwrite is called
> >>>>>(5) ext4_write_begin is called
> >>>>>(6) ext4_journal_start_sb =>  We can STOP!
> >>>>>
> >>>>>[2] One of the page which has been mmapped is not bound. But
> >>>>>the page is already allocated, and the buffer_heads of the page
> >>>>>are not mapped (BH_Mapped). (minor fault?)
> >>>>>
> >>>>>(1) user dirtys a page
> >>>>>(2) a page fault occurs (do_page_fault)
> >>>>>(3) do_wp_page is called.
> >>>>>(4) ext4_page_mkwrite is called
> >>>>>(5) ext4_write_begin is called
> >>>>>(6) ext4_journal_start_sb =>  We can STOP!
> >>
> >>What happens in the case as follows:
> >>
> >>Task 1: Mmapped writes
> >>t1)ext4_page_mkwrite()
> >>   t2) ext4_write_begin() (FS is thawed so we proceed)
> >>   t3) ext4_write_end() (journal is stopped now)
> >>-----Pre-empted-----
> >>
> >>
> >>Task 2: Freeze Task
> >>t4) freezes the super block...
> >>...(continues)....
> >>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>completed execution.
> >>
> >>Task 1: Mmapped writes
> >>tn+1) ext4_page_mkwrite() returns 0.
> >>tn+2) __do_fault() gets control, code gets executed.
> >>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>a file based page which faulted.
> >>
> >>So you end up dirtying the page cache when the F.S is frozen? No?
> >   You are right ext4_page_mkrite() as currently implemented has problems.
> >You have to return the page locked (and check for frozen fs with page lock
> >held) to avoid races.
> >
> >If you check for frozen fs with page lock held, you are guaranteed that
> >freezing code must wait for the page to get unlocked before proceeding. And
> >before the page is unlocked, it is marked dirty by the pagefault code which
> >makes freezing code write the page and writeprotect it again. So everything
> >will be safe.
> For the locked page to be a part of the freeze initiated sync,
> should its owner inode not be dirtied? The page fault handler
> dirties the page, but who ensures that the inode is dirtied at this
> point?
  Follow the path from set_page_dirty() -> __set_page_dirty_buffers()
-> __set_page_dirty() -> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

  More code reading would save you (and me) some typing ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: your mail
  2011-05-03 15:43                                                 ` Surbhi Palande
@ 2011-05-04 19:24                                                   ` Jan Kara
  2011-05-06 15:20                                                     ` [RFC][PATCH] Do not accept a new handle when the F.S is frozen Surbhi Palande
  2011-05-06 15:20                                                     ` [PATCH] Adding support to freeze and unfreeze a journal Surbhi Palande
  0 siblings, 2 replies; 121+ messages in thread
From: Jan Kara @ 2011-05-04 19:24 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, toshi.okajima, tytso, m.mizuma, adilger.kernel,
	linux-ext4, linux-fsdevel, sandeen

On Tue 03-05-11 18:43:48, Surbhi Palande wrote:
> On 05/03/2011 06:36 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> >>On 05/03/2011 04:46 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> >>
> >>Sorry for missing the subject line :(
> >>>>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>>>Toshiyuki pointed out.
> >>>>
> >>>>zap_pte_range()
> >>>>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>>>
> >>>>So, I think that it is here that we should do the checking for a ext4 F.S
> >>>>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>>>
> >>>>Attaching a patch for initial review. Please do let me know your thoughts!
> >>>   This is definitely the wrong place. ->set_page_dirty() callbacks are
> >>>called with various locks held and the page need not be locked (thus
> >>>dereferencing page->mapping is oopsable). Moreover this particular callback
> >>>is called only in data=journal mode.
> >>Ok! Thanks for that!
> >>
> >>>
> >>>Believe me, the right place is page_mkwrite() - you have to catch the
> >>>read-only =>   read-write page transition. Once the page is mapped
> >>>read-write, you've already lost the race.
> >>
> >>My only point is:
> >>1) something should prevent the freeze from happening. We cant
> >>merely check the vfs_check_frozen()?
> >   Yes, I agree - see my other email with patches.
> >
> >>And this should be done where the page is marked dirty.Also, I
> >>thought that the page is marked read-write only in the page table in
> >>the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> >>the page cache? Is this understanding right?
> >   The page can become dirty either because it was written via standard
> >write - write_begin is responsible for reliable check here - or it was
> >written via mmap - here we rely on page_mkwrite to do a reliable check -
> >it is analogous to write_begin callback. There should be no other way
> >to dirty a page.
> >
> >With dirty bits it is a bit complicated. We have two of them in fact. One
> >in page table entry maintained by mmu and one in page structure maintained
> >by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> >from page table into struct page. This is a lazy process so page can in
> >principle have new data without a dirty bit set in struct page because we
> >have not yet copied the dirty bit from page table. Only at moments where it
> >is important (like when we want to unmap the page, or throw away the page,
> >or so), we make sure struct page and page table bits are in sync.
> >
> >Another subtle thing you need not be aware of it that when we clear page
> >dirty bit, we also writeprotect the page. So we are guaranteed to get a
> >page fault when the page is written to again.
> >
> >>IMHO, whatever code dirties the page in the page cache should call a
> >>F.S specific function and let it _prevent_ a fsfreeze while the page
> >>is getting dirtied, so that a freeze called after this point flushes
> >>this page!
> >   Agreed, that's what code in write_begin() and page_mkwrite() should
> >achieve.
> >								Honza
> Thanks a lot for the wonderful explanation :)
> 
> How about the revert : i.e calling  jbd2_journal_unlock_updates()
> from ext4_unfreeze() instead of the ext4_freeze()? Do you agree to
> that?
  Sorry, I don't agree with revert. We could talk about changing
jbd2_journal_unlock_updates() to not return with mutex held (and handle
synchronization of locked journal operations differently) as an alternative
to doing "freeze" reference counting. But returning with mutex held to user
space is no-go. It will cause problems in lockdep, violates kernel locking
rules, and generally is a bad programming ;).

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-04 19:19                                             ` Jan Kara
@ 2011-05-04 21:34                                               ` Surbhi Palande
  2011-05-04 22:48                                                 ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-04 21:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On 05/04/2011 10:19 PM, Jan Kara wrote:
> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>>>> (2011/04/16 2:13), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>>>> probably
>>>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>>>> because
>>>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>>>> and so
>>>>>>>> we could possibly modify the filesystem.
>>>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>>>> minor-faulting pages.
>>>>>>>
>>>>>>> (minor-pagefault)
>>>>>>> ->   do_wp_page()
>>>>>>> ->   page_mkwrite(= ext4_mkwrite())
>>>>>>> =>   BLOCK!
>>>>>>>
>>>>>>> (major-pagefault)
>>>>>>> ->   do_liner_fault()
>>>>>>> ->   page_mkwrite(= ext4_mkwrite())
>>>>>>> =>   BLOCK!
>>>>>>>
>>>>>>>>
>>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>>>> file (mmap).
>>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>>>> while
>>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>>>> operation
>>>>>>>>>>> while fsfreezing.
>>>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>>>> means we
>>>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>>>> not mean
>>>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>>>> properly
>>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>>>> can't allow
>>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>>>> path can
>>>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>>>> happens after
>>>>>>>>> fsfreeze operation is done...
>>>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>>>> above
>>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>>> else?
>>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>>> So, I had read the kernel source code around it, then I maybe
>>>>>>> understand...
>>>>>>>
>>>>>>> I worry whether we can update the file data in mmap case while
>>>>>>> fsfreezing.
>>>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>>>> is not a
>>>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>>>> problem.
>>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>>> while fsfreezing)
>>>>>>>
>>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>>> the page is not allocated yet. (major fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) __do_falut is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb =>   We can STOP!
>>>>>>>
>>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>>> the page is already allocated, and the buffer_heads of the page
>>>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) do_wp_page is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb =>   We can STOP!
>>>>
>>>> What happens in the case as follows:
>>>>
>>>> Task 1: Mmapped writes
>>>> t1)ext4_page_mkwrite()
>>>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>    t3) ext4_write_end() (journal is stopped now)
>>>> -----Pre-empted-----
>>>>
>>>>
>>>> Task 2: Freeze Task
>>>> t4) freezes the super block...
>>>> ...(continues)....
>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>> completed execution.
>>>>
>>>> Task 1: Mmapped writes
>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>> tn+2) __do_fault() gets control, code gets executed.
>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>> a file based page which faulted.
>>>>
>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>    You are right ext4_page_mkrite() as currently implemented has problems.
>>> You have to return the page locked (and check for frozen fs with page lock
>>> held) to avoid races.
>>>
>>> If you check for frozen fs with page lock held, you are guaranteed that
>>> freezing code must wait for the page to get unlocked before proceeding. And
>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>> makes freezing code write the page and writeprotect it again. So everything
>>> will be safe.
>> For the locked page to be a part of the freeze initiated sync,
>> should its owner inode not be dirtied? The page fault handler
>> dirties the page, but who ensures that the inode is dirtied at this
>> point?
Well, I mean it as follows:

Doesn't the writeback code (invoked via sync_filesystem(sb)) write all 
the dirty pages of all the _dirty_ inodes of a superblock?

So in the window from the point where ext4_page_mkwrite returns to 
__do_fault() _till_ you mark the inode dirty (in __mark_inode_dirty()), 
you can have a race with freeze i.e if freeze happens meanwhile, then 
the sync initiated by freeze will not consider this locked page as the 
owner inode is _clean_ (or not dirtied yet) at that point?

Key: tx: time at unit x

P1: mmapped writes
t1) __do_page_fault()
    t2) ext4_page_mkwrite()
       // owner inode of the page is in _clean_ state - not yet dirtied
    --- pre-empted---

P2: Freeze_super
tn) freeze_super gets control
freezes the F.S, skips the owner inode as it is in the clean state. 
syncs all the other dirty inodes. page cache is now clean.


P1: mmapped writes (resume)
tn+x)__do_page_fault() gets control back:
    tn+x+1) set_page_dirty()
      tn+x+2) __set_page_dirty_buffers()
         tn+x+3) __set_page_dirty()
  	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)

So don't we end up dirtying the page cache when the F.S is frozen?

Again, apologies if I understood the writeback code or something else wrong!

Warm Regards,
Surbhi.

>    Follow the path from set_page_dirty() ->  __set_page_dirty_buffers()
> ->  __set_page_dirty() ->  __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);


>
>    More code reading would save you (and me) some typing ;).
P/S: Sorry about that!

>
> 								Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-04 21:34                                               ` Surbhi Palande
@ 2011-05-04 22:48                                                 ` Jan Kara
  2011-05-05  6:06                                                   ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-04 22:48 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
> On 05/04/2011 10:19 PM, Jan Kara wrote:
> >On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> >>On 05/03/2011 06:19 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>>>What happens in the case as follows:
> >>>>
> >>>>Task 1: Mmapped writes
> >>>>t1)ext4_page_mkwrite()
> >>>>   t2) ext4_write_begin() (FS is thawed so we proceed)
> >>>>   t3) ext4_write_end() (journal is stopped now)
> >>>>-----Pre-empted-----
> >>>>
> >>>>
> >>>>Task 2: Freeze Task
> >>>>t4) freezes the super block...
> >>>>...(continues)....
> >>>>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>>>completed execution.
> >>>>
> >>>>Task 1: Mmapped writes
> >>>>tn+1) ext4_page_mkwrite() returns 0.
> >>>>tn+2) __do_fault() gets control, code gets executed.
> >>>>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>>>a file based page which faulted.
> >>>>
> >>>>So you end up dirtying the page cache when the F.S is frozen? No?
> >>>   You are right ext4_page_mkrite() as currently implemented has problems.
> >>>You have to return the page locked (and check for frozen fs with page lock
> >>>held) to avoid races.
> >>>
> >>>If you check for frozen fs with page lock held, you are guaranteed that
> >>>freezing code must wait for the page to get unlocked before proceeding. And
> >>>before the page is unlocked, it is marked dirty by the pagefault code which
> >>>makes freezing code write the page and writeprotect it again. So everything
> >>>will be safe.
> >>For the locked page to be a part of the freeze initiated sync,
> >>should its owner inode not be dirtied? The page fault handler
> >>dirties the page, but who ensures that the inode is dirtied at this
> >>point?
> Well, I mean it as follows:
> 
> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
> all the dirty pages of all the _dirty_ inodes of a superblock?
> 
> So in the window from the point where ext4_page_mkwrite returns to
> __do_fault() _till_ you mark the inode dirty (in
> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
> happens meanwhile, then the sync initiated by freeze will not
> consider this locked page as the owner inode is _clean_ (or not
> dirtied yet) at that point?
  Ah, I see. That's actually a good point! Thanks for persistence. So we
should also dirty the page before checking for frozen fs.

> Key: tx: time at unit x
> 
> P1: mmapped writes
> t1) __do_page_fault()
>    t2) ext4_page_mkwrite()
>       // owner inode of the page is in _clean_ state - not yet dirtied
>    --- pre-empted---
> 
> P2: Freeze_super
> tn) freeze_super gets control
> freezes the F.S, skips the owner inode as it is in the clean state.
> syncs all the other dirty inodes. page cache is now clean.
> 
> 
> P1: mmapped writes (resume)
> tn+x)__do_page_fault() gets control back:
>    tn+x+1) set_page_dirty()
>      tn+x+2) __set_page_dirty_buffers()
>         tn+x+3) __set_page_dirty()
>  	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
> 
> So don't we end up dirtying the page cache when the F.S is frozen?
> 
> Again, apologies if I understood the writeback code or something else wrong!
  No, you understood it right. Just your previous email was too generic so
I have not thought about this particular race.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-04 22:48                                                 ` Jan Kara
@ 2011-05-05  6:06                                                   ` Surbhi Palande
  2011-05-05 11:18                                                     ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-05  6:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On 05/05/2011 01:48 AM, Jan Kara wrote:
> On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
>> On 05/04/2011 10:19 PM, Jan Kara wrote:
>>> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>>>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>>>> What happens in the case as follows:
>>>>>>
>>>>>> Task 1: Mmapped writes
>>>>>> t1)ext4_page_mkwrite()
>>>>>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>>>    t3) ext4_write_end() (journal is stopped now)
>>>>>> -----Pre-empted-----
>>>>>>
>>>>>>
>>>>>> Task 2: Freeze Task
>>>>>> t4) freezes the super block...
>>>>>> ...(continues)....
>>>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>>>> completed execution.
>>>>>>
>>>>>> Task 1: Mmapped writes
>>>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>>>> tn+2) __do_fault() gets control, code gets executed.
>>>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>>>> a file based page which faulted.
>>>>>>
>>>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>>>    You are right ext4_page_mkrite() as currently implemented has problems.
>>>>> You have to return the page locked (and check for frozen fs with page lock
>>>>> held) to avoid races.
>>>>>
>>>>> If you check for frozen fs with page lock held, you are guaranteed that
>>>>> freezing code must wait for the page to get unlocked before proceeding. And
>>>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>>>> makes freezing code write the page and writeprotect it again. So everything
>>>>> will be safe.
>>>> For the locked page to be a part of the freeze initiated sync,
>>>> should its owner inode not be dirtied? The page fault handler
>>>> dirties the page, but who ensures that the inode is dirtied at this
>>>> point?
>> Well, I mean it as follows:
>>
>> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
>> all the dirty pages of all the _dirty_ inodes of a superblock?
>>
>> So in the window from the point where ext4_page_mkwrite returns to
>> __do_fault() _till_ you mark the inode dirty (in
>> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
>> happens meanwhile, then the sync initiated by freeze will not
>> consider this locked page as the owner inode is _clean_ (or not
>> dirtied yet) at that point?
>    Ah, I see. That's actually a good point! Thanks for persistence. So we
> should also dirty the page before checking for frozen fs.

Should we not also dirty the inode? IMHO, marking an inode will be racy 
as well!

Warm Regards,
Surbhi.

>
>> Key: tx: time at unit x
>>
>> P1: mmapped writes
>> t1) __do_page_fault()
>>     t2) ext4_page_mkwrite()
>>        // owner inode of the page is in _clean_ state - not yet dirtied
>>     --- pre-empted---
>>
>> P2: Freeze_super
>> tn) freeze_super gets control
>> freezes the F.S, skips the owner inode as it is in the clean state.
>> syncs all the other dirty inodes. page cache is now clean.
>>
>>
>> P1: mmapped writes (resume)
>> tn+x)__do_page_fault() gets control back:
>>     tn+x+1) set_page_dirty()
>>       tn+x+2) __set_page_dirty_buffers()
>>          tn+x+3) __set_page_dirty()
>>   	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
>>
>> So don't we end up dirtying the page cache when the F.S is frozen?
>>
>> Again, apologies if I understood the writeback code or something else wrong!
>    No, you understood it right. Just your previous email was too generic so
> I have not thought about this particular race.
>
> 									Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-05  6:06                                                   ` Surbhi Palande
@ 2011-05-05 11:18                                                     ` Jan Kara
  2011-05-05 14:01                                                       ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-05 11:18 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Jan Kara, Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On Thu 05-05-11 09:06:29, Surbhi Palande wrote:
> On 05/05/2011 01:48 AM, Jan Kara wrote:
> >On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
> >>On 05/04/2011 10:19 PM, Jan Kara wrote:
> >>>On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> >>>>On 05/03/2011 06:19 PM, Jan Kara wrote:
> >>>>>On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>>>>>What happens in the case as follows:
> >>>>>>
> >>>>>>Task 1: Mmapped writes
> >>>>>>t1)ext4_page_mkwrite()
> >>>>>>   t2) ext4_write_begin() (FS is thawed so we proceed)
> >>>>>>   t3) ext4_write_end() (journal is stopped now)
> >>>>>>-----Pre-empted-----
> >>>>>>
> >>>>>>
> >>>>>>Task 2: Freeze Task
> >>>>>>t4) freezes the super block...
> >>>>>>...(continues)....
> >>>>>>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>>>>>completed execution.
> >>>>>>
> >>>>>>Task 1: Mmapped writes
> >>>>>>tn+1) ext4_page_mkwrite() returns 0.
> >>>>>>tn+2) __do_fault() gets control, code gets executed.
> >>>>>>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>>>>>a file based page which faulted.
> >>>>>>
> >>>>>>So you end up dirtying the page cache when the F.S is frozen? No?
> >>>>>   You are right ext4_page_mkrite() as currently implemented has problems.
> >>>>>You have to return the page locked (and check for frozen fs with page lock
> >>>>>held) to avoid races.
> >>>>>
> >>>>>If you check for frozen fs with page lock held, you are guaranteed that
> >>>>>freezing code must wait for the page to get unlocked before proceeding. And
> >>>>>before the page is unlocked, it is marked dirty by the pagefault code which
> >>>>>makes freezing code write the page and writeprotect it again. So everything
> >>>>>will be safe.
> >>>>For the locked page to be a part of the freeze initiated sync,
> >>>>should its owner inode not be dirtied? The page fault handler
> >>>>dirties the page, but who ensures that the inode is dirtied at this
> >>>>point?
> >>Well, I mean it as follows:
> >>
> >>Doesn't the writeback code (invoked via sync_filesystem(sb)) write
> >>all the dirty pages of all the _dirty_ inodes of a superblock?
> >>
> >>So in the window from the point where ext4_page_mkwrite returns to
> >>__do_fault() _till_ you mark the inode dirty (in
> >>__mark_inode_dirty()), you can have a race with freeze i.e if freeze
> >>happens meanwhile, then the sync initiated by freeze will not
> >>consider this locked page as the owner inode is _clean_ (or not
> >>dirtied yet) at that point?
> >   Ah, I see. That's actually a good point! Thanks for persistence. So we
> >should also dirty the page before checking for frozen fs.
> 
> Should we not also dirty the inode? IMHO, marking an inode will be
> racy as well!
  Marking the page dirty marks the inode dirty as well as I've explained in my
previous emails. So I'm missing what you are concerned about...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-05-05 11:18                                                     ` Jan Kara
@ 2011-05-05 14:01                                                       ` Surbhi Palande
  0 siblings, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-05 14:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Toshiyuki Okajima, Ted Ts'o, Masayoshi MIZUMA,
	Andreas Dilger, linux-ext4, linux-fsdevel, sandeen

On 05/05/2011 02:18 PM, Jan Kara wrote:
> On Thu 05-05-11 09:06:29, Surbhi Palande wrote:
>> On 05/05/2011 01:48 AM, Jan Kara wrote:
>>> On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
>>>> On 05/04/2011 10:19 PM, Jan Kara wrote:
>>>>> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>>>>>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>>>>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>>>>>> What happens in the case as follows:
>>>>>>>>
>>>>>>>> Task 1: Mmapped writes
>>>>>>>> t1)ext4_page_mkwrite()
>>>>>>>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>>>>>    t3) ext4_write_end() (journal is stopped now)
>>>>>>>> -----Pre-empted-----
>>>>>>>>
>>>>>>>>
>>>>>>>> Task 2: Freeze Task
>>>>>>>> t4) freezes the super block...
>>>>>>>> ...(continues)....
>>>>>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>>>>>> completed execution.
>>>>>>>>
>>>>>>>> Task 1: Mmapped writes
>>>>>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>>>>>> tn+2) __do_fault() gets control, code gets executed.
>>>>>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>>>>>> a file based page which faulted.
>>>>>>>>
>>>>>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>>>>>    You are right ext4_page_mkrite() as currently implemented has problems.
>>>>>>> You have to return the page locked (and check for frozen fs with page lock
>>>>>>> held) to avoid races.
>>>>>>>
>>>>>>> If you check for frozen fs with page lock held, you are guaranteed that
>>>>>>> freezing code must wait for the page to get unlocked before proceeding. And
>>>>>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>>>>>> makes freezing code write the page and writeprotect it again. So everything
>>>>>>> will be safe.
>>>>>> For the locked page to be a part of the freeze initiated sync,
>>>>>> should its owner inode not be dirtied? The page fault handler
>>>>>> dirties the page, but who ensures that the inode is dirtied at this
>>>>>> point?
>>>> Well, I mean it as follows:
>>>>
>>>> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
>>>> all the dirty pages of all the _dirty_ inodes of a superblock?
>>>>
>>>> So in the window from the point where ext4_page_mkwrite returns to
>>>> __do_fault() _till_ you mark the inode dirty (in
>>>> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
>>>> happens meanwhile, then the sync initiated by freeze will not
>>>> consider this locked page as the owner inode is _clean_ (or not
>>>> dirtied yet) at that point?
>>>    Ah, I see. That's actually a good point! Thanks for persistence. So we
>>> should also dirty the page before checking for frozen fs.
>>
>> Should we not also dirty the inode? IMHO, marking an inode will be
>> racy as well!
>    Marking the page dirty marks the inode dirty as well as I've explained in my
> previous emails. So I'm missing what you are concerned about...

Yes you are right! There is no other concern - setting the page dirty 
will be racy.

Warm Regards,
Surbhi.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC][PATCH] Do not accept a new handle when the F.S is frozen
  2011-05-04 19:24                                                   ` Jan Kara
@ 2011-05-06 15:20                                                     ` Surbhi Palande
  2011-05-06 15:20                                                     ` [PATCH] Adding support to freeze and unfreeze a journal Surbhi Palande
  1 sibling, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-06 15:20 UTC (permalink / raw)
  To: jack
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

This patch adds a flag which indicates that the journal is frozen or not and
introduces two new functions jbd2_journal_freeze and jbd2_journal_thaw which
should be called when the F.S freezes and thaws respectively. 
A new handle can only be started now when the barrier count is 0 and when the
journal is not in a frozen state. While the journal is in a frozen state,
trying to start a new handle would put the process on wait queue. Thawing the
journal would wake up all the processes waiting on this wait queue.

I have lightly tested this patch. Sending it here for initial review. Please
do let me know your inputs. Thanks a lot!

Warm Regards,
Surbhi.



^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH] Adding support to freeze and unfreeze a journal
  2011-05-04 19:24                                                   ` Jan Kara
  2011-05-06 15:20                                                     ` [RFC][PATCH] Do not accept a new handle when the F.S is frozen Surbhi Palande
@ 2011-05-06 15:20                                                     ` Surbhi Palande
  2011-05-06 20:56                                                       ` Andreas Dilger
  1 sibling, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-06 15:20 UTC (permalink / raw)
  To: jack
  Cc: toshi.okajima, tytso, m.mizuma, adilger.kernel, linux-ext4,
	linux-fsdevel, sandeen

The journal should be frozen when a F.S freezes. What this means is that till
the F.S is thawed again, no new transactions should be accepted by the
journal. When the F.S thaws, inturn it should thaw the journal and this should
allow the journal to resume accepting new transactions.
While the F.S has frozen the journal, the clients of journal on calling
jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
up the sleeping clients and journalling can progress normally.

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
---
 fs/ext4/super.c       |   20 ++++++--------------
 fs/jbd2/journal.c     |    1 +
 fs/jbd2/transaction.c |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h  |    9 +++++++++
 4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
 
 	journal = EXT4_SB(sb)->s_journal;
 
-	/* Now we set up the journal barrier. */
-	jbd2_journal_lock_updates(journal);
-
+	error = jbd2_journal_freeze(journal);
 	/*
-	 * Don't clear the needs_recovery flag if we failed to flush
+	 * Don't clear the needs_recovery flag if we failed to freeze
 	 * the journal.
 	 */
-	error = jbd2_journal_flush(journal);
-	if (error < 0)
-		goto out;
-
-	/* Journal blocked and flushed, clear needs_recovery flag. */
-	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
-	error = ext4_commit_super(sb, 1);
-out:
-	/* we rely on s_frozen to stop further updates */
-	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+	if (error >= 0) {
+		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+		error = ext4_commit_super(sb, 1);
+	}
 	return error;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
 	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_frozen);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..ad5a5df 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,13 @@ repeat:
 				journal->j_barrier_count == 0);
 		goto repeat;
 	}
+	/* dont let a new handle start when a journal is frozen.
+	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+	 * the jflags indicate that the journal is frozen. So if the
+	 * j_barrier_count is 0, then check if this was made 0 by the freezing
+	 * process
+	 */
+	jbd2_check_frozen(journal);
 
 	if (!journal->j_running_transaction) {
 		read_unlock(&journal->j_state_lock);
@@ -489,6 +496,35 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
+int jbd2_journal_freeze(journal_t *journal)
+{
+	int error = 0;
+	/* Now we set up the journal barrier. */
+	jbd2_journal_lock_updates(journal);
+
+	/*
+	 * Don't clear the needs_recovery flag if we failed to flush
+	 * the journal.
+	 */
+	error = jbd2_journal_flush(journal);
+	if (error >= 0) {
+		write_lock(&journal->j_state_lock);
+		journal->j_flags = journal->j_flags | JBD2_FROZEN;
+		write_unlock(&journal->j_state_lock);
+	}
+	jbd2_journal_unlock_updates(journal);
+	return error;
+}
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+	write_lock(&journal->j_state_lock);
+	journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
+	write_unlock(&journal->j_state_lock);
+	wake_up(&journal->j_wait_frozen);
+}
+
+
 /**
  * void jbd2_journal_lock_updates () - establish a transaction barrier.
  * @journal:  Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..31d6c23 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -835,6 +835,9 @@ struct journal_s
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for journal to thaw*/
+	wait_queue_head_t	j_wait_frozen;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct mutex		j_checkpoint_mutex;
 
@@ -1013,7 +1016,11 @@ struct journal_s
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
+
 
+#define jbd2_check_frozen(journal)	\
+		wait_event(journal->j_wait_frozen, ((journal->j_flags & JBD2_FROZEN) != JBD2_FROZEN))
 /*
  * Function declarations for the journaling transaction and buffer
  * management
@@ -1121,6 +1128,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 jbd2_journal_stop(handle_t *);
+extern int	 jbd2_journal_freeze(journal_t *);
+extern void	 jbd2_journal_thaw(journal_t *);
 extern int	 jbd2_journal_flush (journal_t *);
 extern void	 jbd2_journal_lock_updates (journal_t *);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH] Adding support to freeze and unfreeze a journal
  2011-05-06 15:20                                                     ` [PATCH] Adding support to freeze and unfreeze a journal Surbhi Palande
@ 2011-05-06 20:56                                                       ` Andreas Dilger
  2011-05-07 20:04                                                         ` [PATCH v2] " Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Andreas Dilger @ 2011-05-06 20:56 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, toshi.okajima, tytso, m.mizuma, linux-ext4, linux-fsdevel, sandeen

On May 6, 2011, at 09:20, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes. What this means is that till
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>

I think this is not a good patch as-is, see below.

> ---
> fs/ext4/super.c       |   20 ++++++--------------
> fs/jbd2/journal.c     |    1 +
> fs/jbd2/transaction.c |   36 ++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h  |    9 +++++++++
> 4 files changed, 52 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
> 
> 	journal = EXT4_SB(sb)->s_journal;
> 
> -	/* Now we set up the journal barrier. */
> -	jbd2_journal_lock_updates(journal);
> -
> +	error = jbd2_journal_freeze(journal);
> 	/*
> -	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * Don't clear the needs_recovery flag if we failed to freeze
> 	 * the journal.
> 	 */
> -	error = jbd2_journal_flush(journal);
> -	if (error < 0)
> -		goto out;
> -
> -	/* Journal blocked and flushed, clear needs_recovery flag. */
> -	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> -	error = ext4_commit_super(sb, 1);
> -out:
> -	/* we rely on s_frozen to stop further updates */
> -	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> +	if (error >= 0) {
> +		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> +		error = ext4_commit_super(sb, 1);
> +	}
> 	return error;
> }
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> 	init_waitqueue_head(&journal->j_wait_checkpoint);
> 	init_waitqueue_head(&journal->j_wait_commit);
> 	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_frozen);
> 	mutex_init(&journal->j_barrier);
> 	mutex_init(&journal->j_checkpoint_mutex);
> 	spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..ad5a5df 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,13 @@ repeat:
> 				journal->j_barrier_count == 0);
> 		goto repeat;
> 	}
> +	/* dont let a new handle start when a journal is frozen.
> +	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> +	 * the jflags indicate that the journal is frozen. So if the
> +	 * j_barrier_count is 0, then check if this was made 0 by the freezing
> +	 * process
> +	 */
> +	jbd2_check_frozen(journal);

This is called with read_lock(&journal->j_state_lock) held, so it seems that
the thread entering start_this_handle() will hold the j_state_lock the entire
time the journal is frozen.  But, see below in jbd2_journal_thaw()...

> 	if (!journal->j_running_transaction) {
> 		read_unlock(&journal->j_state_lock);
> @@ -489,6 +496,35 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
> 
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> +	int error = 0;
> +	/* Now we set up the journal barrier. */
> +	jbd2_journal_lock_updates(journal);
> +
> +	/*
> +	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * the journal.
> +	 */
> +	error = jbd2_journal_flush(journal);
> +	if (error >= 0) {
> +		write_lock(&journal->j_state_lock);
> +		journal->j_flags = journal->j_flags | JBD2_FROZEN;
> +		write_unlock(&journal->j_state_lock);
> +	}
> +	jbd2_journal_unlock_updates(journal);
> +	return error;
> +}
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
> +	write_unlock(&journal->j_state_lock);

... this code needs to get a write lock j_state_lock in order to unfreeze the
journal.  It seems that this would deadlock as soon as you actually had some
situation where a transaction is being started while the journal is frozen?

> +	wake_up(&journal->j_wait_frozen);
> +}
> +
> +
> /**
>  * void jbd2_journal_lock_updates () - establish a transaction barrier.
>  * @journal:  Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..31d6c23 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -835,6 +835,9 @@ struct journal_s
> 	/* Wait queue to wait for updates to complete */
> 	wait_queue_head_t	j_wait_updates;
> 
> +	/* Wait queue to wait for journal to thaw*/
> +	wait_queue_head_t	j_wait_frozen;
> +
> 	/* Semaphore for locking against concurrent checkpoints */
> 	struct mutex		j_checkpoint_mutex;
> 
> @@ -1013,7 +1016,11 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
> 						 * data write error in ordered
> 						 * mode */
> +#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
> +
> 
> +#define jbd2_check_frozen(journal)	\
> +		wait_event(journal->j_wait_frozen, ((journal->j_flags & JBD2_FROZEN) != JBD2_FROZEN))

It seems that what is needed here is to check the JBD2_FROZEN state under
lock (so that it is not racing with the flag being set) but drop the lock,
wait, and retry if the journal was actually frozen.  Since "goto label" is
ugly from within a macro, this probably needs to be open-coded at the caller in start_this_handle(), something like:

	if (journal->j_flags & JBD2_FROZEN) {
		read_unlock(&journal->j_state_lock);
		wait_event(journal->j_wait_frozen, journal->j_flags&JBD2_FROZEN);
		goto repeat;
	}

This opens the question of whether it is SMP safe to check j_flags without any kind of barrier or lock held?  I think in this case we are fine, since the above code jumps back to "repeat" to re-verify j_flags under j_state_lock, so there is no race on the state.

>  * Function declarations for the journaling transaction and buffer
>  * management
> @@ -1121,6 +1128,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
> 				struct page *, unsigned long);
> extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int	 jbd2_journal_stop(handle_t *);
> +extern int	 jbd2_journal_freeze(journal_t *);
> +extern void	 jbd2_journal_thaw(journal_t *);
> extern int	 jbd2_journal_flush (journal_t *);
> extern void	 jbd2_journal_lock_updates (journal_t *);
> extern void	 jbd2_journal_unlock_updates (journal_t *);
> -- 
> 1.7.1
> 


Cheers, Andreas






^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v2] Adding support to freeze and unfreeze a journal
  2011-05-06 20:56                                                       ` Andreas Dilger
@ 2011-05-07 20:04                                                         ` Surbhi Palande
  2011-05-08  8:24                                                           ` Marco Stornelli
  2011-05-09  9:53                                                           ` Jan Kara
  0 siblings, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-07 20:04 UTC (permalink / raw)
  To: adilger.kernel
  Cc: jack, toshi.okajima, tytso, m.mizuma, linux-ext4, linux-fsdevel, sandeen

On freezing the F.S, the journal should be frozen as well. This implies not
being able to start any new transactions on a frozen journal. Similarly,
thawing a F.S should thaw a journal and this should conversely start accepting
new transactions. While the F.S is frozen any request to start a new
handle should end up on a wait queue till the F.S is thawed back again.

Adding support to freeze and thaw a journal and leveraging this support in
freezing and thawing ext4.

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
---
Changes since v1:
* Check for the j_flag and if frozen release the j_state_lock before sleeping
   on wait queue
* Export the freeze, thaw routines
* Added a barrier after setting the j_flags = JBD2_FROZEN

 fs/ext4/super.c       |   20 ++++++--------------
 fs/jbd2/journal.c     |    1 +
 fs/jbd2/transaction.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h  |    9 +++++++++
 4 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
 
 	journal = EXT4_SB(sb)->s_journal;
 
-	/* Now we set up the journal barrier. */
-	jbd2_journal_lock_updates(journal);
-
+	error = jbd2_journal_freeze(journal);
 	/*
-	 * Don't clear the needs_recovery flag if we failed to flush
+	 * Don't clear the needs_recovery flag if we failed to freeze
 	 * the journal.
 	 */
-	error = jbd2_journal_flush(journal);
-	if (error < 0)
-		goto out;
-
-	/* Journal blocked and flushed, clear needs_recovery flag. */
-	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
-	error = ext4_commit_super(sb, 1);
-out:
-	/* we rely on s_frozen to stop further updates */
-	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+	if (error >= 0) {
+		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+		error = ext4_commit_super(sb, 1);
+	}
 	return error;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
 	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_frozen);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..3283c77 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
 				journal->j_barrier_count == 0);
 		goto repeat;
 	}
+	/* dont let a new handle start when a journal is frozen.
+	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+	 * the jflags indicate that the journal is frozen. So if the
+	 * j_barrier_count is 0, then check if this was made 0 by the freezing
+	 * process
+	 */
+	if (journal->j_flags & JBD2_FROZEN) {
+		read_unlock(&journal->j_state_lock);
+		jbd2_check_frozen(journal);
+		goto repeat;
+	}
 
 	if (!journal->j_running_transaction) {
 		read_unlock(&journal->j_state_lock);
@@ -489,6 +500,38 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
+int jbd2_journal_freeze(journal_t *journal)
+{
+	int error = 0;
+	/* Now we set up the journal barrier. */
+	jbd2_journal_lock_updates(journal);
+
+	/*
+	 * Don't clear the needs_recovery flag if we failed to flush
+	 * the journal.
+	 */
+	error = jbd2_journal_flush(journal);
+	if (error >= 0) {
+		write_lock(&journal->j_state_lock);
+		journal->j_flags = journal->j_flags | JBD2_FROZEN;
+		write_unlock(&journal->j_state_lock);
+	}
+	jbd2_journal_unlock_updates(journal);
+	return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+	write_lock(&journal->j_state_lock);
+	journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
+	write_unlock(&journal->j_state_lock);
+	smp_wmb();
+	wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
 /**
  * void jbd2_journal_lock_updates () - establish a transaction barrier.
  * @journal:  Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..31d6c23 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -835,6 +835,9 @@ struct journal_s
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for journal to thaw*/
+	wait_queue_head_t	j_wait_frozen;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct mutex		j_checkpoint_mutex;
 
@@ -1013,7 +1016,11 @@ struct journal_s
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
+
 
+#define jbd2_check_frozen(journal)	\
+		wait_event(journal->j_wait_frozen, ((journal->j_flags & JBD2_FROZEN) != JBD2_FROZEN))
 /*
  * Function declarations for the journaling transaction and buffer
  * management
@@ -1121,6 +1128,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 jbd2_journal_stop(handle_t *);
+extern int	 jbd2_journal_freeze(journal_t *);
+extern void	 jbd2_journal_thaw(journal_t *);
 extern int	 jbd2_journal_flush (journal_t *);
 extern void	 jbd2_journal_lock_updates (journal_t *);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] Adding support to freeze and unfreeze a journal
  2011-05-07 20:04                                                         ` [PATCH v2] " Surbhi Palande
@ 2011-05-08  8:24                                                           ` Marco Stornelli
  2011-05-09  9:04                                                             ` Surbhi Palande
  2011-05-09  9:53                                                           ` Jan Kara
  1 sibling, 1 reply; 121+ messages in thread
From: Marco Stornelli @ 2011-05-08  8:24 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: adilger.kernel, jack, toshi.okajima, tytso, m.mizuma, linux-ext4,
	linux-fsdevel, sandeen

Il 07/05/2011 22:04, Surbhi Palande ha scritto:
> On freezing the F.S, the journal should be frozen as well. This implies not
> being able to start any new transactions on a frozen journal. Similarly,
> thawing a F.S should thaw a journal and this should conversely start accepting
> new transactions. While the F.S is frozen any request to start a new
> handle should end up on a wait queue till the F.S is thawed back again.
>
> Adding support to freeze and thaw a journal and leveraging this support in
> freezing and thawing ext4.
>
> Signed-off-by: Surbhi Palande<surbhi.palande@canonical.com>
> ---
> Changes since v1:
> * Check for the j_flag and if frozen release the j_state_lock before sleeping
>     on wait queue
> * Export the freeze, thaw routines
> * Added a barrier after setting the j_flags = JBD2_FROZEN
>
>   fs/ext4/super.c       |   20 ++++++--------------
>   fs/jbd2/journal.c     |    1 +
>   fs/jbd2/transaction.c |   43 +++++++++++++++++++++++++++++++++++++++++++
>   include/linux/jbd2.h  |    9 +++++++++
>   4 files changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>
>   	journal = EXT4_SB(sb)->s_journal;
>
> -	/* Now we set up the journal barrier. */
> -	jbd2_journal_lock_updates(journal);
> -
> +	error = jbd2_journal_freeze(journal);
>   	/*
> -	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * Don't clear the needs_recovery flag if we failed to freeze
>   	 * the journal.
>   	 */
> -	error = jbd2_journal_flush(journal);
> -	if (error<  0)
> -		goto out;
> -
> -	/* Journal blocked and flushed, clear needs_recovery flag. */
> -	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> -	error = ext4_commit_super(sb, 1);
> -out:
> -	/* we rely on s_frozen to stop further updates */
> -	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> +	if (error>= 0) {
> +		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> +		error = ext4_commit_super(sb, 1);
> +	}
>   	return error;
>   }
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>   	init_waitqueue_head(&journal->j_wait_checkpoint);
>   	init_waitqueue_head(&journal->j_wait_commit);
>   	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_frozen);
>   	mutex_init(&journal->j_barrier);
>   	mutex_init(&journal->j_checkpoint_mutex);
>   	spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..3283c77 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
>   				journal->j_barrier_count == 0);
>   		goto repeat;
>   	}
> +	/* dont let a new handle start when a journal is frozen.
> +	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> +	 * the jflags indicate that the journal is frozen. So if the
> +	 * j_barrier_count is 0, then check if this was made 0 by the freezing
> +	 * process
> +	 */
> +	if (journal->j_flags&  JBD2_FROZEN) {
> +		read_unlock(&journal->j_state_lock);
> +		jbd2_check_frozen(journal);
> +		goto repeat;
> +	}
>
>   	if (!journal->j_running_transaction) {
>   		read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,38 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>   }
>   EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> +	int error = 0;
> +	/* Now we set up the journal barrier. */
> +	jbd2_journal_lock_updates(journal);
> +
> +	/*
> +	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * the journal.
> +	 */
> +	error = jbd2_journal_flush(journal);
> +	if (error>= 0) {
> +		write_lock(&journal->j_state_lock);
> +		journal->j_flags = journal->j_flags | JBD2_FROZEN;

Better journal->j_flags |= JBD2_FROZEN;

> +		write_unlock(&journal->j_state_lock);
> +	}
> +	jbd2_journal_unlock_updates(journal);
> +	return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags = journal->j_flags&= ~JBD2_FROZEN;

What? Maybe journal->j_flags &= ~JBD2_FROZEN;

> +	write_unlock(&journal->j_state_lock);
> +	smp_wmb();

It'd be better to add a comment here for this barrier.

> +	wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +

Marco

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] Adding support to freeze and unfreeze a journal
  2011-05-08  8:24                                                           ` Marco Stornelli
@ 2011-05-09  9:04                                                             ` Surbhi Palande
  2011-05-09  9:24                                                               ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-09  9:04 UTC (permalink / raw)
  To: Marco Stornelli
  Cc: adilger.kernel, jack, toshi.okajima, tytso, m.mizuma, linux-ext4,
	linux-fsdevel, sandeen

On 05/08/2011 11:24 AM, Marco Stornelli wrote:

Thanks for your review!
> Il 07/05/2011 22:04, Surbhi Palande ha scritto:
>> On freezing the F.S, the journal should be frozen as well. This
>> implies not
>> being able to start any new transactions on a frozen journal. Similarly,
>> thawing a F.S should thaw a journal and this should conversely start
>> accepting
>> new transactions. While the F.S is frozen any request to start a new
>> handle should end up on a wait queue till the F.S is thawed back again.
>>
>> Adding support to freeze and thaw a journal and leveraging this
>> support in
>> freezing and thawing ext4.
>>
>> Signed-off-by: Surbhi Palande<surbhi.palande@canonical.com>
>> ---
>> Changes since v1:
>> * Check for the j_flag and if frozen release the j_state_lock before
>> sleeping
>> on wait queue
>> * Export the freeze, thaw routines
>> * Added a barrier after setting the j_flags = JBD2_FROZEN
>>
>> fs/ext4/super.c | 20 ++++++--------------
>> fs/jbd2/journal.c | 1 +
>> fs/jbd2/transaction.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>> include/linux/jbd2.h | 9 +++++++++
>> 4 files changed, 59 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 8553dfb..796aa4c 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>
>> journal = EXT4_SB(sb)->s_journal;
>>
>> - /* Now we set up the journal barrier. */
>> - jbd2_journal_lock_updates(journal);
>> -
>> + error = jbd2_journal_freeze(journal);
>> /*
>> - * Don't clear the needs_recovery flag if we failed to flush
>> + * Don't clear the needs_recovery flag if we failed to freeze
>> * the journal.
>> */
>> - error = jbd2_journal_flush(journal);
>> - if (error< 0)
>> - goto out;
>> -
>> - /* Journal blocked and flushed, clear needs_recovery flag. */
>> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> - error = ext4_commit_super(sb, 1);
>> -out:
>> - /* we rely on s_frozen to stop further updates */
>> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> + if (error>= 0) {
>> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> + error = ext4_commit_super(sb, 1);
>> + }
>> return error;
>> }
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index e0ec3db..5e46333 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>> init_waitqueue_head(&journal->j_wait_checkpoint);
>> init_waitqueue_head(&journal->j_wait_commit);
>> init_waitqueue_head(&journal->j_wait_updates);
>> + init_waitqueue_head(&journal->j_wait_frozen);
>> mutex_init(&journal->j_barrier);
>> mutex_init(&journal->j_checkpoint_mutex);
>> spin_lock_init(&journal->j_revoke_lock);
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 05fa77a..3283c77 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -171,6 +171,17 @@ repeat:
>> journal->j_barrier_count == 0);
>> goto repeat;
>> }
>> + /* dont let a new handle start when a journal is frozen.
>> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>> + * the jflags indicate that the journal is frozen. So if the
>> + * j_barrier_count is 0, then check if this was made 0 by the freezing
>> + * process
>> + */
>> + if (journal->j_flags& JBD2_FROZEN) {
>> + read_unlock(&journal->j_state_lock);
>> + jbd2_check_frozen(journal);
>> + goto repeat;
>> + }
>>
>> if (!journal->j_running_transaction) {
>> read_unlock(&journal->j_state_lock);
>> @@ -489,6 +500,38 @@ int jbd2_journal_restart(handle_t *handle, int
>> nblocks)
>> }
>> EXPORT_SYMBOL(jbd2_journal_restart);
>>
>> +int jbd2_journal_freeze(journal_t *journal)
>> +{
>> + int error = 0;
>> + /* Now we set up the journal barrier. */
>> + jbd2_journal_lock_updates(journal);
>> +
>> + /*
>> + * Don't clear the needs_recovery flag if we failed to flush
>> + * the journal.
>> + */
>> + error = jbd2_journal_flush(journal);
>> + if (error>= 0) {
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags = journal->j_flags | JBD2_FROZEN;
>
> Better journal->j_flags |= JBD2_FROZEN;

I was wondering why this is actually better than the longer form? Is 
there any technical reason other than preference of coding style here?

>
>> + write_unlock(&journal->j_state_lock);
>> + }
>> + jbd2_journal_unlock_updates(journal);
>> + return error;
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>> +
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags = journal->j_flags&= ~JBD2_FROZEN;
>
> What? Maybe journal->j_flags &= ~JBD2_FROZEN;

This definitely is a typo and needs a change. Again, why do you 
recommend the shorter form?

>
>> + write_unlock(&journal->j_state_lock);
>> + smp_wmb();
>
> It'd be better to add a comment here for this barrier.
Ok!
>
>> + wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>> +
>> +
>
> Marco
Warm Regards,
Surbhi.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] Adding support to freeze and unfreeze a journal
  2011-05-09  9:04                                                             ` Surbhi Palande
@ 2011-05-09  9:24                                                               ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-05-09  9:24 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: Marco Stornelli, adilger.kernel, jack, toshi.okajima, tytso,
	m.mizuma, linux-ext4, linux-fsdevel, sandeen

On Mon 09-05-11 12:04:45, Surbhi Palande wrote:
> On 05/08/2011 11:24 AM, Marco Stornelli wrote:
> >>+ error = jbd2_journal_flush(journal);
> >>+ if (error>= 0) {
> >>+ write_lock(&journal->j_state_lock);
> >>+ journal->j_flags = journal->j_flags | JBD2_FROZEN;
> >
> >Better journal->j_flags |= JBD2_FROZEN;
> 
> I was wondering why this is actually better than the longer form? Is
> there any technical reason other than preference of coding style
> here?
  It's just a coding style but that's kind of important as well. You don't
have to employ your brain by checking whether the right hand side is the
same as the left hand side in this case. So please use |=.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] Adding support to freeze and unfreeze a journal
  2011-05-07 20:04                                                         ` [PATCH v2] " Surbhi Palande
  2011-05-08  8:24                                                           ` Marco Stornelli
@ 2011-05-09  9:53                                                           ` Jan Kara
  2011-05-09 13:49                                                             ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-09  9:53 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: adilger.kernel, jack, toshi.okajima, tytso, m.mizuma, linux-ext4,
	linux-fsdevel, sandeen

On Sat 07-05-11 23:04:22, Surbhi Palande wrote:
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
> +	write_unlock(&journal->j_state_lock);
> +	smp_wmb();
  Why is here the smp_wmb()? The write is inside a rw-lock so it cannot be
reordered. Also wake_up() is protected by queue->lock so I don't see the
need for a barrier.

> +	wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2] Adding support to freeze and unfreeze a journal
  2011-05-09  9:53                                                           ` Jan Kara
@ 2011-05-09 13:49                                                             ` Surbhi Palande
  2011-05-09 14:51                                                               ` [PATCH v3] " Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-09 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: adilger.kernel, toshi.okajima, marco.stornelli, tytso, m.mizuma,
	linux-ext4, linux-fsdevel, sandeen

On 05/09/2011 12:53 PM, Jan Kara wrote:
> On Sat 07-05-11 23:04:22, Surbhi Palande wrote:
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> +	write_lock(&journal->j_state_lock);
>> +	journal->j_flags = journal->j_flags&= ~JBD2_FROZEN;
>> +	write_unlock(&journal->j_state_lock);
>> +	smp_wmb();
>    Why is here the smp_wmb()? The write is inside a rw-lock so it cannot be
> reordered. Also wake_up() is protected by queue->lock so I don't see the
> need for a barrier.

Ok, thanks for letting me know. I was under the impression that a 
reorder was possible in case of SMP. I will rewrite the patch with this 
change and the one that Marco Stornelli suggested as well.

Thanks a lot!

Warm Regards,
Surbhi.

>
>> +	wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>
> 								Honza


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v3] Adding support to freeze and unfreeze a journal
  2011-05-09 13:49                                                             ` Surbhi Palande
@ 2011-05-09 14:51                                                               ` Surbhi Palande
  2011-05-09 15:08                                                                 ` Jan Kara
  2011-05-09 15:23                                                                 ` [PATCH v3] " Eric Sandeen
  0 siblings, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-09 14:51 UTC (permalink / raw)
  To: jack
  Cc: marco.stornelli, adilger.kernel, toshi.okajima, tytso, m.mizuma,
	sandeen, linux-ext4, linux-fsdevel

The journal should be frozen when a F.S freezes. What this means is that till
the F.S is thawed again, no new transactions should be accepted by the
journal. When the F.S thaws, inturn it should thaw the journal and this should
allow the journal to resume accepting new transactions.
While the F.S has frozen the journal, the clients of journal on calling
jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
up the sleeping clients and journalling can progress normally.

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
---
Changes since the last patch:
* Changed to the shorter forms of expressions eg: x |= y
* removed the unnecessary barrier

 fs/ext4/super.c       |   20 ++++++--------------
 fs/jbd2/journal.c     |    1 +
 fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h  |   10 ++++++++++
 4 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
 
 	journal = EXT4_SB(sb)->s_journal;
 
-	/* Now we set up the journal barrier. */
-	jbd2_journal_lock_updates(journal);
-
+	error = jbd2_journal_freeze(journal);
 	/*
-	 * Don't clear the needs_recovery flag if we failed to flush
+	 * Don't clear the needs_recovery flag if we failed to freeze
 	 * the journal.
 	 */
-	error = jbd2_journal_flush(journal);
-	if (error < 0)
-		goto out;
-
-	/* Journal blocked and flushed, clear needs_recovery flag. */
-	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
-	error = ext4_commit_super(sb, 1);
-out:
-	/* we rely on s_frozen to stop further updates */
-	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+	if (error >= 0) {
+		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+		error = ext4_commit_super(sb, 1);
+	}
 	return error;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
 	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_frozen);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..b040293 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
 				journal->j_barrier_count == 0);
 		goto repeat;
 	}
+	/* dont let a new handle start when a journal is frozen.
+	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+	 * the jflags indicate that the journal is frozen. So if the
+	 * j_barrier_count is 0, then check if this was made 0 by the freezing
+	 * process
+	 */
+	if (journal->j_flags & JBD2_FROZEN) {
+		read_unlock(&journal->j_state_lock);
+		jbd2_check_frozen(journal);
+		goto repeat;
+	}
 
 	if (!journal->j_running_transaction) {
 		read_unlock(&journal->j_state_lock);
@@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
+int jbd2_journal_freeze(journal_t *journal)
+{
+	int error = 0;
+	/* Now we set up the journal barrier. */
+	jbd2_journal_lock_updates(journal);
+
+	/*
+	 * Don't clear the needs_recovery flag if we failed to flush
+	 * the journal.
+	 */
+	error = jbd2_journal_flush(journal);
+	if (error >= 0) {
+		write_lock(&journal->j_state_lock);
+		journal->j_flags |= JBD2_FROZEN;
+		write_unlock(&journal->j_state_lock);
+	}
+	jbd2_journal_unlock_updates(journal);
+	return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FROZEN;
+	write_unlock(&journal->j_state_lock);
+	wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
 /**
  * void jbd2_journal_lock_updates () - establish a transaction barrier.
  * @journal:  Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..c7885b2 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
  * @j_wait_checkpoint:  Wait queue to trigger checkpointing
  * @j_wait_commit: Wait queue to trigger commit
  * @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
  * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
  * @j_head: Journal head - identifies the first unused block in the journal
  * @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -835,6 +836,9 @@ struct journal_s
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for journal to thaw*/
+	wait_queue_head_t	j_wait_frozen;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct mutex		j_checkpoint_mutex;
 
@@ -1013,7 +1017,11 @@ struct journal_s
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
+
 
+#define jbd2_check_frozen(journal)	\
+		wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
 /*
  * Function declarations for the journaling transaction and buffer
  * management
@@ -1121,6 +1129,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 jbd2_journal_stop(handle_t *);
+extern int	 jbd2_journal_freeze(journal_t *);
+extern void	 jbd2_journal_thaw(journal_t *);
 extern int	 jbd2_journal_flush (journal_t *);
 extern void	 jbd2_journal_lock_updates (journal_t *);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v3] Adding support to freeze and unfreeze a journal
  2011-05-09 14:51                                                               ` [PATCH v3] " Surbhi Palande
@ 2011-05-09 15:08                                                                 ` Jan Kara
  2011-05-10 15:07                                                                   ` [PATCH] " Surbhi Palande
  2011-05-09 15:23                                                                 ` [PATCH v3] " Eric Sandeen
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Kara @ 2011-05-09 15:08 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, marco.stornelli, adilger.kernel, toshi.okajima, tytso,
	m.mizuma, sandeen, linux-ext4, linux-fsdevel

On Mon 09-05-11 17:51:32, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes. What this means is that till
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.
  The patch looks fine. I'd just add here a scheme of the race which can
happen if we don't really freeze the journal and rely on vfs... You can add:
Acked-by: Jan Kara <jack@suse.cz>

									Honza
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
> ---
> Changes since the last patch:
> * Changed to the shorter forms of expressions eg: x |= y
> * removed the unnecessary barrier
> 
>  fs/ext4/super.c       |   20 ++++++--------------
>  fs/jbd2/journal.c     |    1 +
>  fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/jbd2.h  |   10 ++++++++++
>  4 files changed, 59 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>  
>  	journal = EXT4_SB(sb)->s_journal;
>  
> -	/* Now we set up the journal barrier. */
> -	jbd2_journal_lock_updates(journal);
> -
> +	error = jbd2_journal_freeze(journal);
>  	/*
> -	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * Don't clear the needs_recovery flag if we failed to freeze
>  	 * the journal.
>  	 */
> -	error = jbd2_journal_flush(journal);
> -	if (error < 0)
> -		goto out;
> -
> -	/* Journal blocked and flushed, clear needs_recovery flag. */
> -	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> -	error = ext4_commit_super(sb, 1);
> -out:
> -	/* we rely on s_frozen to stop further updates */
> -	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> +	if (error >= 0) {
> +		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> +		error = ext4_commit_super(sb, 1);
> +	}
>  	return error;
>  }
>  
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>  	init_waitqueue_head(&journal->j_wait_checkpoint);
>  	init_waitqueue_head(&journal->j_wait_commit);
>  	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_frozen);
>  	mutex_init(&journal->j_barrier);
>  	mutex_init(&journal->j_checkpoint_mutex);
>  	spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..b040293 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
>  				journal->j_barrier_count == 0);
>  		goto repeat;
>  	}
> +	/* dont let a new handle start when a journal is frozen.
> +	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> +	 * the jflags indicate that the journal is frozen. So if the
> +	 * j_barrier_count is 0, then check if this was made 0 by the freezing
> +	 * process
> +	 */
> +	if (journal->j_flags & JBD2_FROZEN) {
> +		read_unlock(&journal->j_state_lock);
> +		jbd2_check_frozen(journal);
> +		goto repeat;
> +	}
>  
>  	if (!journal->j_running_transaction) {
>  		read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>  }
>  EXPORT_SYMBOL(jbd2_journal_restart);
>  
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> +	int error = 0;
> +	/* Now we set up the journal barrier. */
> +	jbd2_journal_lock_updates(journal);
> +
> +	/*
> +	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * the journal.
> +	 */
> +	error = jbd2_journal_flush(journal);
> +	if (error >= 0) {
> +		write_lock(&journal->j_state_lock);
> +		journal->j_flags |= JBD2_FROZEN;
> +		write_unlock(&journal->j_state_lock);
> +	}
> +	jbd2_journal_unlock_updates(journal);
> +	return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FROZEN;
> +	write_unlock(&journal->j_state_lock);
> +	wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
>  /**
>   * void jbd2_journal_lock_updates () - establish a transaction barrier.
>   * @journal:  Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..c7885b2 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>   * @j_wait_checkpoint:  Wait queue to trigger checkpointing
>   * @j_wait_commit: Wait queue to trigger commit
>   * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>   * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>   * @j_head: Journal head - identifies the first unused block in the journal
>   * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -835,6 +836,9 @@ struct journal_s
>  	/* Wait queue to wait for updates to complete */
>  	wait_queue_head_t	j_wait_updates;
>  
> +	/* Wait queue to wait for journal to thaw*/
> +	wait_queue_head_t	j_wait_frozen;
> +
>  	/* Semaphore for locking against concurrent checkpoints */
>  	struct mutex		j_checkpoint_mutex;
>  
> @@ -1013,7 +1017,11 @@ struct journal_s
>  #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
>  						 * data write error in ordered
>  						 * mode */
> +#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
> +
>  
> +#define jbd2_check_frozen(journal)	\
> +		wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
>  /*
>   * Function declarations for the journaling transaction and buffer
>   * management
> @@ -1121,6 +1129,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
>  				struct page *, unsigned long);
>  extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>  extern int	 jbd2_journal_stop(handle_t *);
> +extern int	 jbd2_journal_freeze(journal_t *);
> +extern void	 jbd2_journal_thaw(journal_t *);
>  extern int	 jbd2_journal_flush (journal_t *);
>  extern void	 jbd2_journal_lock_updates (journal_t *);
>  extern void	 jbd2_journal_unlock_updates (journal_t *);
> -- 
> 1.7.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3] Adding support to freeze and unfreeze a journal
  2011-05-09 14:51                                                               ` [PATCH v3] " Surbhi Palande
  2011-05-09 15:08                                                                 ` Jan Kara
@ 2011-05-09 15:23                                                                 ` Eric Sandeen
  2011-05-11  7:06                                                                   ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Eric Sandeen @ 2011-05-09 15:23 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, marco.stornelli, adilger.kernel, toshi.okajima, tytso,
	m.mizuma, linux-ext4, linux-fsdevel

On 5/9/11 9:51 AM, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes. What this means is that till
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.

Can I ask how this was tested?

Ideally anything you found useful for testing should probably be integrated
into the xfstests test suite so that we don't regresss in the future.

thanks,
-Eric

> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
> ---
> Changes since the last patch:
> * Changed to the shorter forms of expressions eg: x |= y
> * removed the unnecessary barrier
> 
>  fs/ext4/super.c       |   20 ++++++--------------
>  fs/jbd2/journal.c     |    1 +
>  fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/jbd2.h  |   10 ++++++++++
>  4 files changed, 59 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>  
>  	journal = EXT4_SB(sb)->s_journal;
>  
> -	/* Now we set up the journal barrier. */
> -	jbd2_journal_lock_updates(journal);
> -
> +	error = jbd2_journal_freeze(journal);
>  	/*
> -	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * Don't clear the needs_recovery flag if we failed to freeze
>  	 * the journal.
>  	 */
> -	error = jbd2_journal_flush(journal);
> -	if (error < 0)
> -		goto out;
> -
> -	/* Journal blocked and flushed, clear needs_recovery flag. */
> -	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> -	error = ext4_commit_super(sb, 1);
> -out:
> -	/* we rely on s_frozen to stop further updates */
> -	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> +	if (error >= 0) {
> +		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> +		error = ext4_commit_super(sb, 1);
> +	}
>  	return error;
>  }
>  
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>  	init_waitqueue_head(&journal->j_wait_checkpoint);
>  	init_waitqueue_head(&journal->j_wait_commit);
>  	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_frozen);
>  	mutex_init(&journal->j_barrier);
>  	mutex_init(&journal->j_checkpoint_mutex);
>  	spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..b040293 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
>  				journal->j_barrier_count == 0);
>  		goto repeat;
>  	}
> +	/* dont let a new handle start when a journal is frozen.
> +	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> +	 * the jflags indicate that the journal is frozen. So if the
> +	 * j_barrier_count is 0, then check if this was made 0 by the freezing
> +	 * process
> +	 */
> +	if (journal->j_flags & JBD2_FROZEN) {
> +		read_unlock(&journal->j_state_lock);
> +		jbd2_check_frozen(journal);
> +		goto repeat;
> +	}
>  
>  	if (!journal->j_running_transaction) {
>  		read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>  }
>  EXPORT_SYMBOL(jbd2_journal_restart);
>  
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> +	int error = 0;
> +	/* Now we set up the journal barrier. */
> +	jbd2_journal_lock_updates(journal);
> +
> +	/*
> +	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * the journal.
> +	 */
> +	error = jbd2_journal_flush(journal);
> +	if (error >= 0) {
> +		write_lock(&journal->j_state_lock);
> +		journal->j_flags |= JBD2_FROZEN;
> +		write_unlock(&journal->j_state_lock);
> +	}
> +	jbd2_journal_unlock_updates(journal);
> +	return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FROZEN;
> +	write_unlock(&journal->j_state_lock);
> +	wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
>  /**
>   * void jbd2_journal_lock_updates () - establish a transaction barrier.
>   * @journal:  Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..c7885b2 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>   * @j_wait_checkpoint:  Wait queue to trigger checkpointing
>   * @j_wait_commit: Wait queue to trigger commit
>   * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>   * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>   * @j_head: Journal head - identifies the first unused block in the journal
>   * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -835,6 +836,9 @@ struct journal_s
>  	/* Wait queue to wait for updates to complete */
>  	wait_queue_head_t	j_wait_updates;
>  
> +	/* Wait queue to wait for journal to thaw*/
> +	wait_queue_head_t	j_wait_frozen;
> +
>  	/* Semaphore for locking against concurrent checkpoints */
>  	struct mutex		j_checkpoint_mutex;
>  
> @@ -1013,7 +1017,11 @@ struct journal_s
>  #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
>  						 * data write error in ordered
>  						 * mode */
> +#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
> +
>  
> +#define jbd2_check_frozen(journal)	\
> +		wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
>  /*
>   * Function declarations for the journaling transaction and buffer
>   * management
> @@ -1121,6 +1129,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
>  				struct page *, unsigned long);
>  extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>  extern int	 jbd2_journal_stop(handle_t *);
> +extern int	 jbd2_journal_freeze(journal_t *);
> +extern void	 jbd2_journal_thaw(journal_t *);
>  extern int	 jbd2_journal_flush (journal_t *);
>  extern void	 jbd2_journal_lock_updates (journal_t *);
>  extern void	 jbd2_journal_unlock_updates (journal_t *);


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH] Adding support to freeze and unfreeze a journal
  2011-05-09 15:08                                                                 ` Jan Kara
@ 2011-05-10 15:07                                                                   ` Surbhi Palande
  2011-05-10 21:07                                                                     ` Andreas Dilger
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-10 15:07 UTC (permalink / raw)
  To: jack
  Cc: marco.stornelli, adilger.kernel, toshi.okajima, tytso, m.mizuma,
	sandeen, linux-ext4, linux-fsdevel

The journal should be frozen when a F.S freezes. What this means is that till
the F.S is thawed again, no new transactions should be accepted by the
journal. When the F.S thaws, inturn it should thaw the journal and this should
allow the journal to resume accepting new transactions.
While the F.S has frozen the journal, the clients of journal on calling
jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
up the sleeping clients and journalling can progress normally.

An example of the race condition that can happen without this patch is as
follows:

Say the F.S is thawed when we begin. Let tx be the time at unit x

P1: Process doing an aio write
t1) ext4_file_write()
  t2) generic_file_aio_write()
    t3) __generic_file_aio_write()
      // F.S is not frozen, so we do not block in the next check.
      t4) vfs_check_frozen()
      t5) generic_write_checks()
----------------- Prempted------------------

P2: Process that does fs freeze

t6) freeze_super()
  t7) sync_filesystem()
  t8) sync_blockdev()
  t9) sb->s_op->freeze_fs() (= ext4_freeze)
    t10) jbd2_journal_lock_updates()
    t11) jbd2_journal_flush()
    // Need to unlock the journal before returning to user space.
    t12) jbd2_journal_unlock_updates()
    // Journal is unlocked and so we can start accepting new transactions now.

// freezing process completes execution. Page cache is now clean and should
// remain clean till the F.S is frozen.
--------------------------------------------

P1: writing process gets the control back
t13) generic_file_buffered_write()
  t14) generic_perform_write()
    t15) a_ops->write_begin() (= ext4_write_begin)
      t16) ext4_journal_start()
	// New handle is started. We do not block here! Write continues
	// dirtying the page cache while the F.S is frozen!

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c       |   20 ++++++--------------
 fs/jbd2/journal.c     |    1 +
 fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h  |   10 ++++++++++
 4 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
 
 	journal = EXT4_SB(sb)->s_journal;
 
-	/* Now we set up the journal barrier. */
-	jbd2_journal_lock_updates(journal);
-
+	error = jbd2_journal_freeze(journal);
 	/*
-	 * Don't clear the needs_recovery flag if we failed to flush
+	 * Don't clear the needs_recovery flag if we failed to freeze
 	 * the journal.
 	 */
-	error = jbd2_journal_flush(journal);
-	if (error < 0)
-		goto out;
-
-	/* Journal blocked and flushed, clear needs_recovery flag. */
-	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
-	error = ext4_commit_super(sb, 1);
-out:
-	/* we rely on s_frozen to stop further updates */
-	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+	if (error >= 0) {
+		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+		error = ext4_commit_super(sb, 1);
+	}
 	return error;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
 	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_frozen);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..b040293 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
 				journal->j_barrier_count == 0);
 		goto repeat;
 	}
+	/* dont let a new handle start when a journal is frozen.
+	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+	 * the jflags indicate that the journal is frozen. So if the
+	 * j_barrier_count is 0, then check if this was made 0 by the freezing
+	 * process
+	 */
+	if (journal->j_flags & JBD2_FROZEN) {
+		read_unlock(&journal->j_state_lock);
+		jbd2_check_frozen(journal);
+		goto repeat;
+	}
 
 	if (!journal->j_running_transaction) {
 		read_unlock(&journal->j_state_lock);
@@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
+int jbd2_journal_freeze(journal_t *journal)
+{
+	int error = 0;
+	/* Now we set up the journal barrier. */
+	jbd2_journal_lock_updates(journal);
+
+	/*
+	 * Don't clear the needs_recovery flag if we failed to flush
+	 * the journal.
+	 */
+	error = jbd2_journal_flush(journal);
+	if (error >= 0) {
+		write_lock(&journal->j_state_lock);
+		journal->j_flags |= JBD2_FROZEN;
+		write_unlock(&journal->j_state_lock);
+	}
+	jbd2_journal_unlock_updates(journal);
+	return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FROZEN;
+	write_unlock(&journal->j_state_lock);
+	wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
 /**
  * void jbd2_journal_lock_updates () - establish a transaction barrier.
  * @journal:  Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..c7885b2 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
  * @j_wait_checkpoint:  Wait queue to trigger checkpointing
  * @j_wait_commit: Wait queue to trigger commit
  * @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
  * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
  * @j_head: Journal head - identifies the first unused block in the journal
  * @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -835,6 +836,9 @@ struct journal_s
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for journal to thaw*/
+	wait_queue_head_t	j_wait_frozen;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct mutex		j_checkpoint_mutex;
 
@@ -1013,7 +1017,11 @@ struct journal_s
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
+
 
+#define jbd2_check_frozen(journal)	\
+		wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
 /*
  * Function declarations for the journaling transaction and buffer
  * management
@@ -1121,6 +1129,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 jbd2_journal_stop(handle_t *);
+extern int	 jbd2_journal_freeze(journal_t *);
+extern void	 jbd2_journal_thaw(journal_t *);
 extern int	 jbd2_journal_flush (journal_t *);
 extern void	 jbd2_journal_lock_updates (journal_t *);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH] Adding support to freeze and unfreeze a journal
  2011-05-10 15:07                                                                   ` [PATCH] " Surbhi Palande
@ 2011-05-10 21:07                                                                     ` Andreas Dilger
  2011-05-11  7:46                                                                       ` Surbhi Palande
  0 siblings, 1 reply; 121+ messages in thread
From: Andreas Dilger @ 2011-05-10 21:07 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, marco.stornelli, toshi.okajima, tytso, m.mizuma, sandeen,
	linux-ext4, linux-fsdevel

Mostly minor cleanups to the commit message and comments.

On May 10, 2011, at 09:07, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes.

s/F.S/filesystem/g

> What this means is that till

s/till/until/
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.
> 
> An example of the race condition that can happen without this patch is as
> follows:
> 
> Say the F.S is thawed when we begin. Let tx be the time at unit x
> 
> P1: Process doing an aio write
> t1) ext4_file_write()
> t2) generic_file_aio_write()
>   t3) __generic_file_aio_write()
>     // F.S is not frozen, so we do not block in the next check.
>     t4) vfs_check_frozen()
>     t5) generic_write_checks()
> ----------------- Prempted------------------
> 
> P2: Process that does fs freeze
> 
> t6) freeze_super()
> t7) sync_filesystem()
> t8) sync_blockdev()
> t9) sb->s_op->freeze_fs() (= ext4_freeze)
>   t10) jbd2_journal_lock_updates()
>   t11) jbd2_journal_flush()
>   // Need to unlock the journal before returning to user space.
>   t12) jbd2_journal_unlock_updates()
>   // Journal is unlocked and so we can start accepting new transactions now.
> 
> // freezing process completes execution. Page cache is now clean and should
> // remain clean till the F.S is frozen.
> --------------------------------------------
> 
> P1: writing process gets the control back
> t13) generic_file_buffered_write()
> t14) generic_perform_write()
>   t15) a_ops->write_begin() (= ext4_write_begin)
>     t16) ext4_journal_start()
> 	// New handle is started. We do not block here! Write continues
> 	// dirtying the page cache while the F.S is frozen!
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
> Acked-by: Jan Kara <jack@suse.cz>
> ---
> fs/ext4/super.c       |   20 ++++++--------------
> fs/jbd2/journal.c     |    1 +
> fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h  |   10 ++++++++++
> 4 files changed, 59 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
> 
> 	journal = EXT4_SB(sb)->s_journal;
> 
> -	/* Now we set up the journal barrier. */
> -	jbd2_journal_lock_updates(journal);
> -
> +	error = jbd2_journal_freeze(journal);
> 	/*
> -	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * Don't clear the needs_recovery flag if we failed to freeze
> 	 * the journal.
> 	 */
> -	error = jbd2_journal_flush(journal);
> -	if (error < 0)
> -		goto out;
> -
> -	/* Journal blocked and flushed, clear needs_recovery flag. */
> -	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> -	error = ext4_commit_super(sb, 1);
> -out:
> -	/* we rely on s_frozen to stop further updates */
> -	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> +	if (error >= 0) {
> +		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> +		error = ext4_commit_super(sb, 1);
> +	}
> 	return error;
> }
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> 	init_waitqueue_head(&journal->j_wait_checkpoint);
> 	init_waitqueue_head(&journal->j_wait_commit);
> 	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_frozen);
> 	mutex_init(&journal->j_barrier);
> 	mutex_init(&journal->j_checkpoint_mutex);
> 	spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..b040293 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
> 				journal->j_barrier_count == 0);
> 		goto repeat;
> 	}
> +	/* dont let a new handle start when a journal is frozen.

s/dont/Don't/ or s/dont/Do not/

> +	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> +	 * the jflags indicate that the journal is frozen. So if the

s/jflags/j_flags/

> +	 * j_barrier_count is 0, then check if this was made 0 by the freezing
> +	 * process
> +	 */
> +	if (journal->j_flags & JBD2_FROZEN) {
> +		read_unlock(&journal->j_state_lock);
> +		jbd2_check_frozen(journal);
> +		goto repeat;
> +	}
> 
> 	if (!journal->j_running_transaction) {
> 		read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
> 
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> +	int error = 0;
> +	/* Now we set up the journal barrier. */
> +	jbd2_journal_lock_updates(journal);
> +
> +	/*
> +	 * Don't clear the needs_recovery flag if we failed to flush
> +	 * the journal.
> +	 */
> +	error = jbd2_journal_flush(journal);
> +	if (error >= 0) {
> +		write_lock(&journal->j_state_lock);
> +		journal->j_flags |= JBD2_FROZEN;
> +		write_unlock(&journal->j_state_lock);
> +	}
> +	jbd2_journal_unlock_updates(journal);
> +	return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FROZEN;
> +	write_unlock(&journal->j_state_lock);
> +	wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal:  Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..c7885b2 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> * @j_wait_checkpoint:  Wait queue to trigger checkpointing
> * @j_wait_commit: Wait queue to trigger commit
> * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> * @j_head: Journal head - identifies the first unused block in the journal
> * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -835,6 +836,9 @@ struct journal_s
> 	/* Wait queue to wait for updates to complete */
> 	wait_queue_head_t	j_wait_updates;
> 
> +	/* Wait queue to wait for journal to thaw*/
> +	wait_queue_head_t	j_wait_frozen;
> +
> 	/* Semaphore for locking against concurrent checkpoints */
> 	struct mutex		j_checkpoint_mutex;
> 
> @@ -1013,7 +1017,11 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
> 						 * data write error in ordered
> 						 * mode */
> +#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */

Need to wrap this to 80 columns:

+#define JBD2_FROZEN	0x080   /* Journal thread frozen along with filesystem */

> +#define jbd2_check_frozen(journal)	\
> +		wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))

Having this macro, which is now only used in one place, isn't really clarifying the code because the name "check_frozen" doesn't really imply "wait until it journal is unfrozen".  It would be better to just put the wait_event() inline at the one callsite and remove this macro entirely.

> /*
> * Function declarations for the journaling transaction and buffer
> * management
> @@ -1121,6 +1129,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
> 				struct page *, unsigned long);
> extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int	 jbd2_journal_stop(handle_t *);
> +extern int	 jbd2_journal_freeze(journal_t *);
> +extern void	 jbd2_journal_thaw(journal_t *);
> extern int	 jbd2_journal_flush (journal_t *);
> extern void	 jbd2_journal_lock_updates (journal_t *);
> extern void	 jbd2_journal_unlock_updates (journal_t *);

Once these minor changes have been made you can add:

Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>


Cheers, Andreas






^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3] Adding support to freeze and unfreeze a journal
  2011-05-09 15:23                                                                 ` [PATCH v3] " Eric Sandeen
@ 2011-05-11  7:06                                                                   ` Surbhi Palande
  2011-05-11  7:10                                                                     ` [PATCH] Attempt to sync the fsstress writes to a frozen F.S Surbhi Palande
  2011-05-11  9:05                                                                     ` [PATCH v3] Adding support to freeze and unfreeze a journal Andreas Dilger
  0 siblings, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-11  7:06 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: jack, marco.stornelli, adilger.kernel, toshi.okajima, tytso,
	m.mizuma, linux-ext4, linux-fsdevel

Hi Eric,

On 05/09/2011 06:23 PM, Eric Sandeen wrote:
> On 5/9/11 9:51 AM, Surbhi Palande wrote:
>> The journal should be frozen when a F.S freezes. What this means is that till
>> the F.S is thawed again, no new transactions should be accepted by the
>> journal. When the F.S thaws, inturn it should thaw the journal and this should
>> allow the journal to resume accepting new transactions.
>> While the F.S has frozen the journal, the clients of journal on calling
>> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
>> up the sleeping clients and journalling can progress normally.
>
> Can I ask how this was tested?

Yes! I did the following on an ext4 fs mount:
1. fsfreeze -f $MNT
2. dd if=/dev/zero of=$MNT/file count=10 bs=1024 &
3. sync
4. fsfreeze -u $MNT

If the dd blocks on the start_handle, then the page cache is clean and 
sync should have nothing to write and everything will work fine. But 
otherwise this should sequence should create a deadlock.

I have attempted to create a patch for xfs-test. Shall send it out as a 
reply to this email soon!

Warm Regards,
Surbhi.



>
> Ideally anything you found useful for testing should probably be integrated
> into the xfstests test suite so that we don't regresss in the future.
>
> thanks,
> -Eric
>
>> Signed-off-by: Surbhi Palande<surbhi.palande@canonical.com>
>> ---
>> Changes since the last patch:
>> * Changed to the shorter forms of expressions eg: x |= y
>> * removed the unnecessary barrier
>>
>>   fs/ext4/super.c       |   20 ++++++--------------
>>   fs/jbd2/journal.c     |    1 +
>>   fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/jbd2.h  |   10 ++++++++++
>>   4 files changed, 59 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 8553dfb..796aa4c 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>
>>   	journal = EXT4_SB(sb)->s_journal;
>>
>> -	/* Now we set up the journal barrier. */
>> -	jbd2_journal_lock_updates(journal);
>> -
>> +	error = jbd2_journal_freeze(journal);
>>   	/*
>> -	 * Don't clear the needs_recovery flag if we failed to flush
>> +	 * Don't clear the needs_recovery flag if we failed to freeze
>>   	 * the journal.
>>   	 */
>> -	error = jbd2_journal_flush(journal);
>> -	if (error<  0)
>> -		goto out;
>> -
>> -	/* Journal blocked and flushed, clear needs_recovery flag. */
>> -	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> -	error = ext4_commit_super(sb, 1);
>> -out:
>> -	/* we rely on s_frozen to stop further updates */
>> -	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> +	if (error>= 0) {
>> +		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> +		error = ext4_commit_super(sb, 1);
>> +	}
>>   	return error;
>>   }
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index e0ec3db..5e46333 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>>   	init_waitqueue_head(&journal->j_wait_checkpoint);
>>   	init_waitqueue_head(&journal->j_wait_commit);
>>   	init_waitqueue_head(&journal->j_wait_updates);
>> +	init_waitqueue_head(&journal->j_wait_frozen);
>>   	mutex_init(&journal->j_barrier);
>>   	mutex_init(&journal->j_checkpoint_mutex);
>>   	spin_lock_init(&journal->j_revoke_lock);
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 05fa77a..b040293 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -171,6 +171,17 @@ repeat:
>>   				journal->j_barrier_count == 0);
>>   		goto repeat;
>>   	}
>> +	/* dont let a new handle start when a journal is frozen.
>> +	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>> +	 * the jflags indicate that the journal is frozen. So if the
>> +	 * j_barrier_count is 0, then check if this was made 0 by the freezing
>> +	 * process
>> +	 */
>> +	if (journal->j_flags&  JBD2_FROZEN) {
>> +		read_unlock(&journal->j_state_lock);
>> +		jbd2_check_frozen(journal);
>> +		goto repeat;
>> +	}
>>
>>   	if (!journal->j_running_transaction) {
>>   		read_unlock(&journal->j_state_lock);
>> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>>   }
>>   EXPORT_SYMBOL(jbd2_journal_restart);
>>
>> +int jbd2_journal_freeze(journal_t *journal)
>> +{
>> +	int error = 0;
>> +	/* Now we set up the journal barrier. */
>> +	jbd2_journal_lock_updates(journal);
>> +
>> +	/*
>> +	 * Don't clear the needs_recovery flag if we failed to flush
>> +	 * the journal.
>> +	 */
>> +	error = jbd2_journal_flush(journal);
>> +	if (error>= 0) {
>> +		write_lock(&journal->j_state_lock);
>> +		journal->j_flags |= JBD2_FROZEN;
>> +		write_unlock(&journal->j_state_lock);
>> +	}
>> +	jbd2_journal_unlock_updates(journal);
>> +	return error;
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>> +
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> +	write_lock(&journal->j_state_lock);
>> +	journal->j_flags&= ~JBD2_FROZEN;
>> +	write_unlock(&journal->j_state_lock);
>> +	wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>> +
>> +
>>   /**
>>    * void jbd2_journal_lock_updates () - establish a transaction barrier.
>>    * @journal:  Journal to establish a barrier on.
>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>> index a32dcae..c7885b2 100644
>> --- a/include/linux/jbd2.h
>> +++ b/include/linux/jbd2.h
>> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>>    * @j_wait_checkpoint:  Wait queue to trigger checkpointing
>>    * @j_wait_commit: Wait queue to trigger commit
>>    * @j_wait_updates: Wait queue to wait for updates to complete
>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>>    * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>>    * @j_head: Journal head - identifies the first unused block in the journal
>>    * @j_tail: Journal tail - identifies the oldest still-used block in the
>> @@ -835,6 +836,9 @@ struct journal_s
>>   	/* Wait queue to wait for updates to complete */
>>   	wait_queue_head_t	j_wait_updates;
>>
>> +	/* Wait queue to wait for journal to thaw*/
>> +	wait_queue_head_t	j_wait_frozen;
>> +
>>   	/* Semaphore for locking against concurrent checkpoints */
>>   	struct mutex		j_checkpoint_mutex;
>>
>> @@ -1013,7 +1017,11 @@ struct journal_s
>>   #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
>>   						 * data write error in ordered
>>   						 * mode */
>> +#define JBD2_FROZEN	0x080   /* Journal thread is frozen as the filesystem is frozen */
>> +
>>
>> +#define jbd2_check_frozen(journal)	\
>> +		wait_event(journal->j_wait_frozen, (journal->j_flags&  JBD2_FROZEN))
>>   /*
>>    * Function declarations for the journaling transaction and buffer
>>    * management
>> @@ -1121,6 +1129,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
>>   				struct page *, unsigned long);
>>   extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>>   extern int	 jbd2_journal_stop(handle_t *);
>> +extern int	 jbd2_journal_freeze(journal_t *);
>> +extern void	 jbd2_journal_thaw(journal_t *);
>>   extern int	 jbd2_journal_flush (journal_t *);
>>   extern void	 jbd2_journal_lock_updates (journal_t *);
>>   extern void	 jbd2_journal_unlock_updates (journal_t *);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH] Attempt to sync the fsstress writes to a frozen F.S
  2011-05-11  7:06                                                                   ` Surbhi Palande
@ 2011-05-11  7:10                                                                     ` Surbhi Palande
  2011-05-12 14:22                                                                         ` Eric Sandeen
  2011-05-24 21:42                                                                       ` Ted Ts'o
  2011-05-11  9:05                                                                     ` [PATCH v3] Adding support to freeze and unfreeze a journal Andreas Dilger
  1 sibling, 2 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-11  7:10 UTC (permalink / raw)
  To: sandeen
  Cc: jack, marco.stornelli, adilger.kernel, toshi.okajima, tytso,
	m.mizuma, linux-ext4, linux-fsdevel

While the fsstress background writes are busy dirtying the page cache, if a
fsfreeze happens then the background writes should stall. A sync should then
not have any data to sync to the FS. If it does have any data to sync then
sync will cause a deadlock by holding the s_umount write semaphore and waiting
in the wait queue for the FS to thaw, whereas the F.S can never thaw without
getting the s_umount write semaphore.

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
---
 068 |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/068 b/068
index 82c1a4e..b9ac58d 100755
--- a/068
+++ b/068
@@ -101,6 +101,11 @@ do
 	    tee -a $seq.full
 	sleep 2
 
+	# there should be nothing to sync at this point. This may hang in case
+	# of fsstress background writes dirtying the page cache while the F.S is frozen
+	sync &
+	sleep 2
+
 	echo "*** thawing  \$SCRATCH_MNT" | tee -a $seq.full
 	xfs_freeze -u "$SCRATCH_MNT" | tee -a $seq.full
 	[ $? != 0 ] && echo xfs_freeze -u "$SCRATCH_MNT" failed | \
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] Adding support to freeze and unfreeze a journal
  2011-05-10 21:07                                                                     ` Andreas Dilger
@ 2011-05-11  7:46                                                                       ` Surbhi Palande
  0 siblings, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-11  7:46 UTC (permalink / raw)
  To: adilger.kernel
  Cc: jack, marco.stornelli, toshi.okajima, tytso, m.mizuma, sandeen,
	linux-ext4, linux-fsdevel

The journal should be frozen when a filesystem freezes. What this means is
that until the filesystem is thawed again, no new transactions should be
accepted by the journal. When the filesystem thaws, inturn it should thaw the
journal and this should allow the journal to resume accepting new
transactions. While the the filesystem has frozen the journal, the clients of
the journal on calling jbd2_journal_start() will sleep on a wait queue.
Thawing the journal will wake up the sleeping clients and journalling can
progress normally.

An example of the race condition that can happen without this patch is as
follows:

Say the filesystem is thawed when we begin. Let tx be the time at unit x

P1: Process doing an aio write
t1) ext4_file_write()
  t2) generic_file_aio_write()
    t3) __generic_file_aio_write()
      // filesystem is not frozen, so we do not block in the next check.
      t4) vfs_check_frozen()
      t5) generic_write_checks()
----------------- Prempted------------------

P2: Process that does filesystem freeze

t6) freeze_super()
  t7) sync_filesystem()
  t8) sync_blockdev()
  t9) sb->s_op->freeze_fs() (= ext4_freeze)
    t10) jbd2_journal_lock_updates()
    t11) jbd2_journal_flush()
    // Need to unlock the journal before returning to user space.
    t12) jbd2_journal_unlock_updates()
    // Journal is unlocked and so we can start accepting new transactions now.

// freezing process completes execution. Page cache is now clean and should
// remain clean till the filesystem is frozen.
--------------------------------------------

P1: writing process gets the control back
t13) generic_file_buffered_write()
  t14) generic_perform_write()
    t15) a_ops->write_begin() (= ext4_write_begin)
      t16) ext4_journal_start()
	// New handle is started. We do not block here! Write continues
	// dirtying the page cache while the filesystem is frozen!

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>
---
 fs/ext4/super.c       |   20 ++++++--------------
 fs/jbd2/journal.c     |    1 +
 fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h  |    7 +++++++
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
 
 	journal = EXT4_SB(sb)->s_journal;
 
-	/* Now we set up the journal barrier. */
-	jbd2_journal_lock_updates(journal);
-
+	error = jbd2_journal_freeze(journal);
 	/*
-	 * Don't clear the needs_recovery flag if we failed to flush
+	 * Don't clear the needs_recovery flag if we failed to freeze
 	 * the journal.
 	 */
-	error = jbd2_journal_flush(journal);
-	if (error < 0)
-		goto out;
-
-	/* Journal blocked and flushed, clear needs_recovery flag. */
-	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
-	error = ext4_commit_super(sb, 1);
-out:
-	/* we rely on s_frozen to stop further updates */
-	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+	if (error >= 0) {
+		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+		error = ext4_commit_super(sb, 1);
+	}
 	return error;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
 	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_frozen);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..b111642 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
 				journal->j_barrier_count == 0);
 		goto repeat;
 	}
+	/* Don't let a new handle start when a journal is frozen.
+	 * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+	 * the j_flags indicate that the journal is frozen. So if the
+	 * j_barrier_count is 0, then check if this was made 0 by the freezing
+	 * process
+	 */
+	if (journal->j_flags & JBD2_FROZEN) {
+		read_unlock(&journal->j_state_lock);
+		wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
+		goto repeat;
+	}
 
 	if (!journal->j_running_transaction) {
 		read_unlock(&journal->j_state_lock);
@@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
+int jbd2_journal_freeze(journal_t *journal)
+{
+	int error = 0;
+	/* Now we set up the journal barrier. */
+	jbd2_journal_lock_updates(journal);
+
+	/*
+	 * Don't clear the needs_recovery flag if we failed to flush
+	 * the journal.
+	 */
+	error = jbd2_journal_flush(journal);
+	if (error >= 0) {
+		write_lock(&journal->j_state_lock);
+		journal->j_flags |= JBD2_FROZEN;
+		write_unlock(&journal->j_state_lock);
+	}
+	jbd2_journal_unlock_updates(journal);
+	return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FROZEN;
+	write_unlock(&journal->j_state_lock);
+	wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
 /**
  * void jbd2_journal_lock_updates () - establish a transaction barrier.
  * @journal:  Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..22b76de 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
  * @j_wait_checkpoint:  Wait queue to trigger checkpointing
  * @j_wait_commit: Wait queue to trigger commit
  * @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
  * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
  * @j_head: Journal head - identifies the first unused block in the journal
  * @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -835,6 +836,9 @@ struct journal_s
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for journal to thaw*/
+	wait_queue_head_t	j_wait_frozen;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct mutex		j_checkpoint_mutex;
 
@@ -1013,6 +1017,7 @@ struct journal_s
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FROZEN	0x080 /* Journal thread frozen along with filesystem */
 
 /*
  * Function declarations for the journaling transaction and buffer
@@ -1121,6 +1126,8 @@ extern void	 jbd2_journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 jbd2_journal_stop(handle_t *);
+extern int	 jbd2_journal_freeze(journal_t *);
+extern void	 jbd2_journal_thaw(journal_t *);
 extern int	 jbd2_journal_flush (journal_t *);
 extern void	 jbd2_journal_lock_updates (journal_t *);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v3] Adding support to freeze and unfreeze a journal
  2011-05-11  7:06                                                                   ` Surbhi Palande
  2011-05-11  7:10                                                                     ` [PATCH] Attempt to sync the fsstress writes to a frozen F.S Surbhi Palande
@ 2011-05-11  9:05                                                                     ` Andreas Dilger
  2011-05-12  9:40                                                                       ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Andreas Dilger @ 2011-05-11  9:05 UTC (permalink / raw)
  To: surbhi.palande
  Cc: Eric Sandeen, jack, marco.stornelli, adilger.kernel,
	toshi.okajima, tytso, m.mizuma, linux-ext4, linux-fsdevel

On 2011-05-11, at 1:06 AM, Surbhi Palande <surbhi.palande@canonical.com> wrote:

> On 05/09/2011 06:23 PM, Eric Sandeen wrote:
>> On 5/9/11 9:51 AM, Surbhi Palande wrote:
>>> The journal should be frozen when a F.S freezes. What this means is that till
>>> the F.S is thawed again, no new transactions should be accepted by the
>>> journal. When the F.S thaws, inturn it should thaw the journal and this should
>>> allow the journal to resume accepting new transactions.
>>> While the F.S has frozen the journal, the clients of journal on calling
>>> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
>>> up the sleeping clients and journalling can progress normally.
>> 
>> Can I ask how this was tested?
> 
> Yes! I did the following on an ext4 fs mount:
> 1. fsfreeze -f $MNT
> 2. dd if=/dev/zero of=$MNT/file count=10 bs=1024 &
> 3. sync
> 4. fsfreeze -u $MNT
> 
> If the dd blocks on the start_handle, then the page cache is clean and sync should have nothing to write and everything will work fine. But otherwise this should sequence should create a deadlock.

Sorry to ask the obvious question, but presumably this test fails without your patch?  It isn't clear from your comment that this is the case. 

> I have attempted to create a patch for xfs-test. Shall send it out as a reply to this email soon!
> 
> Warm Regards,
> Surbhi.
> 
> 
> 
>> 
>> Ideally anything you found useful for testing should probably be integrated
>> into the xfstests test suite so that we don't regresss in the future.
>> 
>> thanks,
>> -Eric
>> 
>>> Signed-off-by: Surbhi Palande<surbhi.palande@canonical.com>
>>> ---
>>> Changes since the last patch:
>>> * Changed to the shorter forms of expressions eg: x |= y
>>> * removed the unnecessary barrier
>>> 
>>>  fs/ext4/super.c       |   20 ++++++--------------
>>>  fs/jbd2/journal.c     |    1 +
>>>  fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/jbd2.h  |   10 ++++++++++
>>>  4 files changed, 59 insertions(+), 14 deletions(-)
>>> 
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index 8553dfb..796aa4c 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>> 
>>>      journal = EXT4_SB(sb)->s_journal;
>>> 
>>> -    /* Now we set up the journal barrier. */
>>> -    jbd2_journal_lock_updates(journal);
>>> -
>>> +    error = jbd2_journal_freeze(journal);
>>>      /*
>>> -     * Don't clear the needs_recovery flag if we failed to flush
>>> +     * Don't clear the needs_recovery flag if we failed to freeze
>>>       * the journal.
>>>       */
>>> -    error = jbd2_journal_flush(journal);
>>> -    if (error<  0)
>>> -        goto out;
>>> -
>>> -    /* Journal blocked and flushed, clear needs_recovery flag. */
>>> -    EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>> -    error = ext4_commit_super(sb, 1);
>>> -out:
>>> -    /* we rely on s_frozen to stop further updates */
>>> -    jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>> +    if (error>= 0) {
>>> +        EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>> +        error = ext4_commit_super(sb, 1);
>>> +    }
>>>      return error;
>>>  }
>>> 
>>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>>> index e0ec3db..5e46333 100644
>>> --- a/fs/jbd2/journal.c
>>> +++ b/fs/jbd2/journal.c
>>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>>>      init_waitqueue_head(&journal->j_wait_checkpoint);
>>>      init_waitqueue_head(&journal->j_wait_commit);
>>>      init_waitqueue_head(&journal->j_wait_updates);
>>> +    init_waitqueue_head(&journal->j_wait_frozen);
>>>      mutex_init(&journal->j_barrier);
>>>      mutex_init(&journal->j_checkpoint_mutex);
>>>      spin_lock_init(&journal->j_revoke_lock);
>>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>>> index 05fa77a..b040293 100644
>>> --- a/fs/jbd2/transaction.c
>>> +++ b/fs/jbd2/transaction.c
>>> @@ -171,6 +171,17 @@ repeat:
>>>                  journal->j_barrier_count == 0);
>>>          goto repeat;
>>>      }
>>> +    /* dont let a new handle start when a journal is frozen.
>>> +     * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>>> +     * the jflags indicate that the journal is frozen. So if the
>>> +     * j_barrier_count is 0, then check if this was made 0 by the freezing
>>> +     * process
>>> +     */
>>> +    if (journal->j_flags&  JBD2_FROZEN) {
>>> +        read_unlock(&journal->j_state_lock);
>>> +        jbd2_check_frozen(journal);
>>> +        goto repeat;
>>> +    }
>>> 
>>>      if (!journal->j_running_transaction) {
>>>          read_unlock(&journal->j_state_lock);
>>> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>>>  }
>>>  EXPORT_SYMBOL(jbd2_journal_restart);
>>> 
>>> +int jbd2_journal_freeze(journal_t *journal)
>>> +{
>>> +    int error = 0;
>>> +    /* Now we set up the journal barrier. */
>>> +    jbd2_journal_lock_updates(journal);
>>> +
>>> +    /*
>>> +     * Don't clear the needs_recovery flag if we failed to flush
>>> +     * the journal.
>>> +     */
>>> +    error = jbd2_journal_flush(journal);
>>> +    if (error>= 0) {
>>> +        write_lock(&journal->j_state_lock);
>>> +        journal->j_flags |= JBD2_FROZEN;
>>> +        write_unlock(&journal->j_state_lock);
>>> +    }
>>> +    jbd2_journal_unlock_updates(journal);
>>> +    return error;
>>> +}
>>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>>> +
>>> +void jbd2_journal_thaw(journal_t * journal)
>>> +{
>>> +    write_lock(&journal->j_state_lock);
>>> +    journal->j_flags&= ~JBD2_FROZEN;
>>> +    write_unlock(&journal->j_state_lock);
>>> +    wake_up(&journal->j_wait_frozen);
>>> +}
>>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>>> +
>>> +
>>>  /**
>>>   * void jbd2_journal_lock_updates () - establish a transaction barrier.
>>>   * @journal:  Journal to establish a barrier on.
>>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>>> index a32dcae..c7885b2 100644
>>> --- a/include/linux/jbd2.h
>>> +++ b/include/linux/jbd2.h
>>> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>>>   * @j_wait_checkpoint:  Wait queue to trigger checkpointing
>>>   * @j_wait_commit: Wait queue to trigger commit
>>>   * @j_wait_updates: Wait queue to wait for updates to complete
>>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>>>   * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>>>   * @j_head: Journal head - identifies the first unused block in the journal
>>>   * @j_tail: Journal tail - identifies the oldest still-used block in the
>>> @@ -835,6 +836,9 @@ struct journal_s
>>>      /* Wait queue to wait for updates to complete */
>>>      wait_queue_head_t    j_wait_updates;
>>> 
>>> +    /* Wait queue to wait for journal to thaw*/
>>> +    wait_queue_head_t    j_wait_frozen;
>>> +
>>>      /* Semaphore for locking against concurrent checkpoints */
>>>      struct mutex        j_checkpoint_mutex;
>>> 
>>> @@ -1013,7 +1017,11 @@ struct journal_s
>>>  #define JBD2_ABORT_ON_SYNCDATA_ERR    0x040    /* Abort the journal on file
>>>                           * data write error in ordered
>>>                           * mode */
>>> +#define JBD2_FROZEN    0x080   /* Journal thread is frozen as the filesystem is frozen */
>>> +
>>> 
>>> +#define jbd2_check_frozen(journal)    \
>>> +        wait_event(journal->j_wait_frozen, (journal->j_flags&  JBD2_FROZEN))
>>>  /*
>>>   * Function declarations for the journaling transaction and buffer
>>>   * management
>>> @@ -1121,6 +1129,8 @@ extern void     jbd2_journal_invalidatepage(journal_t *,
>>>                  struct page *, unsigned long);
>>>  extern int     jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>>>  extern int     jbd2_journal_stop(handle_t *);
>>> +extern int     jbd2_journal_freeze(journal_t *);
>>> +extern void     jbd2_journal_thaw(journal_t *);
>>>  extern int     jbd2_journal_flush (journal_t *);
>>>  extern void     jbd2_journal_lock_updates (journal_t *);
>>>  extern void     jbd2_journal_unlock_updates (journal_t *);
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v3] Adding support to freeze and unfreeze a journal
  2011-05-11  9:05                                                                     ` [PATCH v3] Adding support to freeze and unfreeze a journal Andreas Dilger
@ 2011-05-12  9:40                                                                       ` Surbhi Palande
  0 siblings, 0 replies; 121+ messages in thread
From: Surbhi Palande @ 2011-05-12  9:40 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Sandeen, jack, marco.stornelli, adilger.kernel,
	toshi.okajima, tytso, m.mizuma, linux-ext4, linux-fsdevel

On 05/11/2011 12:05 PM, Andreas Dilger wrote:
> On 2011-05-11, at 1:06 AM, Surbhi Palande<surbhi.palande@canonical.com>  wrote:
>
>> On 05/09/2011 06:23 PM, Eric Sandeen wrote:
>>> On 5/9/11 9:51 AM, Surbhi Palande wrote:
>>>> The journal should be frozen when a F.S freezes. What this means is that till
>>>> the F.S is thawed again, no new transactions should be accepted by the
>>>> journal. When the F.S thaws, inturn it should thaw the journal and this should
>>>> allow the journal to resume accepting new transactions.
>>>> While the F.S has frozen the journal, the clients of journal on calling
>>>> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
>>>> up the sleeping clients and journalling can progress normally.
>>>
>>> Can I ask how this was tested?
>>
>> Yes! I did the following on an ext4 fs mount:
>> 1. fsfreeze -f $MNT
>> 2. dd if=/dev/zero of=$MNT/file count=10 bs=1024&
>> 3. sync
>> 4. fsfreeze -u $MNT
>>
>> If the dd blocks on the start_handle, then the page cache is clean and sync should have nothing to write and everything will work fine. But otherwise this should sequence should create a deadlock.
>
> Sorry to ask the obvious question, but presumably this test fails without your patch?  It isn't clear from your comment that this is the case.

Actually since this is a race its very difficult to see this 
deterministically. The deadlock is apparently regulary seen by running 
iozone on multipath - when a path comes back to service.

I imagined that this could be simulated by running a lot of I/O in the 
background and trying fsfreeze, unfreeze parallely (multiple times). 
Unfortunately its not easy to hit the bug - its never deterministic. I 
really smoke tested my patch using this method

{
	dd i/o in a loop (10000 times) & (background process)
	touch some file  & in the same loop (background process)
	
}

{
	freeze &, sync &, unfreeze &, sleep in a loop & (100 times)
        (background processes)
}

and saw that there was not any deadlock in this case.  But I have not 
tested/seen the deadlock with this script without the patch.

Sorry for not being clear before :(

Warm Regards,
Surbhi.















>
>> I have attempted to create a patch for xfs-test. Shall send it out as a reply to this email soon!
>>
>> Warm Regards,
>> Surbhi.
>>
>>
>>
>>>
>>> Ideally anything you found useful for testing should probably be integrated
>>> into the xfstests test suite so that we don't regresss in the future.
>>>
>>> thanks,
>>> -Eric
>>>
>>>> Signed-off-by: Surbhi Palande<surbhi.palande@canonical.com>
>>>> ---
>>>> Changes since the last patch:
>>>> * Changed to the shorter forms of expressions eg: x |= y
>>>> * removed the unnecessary barrier
>>>>
>>>>   fs/ext4/super.c       |   20 ++++++--------------
>>>>   fs/jbd2/journal.c     |    1 +
>>>>   fs/jbd2/transaction.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>>>>   include/linux/jbd2.h  |   10 ++++++++++
>>>>   4 files changed, 59 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index 8553dfb..796aa4c 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>>>
>>>>       journal = EXT4_SB(sb)->s_journal;
>>>>
>>>> -    /* Now we set up the journal barrier. */
>>>> -    jbd2_journal_lock_updates(journal);
>>>> -
>>>> +    error = jbd2_journal_freeze(journal);
>>>>       /*
>>>> -     * Don't clear the needs_recovery flag if we failed to flush
>>>> +     * Don't clear the needs_recovery flag if we failed to freeze
>>>>        * the journal.
>>>>        */
>>>> -    error = jbd2_journal_flush(journal);
>>>> -    if (error<   0)
>>>> -        goto out;
>>>> -
>>>> -    /* Journal blocked and flushed, clear needs_recovery flag. */
>>>> -    EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>>> -    error = ext4_commit_super(sb, 1);
>>>> -out:
>>>> -    /* we rely on s_frozen to stop further updates */
>>>> -    jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>>> +    if (error>= 0) {
>>>> +        EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>>> +        error = ext4_commit_super(sb, 1);
>>>> +    }
>>>>       return error;
>>>>   }
>>>>
>>>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>>>> index e0ec3db..5e46333 100644
>>>> --- a/fs/jbd2/journal.c
>>>> +++ b/fs/jbd2/journal.c
>>>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>>>>       init_waitqueue_head(&journal->j_wait_checkpoint);
>>>>       init_waitqueue_head(&journal->j_wait_commit);
>>>>       init_waitqueue_head(&journal->j_wait_updates);
>>>> +    init_waitqueue_head(&journal->j_wait_frozen);
>>>>       mutex_init(&journal->j_barrier);
>>>>       mutex_init(&journal->j_checkpoint_mutex);
>>>>       spin_lock_init(&journal->j_revoke_lock);
>>>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>>>> index 05fa77a..b040293 100644
>>>> --- a/fs/jbd2/transaction.c
>>>> +++ b/fs/jbd2/transaction.c
>>>> @@ -171,6 +171,17 @@ repeat:
>>>>                   journal->j_barrier_count == 0);
>>>>           goto repeat;
>>>>       }
>>>> +    /* dont let a new handle start when a journal is frozen.
>>>> +     * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>>>> +     * the jflags indicate that the journal is frozen. So if the
>>>> +     * j_barrier_count is 0, then check if this was made 0 by the freezing
>>>> +     * process
>>>> +     */
>>>> +    if (journal->j_flags&   JBD2_FROZEN) {
>>>> +        read_unlock(&journal->j_state_lock);
>>>> +        jbd2_check_frozen(journal);
>>>> +        goto repeat;
>>>> +    }
>>>>
>>>>       if (!journal->j_running_transaction) {
>>>>           read_unlock(&journal->j_state_lock);
>>>> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>>>>   }
>>>>   EXPORT_SYMBOL(jbd2_journal_restart);
>>>>
>>>> +int jbd2_journal_freeze(journal_t *journal)
>>>> +{
>>>> +    int error = 0;
>>>> +    /* Now we set up the journal barrier. */
>>>> +    jbd2_journal_lock_updates(journal);
>>>> +
>>>> +    /*
>>>> +     * Don't clear the needs_recovery flag if we failed to flush
>>>> +     * the journal.
>>>> +     */
>>>> +    error = jbd2_journal_flush(journal);
>>>> +    if (error>= 0) {
>>>> +        write_lock(&journal->j_state_lock);
>>>> +        journal->j_flags |= JBD2_FROZEN;
>>>> +        write_unlock(&journal->j_state_lock);
>>>> +    }
>>>> +    jbd2_journal_unlock_updates(journal);
>>>> +    return error;
>>>> +}
>>>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>>>> +
>>>> +void jbd2_journal_thaw(journal_t * journal)
>>>> +{
>>>> +    write_lock(&journal->j_state_lock);
>>>> +    journal->j_flags&= ~JBD2_FROZEN;
>>>> +    write_unlock(&journal->j_state_lock);
>>>> +    wake_up(&journal->j_wait_frozen);
>>>> +}
>>>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>>>> +
>>>> +
>>>>   /**
>>>>    * void jbd2_journal_lock_updates () - establish a transaction barrier.
>>>>    * @journal:  Journal to establish a barrier on.
>>>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>>>> index a32dcae..c7885b2 100644
>>>> --- a/include/linux/jbd2.h
>>>> +++ b/include/linux/jbd2.h
>>>> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>>>>    * @j_wait_checkpoint:  Wait queue to trigger checkpointing
>>>>    * @j_wait_commit: Wait queue to trigger commit
>>>>    * @j_wait_updates: Wait queue to wait for updates to complete
>>>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>>>>    * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>>>>    * @j_head: Journal head - identifies the first unused block in the journal
>>>>    * @j_tail: Journal tail - identifies the oldest still-used block in the
>>>> @@ -835,6 +836,9 @@ struct journal_s
>>>>       /* Wait queue to wait for updates to complete */
>>>>       wait_queue_head_t    j_wait_updates;
>>>>
>>>> +    /* Wait queue to wait for journal to thaw*/
>>>> +    wait_queue_head_t    j_wait_frozen;
>>>> +
>>>>       /* Semaphore for locking against concurrent checkpoints */
>>>>       struct mutex        j_checkpoint_mutex;
>>>>
>>>> @@ -1013,7 +1017,11 @@ struct journal_s
>>>>   #define JBD2_ABORT_ON_SYNCDATA_ERR    0x040    /* Abort the journal on file
>>>>                            * data write error in ordered
>>>>                            * mode */
>>>> +#define JBD2_FROZEN    0x080   /* Journal thread is frozen as the filesystem is frozen */
>>>> +
>>>>
>>>> +#define jbd2_check_frozen(journal)    \
>>>> +        wait_event(journal->j_wait_frozen, (journal->j_flags&   JBD2_FROZEN))
>>>>   /*
>>>>    * Function declarations for the journaling transaction and buffer
>>>>    * management
>>>> @@ -1121,6 +1129,8 @@ extern void     jbd2_journal_invalidatepage(journal_t *,
>>>>                   struct page *, unsigned long);
>>>>   extern int     jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>>>>   extern int     jbd2_journal_stop(handle_t *);
>>>> +extern int     jbd2_journal_freeze(journal_t *);
>>>> +extern void     jbd2_journal_thaw(journal_t *);
>>>>   extern int     jbd2_journal_flush (journal_t *);
>>>>   extern void     jbd2_journal_lock_updates (journal_t *);
>>>>   extern void     jbd2_journal_unlock_updates (journal_t *);
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S
  2011-05-11  7:10                                                                     ` [PATCH] Attempt to sync the fsstress writes to a frozen F.S Surbhi Palande
@ 2011-05-12 14:22                                                                         ` Eric Sandeen
  2011-05-24 21:42                                                                       ` Ted Ts'o
  1 sibling, 0 replies; 121+ messages in thread
From: Eric Sandeen @ 2011-05-12 14:22 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: jack, marco.stornelli, adilger.kernel, toshi.okajima, tytso,
	m.mizuma, linux-ext4, linux-fsdevel, xfs-oss

On 5/11/11 2:10 AM, Surbhi Palande wrote:
> While the fsstress background writes are busy dirtying the page cache, if a
> fsfreeze happens then the background writes should stall. A sync should then
> not have any data to sync to the FS. If it does have any data to sync then
> sync will cause a deadlock by holding the s_umount write semaphore and waiting
> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
> getting the s_umount write semaphore.
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>

Seems ok to me.  In the future, when sending xfstests patches,
if you can add "xfstests" to the subject, and cc: the xfs list,
it'd be great.

I presume that this test does fail for you without your fixes?

I'll see if anyone on the xfs list has comments and if not, I can check this in.

Thanks,
-Eric

> ---
>  068 |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/068 b/068
> index 82c1a4e..b9ac58d 100755
> --- a/068
> +++ b/068
> @@ -101,6 +101,11 @@ do
>  	    tee -a $seq.full
>  	sleep 2
>  
> +	# there should be nothing to sync at this point. This may hang in case
> +	# of fsstress background writes dirtying the page cache while the F.S is frozen
> +	sync &
> +	sleep 2
> +
>  	echo "*** thawing  \$SCRATCH_MNT" | tee -a $seq.full
>  	xfs_freeze -u "$SCRATCH_MNT" | tee -a $seq.full
>  	[ $? != 0 ] && echo xfs_freeze -u "$SCRATCH_MNT" failed | \


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S
@ 2011-05-12 14:22                                                                         ` Eric Sandeen
  0 siblings, 0 replies; 121+ messages in thread
From: Eric Sandeen @ 2011-05-12 14:22 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: m.mizuma, jack, marco.stornelli, xfs-oss, toshi.okajima,
	adilger.kernel, linux-fsdevel, tytso, linux-ext4

On 5/11/11 2:10 AM, Surbhi Palande wrote:
> While the fsstress background writes are busy dirtying the page cache, if a
> fsfreeze happens then the background writes should stall. A sync should then
> not have any data to sync to the FS. If it does have any data to sync then
> sync will cause a deadlock by holding the s_umount write semaphore and waiting
> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
> getting the s_umount write semaphore.
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>

Seems ok to me.  In the future, when sending xfstests patches,
if you can add "xfstests" to the subject, and cc: the xfs list,
it'd be great.

I presume that this test does fail for you without your fixes?

I'll see if anyone on the xfs list has comments and if not, I can check this in.

Thanks,
-Eric

> ---
>  068 |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/068 b/068
> index 82c1a4e..b9ac58d 100755
> --- a/068
> +++ b/068
> @@ -101,6 +101,11 @@ do
>  	    tee -a $seq.full
>  	sleep 2
>  
> +	# there should be nothing to sync at this point. This may hang in case
> +	# of fsstress background writes dirtying the page cache while the F.S is frozen
> +	sync &
> +	sleep 2
> +
>  	echo "*** thawing  \$SCRATCH_MNT" | tee -a $seq.full
>  	xfs_freeze -u "$SCRATCH_MNT" | tee -a $seq.full
>  	[ $? != 0 ] && echo xfs_freeze -u "$SCRATCH_MNT" failed | \

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S
  2011-05-11  7:10                                                                     ` [PATCH] Attempt to sync the fsstress writes to a frozen F.S Surbhi Palande
  2011-05-12 14:22                                                                         ` Eric Sandeen
@ 2011-05-24 21:42                                                                       ` Ted Ts'o
  2011-05-25 12:00                                                                         ` Surbhi Palande
  1 sibling, 1 reply; 121+ messages in thread
From: Ted Ts'o @ 2011-05-24 21:42 UTC (permalink / raw)
  To: Surbhi Palande
  Cc: sandeen, jack, marco.stornelli, adilger.kernel, toshi.okajima,
	m.mizuma, linux-ext4, linux-fsdevel

On Wed, May 11, 2011 at 10:10:41AM +0300, Surbhi Palande wrote:
> While the fsstress background writes are busy dirtying the page cache, if a
> fsfreeze happens then the background writes should stall. A sync should then
> not have any data to sync to the FS. If it does have any data to sync then
> sync will cause a deadlock by holding the s_umount write semaphore and waiting
> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
> getting the s_umount write semaphore.
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>

Hi Surbhi,

Have you tried out Jan Kara's patches?

[1/3] fs: Create __block_page_mkwrite() helper passing error values back
[2/3] vfs: Block mmapped writes while the fs is frozen
[3/3] ext4: Rewrite ext4_page_mkwrite() to return locked page

Do these patches fix the problem you've been trying to fix with your
patches?  I believe they should, but I would appreciate confirmation
that with these patches, you're no longer able to reproduce the
problem you've been concerned about.

Thanks, regards,

						- Ted

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S
  2011-05-24 21:42                                                                       ` Ted Ts'o
@ 2011-05-25 12:00                                                                         ` Surbhi Palande
  2011-05-25 12:12                                                                           ` Theodore Tso
  0 siblings, 1 reply; 121+ messages in thread
From: Surbhi Palande @ 2011-05-25 12:00 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: sandeen, jack, marco.stornelli, adilger.kernel, toshi.okajima,
	m.mizuma, linux-ext4, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3511 bytes --]

Hi Ted,


On 05/25/2011 12:42 AM, Ted Ts'o wrote:
> On Wed, May 11, 2011 at 10:10:41AM +0300, Surbhi Palande wrote:
>> While the fsstress background writes are busy dirtying the page cache, if a
>> fsfreeze happens then the background writes should stall. A sync should then
>> not have any data to sync to the FS. If it does have any data to sync then
>> sync will cause a deadlock by holding the s_umount write semaphore and waiting
>> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
>> getting the s_umount write semaphore.
>>
>> Signed-off-by: Surbhi Palande<surbhi.palande@canonical.com>
>
> Hi Surbhi,
>
> Have you tried out Jan Kara's patches?
>
> [1/3] fs: Create __block_page_mkwrite() helper passing error values back
> [2/3] vfs: Block mmapped writes while the fs is frozen
> [3/3] ext4: Rewrite ext4_page_mkwrite() to return locked page

Yes! We have tried these patches and we still see the same 
deadlock/hang. The following is the reason for it:


// lets assume the inode is clean and so are its pages.
P1: process that tries mmap write
t1) __do_fault()
   t2) ext4_page_mkwrite()
     t3) block_page_mkwrite()
       t4) vfs_check_frozen()
// filesystem is not frozen so control falls through.
       t5) __block_page_mkwrite()
         t6) set_page_dirty()
           t7) __set_page_dirty()
	    t8) radix_tree_tag_set(PAGECACHE_TAG_DIRTY)
// page is dirtied, but inode is yet clean.
---------------------- Pre-empted-----------------
P2: freeze process

t9) freeze_super()
   t10) sync_filesystem()
  // page cache now clean! no inode is dirty.
// however we have a dirty page belonging to a clean inode.
----------------------Freeze process finishes, filesystem frozen!----


P1: process that tries mmap write gets control.
t11) __set_page_dirty() // gets control back
     t12) __mark_inode_dirty()v
    // inode is now dirty and it has a dirty page.
    // though in reality there is no write which has occured.
t13)   if (inode->i_sb->s_frozen != SB_UNFROZEN)
     // __block_page_mkwrite() gets control back
t14) unlock_page()
t15) __block_page_mkwrite() returns -EAGAIN
t16) block_page_mkwrite() returns VM_FAULT_RETRY

---------------------------
// now we see the original deadlock reported.
P3: sync a filesystem
t17) down_read(s_umount)
  t18) sync_filesystem()
   t19) sb->s_op->sync_fs() // =ext4_sync_fs()
    t20) vfs_check_frozen() // now blocks for thaw.
// so thaw cannot happen because sync process sleeps with s_umount!

This deadlock can occur whenever the freeze happens after the 
vfs_check_frozen() but before the __mark_inode_dirty().

We see blocked sync processes every time we do the following:

1) executing iozone on multipath and
2) I modified the script that Toshiyuki sent, attaching it here. This 
script reproduces the bug faster when executed with iozone.
(Note, that since this is a race, this script _may not_ always produce 
it on its own)


I also found one more missing piece in the "Add support to freeze and 
unfreeze journal":
1) Call jdb2_journal_thaw() from ext4_unfreeze() to restart the 
transactions.

I shall send a patch for the same as a reply to this email again.

Thanks!

Warm Regards,
Surbhi.












P3: sync







>
> Do these patches fix the problem you've been trying to fix with your
> patches?  I believe they should, but I would appreciate confirmation
> that with these patches, you're no longer able to reproduce the
> problem you've been concerned about.
>
> Thanks, regards,
>
> 						- Ted


[-- Attachment #2: test.sh --]
[-- Type: application/x-sh, Size: 2746 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S
  2011-05-25 12:00                                                                         ` Surbhi Palande
@ 2011-05-25 12:12                                                                           ` Theodore Tso
  2011-05-27 16:28                                                                             ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Theodore Tso @ 2011-05-25 12:12 UTC (permalink / raw)
  To: surbhi.palande
  Cc: sandeen, jack, marco.stornelli, adilger.kernel, toshi.okajima,
	m.mizuma, linux-ext4, linux-fsdevel

Hi Surbhi,

Just as a request --- could you start a new thread (this one is getting so long it's hard to follow)?

And could you also include a reliable reproduction case?

Many thanks!!

-- Ted



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S
  2011-05-25 12:12                                                                           ` Theodore Tso
@ 2011-05-27 16:28                                                                             ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2011-05-27 16:28 UTC (permalink / raw)
  To: Theodore Tso
  Cc: surbhi.palande, sandeen, jack, marco.stornelli, adilger.kernel,
	toshi.okajima, m.mizuma, linux-ext4, linux-fsdevel

  Ted,

On Wed 25-05-11 08:12:15, Ted Tso wrote:
> Just as a request --- could you start a new thread (this one is getting
> so long it's hard to follow)?
> 
> And could you also include a reliable reproduction case?
  Just a quick note - this patch series was not really meant to fix the
deadlocks. They are meant to make freezing reliable in combination with
mmapped writes. As a side-effect, they make the deadlock Surbhi describes
less probable but I'm aware it's still there.

I plan to have another look at how the deadlock could be fixed (the first
attempt was rejected by Dave Chinner) but currently I'm busy with other
stuff...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-02-07 11:53 [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Masayoshi MIZUMA
  2011-02-15 16:06 ` Jan Kara
@ 2011-12-09  1:56 ` Masayoshi MIZUMA
  2011-12-15 12:41   ` Masayoshi MIZUMA
  1 sibling, 1 reply; 121+ messages in thread
From: Masayoshi MIZUMA @ 2011-12-09  1:56 UTC (permalink / raw)
  To: Jan Kara, Andreas Dilger, Theodore Ts'o
  Cc: linux-ext4, linux-fsdevel, Christoph Hellwig, Toshiyuki Okajima


(2011/02/07 20:53), Masayoshi MIZUMA wrote:

> Hi,
> 
> When I checked the freeze feature for ext4 filesystem using fsfreeze command
> at 2.6.38-rc3, I got the following messeges:

Hi,

I checked freeze function with using below test program at 3.2.0-rc4, 
then, I got following messeages and the test program hanged up.
I think this bug is still in  3.2.0-rc4...

The test program:
-----------------------------------------------------------
#!/bin/bash

DEV_1=/dev/sda5
MNT_1=/tmp/sda5
LOOP=500

if [[ ! -d $MNT_1 ]]
then
        mkdir -p $MNT_1
fi

mkfs -t ext4 $DEV_1
mount $DEV_1 $MNT_1

./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
PID=$!

for ((i=0; i<LOOP; i++))
do
        echo LOOP: $i
        fsfreeze -f $MNT_1
        fsfreeze -u $MNT_1
done

kill $PID
-----------------------------------------------------------

The messages I got when I ran the test program is below.
-------------------------------------------------------------
INFO: task flush-8:0:720 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0       D 0000000100521461     0   720      2 0x00000000
 ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
 0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
 ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
Call Trace:
 [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
 [<ffffffff814ee3ff>] schedule+0x3f/0x60
 [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
 [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
 [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
 [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
 [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
 [<ffffffff81112bb1>] do_writepages+0x21/0x40
 [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
 [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
 [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
 [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
 [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
 [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
 [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
 [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
 [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
 [<ffffffff810861a6>] kthread+0x96/0xa0
 [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
 [<ffffffff814fa5b0>] ? gs_change+0x13/0x13

INFO: task fsstress:4376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fsstress        D ffff88009b52dda8     0  4376   4364 0x00000080
 ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
 0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
 ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
Call Trace:
 [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
 [<ffffffff814ee3ff>] schedule+0x3f/0x60
 [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
 [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
 [<ffffffff810127e4>] ? __switch_to+0x194/0x320
 [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
 [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
 [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
 [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
 [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
 [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
 [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
 [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
 [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
 [<ffffffff811695da>] iterate_supers+0x7a/0xd0
 [<ffffffff81193934>] sys_sync+0x34/0x70
 [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
-------------------------------------------------------------

The test program for xfstests is below.
-------------------------------------------------------------
#! /bin/bash
# FSQA Test No. 277
#
# Run fsstress and  freeze/unfreeze in parallel
#
#-----------------------------------------------------------------------
# Copyright (c) 2006 Silicon Graphics, Inc.  All Rights Reserved.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
#
#-----------------------------------------------------------------------
#
# creator
owner=m.mizuma@jp.fujitsu.com

seq=`basename $0`
echo "QA output created by $seq"

here=`pwd`
tmp=/tmp/$$
status=0        # success is the default!
trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15

# get standard environment, filters and checks
. ./common.rc
. ./common.filter

_workout()
{
	echo ""
	echo "Run fsstress"
	echo ""
	num_iterations=500
	out=$SCRATCH_MNT/fsstress.$$
	args="-p100 -n10000 -d $out"
	echo "fsstress $args" >> $here/$seq.full
	$FSSTRESS_PROG $args > /dev/null 2>&1 &
	pid=$!
	echo "Run xfs_freeze in parallel"
	for ((i=0; i < num_iterations; i++))
	do
		xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
		xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
	done
	kill $pid 2> /dev/null
	wait $pid
}

# real QA test starts here
_supported_fs generic
_supported_os Linux
_need_to_be_root
_require_scratch

_scratch_mkfs >> $seq.full 2>&1
_scratch_mount

if ! _workout; then
	umount $SCRATCH_DEV 2>/dev/null
	exit
fi

if ! _scratch_unmount; then
	echo "failed to umount"
	status=1
	exit
fi
_check_scratch_fs
status=$?
exit
-------------------------------------------------------------

Thanks,
Masayoshi Mizuma

> 
> ---------------------------------------------------------------------
> Feb  7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> Feb  7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  7 15:05:09 RX300S6 kernel: fsfreeze        D ffff880076d5f040     0  2104   2018 0x00000000
> Feb  7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> Feb  7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> Feb  7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> Feb  7 15:05:09 RX300S6 kernel: Call Trace:
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> ...
> Feb  7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> Feb  7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  7 15:07:09 RX300S6 kernel: flush-8:0       D ffff880037777a30     0  1409      2 0x00000000
> Feb  7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> Feb  7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> Feb  7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> Feb  7 15:07:09 RX300S6 kernel: Call Trace:
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> ---------------------------------------------------------------------
> 
> I think the following deadlock problem happened:
> 
>               [flush-8:0:1409]              |          [fsfreeze:2104]
> --------------------------------------------+--------------------------------
> writeback_inodes_wb                         |
>  pin_sb_for_writeback                       |
>    down_read_trylock(&sb->s_umount)         |
>  writeback_sb_inodes                        |thaw_super
>    writeback_single_inode                   | down_write(&sb->s_umount)
>      do_writepages                          |  # stop until flush-8:0 releases
>       ext4_da_writepages                    |  # read lock of sb->s_umount...
>        ext4_journal_start_sb                |
>         vfs_check_frozen                    |
>           wait_event((sb)->s_wait_unfrozen, |
>            ((sb)->s_frozen < (level)))      |
>             # stop until being waked up by  |
>             # fsfreeze...                   |
> --------------------------------------------+--------------------------------
> 
> Could anyone check this problem?
> 
> Thanks,
> Masayoshi Mizuma
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-12-09  1:56 ` Masayoshi MIZUMA
@ 2011-12-15 12:41   ` Masayoshi MIZUMA
  2013-11-29  4:58     ` Yongqiang Yang
  0 siblings, 1 reply; 121+ messages in thread
From: Masayoshi MIZUMA @ 2011-12-15 12:41 UTC (permalink / raw)
  To: Jan Kara, Andreas Dilger, Theodore Ts'o
  Cc: linux-ext4, linux-fsdevel, Christoph Hellwig, Toshiyuki Okajima


(2011/12/09 10:56), Masayoshi MIZUMA wrote:

> 
> (2011/02/07 20:53), Masayoshi MIZUMA wrote:
> 
> > Hi,
> > 
> > When I checked the freeze feature for ext4 filesystem using fsfreeze command
> > at 2.6.38-rc3, I got the following messeges:
> 
> Hi,
> 
> I checked freeze function with using below test program at 3.2.0-rc4, 
> then, I got following messeages and the test program hanged up.
> I think this bug is still in  3.2.0-rc4...

I think the problem is as follows.
When a race between ext4_page_mkwrite() and freeze_super() occurs,
ext4_page_mkwrite() can add a inode to a list (bdi_writeback.b_dirty)
which is needed to do writeback nevertheless sb->s_frozen is SB_FREEZE_WRITE
or SB_FREEZE_TRANS.

      process A               |     process B
------------------------------+-----------------------------------------------
ext4_page_mkwrite()           |
=> vfs_check_frozen()         |
                              | freeze_super()
       	                      | sb->s_frozen = SB_FREEZE_WRITE
=>__block_page_mkwrite()      | => sync_filesystem()
  :                           |    # write inodes which are in the list.
  :                           | sb->s_frozen = SB_FREEZE_TRANS
  :                           |
  =>__mark_inode_dirty        |
    # add inode to the list.  |
------------------------------+-----------------------------------------------

As the result, if "flush" kthread does writeback the inode which was
added by ext4_page_mkwrite() and thaw_super() runs concurrently, the
deadlock will happen.

Thanks,
Masayoshi Mizuma

> 
> The test program:
> -----------------------------------------------------------
> #!/bin/bash
> 
> DEV_1=/dev/sda5
> MNT_1=/tmp/sda5
> LOOP=500
> 
> if [[ ! -d $MNT_1 ]]
> then
>         mkdir -p $MNT_1
> fi
> 
> mkfs -t ext4 $DEV_1
> mount $DEV_1 $MNT_1
> 
> ./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
> PID=$!
> 
> for ((i=0; i<LOOP; i++))
> do
>         echo LOOP: $i
>         fsfreeze -f $MNT_1
>         fsfreeze -u $MNT_1
> done
> 
> kill $PID
> -----------------------------------------------------------
> 
> The messages I got when I ran the test program is below.
> -------------------------------------------------------------
> INFO: task flush-8:0:720 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> flush-8:0       D 0000000100521461     0   720      2 0x00000000
>  ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
>  0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
>  ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
> Call Trace:
>  [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
>  [<ffffffff814ee3ff>] schedule+0x3f/0x60
>  [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
>  [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
>  [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
>  [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
>  [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
>  [<ffffffff81112bb1>] do_writepages+0x21/0x40
>  [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
>  [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
>  [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
>  [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
>  [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
>  [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
>  [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
>  [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
>  [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
>  [<ffffffff810861a6>] kthread+0x96/0xa0
>  [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
>  [<ffffffff814fa5b0>] ? gs_change+0x13/0x13
> 
> INFO: task fsstress:4376 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> fsstress        D ffff88009b52dda8     0  4376   4364 0x00000080
>  ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
>  0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
>  ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
> Call Trace:
>  [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
>  [<ffffffff814ee3ff>] schedule+0x3f/0x60
>  [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
>  [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
>  [<ffffffff810127e4>] ? __switch_to+0x194/0x320
>  [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
>  [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
>  [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
>  [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
>  [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
>  [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
>  [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
>  [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
>  [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
>  [<ffffffff811695da>] iterate_supers+0x7a/0xd0
>  [<ffffffff81193934>] sys_sync+0x34/0x70
>  [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
> -------------------------------------------------------------
> 
> The test program for xfstests is below.
> -------------------------------------------------------------
> #! /bin/bash
> # FSQA Test No. 277
> #
> # Run fsstress and  freeze/unfreeze in parallel
> #
> #-----------------------------------------------------------------------
> # Copyright (c) 2006 Silicon Graphics, Inc.  All Rights Reserved.
> #
> # This program is free software; you can redistribute it and/or
> # modify it under the terms of the GNU General Public License as
> # published by the Free Software Foundation.
> #
> # This program is distributed in the hope that it would be useful,
> # but WITHOUT ANY WARRANTY; without even the implied warranty of
> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> # GNU General Public License for more details.
> #
> # You should have received a copy of the GNU General Public License
> # along with this program; if not, write the Free Software Foundation,
> # Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> #
> #-----------------------------------------------------------------------
> #
> # creator
> owner=m.mizuma@jp.fujitsu.com
> 
> seq=`basename $0`
> echo "QA output created by $seq"
> 
> here=`pwd`
> tmp=/tmp/$$
> status=0        # success is the default!
> trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
> 
> # get standard environment, filters and checks
> . ./common.rc
> . ./common.filter
> 
> _workout()
> {
> 	echo ""
> 	echo "Run fsstress"
> 	echo ""
> 	num_iterations=500
> 	out=$SCRATCH_MNT/fsstress.$$
> 	args="-p100 -n10000 -d $out"
> 	echo "fsstress $args" >> $here/$seq.full
> 	$FSSTRESS_PROG $args > /dev/null 2>&1 &
> 	pid=$!
> 	echo "Run xfs_freeze in parallel"
> 	for ((i=0; i < num_iterations; i++))
> 	do
> 		xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
> 		xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
> 	done
> 	kill $pid 2> /dev/null
> 	wait $pid
> }
> 
> # real QA test starts here
> _supported_fs generic
> _supported_os Linux
> _need_to_be_root
> _require_scratch
> 
> _scratch_mkfs >> $seq.full 2>&1
> _scratch_mount
> 
> if ! _workout; then
> 	umount $SCRATCH_DEV 2>/dev/null
> 	exit
> fi
> 
> if ! _scratch_unmount; then
> 	echo "failed to umount"
> 	status=1
> 	exit
> fi
> _check_scratch_fs
> status=$?
> exit
> -------------------------------------------------------------
> 
> Thanks,
> Masayoshi Mizuma
> 
> > 
> > ---------------------------------------------------------------------
> > Feb  7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> > Feb  7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Feb  7 15:05:09 RX300S6 kernel: fsfreeze        D ffff880076d5f040     0  2104   2018 0x00000000
> > Feb  7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> > Feb  7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> > Feb  7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> > Feb  7 15:05:09 RX300S6 kernel: Call Trace:
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> > ...
> > Feb  7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> > Feb  7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Feb  7 15:07:09 RX300S6 kernel: flush-8:0       D ffff880037777a30     0  1409      2 0x00000000
> > Feb  7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> > Feb  7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> > Feb  7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> > Feb  7 15:07:09 RX300S6 kernel: Call Trace:
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> > ---------------------------------------------------------------------
> > 
> > I think the following deadlock problem happened:
> > 
> >               [flush-8:0:1409]              |          [fsfreeze:2104]
> > --------------------------------------------+--------------------------------
> > writeback_inodes_wb                         |
> >  pin_sb_for_writeback                       |
> >    down_read_trylock(&sb->s_umount)         |
> >  writeback_sb_inodes                        |thaw_super
> >    writeback_single_inode                   | down_write(&sb->s_umount)
> >      do_writepages                          |  # stop until flush-8:0 releases
> >       ext4_da_writepages                    |  # read lock of sb->s_umount...
> >        ext4_journal_start_sb                |
> >         vfs_check_frozen                    |
> >           wait_event((sb)->s_wait_unfrozen, |
> >            ((sb)->s_frozen < (level)))      |
> >             # stop until being waked up by  |
> >             # fsfreeze...                   |
> > --------------------------------------------+--------------------------------
> > 
> > Could anyone check this problem?
> > 
> > Thanks,
> > Masayoshi Mizuma
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2011-12-15 12:41   ` Masayoshi MIZUMA
@ 2013-11-29  4:58     ` Yongqiang Yang
  2013-11-29  8:00       ` Jan Kara
  0 siblings, 1 reply; 121+ messages in thread
From: Yongqiang Yang @ 2013-11-29  4:58 UTC (permalink / raw)
  To: Masayoshi MIZUMA
  Cc: Jan Kara, Andreas Dilger, Theodore Ts'o,
	Ext4 Developers List, Linux Filesystem Mailing List,
	Christoph Hellwig, Toshiyuki Okajima

How is fthe bug fixed at last? I can not find the accepted patch.


Thanks,
Yongqiang.

On Thu, Dec 15, 2011 at 8:41 PM, Masayoshi MIZUMA
<m.mizuma@jp.fujitsu.com> wrote:
>
> (2011/12/09 10:56), Masayoshi MIZUMA wrote:
>
>>
>> (2011/02/07 20:53), Masayoshi MIZUMA wrote:
>>
>> > Hi,
>> >
>> > When I checked the freeze feature for ext4 filesystem using fsfreeze command
>> > at 2.6.38-rc3, I got the following messeges:
>>
>> Hi,
>>
>> I checked freeze function with using below test program at 3.2.0-rc4,
>> then, I got following messeages and the test program hanged up.
>> I think this bug is still in  3.2.0-rc4...
>
> I think the problem is as follows.
> When a race between ext4_page_mkwrite() and freeze_super() occurs,
> ext4_page_mkwrite() can add a inode to a list (bdi_writeback.b_dirty)
> which is needed to do writeback nevertheless sb->s_frozen is SB_FREEZE_WRITE
> or SB_FREEZE_TRANS.
>
>       process A               |     process B
> ------------------------------+-----------------------------------------------
> ext4_page_mkwrite()           |
> => vfs_check_frozen()         |
>                               | freeze_super()
>                               | sb->s_frozen = SB_FREEZE_WRITE
> =>__block_page_mkwrite()      | => sync_filesystem()
>   :                           |    # write inodes which are in the list.
>   :                           | sb->s_frozen = SB_FREEZE_TRANS
>   :                           |
>   =>__mark_inode_dirty        |
>     # add inode to the list.  |
> ------------------------------+-----------------------------------------------
>
> As the result, if "flush" kthread does writeback the inode which was
> added by ext4_page_mkwrite() and thaw_super() runs concurrently, the
> deadlock will happen.
>
> Thanks,
> Masayoshi Mizuma
>
>>
>> The test program:
>> -----------------------------------------------------------
>> #!/bin/bash
>>
>> DEV_1=/dev/sda5
>> MNT_1=/tmp/sda5
>> LOOP=500
>>
>> if [[ ! -d $MNT_1 ]]
>> then
>>         mkdir -p $MNT_1
>> fi
>>
>> mkfs -t ext4 $DEV_1
>> mount $DEV_1 $MNT_1
>>
>> ./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
>> PID=$!
>>
>> for ((i=0; i<LOOP; i++))
>> do
>>         echo LOOP: $i
>>         fsfreeze -f $MNT_1
>>         fsfreeze -u $MNT_1
>> done
>>
>> kill $PID
>> -----------------------------------------------------------
>>
>> The messages I got when I ran the test program is below.
>> -------------------------------------------------------------
>> INFO: task flush-8:0:720 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> flush-8:0       D 0000000100521461     0   720      2 0x00000000
>>  ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
>>  0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
>>  ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
>> Call Trace:
>>  [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
>>  [<ffffffff814ee3ff>] schedule+0x3f/0x60
>>  [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
>>  [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
>>  [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
>>  [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
>>  [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
>>  [<ffffffff81112bb1>] do_writepages+0x21/0x40
>>  [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
>>  [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
>>  [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
>>  [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
>>  [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
>>  [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
>>  [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
>>  [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
>>  [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
>>  [<ffffffff810861a6>] kthread+0x96/0xa0
>>  [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
>>  [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
>>  [<ffffffff814fa5b0>] ? gs_change+0x13/0x13
>>
>> INFO: task fsstress:4376 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> fsstress        D ffff88009b52dda8     0  4376   4364 0x00000080
>>  ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
>>  0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
>>  ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
>> Call Trace:
>>  [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
>>  [<ffffffff814ee3ff>] schedule+0x3f/0x60
>>  [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
>>  [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
>>  [<ffffffff810127e4>] ? __switch_to+0x194/0x320
>>  [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
>>  [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
>>  [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
>>  [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
>>  [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
>>  [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
>>  [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
>>  [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
>>  [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
>>  [<ffffffff811695da>] iterate_supers+0x7a/0xd0
>>  [<ffffffff81193934>] sys_sync+0x34/0x70
>>  [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
>> -------------------------------------------------------------
>>
>> The test program for xfstests is below.
>> -------------------------------------------------------------
>> #! /bin/bash
>> # FSQA Test No. 277
>> #
>> # Run fsstress and  freeze/unfreeze in parallel
>> #
>> #-----------------------------------------------------------------------
>> # Copyright (c) 2006 Silicon Graphics, Inc.  All Rights Reserved.
>> #
>> # This program is free software; you can redistribute it and/or
>> # modify it under the terms of the GNU General Public License as
>> # published by the Free Software Foundation.
>> #
>> # This program is distributed in the hope that it would be useful,
>> # but WITHOUT ANY WARRANTY; without even the implied warranty of
>> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> # GNU General Public License for more details.
>> #
>> # You should have received a copy of the GNU General Public License
>> # along with this program; if not, write the Free Software Foundation,
>> # Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> #
>> #-----------------------------------------------------------------------
>> #
>> # creator
>> owner=m.mizuma@jp.fujitsu.com
>>
>> seq=`basename $0`
>> echo "QA output created by $seq"
>>
>> here=`pwd`
>> tmp=/tmp/$$
>> status=0        # success is the default!
>> trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
>>
>> # get standard environment, filters and checks
>> . ./common.rc
>> . ./common.filter
>>
>> _workout()
>> {
>>       echo ""
>>       echo "Run fsstress"
>>       echo ""
>>       num_iterations=500
>>       out=$SCRATCH_MNT/fsstress.$$
>>       args="-p100 -n10000 -d $out"
>>       echo "fsstress $args" >> $here/$seq.full
>>       $FSSTRESS_PROG $args > /dev/null 2>&1 &
>>       pid=$!
>>       echo "Run xfs_freeze in parallel"
>>       for ((i=0; i < num_iterations; i++))
>>       do
>>               xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
>>               xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
>>       done
>>       kill $pid 2> /dev/null
>>       wait $pid
>> }
>>
>> # real QA test starts here
>> _supported_fs generic
>> _supported_os Linux
>> _need_to_be_root
>> _require_scratch
>>
>> _scratch_mkfs >> $seq.full 2>&1
>> _scratch_mount
>>
>> if ! _workout; then
>>       umount $SCRATCH_DEV 2>/dev/null
>>       exit
>> fi
>>
>> if ! _scratch_unmount; then
>>       echo "failed to umount"
>>       status=1
>>       exit
>> fi
>> _check_scratch_fs
>> status=$?
>> exit
>> -------------------------------------------------------------
>>
>> Thanks,
>> Masayoshi Mizuma
>>
>> >
>> > ---------------------------------------------------------------------
>> > Feb  7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
>> > Feb  7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > Feb  7 15:05:09 RX300S6 kernel: fsfreeze        D ffff880076d5f040     0  2104   2018 0x00000000
>> > Feb  7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
>> > Feb  7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
>> > Feb  7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
>> > Feb  7 15:05:09 RX300S6 kernel: Call Trace:
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
>> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
>> > ...
>> > Feb  7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
>> > Feb  7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > Feb  7 15:07:09 RX300S6 kernel: flush-8:0       D ffff880037777a30     0  1409      2 0x00000000
>> > Feb  7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
>> > Feb  7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
>> > Feb  7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
>> > Feb  7 15:07:09 RX300S6 kernel: Call Trace:
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
>> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
>> > ---------------------------------------------------------------------
>> >
>> > I think the following deadlock problem happened:
>> >
>> >               [flush-8:0:1409]              |          [fsfreeze:2104]
>> > --------------------------------------------+--------------------------------
>> > writeback_inodes_wb                         |
>> >  pin_sb_for_writeback                       |
>> >    down_read_trylock(&sb->s_umount)         |
>> >  writeback_sb_inodes                        |thaw_super
>> >    writeback_single_inode                   | down_write(&sb->s_umount)
>> >      do_writepages                          |  # stop until flush-8:0 releases
>> >       ext4_da_writepages                    |  # read lock of sb->s_umount...
>> >        ext4_journal_start_sb                |
>> >         vfs_check_frozen                    |
>> >           wait_event((sb)->s_wait_unfrozen, |
>> >            ((sb)->s_frozen < (level)))      |
>> >             # stop until being waked up by  |
>> >             # fsfreeze...                   |
>> > --------------------------------------------+--------------------------------
>> >
>> > Could anyone check this problem?
>> >
>> > Thanks,
>> > Masayoshi Mizuma
>> >
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Wishes
Yongqiang Yang

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock
  2013-11-29  4:58     ` Yongqiang Yang
@ 2013-11-29  8:00       ` Jan Kara
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Kara @ 2013-11-29  8:00 UTC (permalink / raw)
  To: Yongqiang Yang
  Cc: Masayoshi MIZUMA, Jan Kara, Andreas Dilger, Theodore Ts'o,
	Ext4 Developers List, Linux Filesystem Mailing List,
	Christoph Hellwig, Toshiyuki Okajima

On Fri 29-11-13 12:58:29, Yongqiang Yang wrote:
> How is fthe bug fixed at last? I can not find the accepted patch.
  It was fixed by rewriting handling of freezing in VFS. That was rather
large patch series but the gut of the change is commit
5accdf82ba25cacefd6c1867f1704beb4d244cdd.

								Honza

> On Thu, Dec 15, 2011 at 8:41 PM, Masayoshi MIZUMA
> <m.mizuma@jp.fujitsu.com> wrote:
> >
> > (2011/12/09 10:56), Masayoshi MIZUMA wrote:
> >
> >>
> >> (2011/02/07 20:53), Masayoshi MIZUMA wrote:
> >>
> >> > Hi,
> >> >
> >> > When I checked the freeze feature for ext4 filesystem using fsfreeze command
> >> > at 2.6.38-rc3, I got the following messeges:
> >>
> >> Hi,
> >>
> >> I checked freeze function with using below test program at 3.2.0-rc4,
> >> then, I got following messeages and the test program hanged up.
> >> I think this bug is still in  3.2.0-rc4...
> >
> > I think the problem is as follows.
> > When a race between ext4_page_mkwrite() and freeze_super() occurs,
> > ext4_page_mkwrite() can add a inode to a list (bdi_writeback.b_dirty)
> > which is needed to do writeback nevertheless sb->s_frozen is SB_FREEZE_WRITE
> > or SB_FREEZE_TRANS.
> >
> >       process A               |     process B
> > ------------------------------+-----------------------------------------------
> > ext4_page_mkwrite()           |
> > => vfs_check_frozen()         |
> >                               | freeze_super()
> >                               | sb->s_frozen = SB_FREEZE_WRITE
> > =>__block_page_mkwrite()      | => sync_filesystem()
> >   :                           |    # write inodes which are in the list.
> >   :                           | sb->s_frozen = SB_FREEZE_TRANS
> >   :                           |
> >   =>__mark_inode_dirty        |
> >     # add inode to the list.  |
> > ------------------------------+-----------------------------------------------
> >
> > As the result, if "flush" kthread does writeback the inode which was
> > added by ext4_page_mkwrite() and thaw_super() runs concurrently, the
> > deadlock will happen.
> >
> > Thanks,
> > Masayoshi Mizuma
> >
> >>
> >> The test program:
> >> -----------------------------------------------------------
> >> #!/bin/bash
> >>
> >> DEV_1=/dev/sda5
> >> MNT_1=/tmp/sda5
> >> LOOP=500
> >>
> >> if [[ ! -d $MNT_1 ]]
> >> then
> >>         mkdir -p $MNT_1
> >> fi
> >>
> >> mkfs -t ext4 $DEV_1
> >> mount $DEV_1 $MNT_1
> >>
> >> ./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
> >> PID=$!
> >>
> >> for ((i=0; i<LOOP; i++))
> >> do
> >>         echo LOOP: $i
> >>         fsfreeze -f $MNT_1
> >>         fsfreeze -u $MNT_1
> >> done
> >>
> >> kill $PID
> >> -----------------------------------------------------------
> >>
> >> The messages I got when I ran the test program is below.
> >> -------------------------------------------------------------
> >> INFO: task flush-8:0:720 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> flush-8:0       D 0000000100521461     0   720      2 0x00000000
> >>  ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
> >>  0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
> >>  ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
> >> Call Trace:
> >>  [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
> >>  [<ffffffff814ee3ff>] schedule+0x3f/0x60
> >>  [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
> >>  [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
> >>  [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
> >>  [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
> >>  [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
> >>  [<ffffffff81112bb1>] do_writepages+0x21/0x40
> >>  [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
> >>  [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
> >>  [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
> >>  [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
> >>  [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
> >>  [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
> >>  [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
> >>  [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
> >>  [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
> >>  [<ffffffff810861a6>] kthread+0x96/0xa0
> >>  [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
> >>  [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
> >>  [<ffffffff814fa5b0>] ? gs_change+0x13/0x13
> >>
> >> INFO: task fsstress:4376 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> fsstress        D ffff88009b52dda8     0  4376   4364 0x00000080
> >>  ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
> >>  0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
> >>  ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
> >> Call Trace:
> >>  [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
> >>  [<ffffffff814ee3ff>] schedule+0x3f/0x60
> >>  [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
> >>  [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
> >>  [<ffffffff810127e4>] ? __switch_to+0x194/0x320
> >>  [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
> >>  [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
> >>  [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
> >>  [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
> >>  [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
> >>  [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
> >>  [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
> >>  [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
> >>  [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
> >>  [<ffffffff811695da>] iterate_supers+0x7a/0xd0
> >>  [<ffffffff81193934>] sys_sync+0x34/0x70
> >>  [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
> >> -------------------------------------------------------------
> >>
> >> The test program for xfstests is below.
> >> -------------------------------------------------------------
> >> #! /bin/bash
> >> # FSQA Test No. 277
> >> #
> >> # Run fsstress and  freeze/unfreeze in parallel
> >> #
> >> #-----------------------------------------------------------------------
> >> # Copyright (c) 2006 Silicon Graphics, Inc.  All Rights Reserved.
> >> #
> >> # This program is free software; you can redistribute it and/or
> >> # modify it under the terms of the GNU General Public License as
> >> # published by the Free Software Foundation.
> >> #
> >> # This program is distributed in the hope that it would be useful,
> >> # but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> # GNU General Public License for more details.
> >> #
> >> # You should have received a copy of the GNU General Public License
> >> # along with this program; if not, write the Free Software Foundation,
> >> # Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> >> #
> >> #-----------------------------------------------------------------------
> >> #
> >> # creator
> >> owner=m.mizuma@jp.fujitsu.com
> >>
> >> seq=`basename $0`
> >> echo "QA output created by $seq"
> >>
> >> here=`pwd`
> >> tmp=/tmp/$$
> >> status=0        # success is the default!
> >> trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
> >>
> >> # get standard environment, filters and checks
> >> . ./common.rc
> >> . ./common.filter
> >>
> >> _workout()
> >> {
> >>       echo ""
> >>       echo "Run fsstress"
> >>       echo ""
> >>       num_iterations=500
> >>       out=$SCRATCH_MNT/fsstress.$$
> >>       args="-p100 -n10000 -d $out"
> >>       echo "fsstress $args" >> $here/$seq.full
> >>       $FSSTRESS_PROG $args > /dev/null 2>&1 &
> >>       pid=$!
> >>       echo "Run xfs_freeze in parallel"
> >>       for ((i=0; i < num_iterations; i++))
> >>       do
> >>               xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
> >>               xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
> >>       done
> >>       kill $pid 2> /dev/null
> >>       wait $pid
> >> }
> >>
> >> # real QA test starts here
> >> _supported_fs generic
> >> _supported_os Linux
> >> _need_to_be_root
> >> _require_scratch
> >>
> >> _scratch_mkfs >> $seq.full 2>&1
> >> _scratch_mount
> >>
> >> if ! _workout; then
> >>       umount $SCRATCH_DEV 2>/dev/null
> >>       exit
> >> fi
> >>
> >> if ! _scratch_unmount; then
> >>       echo "failed to umount"
> >>       status=1
> >>       exit
> >> fi
> >> _check_scratch_fs
> >> status=$?
> >> exit
> >> -------------------------------------------------------------
> >>
> >> Thanks,
> >> Masayoshi Mizuma
> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > Feb  7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> >> > Feb  7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > Feb  7 15:05:09 RX300S6 kernel: fsfreeze        D ffff880076d5f040     0  2104   2018 0x00000000
> >> > Feb  7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> >> > Feb  7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> >> > Feb  7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> >> > Feb  7 15:05:09 RX300S6 kernel: Call Trace:
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> >> > Feb  7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> >> > ...
> >> > Feb  7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> >> > Feb  7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > Feb  7 15:07:09 RX300S6 kernel: flush-8:0       D ffff880037777a30     0  1409      2 0x00000000
> >> > Feb  7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> >> > Feb  7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> >> > Feb  7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> >> > Feb  7 15:07:09 RX300S6 kernel: Call Trace:
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> >> > Feb  7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> >> > ---------------------------------------------------------------------
> >> >
> >> > I think the following deadlock problem happened:
> >> >
> >> >               [flush-8:0:1409]              |          [fsfreeze:2104]
> >> > --------------------------------------------+--------------------------------
> >> > writeback_inodes_wb                         |
> >> >  pin_sb_for_writeback                       |
> >> >    down_read_trylock(&sb->s_umount)         |
> >> >  writeback_sb_inodes                        |thaw_super
> >> >    writeback_single_inode                   | down_write(&sb->s_umount)
> >> >      do_writepages                          |  # stop until flush-8:0 releases
> >> >       ext4_da_writepages                    |  # read lock of sb->s_umount...
> >> >        ext4_journal_start_sb                |
> >> >         vfs_check_frozen                    |
> >> >           wait_event((sb)->s_wait_unfrozen, |
> >> >            ((sb)->s_frozen < (level)))      |
> >> >             # stop until being waked up by  |
> >> >             # fsfreeze...                   |
> >> > --------------------------------------------+--------------------------------
> >> >
> >> > Could anyone check this problem?
> >> >
> >> > Thanks,
> >> > Masayoshi Mizuma
> >> >
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Best Wishes
> Yongqiang Yang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2013-11-29  8:00 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-07 11:53 [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Masayoshi MIZUMA
2011-02-15 16:06 ` Jan Kara
2011-02-15 17:03   ` Ted Ts'o
2011-02-15 17:29     ` Jan Kara
2011-02-15 18:04       ` Ted Ts'o
2011-02-15 19:11         ` Jan Kara
2011-02-15 23:17       ` Toshiyuki Okajima
2011-02-16 14:56         ` Jan Kara
2011-02-17  3:50           ` Toshiyuki Okajima
2011-02-17  5:13             ` Andreas Dilger
2011-02-17 10:41               ` Jan Kara
2011-02-17 10:45             ` Jan Kara
2011-03-28  8:06               ` [RFC][PATCH] " Toshiyuki Okajima
2011-03-30 14:12                 ` Jan Kara
2011-03-31  8:37                   ` Yongqiang Yang
2011-03-31  8:48                     ` Yongqiang Yang
2011-03-31 14:04                     ` Eric Sandeen
2011-03-31 14:36                       ` Yongqiang Yang
2011-03-31 15:25                         ` Eric Sandeen
2011-03-31 16:28                         ` Jan Kara
2011-03-31 12:03                   ` Toshiyuki Okajima
2011-04-05 10:25                     ` Toshiyuki Okajima
2011-04-05 22:54                       ` Jan Kara
2011-04-06  5:09                         ` Toshiyuki Okajima
2011-04-06  5:57                           ` Jan Kara
2011-04-06  7:40                             ` Toshiyuki Okajima
2011-04-06 17:46                               ` Jan Kara
2011-04-15 13:39                                 ` Toshiyuki Okajima
2011-04-15 17:13                                   ` Jan Kara
2011-04-15 17:17                                     ` Eric Sandeen
2011-04-15 17:37                                       ` Jan Kara
2011-04-18  9:05                                     ` Toshiyuki Okajima
2011-04-18 10:51                                       ` Jan Kara
2011-04-19  9:43                                         ` Toshiyuki Okajima
2011-04-22  6:58                                           ` Toshiyuki Okajima
2011-04-22 21:26                                             ` Peter M. Petrakis
2011-04-22 21:40                                               ` Jan Kara
2011-04-22 22:57                                                 ` Peter M. Petrakis
2011-04-22 22:10                                             ` Jan Kara
2011-04-25  6:28                                               ` Toshiyuki Okajima
2011-05-03  8:06                                                 ` Surbhi Palande
2011-05-03 11:01                                       ` Surbhi Palande
2011-05-03 13:08                                         ` (unknown), Surbhi Palande
2011-05-03 13:46                                           ` your mail Jan Kara
2011-05-03 13:56                                             ` Surbhi Palande
2011-05-03 15:26                                               ` Surbhi Palande
2011-05-03 15:36                                               ` Jan Kara
2011-05-03 15:43                                                 ` Surbhi Palande
2011-05-04 19:24                                                   ` Jan Kara
2011-05-06 15:20                                                     ` [RFC][PATCH] Do not accept a new handle when the F.S is frozen Surbhi Palande
2011-05-06 15:20                                                     ` [PATCH] Adding support to freeze and unfreeze a journal Surbhi Palande
2011-05-06 20:56                                                       ` Andreas Dilger
2011-05-07 20:04                                                         ` [PATCH v2] " Surbhi Palande
2011-05-08  8:24                                                           ` Marco Stornelli
2011-05-09  9:04                                                             ` Surbhi Palande
2011-05-09  9:24                                                               ` Jan Kara
2011-05-09  9:53                                                           ` Jan Kara
2011-05-09 13:49                                                             ` Surbhi Palande
2011-05-09 14:51                                                               ` [PATCH v3] " Surbhi Palande
2011-05-09 15:08                                                                 ` Jan Kara
2011-05-10 15:07                                                                   ` [PATCH] " Surbhi Palande
2011-05-10 21:07                                                                     ` Andreas Dilger
2011-05-11  7:46                                                                       ` Surbhi Palande
2011-05-09 15:23                                                                 ` [PATCH v3] " Eric Sandeen
2011-05-11  7:06                                                                   ` Surbhi Palande
2011-05-11  7:10                                                                     ` [PATCH] Attempt to sync the fsstress writes to a frozen F.S Surbhi Palande
2011-05-12 14:22                                                                       ` Eric Sandeen
2011-05-12 14:22                                                                         ` Eric Sandeen
2011-05-24 21:42                                                                       ` Ted Ts'o
2011-05-25 12:00                                                                         ` Surbhi Palande
2011-05-25 12:12                                                                           ` Theodore Tso
2011-05-27 16:28                                                                             ` Jan Kara
2011-05-11  9:05                                                                     ` [PATCH v3] Adding support to freeze and unfreeze a journal Andreas Dilger
2011-05-12  9:40                                                                       ` Surbhi Palande
2011-05-03 13:08                                         ` [PATCH] Prevent dirtying a page when ext4 F.S is frozen Surbhi Palande
2011-05-03 15:19                                         ` [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock Jan Kara
2011-05-04 12:09                                           ` Surbhi Palande
2011-05-04 19:19                                             ` Jan Kara
2011-05-04 21:34                                               ` Surbhi Palande
2011-05-04 22:48                                                 ` Jan Kara
2011-05-05  6:06                                                   ` Surbhi Palande
2011-05-05 11:18                                                     ` Jan Kara
2011-05-05 14:01                                                       ` Surbhi Palande
2011-03-31 23:40                 ` Dave Chinner
2011-03-31 23:53                   ` Eric Sandeen
2011-04-01 14:08                   ` Jan Kara
2011-04-06  5:40                     ` Dave Chinner
2011-04-06  6:18                       ` Jan Kara
2011-04-06 11:21                         ` Dave Chinner
2011-04-06 13:44                           ` Christoph Hellwig
2011-04-06 22:59                             ` Dave Chinner
2011-04-06 17:40                           ` Jan Kara
2011-04-06 22:54                             ` Dave Chinner
2011-04-08 21:33                               ` Jan Kara
2011-05-02  9:07                           ` Surbhi Palande
2011-05-02 10:56                             ` Jan Kara
2011-05-02 11:27                               ` Surbhi Palande
2011-05-02 12:06                                 ` Surbhi Palande
2011-05-02 12:20                                 ` Jan Kara
2011-05-02 12:30                                   ` Surbhi Palande
2011-05-02 13:16                                     ` Jan Kara
2011-05-02 13:22                                       ` Christoph Hellwig
2011-05-02 14:20                                         ` Jan Kara
2011-05-02 14:41                                           ` Christoph Hellwig
2011-05-02 16:23                                             ` Jan Kara
2011-05-02 16:38                                               ` Christoph Hellwig
2011-05-02 13:22                                       ` Surbhi Palande
2011-05-02 13:24                                         ` Christoph Hellwig
2011-05-02 13:27                                           ` Surbhi Palande
2011-05-02 14:26                                             ` Jan Kara
2011-05-02 14:04                                         ` Eric Sandeen
2011-05-03  7:27                                           ` Surbhi Palande
2011-05-03 20:14                                             ` Eric Sandeen
2011-05-04  8:26                                               ` Surbhi Palande
2011-05-04 14:30                                                 ` Eric Sandeen
2011-05-02 14:01                                     ` Eric Sandeen
2011-04-05 10:44                   ` Toshiyuki Okajima
2011-12-09  1:56 ` Masayoshi MIZUMA
2011-12-15 12:41   ` Masayoshi MIZUMA
2013-11-29  4:58     ` Yongqiang Yang
2013-11-29  8:00       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.