linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] deadlock with suspend and quotas
@ 2011-11-25 20:25 Mikulas Patocka
  2011-11-28 15:04 ` Jan Kara
  2012-01-03  3:30 ` Al Viro
  0 siblings, 2 replies; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-25 20:25 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, Jan Kara; +Cc: dm-devel

This script causes a kernel deadlock:
#!/bin/sh
set -e
DEVICE=/dev/vg1/linear
lvchange -ay $DEVICE
mkfs.ext3 $DEVICE
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotacheck -gu /mnt/test
umount /mnt/test
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotaon /mnt/test
dmsetup suspend $DEVICE
setquota -u root 1 2 3 4 /mnt/test &
sleep 1
dmsetup resume $DEVICE

setquota acquired semaphore s_umount for read and then tried to perform
a transaction (and waits because the device is suspended).
dmsetup resume tries to acquire s_umount for write before resuming the device
(and waits for setquota).

Here are stacktraces:
setquota:
[   67.524456]  [<ffffffff810aa84e>] ? get_page_from_freelist+0x31e/0x790
[   67.524529]  [<ffffffffa0250265>] ? start_this_handle.isra.9+0x265/0x3b0 [jbd]
[   67.524604]  [<ffffffff8105bc00>] ? add_wait_queue+0x60/0x60
[   67.524675]  [<ffffffffa02505a1>] ? journal_start+0xc1/0x100 [jbd]
[   67.524742]  [<ffffffff810e62d6>] ? kmem_cache_alloc+0xf6/0x1b0
[   67.524808]  [<ffffffffa028018d>] ? ext3_acquire_dquot+0x3d/0x80 [ext3]
[   67.524872]  [<ffffffff81143749>] ? dqget+0x359/0x3b0
[   67.524916]  [<ffffffff81143ad4>] ? dquot_get_dqblk+0x14/0x1b0
[   67.524985]  [<ffffffff81147c34>] ? quota_getquota+0x24/0xd0
[   67.525048]  [<ffffffff810ff3db>] ? do_path_lookup+0x2b/0x90
[   67.525082]  [<ffffffff810ff8cd>] ? kern_path+0x1d/0x40
[   67.525134]  [<ffffffff81148311>] ? do_quotactl+0x421/0x540
[   67.525191]  [<ffffffff811073f0>] ? dput+0x20/0x230
[   67.525234]  [<ffffffff81148507>] ? sys_quotactl+0xd7/0x1a0
[   67.525304]  [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b

dmsetup resume:
[   67.525887]  [<ffffffffa0238280>] ? dev_wait+0xc0/0xc0 [dm_mod]
[   67.525948]  [<ffffffff81309225>] ? rwsem_down_failed_common+0xc5/0x160
[   67.526013]  [<ffffffff81198a43>] ? call_rwsem_down_write_failed+0x13/0x20
[   67.526058]  [<ffffffff81308adc>] ? down_write+0x1c/0x1d
[   67.526103]  [<ffffffff810f3a91>] ? thaw_super+0x21/0xc0
[   67.526166]  [<ffffffff81124d4d>] ? thaw_bdev+0x6d/0x90
[   67.526223]  [<ffffffff8105583e>] ? queue_work+0x4e/0x60
[   67.526269]  [<ffffffffa0230e63>] ? unlock_fs+0x23/0x40 [dm_mod]
[   67.526341]  [<ffffffffa02336d0>] ? dm_resume+0xb0/0xd0 [dm_mod]
[   67.526388]  [<ffffffffa0238420>] ? dev_suspend+0x1a0/0x230 [dm_mod]
[   67.526441]  [<ffffffffa0238a59>] ? ctl_ioctl+0x159/0x2a0 [dm_mod]
[   67.526510]  [<ffffffff8116c4ee>] ? ipc_addid+0x4e/0xd0
[   67.526555]  [<ffffffffa0238bae>] ? dm_ctl_ioctl+0xe/0x20 [dm_mod]
[   67.526620]  [<ffffffff811025de>] ? do_vfs_ioctl+0x8e/0x4e0
[   67.526670]  [<ffffffff811073f0>] ? dput+0x20/0x230
[   67.526737]  [<ffffffff810f3112>] ? fput+0x162/0x220
[   67.526783]  [<ffffffff81102a79>] ? sys_ioctl+0x49/0x90
[   67.526838]  [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b

The following patch fixes the deadlock. When the quota subsystem takes s_umount,
it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
the filesystem to resume and retry.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org

---
 fs/quota/quota.c   |   12 ++++++++++++
 include/linux/fs.h |    1 +
 2 files changed, 13 insertions(+)

Index: linux-3.1-fast/fs/quota/quota.c
===================================================================
--- linux-3.1-fast.orig/fs/quota/quota.c	2011-11-25 20:19:14.000000000 +0100
+++ linux-3.1-fast/fs/quota/quota.c	2011-11-25 21:12:32.000000000 +0100
@@ -310,7 +310,19 @@ static struct super_block *quotactl_bloc
 	putname(tmp);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
+retry_super:
 	sb = get_super(bdev);
+	if (sb && sb->s_frozen != SB_UNFROZEN) {
+		/*
+		 * When resuming a frozen device, s_umount is taken for write.
+		 * To avoid deadlock, we drop s_umount if the filesystem
+		 * is frozen and wait for it to thaw.
+		 */
+		up_read(&sb->s_umount);
+		vfs_check_frozen(sb, SB_FREEZE_WRITE);
+		put_super(sb);
+		goto retry_super;
+	}
 	bdput(bdev);
 	if (!sb)
 		return ERR_PTR(-ENODEV);
Index: linux-3.1-fast/include/linux/fs.h
===================================================================
--- linux-3.1-fast.orig/include/linux/fs.h	2011-11-25 21:13:56.000000000 +0100
+++ linux-3.1-fast/include/linux/fs.h	2011-11-25 21:14:03.000000000 +0100
@@ -2496,6 +2496,7 @@ extern struct file_system_type *get_fs_t
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern struct super_block *user_get_super(dev_t);
+extern void put_super(struct super_block *sb);
 extern void drop_super(struct super_block *sb);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
 extern void iterate_supers_type(struct file_system_type *,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-25 20:25 [PATCH] deadlock with suspend and quotas Mikulas Patocka
@ 2011-11-28 15:04 ` Jan Kara
  2011-11-28 21:00   ` Valerie Aurora
  2012-01-03  3:30 ` Al Viro
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kara @ 2011-11-28 15:04 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-kernel, linux-fsdevel, Jan Kara, dm-devel, Valerie Aurora

  Hello,

On Fri 25-11-11 15:25:16, Mikulas Patocka wrote:
> This script causes a kernel deadlock:
> #!/bin/sh
> set -e
> DEVICE=/dev/vg1/linear
> lvchange -ay $DEVICE
> mkfs.ext3 $DEVICE
> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
> quotacheck -gu /mnt/test
> umount /mnt/test
> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
> quotaon /mnt/test
> dmsetup suspend $DEVICE
> setquota -u root 1 2 3 4 /mnt/test &
> sleep 1
> dmsetup resume $DEVICE
> 
> setquota acquired semaphore s_umount for read and then tried to perform
> a transaction (and waits because the device is suspended).
> dmsetup resume tries to acquire s_umount for write before resuming the device
> (and waits for setquota).
> 
> Here are stacktraces:
> setquota:
> [   67.524456]  [<ffffffff810aa84e>] ? get_page_from_freelist+0x31e/0x790
> [   67.524529]  [<ffffffffa0250265>] ? start_this_handle.isra.9+0x265/0x3b0 [jbd]
> [   67.524604]  [<ffffffff8105bc00>] ? add_wait_queue+0x60/0x60
> [   67.524675]  [<ffffffffa02505a1>] ? journal_start+0xc1/0x100 [jbd]
> [   67.524742]  [<ffffffff810e62d6>] ? kmem_cache_alloc+0xf6/0x1b0
> [   67.524808]  [<ffffffffa028018d>] ? ext3_acquire_dquot+0x3d/0x80 [ext3]
> [   67.524872]  [<ffffffff81143749>] ? dqget+0x359/0x3b0
> [   67.524916]  [<ffffffff81143ad4>] ? dquot_get_dqblk+0x14/0x1b0
> [   67.524985]  [<ffffffff81147c34>] ? quota_getquota+0x24/0xd0
> [   67.525048]  [<ffffffff810ff3db>] ? do_path_lookup+0x2b/0x90
> [   67.525082]  [<ffffffff810ff8cd>] ? kern_path+0x1d/0x40
> [   67.525134]  [<ffffffff81148311>] ? do_quotactl+0x421/0x540
> [   67.525191]  [<ffffffff811073f0>] ? dput+0x20/0x230
> [   67.525234]  [<ffffffff81148507>] ? sys_quotactl+0xd7/0x1a0
> [   67.525304]  [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
> 
> dmsetup resume:
> [   67.525887]  [<ffffffffa0238280>] ? dev_wait+0xc0/0xc0 [dm_mod]
> [   67.525948]  [<ffffffff81309225>] ? rwsem_down_failed_common+0xc5/0x160
> [   67.526013]  [<ffffffff81198a43>] ? call_rwsem_down_write_failed+0x13/0x20
> [   67.526058]  [<ffffffff81308adc>] ? down_write+0x1c/0x1d
> [   67.526103]  [<ffffffff810f3a91>] ? thaw_super+0x21/0xc0
> [   67.526166]  [<ffffffff81124d4d>] ? thaw_bdev+0x6d/0x90
> [   67.526223]  [<ffffffff8105583e>] ? queue_work+0x4e/0x60
> [   67.526269]  [<ffffffffa0230e63>] ? unlock_fs+0x23/0x40 [dm_mod]
> [   67.526341]  [<ffffffffa02336d0>] ? dm_resume+0xb0/0xd0 [dm_mod]
> [   67.526388]  [<ffffffffa0238420>] ? dev_suspend+0x1a0/0x230 [dm_mod]
> [   67.526441]  [<ffffffffa0238a59>] ? ctl_ioctl+0x159/0x2a0 [dm_mod]
> [   67.526510]  [<ffffffff8116c4ee>] ? ipc_addid+0x4e/0xd0
> [   67.526555]  [<ffffffffa0238bae>] ? dm_ctl_ioctl+0xe/0x20 [dm_mod]
> [   67.526620]  [<ffffffff811025de>] ? do_vfs_ioctl+0x8e/0x4e0
> [   67.526670]  [<ffffffff811073f0>] ? dput+0x20/0x230
> [   67.526737]  [<ffffffff810f3112>] ? fput+0x162/0x220
> [   67.526783]  [<ffffffff81102a79>] ? sys_ioctl+0x49/0x90
> [   67.526838]  [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
> 
> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> the filesystem to resume and retry.
  Thanks for the patch. I'm aware of the deadlock and Val Henson is working
on resolving these types of deadlocks more systematically. But since I
haven't heard from her for a while, I guess I'll merge your fix and she'll
update her series to reflect your change since those patches are going to
go in at earliest in the next merge window.
  To sum up: I've merged your patch.

								Honza
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> CC: stable@kernel.org
> 
> ---
>  fs/quota/quota.c   |   12 ++++++++++++
>  include/linux/fs.h |    1 +
>  2 files changed, 13 insertions(+)
> 
> Index: linux-3.1-fast/fs/quota/quota.c
> ===================================================================
> --- linux-3.1-fast.orig/fs/quota/quota.c	2011-11-25 20:19:14.000000000 +0100
> +++ linux-3.1-fast/fs/quota/quota.c	2011-11-25 21:12:32.000000000 +0100
> @@ -310,7 +310,19 @@ static struct super_block *quotactl_bloc
>  	putname(tmp);
>  	if (IS_ERR(bdev))
>  		return ERR_CAST(bdev);
> +retry_super:
>  	sb = get_super(bdev);
> +	if (sb && sb->s_frozen != SB_UNFROZEN) {
> +		/*
> +		 * When resuming a frozen device, s_umount is taken for write.
> +		 * To avoid deadlock, we drop s_umount if the filesystem
> +		 * is frozen and wait for it to thaw.
> +		 */
> +		up_read(&sb->s_umount);
> +		vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +		put_super(sb);
> +		goto retry_super;
> +	}
>  	bdput(bdev);
>  	if (!sb)
>  		return ERR_PTR(-ENODEV);
> Index: linux-3.1-fast/include/linux/fs.h
> ===================================================================
> --- linux-3.1-fast.orig/include/linux/fs.h	2011-11-25 21:13:56.000000000 +0100
> +++ linux-3.1-fast/include/linux/fs.h	2011-11-25 21:14:03.000000000 +0100
> @@ -2496,6 +2496,7 @@ extern struct file_system_type *get_fs_t
>  extern struct super_block *get_super(struct block_device *);
>  extern struct super_block *get_active_super(struct block_device *bdev);
>  extern struct super_block *user_get_super(dev_t);
> +extern void put_super(struct super_block *sb);
>  extern void drop_super(struct super_block *sb);
>  extern void iterate_supers(void (*)(struct super_block *, void *), void *);
>  extern void iterate_supers_type(struct file_system_type *,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-28 15:04 ` Jan Kara
@ 2011-11-28 21:00   ` Valerie Aurora
  2011-11-28 21:14     ` Mikulas Patocka
  2011-11-29 20:00     ` Kamal Mostafa
  0 siblings, 2 replies; 33+ messages in thread
From: Valerie Aurora @ 2011-11-28 21:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mikulas Patocka, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain

On Mon, Nov 28, 2011 at 8:04 AM, Jan Kara <jack@suse.cz> wrote:
>  Hello,
>
> On Fri 25-11-11 15:25:16, Mikulas Patocka wrote:
>> This script causes a kernel deadlock:
>> #!/bin/sh
>> set -e
>> DEVICE=/dev/vg1/linear
>> lvchange -ay $DEVICE
>> mkfs.ext3 $DEVICE
>> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
>> quotacheck -gu /mnt/test
>> umount /mnt/test
>> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
>> quotaon /mnt/test
>> dmsetup suspend $DEVICE
>> setquota -u root 1 2 3 4 /mnt/test &
>> sleep 1
>> dmsetup resume $DEVICE
>>
>> setquota acquired semaphore s_umount for read and then tried to perform
>> a transaction (and waits because the device is suspended).
>> dmsetup resume tries to acquire s_umount for write before resuming the device
>> (and waits for setquota).
>>
>> Here are stacktraces:
>> setquota:
>> [   67.524456]  [<ffffffff810aa84e>] ? get_page_from_freelist+0x31e/0x790
>> [   67.524529]  [<ffffffffa0250265>] ? start_this_handle.isra.9+0x265/0x3b0 [jbd]
>> [   67.524604]  [<ffffffff8105bc00>] ? add_wait_queue+0x60/0x60
>> [   67.524675]  [<ffffffffa02505a1>] ? journal_start+0xc1/0x100 [jbd]
>> [   67.524742]  [<ffffffff810e62d6>] ? kmem_cache_alloc+0xf6/0x1b0
>> [   67.524808]  [<ffffffffa028018d>] ? ext3_acquire_dquot+0x3d/0x80 [ext3]
>> [   67.524872]  [<ffffffff81143749>] ? dqget+0x359/0x3b0
>> [   67.524916]  [<ffffffff81143ad4>] ? dquot_get_dqblk+0x14/0x1b0
>> [   67.524985]  [<ffffffff81147c34>] ? quota_getquota+0x24/0xd0
>> [   67.525048]  [<ffffffff810ff3db>] ? do_path_lookup+0x2b/0x90
>> [   67.525082]  [<ffffffff810ff8cd>] ? kern_path+0x1d/0x40
>> [   67.525134]  [<ffffffff81148311>] ? do_quotactl+0x421/0x540
>> [   67.525191]  [<ffffffff811073f0>] ? dput+0x20/0x230
>> [   67.525234]  [<ffffffff81148507>] ? sys_quotactl+0xd7/0x1a0
>> [   67.525304]  [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
>>
>> dmsetup resume:
>> [   67.525887]  [<ffffffffa0238280>] ? dev_wait+0xc0/0xc0 [dm_mod]
>> [   67.525948]  [<ffffffff81309225>] ? rwsem_down_failed_common+0xc5/0x160
>> [   67.526013]  [<ffffffff81198a43>] ? call_rwsem_down_write_failed+0x13/0x20
>> [   67.526058]  [<ffffffff81308adc>] ? down_write+0x1c/0x1d
>> [   67.526103]  [<ffffffff810f3a91>] ? thaw_super+0x21/0xc0
>> [   67.526166]  [<ffffffff81124d4d>] ? thaw_bdev+0x6d/0x90
>> [   67.526223]  [<ffffffff8105583e>] ? queue_work+0x4e/0x60
>> [   67.526269]  [<ffffffffa0230e63>] ? unlock_fs+0x23/0x40 [dm_mod]
>> [   67.526341]  [<ffffffffa02336d0>] ? dm_resume+0xb0/0xd0 [dm_mod]
>> [   67.526388]  [<ffffffffa0238420>] ? dev_suspend+0x1a0/0x230 [dm_mod]
>> [   67.526441]  [<ffffffffa0238a59>] ? ctl_ioctl+0x159/0x2a0 [dm_mod]
>> [   67.526510]  [<ffffffff8116c4ee>] ? ipc_addid+0x4e/0xd0
>> [   67.526555]  [<ffffffffa0238bae>] ? dm_ctl_ioctl+0xe/0x20 [dm_mod]
>> [   67.526620]  [<ffffffff811025de>] ? do_vfs_ioctl+0x8e/0x4e0
>> [   67.526670]  [<ffffffff811073f0>] ? dput+0x20/0x230
>> [   67.526737]  [<ffffffff810f3112>] ? fput+0x162/0x220
>> [   67.526783]  [<ffffffff81102a79>] ? sys_ioctl+0x49/0x90
>> [   67.526838]  [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
>>
>> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
>> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
>> the filesystem to resume and retry.
>  Thanks for the patch. I'm aware of the deadlock and Val Henson is working
> on resolving these types of deadlocks more systematically. But since I
> haven't heard from her for a while, I guess I'll merge your fix and she'll
> update her series to reflect your change since those patches are going to
> go in at earliest in the next merge window.
>  To sum up: I've merged your patch.

FYI, I no longer have time to consult in addition to my full-time job.
 Canonical is taking over this patch set.

-VAL

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-28 21:00   ` Valerie Aurora
@ 2011-11-28 21:14     ` Mikulas Patocka
  2011-11-28 23:32       ` Mikulas Patocka
  2011-11-29 20:00     ` Kamal Mostafa
  1 sibling, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-28 21:14 UTC (permalink / raw)
  To: Valerie Aurora
  Cc: Jan Kara, linux-kernel, linux-fsdevel, dm-devel, Christopher Chaltain

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1077 bytes --]

> >> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> >> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> >> the filesystem to resume and retry.
> >  Thanks for the patch. I'm aware of the deadlock and Val Henson is working
> > on resolving these types of deadlocks more systematically. But since I
> > haven't heard from her for a while, I guess I'll merge your fix and she'll
> > update her series to reflect your change since those patches are going to
> > go in at earliest in the next merge window.
> >  To sum up: I've merged your patch.
> 
> FYI, I no longer have time to consult in addition to my full-time job.
>  Canonical is taking over this patch set.
> 
> -VAL

Hi

Where can I get that patch set?

We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
background writeback (these code paths take s_umount and wait trying to do 
I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
Are there other known deadlock possibilities?

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-28 21:14     ` Mikulas Patocka
@ 2011-11-28 23:32       ` Mikulas Patocka
  2011-11-29 10:19         ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-28 23:32 UTC (permalink / raw)
  To: Valerie Aurora
  Cc: Jan Kara, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen

> Hi
> 
> Where can I get that patch set?
> 
> We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
> background writeback (these code paths take s_umount and wait trying to do 
> I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
> Are there other known deadlock possibilities?
> 
> Mikulas

I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write 
deadlock" (I couldn't find the next two parts of the patch in the 
archives). And the patch looks wrong:

- down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not 
held when the filesystem is frozen and it is taken for write when thawing. 
Consequently, any task can succeed with down_read_trylock(&sb->s_umount) 
on a frozen filesystem and if this tasks attempts to do an I/O that is 
waiting for thaw, it may still deadlock.

- skipping sync on frozen filesystem violates sync semantics. 
Applications, such as databases, assume that when sync finishes, data were 
written to stable storage. If we skip sync when the filesystem is frozen, 
we can cause data corruption in these applications (if the system crashes 
after we skipped a sync).

- I'm not sure what userspace quota subsystem will do if we start 
returning -EBUSY spuriously.

There is another thing --- I wasn't able to reproduce these sync-related 
deadlocks at all. Did anyone succeeded in reproducing them? Are there any 
reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel 
prevents creating new dirty data, syncs all data, and freezes the 
filesystem. Consequently, the sync function never finds any dirty data and 
so it doesn't block (sync doesn't writeback ATIME change, I don't know 
why).

I made this patch that fixes the quota issue and possible sync issues. It 
introduces a function down_read_s_umount(sb); --- this function takes 
s_umount lock for read, but it waits if the filesystem is suspended.

There are three more potentially unsafe users: fs/cachefiles/interface.c, 
fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't 
test them.

Mikulas

---

Fix a s_umount deadlock

The lock s_umount is taken for write and dropped when we freeze and resume
a block device.

Consequently, if some code holds s_umount and performs an I/O, a deadlock may
happen if the device is suspended.

Unfortunatelly, it seems that developers are not aware of this restriction
that I/O must not be peformed with s_umount held.

These deadlocks were observed:
* lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
  ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
  call_rwsem_down_write_failed (the process is trying to resume frozen device, 
  but it is waiting trying to acquire s_umount)
* quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
  ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
  start_this_handle (the process is holding s_umount and trying to perform
  a transaction, waiting for the device being unfrozen)

* iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
  writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
  (the process is holding s_umount for read and trying to perform some I/O)
* flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
  wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
  writeback_single_inode -> do_writepages -> ext4_da_writepages ->
  ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
  (holding s_umount for read too, acquired in writeback_inodes_wb,
  and trying to perform I/O)
* lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
  ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
  call_rwsem_down_write_failed (just like in a previous case: trying to
  resume frozen device, waiting on s_umount)

This patch fixes these observed code paths:
* introduce a function to safely take s_unlock - down_read_s_umount. It takes
  the lock, check if a filesystem is frozen, if it is, drops the lock and
  waits for unfreeze.
* get_super function has a parameter, if the parameter is true, it waits for
  the device to unfreeze (it fixes quota deadlock and a possible deadlock in
  fsync_bdev and __invalidate_device)
* the same for iterate_supers (it fixes the sync deadlock and a possible
  deadlock in drop_caches_sysctl_handler)
* grab_super_passive fails if the device is suspended (fixes the inode writeback
  deadlock and a possible deadlock in prune_super)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org

---
 fs/block_dev.c           |    6 +++---
 fs/buffer.c              |    2 +-
 fs/drop_caches.c         |    2 +-
 fs/fs-writeback.c        |    8 ++++++++
 fs/quota/quota.c         |    4 ++--
 fs/super.c               |   29 ++++++++++++++++++++---------
 fs/sync.c                |    4 ++--
 include/linux/fs.h       |   28 +++++++++++++++++++++++++---
 security/selinux/hooks.c |    2 +-
 9 files changed, 63 insertions(+), 22 deletions(-)

Index: linux-3.2-rc3-fast/fs/quota/quota.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/quota/quota.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/quota/quota.c	2011-11-25 05:51:36.000000000 +0100
@@ -59,7 +59,7 @@ static int quota_sync_all(int type)
 		return -EINVAL;
 	ret = security_quotactl(Q_SYNC, type, 0, NULL);
 	if (!ret)
-		iterate_supers(quota_sync_one, &type);
+		iterate_supers(quota_sync_one, &type, true);
 	return ret;
 }
 
@@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
 	putname(tmp);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
-	sb = get_super(bdev);
+	sb = get_super(bdev, true);
 	bdput(bdev);
 	if (!sb)
 		return ERR_PTR(-ENODEV);
Index: linux-3.2-rc3-fast/include/linux/fs.h
===================================================================
--- linux-3.2-rc3-fast.orig/include/linux/fs.h	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/include/linux/fs.h	2011-11-25 05:56:03.000000000 +0100
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/sched.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -1502,6 +1503,26 @@ enum {
 	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
 
 /*
+ * Take s_umount and make sure that the filesystem is not frozen.
+ * This function must be used if we intend to perform any I/O while
+ * holding s_umount.
+ *
+ * s_umount is taken for write when resuming a frozen filesystem,
+ * thus if someone calls down_read(&s->s_umount) and performs any I/O,
+ * it may deadlock.
+ */
+static inline void down_read_s_umount(struct super_block *sb)
+{
+retry:
+	down_read(&sb->s_umount);
+	if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
+		up_write(&sb->s_umount);
+		vfs_check_frozen(sb, SB_FREEZE_WRITE);
+		goto retry;
+	}
+}
+
+/*
  * until VFS tracks user namespaces for inodes, just make all files
  * belong to init_user_ns
  */
@@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
 extern void get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
-extern struct super_block *get_super(struct block_device *);
+extern struct super_block *get_super(struct block_device *, bool);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern struct super_block *user_get_super(dev_t);
 extern void drop_super(struct super_block *sb);
-extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
 extern void iterate_supers_type(struct file_system_type *,
-			        void (*)(struct super_block *, void *), void *);
+			        void (*)(struct super_block *, void *), void *,
+				bool);
 
 extern int dcache_dir_open(struct inode *, struct file *);
 extern int dcache_dir_close(struct inode *, struct file *);
Index: linux-3.2-rc3-fast/fs/buffer.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/buffer.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/buffer.c	2011-11-25 05:51:36.000000000 +0100
@@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
 
 static void do_thaw_all(struct work_struct *work)
 {
-	iterate_supers(do_thaw_one, NULL);
+	iterate_supers(do_thaw_one, NULL, false);
 	kfree(work);
 	printk(KERN_WARNING "Emergency Thaw complete\n");
 }
Index: linux-3.2-rc3-fast/fs/super.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/super.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/super.c	2011-11-29 00:16:01.000000000 +0100
@@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
 	spin_unlock(&sb_lock);
 
 	if (down_read_trylock(&sb->s_umount)) {
-		if (sb->s_root)
+		if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
 			return true;
 		up_read(&sb->s_umount);
 	}
@@ -503,7 +503,7 @@ void sync_supers(void)
 			sb->s_count++;
 			spin_unlock(&sb_lock);
 
-			down_read(&sb->s_umount);
+			down_read_s_umount(sb);
 			if (sb->s_root && sb->s_dirt)
 				sb->s_op->write_super(sb);
 			up_read(&sb->s_umount);
@@ -527,7 +527,8 @@ void sync_supers(void)
  *	Scans the superblock list and calls given function, passing it
  *	locked superblock and given argument.
  */
-void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
+		    bool wait_for_unfreeze)
 {
 	struct super_block *sb, *p = NULL;
 
@@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
-		down_read(&sb->s_umount);
+		if (!wait_for_unfreeze)
+			down_read(&sb->s_umount);
+		else
+			down_read_s_umount(sb);
 		if (sb->s_root)
 			f(sb, arg);
 		up_read(&sb->s_umount);
@@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
  *	locked superblock and given argument.
  */
 void iterate_supers_type(struct file_system_type *type,
-	void (*f)(struct super_block *, void *), void *arg)
+	void (*f)(struct super_block *, void *), void *arg,
+	bool wait_for_unfreeze)
 {
 	struct super_block *sb, *p = NULL;
 
@@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
-		down_read(&sb->s_umount);
+		if (!wait_for_unfreeze)
+			down_read(&sb->s_umount);
+		else
+			down_read_s_umount(sb);
 		if (sb->s_root)
 			f(sb, arg);
 		up_read(&sb->s_umount);
@@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
  *	mounted on the device given. %NULL is returned if no match is found.
  */
 
-struct super_block *get_super(struct block_device *bdev)
+struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
 {
 	struct super_block *sb;
 
@@ -612,7 +620,10 @@ rescan:
 		if (sb->s_bdev == bdev) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
-			down_read(&sb->s_umount);
+			if (wait_for_unfreeze)
+				down_read_s_umount(sb);
+			else
+				down_read(&sb->s_umount);
 			/* still alive? */
 			if (sb->s_root)
 				return sb;
@@ -672,7 +683,7 @@ rescan:
 		if (sb->s_dev ==  dev) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
-			down_read(&sb->s_umount);
+			down_read_s_umount(sb);
 			/* still alive? */
 			if (sb->s_root)
 				return sb;
Index: linux-3.2-rc3-fast/fs/sync.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/sync.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/sync.c	2011-11-25 05:51:36.000000000 +0100
@@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
  */
 static void sync_filesystems(int wait)
 {
-	iterate_supers(sync_one_sb, &wait);
+	iterate_supers(sync_one_sb, &wait, true);
 }
 
 /*
@@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
 		return -EBADF;
 	sb = file->f_dentry->d_sb;
 
-	down_read(&sb->s_umount);
+	down_read_s_umount(sb);
 	ret = sync_filesystem(sb);
 	up_read(&sb->s_umount);
 
Index: linux-3.2-rc3-fast/fs/block_dev.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/block_dev.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/block_dev.c	2011-11-25 05:51:36.000000000 +0100
@@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
  */
 int fsync_bdev(struct block_device *bdev)
 {
-	struct super_block *sb = get_super(bdev);
+	struct super_block *sb = get_super(bdev, true);
 	if (sb) {
 		int res = sync_filesystem(sb);
 		drop_super(sb);
@@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
 		 * to freeze_bdev grab an active reference and only the last
 		 * thaw_bdev drops it.
 		 */
-		sb = get_super(bdev);
+		sb = get_super(bdev, false);
 		drop_super(sb);
 		mutex_unlock(&bdev->bd_fsfreeze_mutex);
 		return sb;
@@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
 
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
-	struct super_block *sb = get_super(bdev);
+	struct super_block *sb = get_super(bdev, true);
 	int res = 0;
 
 	if (sb) {
Index: linux-3.2-rc3-fast/fs/drop_caches.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/drop_caches.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/drop_caches.c	2011-11-25 05:51:36.000000000 +0100
@@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
 		return ret;
 	if (write) {
 		if (sysctl_drop_caches & 1)
-			iterate_supers(drop_pagecache_sb, NULL);
+			iterate_supers(drop_pagecache_sb, NULL, true);
 		if (sysctl_drop_caches & 2)
 			drop_slab();
 	}
Index: linux-3.2-rc3-fast/security/selinux/hooks.c
===================================================================
--- linux-3.2-rc3-fast.orig/security/selinux/hooks.c	2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/security/selinux/hooks.c	2011-11-25 05:51:36.000000000 +0100
@@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
 
 	/* Set up any superblocks initialized prior to the policy load. */
 	printk(KERN_DEBUG "SELinux:  Setting up existing superblocks.\n");
-	iterate_supers(delayed_superblock_init, NULL);
+	iterate_supers(delayed_superblock_init, NULL, true);
 }
 
 /* SELinux requires early initialization in order to label
Index: linux-3.2-rc3-fast/fs/fs-writeback.c
===================================================================
--- linux-3.2-rc3-fast.orig/fs/fs-writeback.c	2011-11-29 00:09:30.000000000 +0100
+++ linux-3.2-rc3-fast/fs/fs-writeback.c	2011-11-29 00:12:49.000000000 +0100
@@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
 		down_read(&sb->s_umount);
+		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
+			up_read(&sb->s_umount);
+			return 0;
+		}
 		writeback_inodes_sb(sb, reason);
 		up_read(&sb->s_umount);
 		return 1;
@@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
 		down_read(&sb->s_umount);
+		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
+			up_read(&sb->s_umount);
+			return 0;
+		}
 		writeback_inodes_sb_nr(sb, nr, reason);
 		up_read(&sb->s_umount);
 		return 1;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-28 23:32       ` Mikulas Patocka
@ 2011-11-29 10:19         ` Jan Kara
  2011-11-29 10:21           ` Jan Kara
                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Jan Kara @ 2011-11-29 10:19 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Valerie Aurora, Jan Kara, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen

  Hi,

On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > Hi
> > 
> > Where can I get that patch set?
> > 
> > We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
> > background writeback (these code paths take s_umount and wait trying to do 
> > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
> > Are there other known deadlock possibilities?
> 
> I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write 
> deadlock" (I couldn't find the next two parts of the patch in the 
> archives). And the patch looks wrong:
  Yes, that seems to be the series. I generally agree with you that the
last iteration still had some problems and some changes were requested.
That's why it's not merged yet after all...

> - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not 
> held when the filesystem is frozen and it is taken for write when thawing. 
> Consequently, any task can succeed with down_read_trylock(&sb->s_umount) 
> on a frozen filesystem and if this tasks attempts to do an I/O that is 
> waiting for thaw, it may still deadlock.
  Agreed.

> - skipping sync on frozen filesystem violates sync semantics. 
> Applications, such as databases, assume that when sync finishes, data were 
> written to stable storage. If we skip sync when the filesystem is frozen, 
> we can cause data corruption in these applications (if the system crashes 
> after we skipped a sync).
  Here I don't agree. Filesystem must guarantee there are no dirty data on
a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
page_mkwrite() implementation for this but that's the problem of ext3, not
freezing code in general. If there are no dirty data, sync code (and also
flusher thread) is free to return without doing anything.

That being said, it is hard to implement freeze handling in page_mkwrite()
in such a way that there would be no dirty pages (but we know there cannot
be dirty data in such pages). Currently we mark the page dirty during page
fault and wait for frozen filesystem only after that so that we are
guaranteed that either freezing code will wait for page fault to finish and
will write the page or page fault code notices that freezing is in progress
and blocks (see fs/buffer.c:block_page_mkwrite()).

So I believe the consensus was that we should not block sync or flusher
thread on frozen filesystem. Firstly, it's kind of ugly from user
perspective (you cannot sync filesystems on your system while one
filesystem is frozen???), secondly, in case of flusher thread it has some
serious implications if there are more filesystems on the same device - you
would effectively stop any writeback to the device possibly hanging the
whole system due to dirty limit being exceeded. So at least in these two
cases we should just ignore frozen filesystem.

> - I'm not sure what userspace quota subsystem will do if we start 
> returning -EBUSY spuriously.
  Quota tools will complain to the user which would be fine I think. But
blocking is fine as well. In this particular case I don't care much but it
should be consistent with what happens to sync. So probably Q_SYNC command
should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.

> There is another thing --- I wasn't able to reproduce these sync-related 
> deadlocks at all. Did anyone succeeded in reproducing them? Are there any 
> reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel 
> prevents creating new dirty data, syncs all data, and freezes the 
> filesystem. Consequently, the sync function never finds any dirty data and 
> so it doesn't block (sync doesn't writeback ATIME change, I don't know 
> why).
  See above why sync can actually spot some dirty inodes/page (although
there is not any dirty data). Surbhi (added to CC) from Canonical could
actually trigger these races and consequent deadlocks (and I belive some of
their customers as well). Also some RH customers were hitting these
deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
happy by my changes to block_page_mkwrite() which made the race window much
narrower.

> I made this patch that fixes the quota issue and possible sync issues. It 
> introduces a function down_read_s_umount(sb); --- this function takes 
> s_umount lock for read, but it waits if the filesystem is suspended.
> 
> There are three more potentially unsafe users: fs/cachefiles/interface.c, 
> fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't 
> test them.
> 
> Mikulas
> 
> ---
> 
> Fix a s_umount deadlock
> 
> The lock s_umount is taken for write and dropped when we freeze and resume
> a block device.
> 
> Consequently, if some code holds s_umount and performs an I/O, a deadlock may
> happen if the device is suspended.
> 
> Unfortunatelly, it seems that developers are not aware of this restriction
> that I/O must not be peformed with s_umount held.
> 
> These deadlocks were observed:
> * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
>   ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
>   call_rwsem_down_write_failed (the process is trying to resume frozen device, 
>   but it is waiting trying to acquire s_umount)
> * quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
>   ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
>   start_this_handle (the process is holding s_umount and trying to perform
>   a transaction, waiting for the device being unfrozen)
> 
> * iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
>   writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
>   (the process is holding s_umount for read and trying to perform some I/O)
> * flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
>   wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
>   writeback_single_inode -> do_writepages -> ext4_da_writepages ->
>   ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
>   (holding s_umount for read too, acquired in writeback_inodes_wb,
>   and trying to perform I/O)
> * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
>   ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
>   call_rwsem_down_write_failed (just like in a previous case: trying to
>   resume frozen device, waiting on s_umount)
> 
> This patch fixes these observed code paths:
> * introduce a function to safely take s_unlock - down_read_s_umount. It takes
>   the lock, check if a filesystem is frozen, if it is, drops the lock and
>   waits for unfreeze.
> * get_super function has a parameter, if the parameter is true, it waits for
>   the device to unfreeze (it fixes quota deadlock and a possible deadlock in
>   fsync_bdev and __invalidate_device)
> * the same for iterate_supers (it fixes the sync deadlock and a possible
>   deadlock in drop_caches_sysctl_handler)
> * grab_super_passive fails if the device is suspended (fixes the inode writeback
>   deadlock and a possible deadlock in prune_super)
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> CC: stable@kernel.org
> 
> ---
>  fs/block_dev.c           |    6 +++---
>  fs/buffer.c              |    2 +-
>  fs/drop_caches.c         |    2 +-
>  fs/fs-writeback.c        |    8 ++++++++
>  fs/quota/quota.c         |    4 ++--
>  fs/super.c               |   29 ++++++++++++++++++++---------
>  fs/sync.c                |    4 ++--
>  include/linux/fs.h       |   28 +++++++++++++++++++++++++---
>  security/selinux/hooks.c |    2 +-
>  9 files changed, 63 insertions(+), 22 deletions(-)
> 
> Index: linux-3.2-rc3-fast/fs/quota/quota.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/quota/quota.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/quota/quota.c	2011-11-25 05:51:36.000000000 +0100
> @@ -59,7 +59,7 @@ static int quota_sync_all(int type)
>  		return -EINVAL;
>  	ret = security_quotactl(Q_SYNC, type, 0, NULL);
>  	if (!ret)
> -		iterate_supers(quota_sync_one, &type);
> +		iterate_supers(quota_sync_one, &type, true);
>  	return ret;
>  }
>  
> @@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
>  	putname(tmp);
>  	if (IS_ERR(bdev))
>  		return ERR_CAST(bdev);
> -	sb = get_super(bdev);
> +	sb = get_super(bdev, true);
>  	bdput(bdev);
>  	if (!sb)
>  		return ERR_PTR(-ENODEV);
> Index: linux-3.2-rc3-fast/include/linux/fs.h
> ===================================================================
> --- linux-3.2-rc3-fast.orig/include/linux/fs.h	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/include/linux/fs.h	2011-11-25 05:56:03.000000000 +0100
> @@ -10,6 +10,7 @@
>  #include <linux/ioctl.h>
>  #include <linux/blk_types.h>
>  #include <linux/types.h>
> +#include <linux/sched.h>
>  
>  /*
>   * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> @@ -1502,6 +1503,26 @@ enum {
>  	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
>  
>  /*
> + * Take s_umount and make sure that the filesystem is not frozen.
> + * This function must be used if we intend to perform any I/O while
> + * holding s_umount.
> + *
> + * s_umount is taken for write when resuming a frozen filesystem,
> + * thus if someone calls down_read(&s->s_umount) and performs any I/O,
> + * it may deadlock.
> + */
> +static inline void down_read_s_umount(struct super_block *sb)
> +{
> +retry:
> +	down_read(&sb->s_umount);
> +	if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> +		up_write(&sb->s_umount);
> +		vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +		goto retry;
> +	}
> +}
> +
> +/*
>   * until VFS tracks user namespaces for inodes, just make all files
>   * belong to init_user_ns
>   */
> @@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
>  extern void get_filesystem(struct file_system_type *fs);
>  extern void put_filesystem(struct file_system_type *fs);
>  extern struct file_system_type *get_fs_type(const char *name);
> -extern struct super_block *get_super(struct block_device *);
> +extern struct super_block *get_super(struct block_device *, bool);
>  extern struct super_block *get_active_super(struct block_device *bdev);
>  extern struct super_block *user_get_super(dev_t);
>  extern void drop_super(struct super_block *sb);
> -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> +extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
>  extern void iterate_supers_type(struct file_system_type *,
> -			        void (*)(struct super_block *, void *), void *);
> +			        void (*)(struct super_block *, void *), void *,
> +				bool);
>  
>  extern int dcache_dir_open(struct inode *, struct file *);
>  extern int dcache_dir_close(struct inode *, struct file *);
> Index: linux-3.2-rc3-fast/fs/buffer.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/buffer.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/buffer.c	2011-11-25 05:51:36.000000000 +0100
> @@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
>  
>  static void do_thaw_all(struct work_struct *work)
>  {
> -	iterate_supers(do_thaw_one, NULL);
> +	iterate_supers(do_thaw_one, NULL, false);
>  	kfree(work);
>  	printk(KERN_WARNING "Emergency Thaw complete\n");
>  }
> Index: linux-3.2-rc3-fast/fs/super.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/super.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/super.c	2011-11-29 00:16:01.000000000 +0100
> @@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
>  	spin_unlock(&sb_lock);
>  
>  	if (down_read_trylock(&sb->s_umount)) {
> -		if (sb->s_root)
> +		if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
>  			return true;
>  		up_read(&sb->s_umount);
>  	}
> @@ -503,7 +503,7 @@ void sync_supers(void)
>  			sb->s_count++;
>  			spin_unlock(&sb_lock);
>  
> -			down_read(&sb->s_umount);
> +			down_read_s_umount(sb);
>  			if (sb->s_root && sb->s_dirt)
>  				sb->s_op->write_super(sb);
>  			up_read(&sb->s_umount);
> @@ -527,7 +527,8 @@ void sync_supers(void)
>   *	Scans the superblock list and calls given function, passing it
>   *	locked superblock and given argument.
>   */
> -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> +void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> +		    bool wait_for_unfreeze)
>  {
>  	struct super_block *sb, *p = NULL;
>  
> @@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
>  		sb->s_count++;
>  		spin_unlock(&sb_lock);
>  
> -		down_read(&sb->s_umount);
> +		if (!wait_for_unfreeze)
> +			down_read(&sb->s_umount);
> +		else
> +			down_read_s_umount(sb);
>  		if (sb->s_root)
>  			f(sb, arg);
>  		up_read(&sb->s_umount);
> @@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
>   *	locked superblock and given argument.
>   */
>  void iterate_supers_type(struct file_system_type *type,
> -	void (*f)(struct super_block *, void *), void *arg)
> +	void (*f)(struct super_block *, void *), void *arg,
> +	bool wait_for_unfreeze)
>  {
>  	struct super_block *sb, *p = NULL;
>  
> @@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
>  		sb->s_count++;
>  		spin_unlock(&sb_lock);
>  
> -		down_read(&sb->s_umount);
> +		if (!wait_for_unfreeze)
> +			down_read(&sb->s_umount);
> +		else
> +			down_read_s_umount(sb);
>  		if (sb->s_root)
>  			f(sb, arg);
>  		up_read(&sb->s_umount);
> @@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
>   *	mounted on the device given. %NULL is returned if no match is found.
>   */
>  
> -struct super_block *get_super(struct block_device *bdev)
> +struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
>  {
>  	struct super_block *sb;
>  
> @@ -612,7 +620,10 @@ rescan:
>  		if (sb->s_bdev == bdev) {
>  			sb->s_count++;
>  			spin_unlock(&sb_lock);
> -			down_read(&sb->s_umount);
> +			if (wait_for_unfreeze)
> +				down_read_s_umount(sb);
> +			else
> +				down_read(&sb->s_umount);
>  			/* still alive? */
>  			if (sb->s_root)
>  				return sb;
> @@ -672,7 +683,7 @@ rescan:
>  		if (sb->s_dev ==  dev) {
>  			sb->s_count++;
>  			spin_unlock(&sb_lock);
> -			down_read(&sb->s_umount);
> +			down_read_s_umount(sb);
>  			/* still alive? */
>  			if (sb->s_root)
>  				return sb;
> Index: linux-3.2-rc3-fast/fs/sync.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/sync.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/sync.c	2011-11-25 05:51:36.000000000 +0100
> @@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
>   */
>  static void sync_filesystems(int wait)
>  {
> -	iterate_supers(sync_one_sb, &wait);
> +	iterate_supers(sync_one_sb, &wait, true);
>  }
>  
>  /*
> @@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
>  		return -EBADF;
>  	sb = file->f_dentry->d_sb;
>  
> -	down_read(&sb->s_umount);
> +	down_read_s_umount(sb);
>  	ret = sync_filesystem(sb);
>  	up_read(&sb->s_umount);
>  
> Index: linux-3.2-rc3-fast/fs/block_dev.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/block_dev.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/block_dev.c	2011-11-25 05:51:36.000000000 +0100
> @@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
>   */
>  int fsync_bdev(struct block_device *bdev)
>  {
> -	struct super_block *sb = get_super(bdev);
> +	struct super_block *sb = get_super(bdev, true);
>  	if (sb) {
>  		int res = sync_filesystem(sb);
>  		drop_super(sb);
> @@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
>  		 * to freeze_bdev grab an active reference and only the last
>  		 * thaw_bdev drops it.
>  		 */
> -		sb = get_super(bdev);
> +		sb = get_super(bdev, false);
>  		drop_super(sb);
>  		mutex_unlock(&bdev->bd_fsfreeze_mutex);
>  		return sb;
> @@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
>  
>  int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>  {
> -	struct super_block *sb = get_super(bdev);
> +	struct super_block *sb = get_super(bdev, true);
>  	int res = 0;
>  
>  	if (sb) {
> Index: linux-3.2-rc3-fast/fs/drop_caches.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/drop_caches.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/drop_caches.c	2011-11-25 05:51:36.000000000 +0100
> @@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
>  		return ret;
>  	if (write) {
>  		if (sysctl_drop_caches & 1)
> -			iterate_supers(drop_pagecache_sb, NULL);
> +			iterate_supers(drop_pagecache_sb, NULL, true);
>  		if (sysctl_drop_caches & 2)
>  			drop_slab();
>  	}
> Index: linux-3.2-rc3-fast/security/selinux/hooks.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/security/selinux/hooks.c	2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/security/selinux/hooks.c	2011-11-25 05:51:36.000000000 +0100
> @@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
>  
>  	/* Set up any superblocks initialized prior to the policy load. */
>  	printk(KERN_DEBUG "SELinux:  Setting up existing superblocks.\n");
> -	iterate_supers(delayed_superblock_init, NULL);
> +	iterate_supers(delayed_superblock_init, NULL, true);
>  }
>  
>  /* SELinux requires early initialization in order to label
> Index: linux-3.2-rc3-fast/fs/fs-writeback.c
> ===================================================================
> --- linux-3.2-rc3-fast.orig/fs/fs-writeback.c	2011-11-29 00:09:30.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/fs-writeback.c	2011-11-29 00:12:49.000000000 +0100
> @@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
>  		down_read(&sb->s_umount);
> +		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> +			up_read(&sb->s_umount);
> +			return 0;
> +		}
>  		writeback_inodes_sb(sb, reason);
>  		up_read(&sb->s_umount);
>  		return 1;
> @@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
>  		down_read(&sb->s_umount);
> +		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> +			up_read(&sb->s_umount);
> +			return 0;
> +		}
>  		writeback_inodes_sb_nr(sb, nr, reason);
>  		up_read(&sb->s_umount);
>  		return 1;
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-29 10:19         ` Jan Kara
@ 2011-11-29 10:21           ` Jan Kara
  2011-11-29 11:06             ` Mikulas Patocka
  2011-11-30 13:33           ` Alasdair G Kergon
  2011-11-30 14:09           ` Alasdair G Kergon
  2 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2011-11-29 10:21 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Valerie Aurora, Jan Kara, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen, Surbhi Palande

  Sorry for replying once more - forgot to add those CCs - so please reply
to this email...

On Tue 29-11-11 11:19:01, Jan Kara wrote:
>   Hi,
> 
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > Where can I get that patch set?
> > > 
> > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
> > > background writeback (these code paths take s_umount and wait trying to do 
> > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
> > > Are there other known deadlock possibilities?
> > 
> > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write 
> > deadlock" (I couldn't find the next two parts of the patch in the 
> > archives). And the patch looks wrong:
>   Yes, that seems to be the series. I generally agree with you that the
> last iteration still had some problems and some changes were requested.
> That's why it's not merged yet after all...
> 
> > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not 
> > held when the filesystem is frozen and it is taken for write when thawing. 
> > Consequently, any task can succeed with down_read_trylock(&sb->s_umount) 
> > on a frozen filesystem and if this tasks attempts to do an I/O that is 
> > waiting for thaw, it may still deadlock.
>   Agreed.
> 
> > - skipping sync on frozen filesystem violates sync semantics. 
> > Applications, such as databases, assume that when sync finishes, data were 
> > written to stable storage. If we skip sync when the filesystem is frozen, 
> > we can cause data corruption in these applications (if the system crashes 
> > after we skipped a sync).
>   Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
> 
> That being said, it is hard to implement freeze handling in page_mkwrite()
> in such a way that there would be no dirty pages (but we know there cannot
> be dirty data in such pages). Currently we mark the page dirty during page
> fault and wait for frozen filesystem only after that so that we are
> guaranteed that either freezing code will wait for page fault to finish and
> will write the page or page fault code notices that freezing is in progress
> and blocks (see fs/buffer.c:block_page_mkwrite()).
> 
> So I believe the consensus was that we should not block sync or flusher
> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.
> 
> > - I'm not sure what userspace quota subsystem will do if we start 
> > returning -EBUSY spuriously.
>   Quota tools will complain to the user which would be fine I think. But
> blocking is fine as well. In this particular case I don't care much but it
> should be consistent with what happens to sync. So probably Q_SYNC command
> should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.
> 
> > There is another thing --- I wasn't able to reproduce these sync-related 
> > deadlocks at all. Did anyone succeeded in reproducing them? Are there any 
> > reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel 
> > prevents creating new dirty data, syncs all data, and freezes the 
> > filesystem. Consequently, the sync function never finds any dirty data and 
> > so it doesn't block (sync doesn't writeback ATIME change, I don't know 
> > why).
>   See above why sync can actually spot some dirty inodes/page (although
> there is not any dirty data). Surbhi (added to CC) from Canonical could
> actually trigger these races and consequent deadlocks (and I belive some of
> their customers as well). Also some RH customers were hitting these
> deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
> happy by my changes to block_page_mkwrite() which made the race window much
> narrower.
> 
> > I made this patch that fixes the quota issue and possible sync issues. It 
> > introduces a function down_read_s_umount(sb); --- this function takes 
> > s_umount lock for read, but it waits if the filesystem is suspended.
> > 
> > There are three more potentially unsafe users: fs/cachefiles/interface.c, 
> > fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't 
> > test them.
> > 
> > Mikulas
> > 
> > ---
> > 
> > Fix a s_umount deadlock
> > 
> > The lock s_umount is taken for write and dropped when we freeze and resume
> > a block device.
> > 
> > Consequently, if some code holds s_umount and performs an I/O, a deadlock may
> > happen if the device is suspended.
> > 
> > Unfortunatelly, it seems that developers are not aware of this restriction
> > that I/O must not be peformed with s_umount held.
> > 
> > These deadlocks were observed:
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> >   ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> >   call_rwsem_down_write_failed (the process is trying to resume frozen device, 
> >   but it is waiting trying to acquire s_umount)
> > * quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
> >   ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
> >   start_this_handle (the process is holding s_umount and trying to perform
> >   a transaction, waiting for the device being unfrozen)
> > 
> > * iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
> >   writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
> >   (the process is holding s_umount for read and trying to perform some I/O)
> > * flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
> >   wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
> >   writeback_single_inode -> do_writepages -> ext4_da_writepages ->
> >   ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
> >   (holding s_umount for read too, acquired in writeback_inodes_wb,
> >   and trying to perform I/O)
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> >   ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> >   call_rwsem_down_write_failed (just like in a previous case: trying to
> >   resume frozen device, waiting on s_umount)
> > 
> > This patch fixes these observed code paths:
> > * introduce a function to safely take s_unlock - down_read_s_umount. It takes
> >   the lock, check if a filesystem is frozen, if it is, drops the lock and
> >   waits for unfreeze.
> > * get_super function has a parameter, if the parameter is true, it waits for
> >   the device to unfreeze (it fixes quota deadlock and a possible deadlock in
> >   fsync_bdev and __invalidate_device)
> > * the same for iterate_supers (it fixes the sync deadlock and a possible
> >   deadlock in drop_caches_sysctl_handler)
> > * grab_super_passive fails if the device is suspended (fixes the inode writeback
> >   deadlock and a possible deadlock in prune_super)
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > CC: stable@kernel.org
> > 
> > ---
> >  fs/block_dev.c           |    6 +++---
> >  fs/buffer.c              |    2 +-
> >  fs/drop_caches.c         |    2 +-
> >  fs/fs-writeback.c        |    8 ++++++++
> >  fs/quota/quota.c         |    4 ++--
> >  fs/super.c               |   29 ++++++++++++++++++++---------
> >  fs/sync.c                |    4 ++--
> >  include/linux/fs.h       |   28 +++++++++++++++++++++++++---
> >  security/selinux/hooks.c |    2 +-
> >  9 files changed, 63 insertions(+), 22 deletions(-)
> > 
> > Index: linux-3.2-rc3-fast/fs/quota/quota.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/quota/quota.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/quota/quota.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ static int quota_sync_all(int type)
> >  		return -EINVAL;
> >  	ret = security_quotactl(Q_SYNC, type, 0, NULL);
> >  	if (!ret)
> > -		iterate_supers(quota_sync_one, &type);
> > +		iterate_supers(quota_sync_one, &type, true);
> >  	return ret;
> >  }
> >  
> > @@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
> >  	putname(tmp);
> >  	if (IS_ERR(bdev))
> >  		return ERR_CAST(bdev);
> > -	sb = get_super(bdev);
> > +	sb = get_super(bdev, true);
> >  	bdput(bdev);
> >  	if (!sb)
> >  		return ERR_PTR(-ENODEV);
> > Index: linux-3.2-rc3-fast/include/linux/fs.h
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/include/linux/fs.h	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/include/linux/fs.h	2011-11-25 05:56:03.000000000 +0100
> > @@ -10,6 +10,7 @@
> >  #include <linux/ioctl.h>
> >  #include <linux/blk_types.h>
> >  #include <linux/types.h>
> > +#include <linux/sched.h>
> >  
> >  /*
> >   * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> > @@ -1502,6 +1503,26 @@ enum {
> >  	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
> >  
> >  /*
> > + * Take s_umount and make sure that the filesystem is not frozen.
> > + * This function must be used if we intend to perform any I/O while
> > + * holding s_umount.
> > + *
> > + * s_umount is taken for write when resuming a frozen filesystem,
> > + * thus if someone calls down_read(&s->s_umount) and performs any I/O,
> > + * it may deadlock.
> > + */
> > +static inline void down_read_s_umount(struct super_block *sb)
> > +{
> > +retry:
> > +	down_read(&sb->s_umount);
> > +	if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > +		up_write(&sb->s_umount);
> > +		vfs_check_frozen(sb, SB_FREEZE_WRITE);
> > +		goto retry;
> > +	}
> > +}
> > +
> > +/*
> >   * until VFS tracks user namespaces for inodes, just make all files
> >   * belong to init_user_ns
> >   */
> > @@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
> >  extern void get_filesystem(struct file_system_type *fs);
> >  extern void put_filesystem(struct file_system_type *fs);
> >  extern struct file_system_type *get_fs_type(const char *name);
> > -extern struct super_block *get_super(struct block_device *);
> > +extern struct super_block *get_super(struct block_device *, bool);
> >  extern struct super_block *get_active_super(struct block_device *bdev);
> >  extern struct super_block *user_get_super(dev_t);
> >  extern void drop_super(struct super_block *sb);
> > -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> > +extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
> >  extern void iterate_supers_type(struct file_system_type *,
> > -			        void (*)(struct super_block *, void *), void *);
> > +			        void (*)(struct super_block *, void *), void *,
> > +				bool);
> >  
> >  extern int dcache_dir_open(struct inode *, struct file *);
> >  extern int dcache_dir_close(struct inode *, struct file *);
> > Index: linux-3.2-rc3-fast/fs/buffer.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/buffer.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/buffer.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
> >  
> >  static void do_thaw_all(struct work_struct *work)
> >  {
> > -	iterate_supers(do_thaw_one, NULL);
> > +	iterate_supers(do_thaw_one, NULL, false);
> >  	kfree(work);
> >  	printk(KERN_WARNING "Emergency Thaw complete\n");
> >  }
> > Index: linux-3.2-rc3-fast/fs/super.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/super.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/super.c	2011-11-29 00:16:01.000000000 +0100
> > @@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
> >  	spin_unlock(&sb_lock);
> >  
> >  	if (down_read_trylock(&sb->s_umount)) {
> > -		if (sb->s_root)
> > +		if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> >  			return true;
> >  		up_read(&sb->s_umount);
> >  	}
> > @@ -503,7 +503,7 @@ void sync_supers(void)
> >  			sb->s_count++;
> >  			spin_unlock(&sb_lock);
> >  
> > -			down_read(&sb->s_umount);
> > +			down_read_s_umount(sb);
> >  			if (sb->s_root && sb->s_dirt)
> >  				sb->s_op->write_super(sb);
> >  			up_read(&sb->s_umount);
> > @@ -527,7 +527,8 @@ void sync_supers(void)
> >   *	Scans the superblock list and calls given function, passing it
> >   *	locked superblock and given argument.
> >   */
> > -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> > +void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> > +		    bool wait_for_unfreeze)
> >  {
> >  	struct super_block *sb, *p = NULL;
> >  
> > @@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
> >  		sb->s_count++;
> >  		spin_unlock(&sb_lock);
> >  
> > -		down_read(&sb->s_umount);
> > +		if (!wait_for_unfreeze)
> > +			down_read(&sb->s_umount);
> > +		else
> > +			down_read_s_umount(sb);
> >  		if (sb->s_root)
> >  			f(sb, arg);
> >  		up_read(&sb->s_umount);
> > @@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
> >   *	locked superblock and given argument.
> >   */
> >  void iterate_supers_type(struct file_system_type *type,
> > -	void (*f)(struct super_block *, void *), void *arg)
> > +	void (*f)(struct super_block *, void *), void *arg,
> > +	bool wait_for_unfreeze)
> >  {
> >  	struct super_block *sb, *p = NULL;
> >  
> > @@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
> >  		sb->s_count++;
> >  		spin_unlock(&sb_lock);
> >  
> > -		down_read(&sb->s_umount);
> > +		if (!wait_for_unfreeze)
> > +			down_read(&sb->s_umount);
> > +		else
> > +			down_read_s_umount(sb);
> >  		if (sb->s_root)
> >  			f(sb, arg);
> >  		up_read(&sb->s_umount);
> > @@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
> >   *	mounted on the device given. %NULL is returned if no match is found.
> >   */
> >  
> > -struct super_block *get_super(struct block_device *bdev)
> > +struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
> >  {
> >  	struct super_block *sb;
> >  
> > @@ -612,7 +620,10 @@ rescan:
> >  		if (sb->s_bdev == bdev) {
> >  			sb->s_count++;
> >  			spin_unlock(&sb_lock);
> > -			down_read(&sb->s_umount);
> > +			if (wait_for_unfreeze)
> > +				down_read_s_umount(sb);
> > +			else
> > +				down_read(&sb->s_umount);
> >  			/* still alive? */
> >  			if (sb->s_root)
> >  				return sb;
> > @@ -672,7 +683,7 @@ rescan:
> >  		if (sb->s_dev ==  dev) {
> >  			sb->s_count++;
> >  			spin_unlock(&sb_lock);
> > -			down_read(&sb->s_umount);
> > +			down_read_s_umount(sb);
> >  			/* still alive? */
> >  			if (sb->s_root)
> >  				return sb;
> > Index: linux-3.2-rc3-fast/fs/sync.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/sync.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/sync.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
> >   */
> >  static void sync_filesystems(int wait)
> >  {
> > -	iterate_supers(sync_one_sb, &wait);
> > +	iterate_supers(sync_one_sb, &wait, true);
> >  }
> >  
> >  /*
> > @@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> >  		return -EBADF;
> >  	sb = file->f_dentry->d_sb;
> >  
> > -	down_read(&sb->s_umount);
> > +	down_read_s_umount(sb);
> >  	ret = sync_filesystem(sb);
> >  	up_read(&sb->s_umount);
> >  
> > Index: linux-3.2-rc3-fast/fs/block_dev.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/block_dev.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/block_dev.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
> >   */
> >  int fsync_bdev(struct block_device *bdev)
> >  {
> > -	struct super_block *sb = get_super(bdev);
> > +	struct super_block *sb = get_super(bdev, true);
> >  	if (sb) {
> >  		int res = sync_filesystem(sb);
> >  		drop_super(sb);
> > @@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
> >  		 * to freeze_bdev grab an active reference and only the last
> >  		 * thaw_bdev drops it.
> >  		 */
> > -		sb = get_super(bdev);
> > +		sb = get_super(bdev, false);
> >  		drop_super(sb);
> >  		mutex_unlock(&bdev->bd_fsfreeze_mutex);
> >  		return sb;
> > @@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
> >  
> >  int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> >  {
> > -	struct super_block *sb = get_super(bdev);
> > +	struct super_block *sb = get_super(bdev, true);
> >  	int res = 0;
> >  
> >  	if (sb) {
> > Index: linux-3.2-rc3-fast/fs/drop_caches.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/drop_caches.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/drop_caches.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
> >  		return ret;
> >  	if (write) {
> >  		if (sysctl_drop_caches & 1)
> > -			iterate_supers(drop_pagecache_sb, NULL);
> > +			iterate_supers(drop_pagecache_sb, NULL, true);
> >  		if (sysctl_drop_caches & 2)
> >  			drop_slab();
> >  	}
> > Index: linux-3.2-rc3-fast/security/selinux/hooks.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/security/selinux/hooks.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/security/selinux/hooks.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
> >  
> >  	/* Set up any superblocks initialized prior to the policy load. */
> >  	printk(KERN_DEBUG "SELinux:  Setting up existing superblocks.\n");
> > -	iterate_supers(delayed_superblock_init, NULL);
> > +	iterate_supers(delayed_superblock_init, NULL, true);
> >  }
> >  
> >  /* SELinux requires early initialization in order to label
> > Index: linux-3.2-rc3-fast/fs/fs-writeback.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/fs-writeback.c	2011-11-29 00:09:30.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/fs-writeback.c	2011-11-29 00:12:49.000000000 +0100
> > @@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> >  		down_read(&sb->s_umount);
> > +		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > +			up_read(&sb->s_umount);
> > +			return 0;
> > +		}
> >  		writeback_inodes_sb(sb, reason);
> >  		up_read(&sb->s_umount);
> >  		return 1;
> > @@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> >  		down_read(&sb->s_umount);
> > +		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > +			up_read(&sb->s_umount);
> > +			return 0;
> > +		}
> >  		writeback_inodes_sb_nr(sb, nr, reason);
> >  		up_read(&sb->s_umount);
> >  		return 1;
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-29 10:21           ` Jan Kara
@ 2011-11-29 11:06             ` Mikulas Patocka
  2011-11-29 11:11               ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-29 11:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Valerie Aurora, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen, Surbhi Palande



On Tue 29-11-11 11:19:01, Jan Kara wrote:
>   Hi,
> 
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > Where can I get that patch set?
> > > 
> > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
> > > background writeback (these code paths take s_umount and wait trying to do 
> > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
> > > Are there other known deadlock possibilities?
> > 
> > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write 
> > deadlock" (I couldn't find the next two parts of the patch in the 
> > archives). And the patch looks wrong:
>   Yes, that seems to be the series. I generally agree with you that the
> last iteration still had some problems and some changes were requested.
> That's why it's not merged yet after all...
> 
> > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not 
> > held when the filesystem is frozen and it is taken for write when thawing. 
> > Consequently, any task can succeed with down_read_trylock(&sb->s_umount) 
> > on a frozen filesystem and if this tasks attempts to do an I/O that is 
> > waiting for thaw, it may still deadlock.
>   Agreed.
> 
> > - skipping sync on frozen filesystem violates sync semantics. 
> > Applications, such as databases, assume that when sync finishes, data were 
> > written to stable storage. If we skip sync when the filesystem is frozen, 
> > we can cause data corruption in these applications (if the system crashes 
> > after we skipped a sync).
>   Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem.

This is technically impossible to achieve on ext2, fat or other 
non-transactional filesystems. These filesystems have no locks around code 
paths that set data or inodes dirty. And you still need working sync for 
ext2. So the best thing to do in sync is to wait until the filesystem is 
unfrozen.

> Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
> 
> That being said, it is hard to implement freeze handling in page_mkwrite()
> in such a way that there would be no dirty pages (but we know there cannot
> be dirty data in such pages). Currently we mark the page dirty during page
> fault and wait for frozen filesystem only after that so that we are
> guaranteed that either freezing code will wait for page fault to finish and
> will write the page or page fault code notices that freezing is in progress
> and blocks (see fs/buffer.c:block_page_mkwrite()).
> 
> So I believe the consensus was that we should not block sync or flusher
> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.

For background writes I agree, that they should skip frozen filesystems. 
But for synchronous writes (sync) must wait, at least on non-transactional 
filesystems.

> > - I'm not sure what userspace quota subsystem will do if we start 
> > returning -EBUSY spuriously.
>   Quota tools will complain to the user which would be fine I think. But
> blocking is fine as well. In this particular case I don't care much but it
> should be consistent with what happens to sync. So probably Q_SYNC command
> should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.
> 
> > There is another thing --- I wasn't able to reproduce these sync-related 
> > deadlocks at all. Did anyone succeeded in reproducing them? Are there any 
> > reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel 
> > prevents creating new dirty data, syncs all data, and freezes the 
> > filesystem. Consequently, the sync function never finds any dirty data and 
> > so it doesn't block (sync doesn't writeback ATIME change, I don't know 
> > why).
>   See above why sync can actually spot some dirty inodes/page (although
> there is not any dirty data). Surbhi (added to CC) from Canonical could
> actually trigger these races and consequent deadlocks (and I belive some of
> their customers as well). Also some RH customers were hitting these
> deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
> happy by my changes to block_page_mkwrite() which made the race window much
> narrower.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-29 11:06             ` Mikulas Patocka
@ 2011-11-29 11:11               ` Jan Kara
  2011-11-29 12:54                 ` Mikulas Patocka
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2011-11-29 11:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Valerie Aurora, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen, Surbhi Palande

On Tue 29-11-11 06:06:21, Mikulas Patocka wrote:
> On Tue 29-11-11 11:19:01, Jan Kara wrote:
> >   Hi,
> > 
> > On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > > Hi
> > > > 
> > > > Where can I get that patch set?
> > > > 
> > > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
> > > > background writeback (these code paths take s_umount and wait trying to do 
> > > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
> > > > Are there other known deadlock possibilities?
> > > 
> > > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write 
> > > deadlock" (I couldn't find the next two parts of the patch in the 
> > > archives). And the patch looks wrong:
> >   Yes, that seems to be the series. I generally agree with you that the
> > last iteration still had some problems and some changes were requested.
> > That's why it's not merged yet after all...
> > 
> > > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not 
> > > held when the filesystem is frozen and it is taken for write when thawing. 
> > > Consequently, any task can succeed with down_read_trylock(&sb->s_umount) 
> > > on a frozen filesystem and if this tasks attempts to do an I/O that is 
> > > waiting for thaw, it may still deadlock.
> >   Agreed.
> > 
> > > - skipping sync on frozen filesystem violates sync semantics. 
> > > Applications, such as databases, assume that when sync finishes, data were 
> > > written to stable storage. If we skip sync when the filesystem is frozen, 
> > > we can cause data corruption in these applications (if the system crashes 
> > > after we skipped a sync).
> >   Here I don't agree. Filesystem must guarantee there are no dirty data on
> > a frozen filesystem.
> 
> This is technically impossible to achieve on ext2, fat or other 
> non-transactional filesystems. These filesystems have no locks around code 
> paths that set data or inodes dirty. And you still need working sync for 
> ext2. So the best thing to do in sync is to wait until the filesystem is 
> unfrozen.
  Then suspend is effectively unsupported on the filesystem and should
return EOPNOTSUPP? At least that's what I'd expect...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-29 11:11               ` Jan Kara
@ 2011-11-29 12:54                 ` Mikulas Patocka
  2011-11-29 13:09                   ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-29 12:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Valerie Aurora, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen, Surbhi Palande

> > This is technically impossible to achieve on ext2, fat or other 
> > non-transactional filesystems. These filesystems have no locks around code 
> > paths that set data or inodes dirty. And you still need working sync for 
> > ext2. So the best thing to do in sync is to wait until the filesystem is 
> > unfrozen.
>   Then suspend is effectively unsupported on the filesystem and should
> return EOPNOTSUPP? At least that's what I'd expect...
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

LVM uses suspend every time it changes layout of the logical volume. For 
example when it converts to/from mirrored format, extends/shrinks the 
volume, moves the volume to a different disk, takes a snapshots, merges a 
snapshot back, on mirror or multipath failover.

For most of these actions (except taking a snapshot) it is irrelevant if 
there are dirty data in the filesystem cache or not while it is suspended. 
So there is no point in banning suspend on ext2. If you banned it, you 
couldn't use ext2 on LVM at all.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-29 12:54                 ` Mikulas Patocka
@ 2011-11-29 13:09                   ` Jan Kara
  2011-11-29 13:18                     ` [dm-devel] " Alasdair G Kergon
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2011-11-29 13:09 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Valerie Aurora, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, esandeen, Surbhi Palande

On Tue 29-11-11 07:54:28, Mikulas Patocka wrote:
> > > This is technically impossible to achieve on ext2, fat or other 
> > > non-transactional filesystems. These filesystems have no locks around code 
> > > paths that set data or inodes dirty. And you still need working sync for 
> > > ext2. So the best thing to do in sync is to wait until the filesystem is 
> > > unfrozen.
> >   Then suspend is effectively unsupported on the filesystem and should
> > return EOPNOTSUPP? At least that's what I'd expect...
> 
> LVM uses suspend every time it changes layout of the logical volume. For 
> example when it converts to/from mirrored format, extends/shrinks the 
> volume, moves the volume to a different disk, takes a snapshots, merges a 
> snapshot back, on mirror or multipath failover.
> 
> For most of these actions (except taking a snapshot) it is irrelevant if 
> there are dirty data in the filesystem cache or not while it is suspended. 
  Hmm, then why do these operations suspend the filesystem if they
apparently don't need it? Sorry for my ignorance, I never seriously worked
with LVM code...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-29 13:09                   ` Jan Kara
@ 2011-11-29 13:18                     ` Alasdair G Kergon
  2011-11-29 13:32                       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Alasdair G Kergon @ 2011-11-29 13:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mikulas Patocka, esandeen, Surbhi Palande, linux-kernel,
	dm-devel, linux-fsdevel, Christopher Chaltain, Valerie Aurora

On Tue, Nov 29, 2011 at 02:09:13PM +0100, Jan Kara wrote:
>   Hmm, then why do these operations suspend the filesystem if they
> apparently don't need it? Sorry for my ignorance, I never seriously worked
> with LVM code...

They don't suspend it if they don't need to.

dm-ioctl.h:
/*
 * Set this to avoid attempting to freeze any filesystem when suspending.
 */
#define DM_SKIP_LOCKFS_FLAG     (1 << 10) /* In */

Alasdair
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-29 13:18                     ` [dm-devel] " Alasdair G Kergon
@ 2011-11-29 13:32                       ` Jan Kara
  2011-11-29 16:33                         ` Eric Sandeen
  2011-11-30  6:52                         ` Mikulas Patocka
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Kara @ 2011-11-29 13:32 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Jan Kara, Mikulas Patocka, esandeen, Surbhi Palande,
	linux-kernel, dm-devel, linux-fsdevel, Christopher Chaltain,
	Valerie Aurora

On Tue 29-11-11 13:18:11, Alasdair G Kergon wrote:
> On Tue, Nov 29, 2011 at 02:09:13PM +0100, Jan Kara wrote:
> >   Hmm, then why do these operations suspend the filesystem if they
> > apparently don't need it? Sorry for my ignorance, I never seriously worked
> > with LVM code...
> 
> They don't suspend it if they don't need to.
> 
> dm-ioctl.h:
> /*
>  * Set this to avoid attempting to freeze any filesystem when suspending.
>  */
> #define DM_SKIP_LOCKFS_FLAG     (1 << 10) /* In */
  Thanks. I was now checking in detail and indeed FIFREEZE fails if
->freeze_fs is not set. And only xfs, ext3, ext4, reiserfs, jfs, nilfs2,
and gfs2 provide this function. So I was correct in assuming that when
filesystem supports FIFREEZE it must make sure no modifications happen to
the filesystem. So I believe that my original plan for sync to skip frozen
filesystem is correct.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-29 13:32                       ` Jan Kara
@ 2011-11-29 16:33                         ` Eric Sandeen
  2011-11-30  6:52                         ` Mikulas Patocka
  1 sibling, 0 replies; 33+ messages in thread
From: Eric Sandeen @ 2011-11-29 16:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alasdair G Kergon, Mikulas Patocka, Surbhi Palande, linux-kernel,
	dm-devel, linux-fsdevel, Christopher Chaltain, Valerie Aurora

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/29/11 7:32 AM, Jan Kara wrote:
> On Tue 29-11-11 13:18:11, Alasdair G Kergon wrote:
>> On Tue, Nov 29, 2011 at 02:09:13PM +0100, Jan Kara wrote:
>>>   Hmm, then why do these operations suspend the filesystem if they
>>> apparently don't need it? Sorry for my ignorance, I never seriously worked
>>> with LVM code...
>>
>> They don't suspend it if they don't need to.
>>
>> dm-ioctl.h:
>> /*
>>  * Set this to avoid attempting to freeze any filesystem when suspending.
>>  */
>> #define DM_SKIP_LOCKFS_FLAG     (1 << 10) /* In */
>   Thanks. I was now checking in detail and indeed FIFREEZE fails if
> ->freeze_fs is not set. And only xfs, ext3, ext4, reiserfs, jfs, nilfs2,
> and gfs2 provide this function. So I was correct in assuming that when
> filesystem supports FIFREEZE it must make sure no modifications happen to
> the filesystem. So I believe that my original plan for sync to skip frozen
> filesystem is correct.

Jan, FWIW I agree with this as well.

- -Eric
 
> 								Honza

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJO1QlpAAoJECCuFpLhPd7g3wEP/1C7T280LLZHubBMolF1dEp0
qJPQAcMUCcwMn+ewMZF5Pg/Li5WxveLNydzUWURrmnXolHddkvmoUESeJt6ugraT
i0bgb3ICRKpYi/1GEsW8UZQMXWIjdg5qyEvagDou/oy7qPXeND+JTvSYdqoti0Fn
pKv9ZUuU2arPrdDqCHP0lmiU1mF85/h/liB09GdFll3nLvSahJK5N9Cr6d0jG4VL
9faJSKI4EgVvyDPWx+L4mA5aIbPveQY0LbzpS+QSGswkd4ylVvLCV76oYj4jEH7R
hbv8OjgWTEKV8wgqmFFF/HXwzmJqzfT0rFCgbe4ZCJdnNujRe2h/wg5Gpkbj3lWc
htwUZfKKg9Vm9PheqMVxbRYEbWXB+g2Seiv/5fHNNG3WeFuPo15ftcu3Kjj7z/gS
zY1PnkHJj7/fF9tyZyTGb/AXicJyLJN7gWvf8pWNETXd4PfuDHgD70iQfWbZdKfe
0a+G4AGhXJ9zrjImM1vnsq09AMTQApL+Lm+VxdPfUbFBZd9pSC+xbBDKvOl8qy3A
Gllbjtbwi/3hV2S8t4vSLLowRTZqkLC+VHphnyLlwVeHAUsqZXQ58ZrFNx2NaYdL
UGkEty8zHGL08Yglwfd2iSlzDb/0OmCurXSZcAAXH47AAzPxZgN80KFbwc2PsGG7
DC9hxrb1bGZojdDTsuku
=UWkJ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-28 21:00   ` Valerie Aurora
  2011-11-28 21:14     ` Mikulas Patocka
@ 2011-11-29 20:00     ` Kamal Mostafa
  1 sibling, 0 replies; 33+ messages in thread
From: Kamal Mostafa @ 2011-11-29 20:00 UTC (permalink / raw)
  To: Valerie Aurora
  Cc: Jan Kara, Mikulas Patocka, linux-kernel, linux-fsdevel, dm-devel,
	Christopher Chaltain, Surbhi Palande

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

> On Mon, Nov 28, 2011 at 8:04 AM, Jan Kara <jack@suse.cz> wrote:
> >  Thanks for the patch. I'm aware of the deadlock and Val Henson is working
> > on resolving these types of deadlocks more systematically. [...]

On Mon, 2011-11-28 at 14:00 -0700, Valerie Aurora wrote:
> FYI, I no longer have time to consult in addition to my full-time job.
>  Canonical is taking over this patch set.

We've been testing the patch set Val refers to (patches developed by her
and by Surbhi Palande) within Canonical.  I will resubmit them on their
behalf within the next day or two.

 -Kamal


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-29 13:32                       ` Jan Kara
  2011-11-29 16:33                         ` Eric Sandeen
@ 2011-11-30  6:52                         ` Mikulas Patocka
  2011-11-30 11:16                           ` Jan Kara
  1 sibling, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30  6:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alasdair G Kergon, esandeen, Surbhi Palande, linux-kernel,
	dm-devel, linux-fsdevel, Christopher Chaltain, Valerie Aurora

> > dm-ioctl.h:
> > /*
> >  * Set this to avoid attempting to freeze any filesystem when suspending.
> >  */
> > #define DM_SKIP_LOCKFS_FLAG     (1 << 10) /* In */
>   Thanks. I was now checking in detail and indeed FIFREEZE fails if
> ->freeze_fs is not set. And only xfs, ext3, ext4, reiserfs, jfs, nilfs2,
> and gfs2 provide this function. So I was correct in assuming that when
> filesystem supports FIFREEZE it must make sure no modifications happen to
> the filesystem. So I believe that my original plan for sync to skip frozen
> filesystem is correct.
> 
> 								Honza

LVM doesn't suspend with FIFREEZE, it calls freeze_bdev directly from 
drivers/md/dm.c (and it works for all filesystems, including ext2). So if 
you skip sync of frozen filesystems, you introduce a data corruption if 
someone takes a snapshot of ext2.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30  6:52                         ` Mikulas Patocka
@ 2011-11-30 11:16                           ` Jan Kara
  2011-11-30 12:14                             ` Mikulas Patocka
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2011-11-30 11:16 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Alasdair G Kergon, esandeen, Surbhi Palande,
	linux-kernel, dm-devel, linux-fsdevel, Christopher Chaltain,
	Valerie Aurora

On Wed 30-11-11 01:52:22, Mikulas Patocka wrote:
> > > dm-ioctl.h:
> > > /*
> > >  * Set this to avoid attempting to freeze any filesystem when suspending.
> > >  */
> > > #define DM_SKIP_LOCKFS_FLAG     (1 << 10) /* In */
> >   Thanks. I was now checking in detail and indeed FIFREEZE fails if
> > ->freeze_fs is not set. And only xfs, ext3, ext4, reiserfs, jfs, nilfs2,
> > and gfs2 provide this function. So I was correct in assuming that when
> > filesystem supports FIFREEZE it must make sure no modifications happen to
> > the filesystem. So I believe that my original plan for sync to skip frozen
> > filesystem is correct.
> > 
> > 								Honza
> 
> LVM doesn't suspend with FIFREEZE, it calls freeze_bdev directly from 
> drivers/md/dm.c (and it works for all filesystems, including ext2).
  Ah, I see. Sorry I missed this. But then I can only reiterate that
drivers/md/dm.c is IMHO broken. Either it cares about filesystem being
really frozen - and then it should refuse the operation for e.g. ext2
because it cannot be frozen - or it does not care about filesystem being
frozen and then there's no point in calling freeze_super(). Possibly, you
might still want to e.g. try snapshotting even if freeze_super() would
return EOPNOTSUPP but that should be handled inside dm, not by errorneously
marking filesystem as frozen when it is not. Or am I still missing
something?

> So if you skip sync of frozen filesystems, you introduce a data
> corruption if someone takes a snapshot of ext2.
  Yes, because ext2 cannot really be frozen, it is (errorneously) marked
as such but it is not frozen...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 11:16                           ` Jan Kara
@ 2011-11-30 12:14                             ` Mikulas Patocka
  2011-11-30 13:05                               ` Alasdair G Kergon
  0 siblings, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30 12:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alasdair G Kergon, esandeen, Surbhi Palande, linux-kernel,
	dm-devel, linux-fsdevel, Christopher Chaltain, Valerie Aurora



On Wed, 30 Nov 2011, Jan Kara wrote:

> On Wed 30-11-11 01:52:22, Mikulas Patocka wrote:
> > > > dm-ioctl.h:
> > > > /*
> > > >  * Set this to avoid attempting to freeze any filesystem when suspending.
> > > >  */
> > > > #define DM_SKIP_LOCKFS_FLAG     (1 << 10) /* In */
> > >   Thanks. I was now checking in detail and indeed FIFREEZE fails if
> > > ->freeze_fs is not set. And only xfs, ext3, ext4, reiserfs, jfs, nilfs2,
> > > and gfs2 provide this function. So I was correct in assuming that when
> > > filesystem supports FIFREEZE it must make sure no modifications happen to
> > > the filesystem. So I believe that my original plan for sync to skip frozen
> > > filesystem is correct.
> > > 
> > > 								Honza
> > 
> > LVM doesn't suspend with FIFREEZE, it calls freeze_bdev directly from 
> > drivers/md/dm.c (and it works for all filesystems, including ext2).
>   Ah, I see. Sorry I missed this. But then I can only reiterate that
> drivers/md/dm.c is IMHO broken. Either it cares about filesystem being
> really frozen - and then it should refuse the operation for e.g. ext2
> because it cannot be frozen - or it does not care about filesystem being
> frozen and then there's no point in calling freeze_super(). Possibly, you
> might still want to e.g. try snapshotting even if freeze_super() would
> return EOPNOTSUPP but that should be handled inside dm, not by errorneously
> marking filesystem as frozen when it is not. Or am I still missing
> something?
> 
> > So if you skip sync of frozen filesystems, you introduce a data
> > corruption if someone takes a snapshot of ext2.
>   Yes, because ext2 cannot really be frozen, it is (errorneously) marked
> as such but it is not frozen...
> 
> 								Honza

The semantics of freeze is like "do the best you can". You can't freeze 
ext2 to a clean state (even if you managed to block all code paths that 
create dirty data, it still can't be cleaned because you can't get rid of 
inodes that are open and deleted).

On non-journaled filesystems, freeze_bdev does a sync of the filesystem 
and prevents some code paths (such as __generic_file_aio_write) from 
creating more dirty data with "vfs_check_frozen" --- that vfs_check_frozen 
doesn't guarantee anything, but it reduces a probability of corruption.

So at the end, all data that were dirty before taking a snapshot are 
guaranteed to be flushed to disk. And if someone writes to the ext2 
filesystem while taking a snapshot, it may create inconsistencies that 
must be fixed by running fsck on the snapshot.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 12:14                             ` Mikulas Patocka
@ 2011-11-30 13:05                               ` Alasdair G Kergon
  2011-11-30 16:53                                 ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Alasdair G Kergon @ 2011-11-30 13:05 UTC (permalink / raw)
  To: Mikulas Patocka, Jan Kara, Alasdair G Kergon, esandeen,
	Surbhi Palande, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

On Wed, Nov 30, 2011 at 07:14:18AM -0500, Mikulas Patocka wrote:
> On Wed, 30 Nov 2011, Jan Kara wrote:
> > > So if you skip sync of frozen filesystems, you introduce a data
> > > corruption if someone takes a snapshot of ext2.
> >   Yes, because ext2 cannot really be frozen, it is (errorneously) marked
> > as such but it is not frozen...

This is just getting into semantics.  AFAIK (and it was before my involvement)
LVM used the term 'lockfs' for this operation when it was introduced to ext2.
It later got renamed in-kernel to 'frozen' to bring it into line with newer
filesystems.  But userspace and the interface still retain the original
'lockfs' name.

There is no further I/O sent to the filesystem during the 'lockfs' operation:
LVM uses dm to block that.

Alasdair


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-29 10:19         ` Jan Kara
  2011-11-29 10:21           ` Jan Kara
@ 2011-11-30 13:33           ` Alasdair G Kergon
  2011-11-30 13:48             ` Alasdair G Kergon
  2011-11-30 16:34             ` Mikulas Patocka
  2011-11-30 14:09           ` Alasdair G Kergon
  2 siblings, 2 replies; 33+ messages in thread
From: Alasdair G Kergon @ 2011-11-30 13:33 UTC (permalink / raw)
  To: Jan Kara, Mikulas Patocka
  Cc: esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

On Tue, Nov 29, 2011 at 11:19:01AM +0100, Jan Kara wrote:
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > - skipping sync on frozen filesystem violates sync semantics. 
> > Applications, such as databases, assume that when sync finishes, data were 
> > written to stable storage. If we skip sync when the filesystem is frozen, 
> > we can cause data corruption in these applications (if the system crashes 
> > after we skipped a sync).

>   Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
 
Consider, during a 'create a snapshot' operation:
   I/O flow:  application -> filesystem -> LV -> disk

dm lockfs is issued by LVM.
  When this returns, the filesystem should be locked i.e. not issue any
  further I/O to the LV.  (But if it did happen to issue I/O, it
  wouldn't be a problem, as it would just get queued by dm and have no
  impact on the snapshot creation operation.)

The application is still running and might still be issuing writes to
the filesystem and might itself issue 'sync'.  But a 'sync' would only
be meaningful for already-completed writes and the lockfs process should
have already seen that they have hit disk.  So a sync issued while a
device is locked can always be skipped.  Have I missed something in this
reasoning, Mikulas?

Alasdair

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 13:33           ` Alasdair G Kergon
@ 2011-11-30 13:48             ` Alasdair G Kergon
  2011-11-30 16:36               ` Mikulas Patocka
  2011-11-30 16:34             ` Mikulas Patocka
  1 sibling, 1 reply; 33+ messages in thread
From: Alasdair G Kergon @ 2011-11-30 13:48 UTC (permalink / raw)
  To: Jan Kara, Mikulas Patocka
  Cc: esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

On Wed, Nov 30, 2011 at 01:33:32PM +0000, Alasdair G Kergon wrote:
> Have I missed something in this
> reasoning, Mikulas?
 
Ah - counter-example: mmap?
So sync might indeed be required to block on the filesystem iff there
are newly-dirtied pages.

Alasdair


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-29 10:19         ` Jan Kara
  2011-11-29 10:21           ` Jan Kara
  2011-11-30 13:33           ` Alasdair G Kergon
@ 2011-11-30 14:09           ` Alasdair G Kergon
  2011-11-30 16:53             ` Mikulas Patocka
  2011-11-30 17:03             ` Mikulas Patocka
  2 siblings, 2 replies; 33+ messages in thread
From: Alasdair G Kergon @ 2011-11-30 14:09 UTC (permalink / raw)
  To: Jan Kara, Mikulas Patocka
  Cc: esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

On Tue, Nov 29, 2011 at 11:19:01AM +0100, Jan Kara wrote:
> So I believe the consensus was that we should not block sync or flusher

Consensus where?

> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.

The sync only needs to block on a particular fs if there is data to flush.

A sync that originated in a way that can only be independent of any
application that is changing the fs may skip that fs if it is frozen.

It's the user's responsibility only to freeze filesystems for very brief
periods of time if they are still being changed.

?
 
Alasdair


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 13:33           ` Alasdair G Kergon
  2011-11-30 13:48             ` Alasdair G Kergon
@ 2011-11-30 16:34             ` Mikulas Patocka
  2011-12-01  0:34               ` Jan Kara
  1 sibling, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30 16:34 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Jan Kara, esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

On Wed, 30 Nov 2011, Alasdair G Kergon wrote:

> On Tue, Nov 29, 2011 at 11:19:01AM +0100, Jan Kara wrote:
> > On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > - skipping sync on frozen filesystem violates sync semantics. 
> > > Applications, such as databases, assume that when sync finishes, data were 
> > > written to stable storage. If we skip sync when the filesystem is frozen, 
> > > we can cause data corruption in these applications (if the system crashes 
> > > after we skipped a sync).
> 
> >   Here I don't agree. Filesystem must guarantee there are no dirty data on
> > a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> > page_mkwrite() implementation for this but that's the problem of ext3, not
> > freezing code in general. If there are no dirty data, sync code (and also
> > flusher thread) is free to return without doing anything.
>  
> Consider, during a 'create a snapshot' operation:
>    I/O flow:  application -> filesystem -> LV -> disk
> 
> dm lockfs is issued by LVM.
>   When this returns, the filesystem should be locked i.e. not issue any
>   further I/O to the LV.  (But if it did happen to issue I/O, it
>   wouldn't be a problem, as it would just get queued by dm and have no
>   impact on the snapshot creation operation.)
> 
> The application is still running and might still be issuing writes to
> the filesystem and might itself issue 'sync'.  But a 'sync' would only
> be meaningful for already-completed writes and the lockfs process should
> have already seen that they have hit disk.  So a sync issued while a
> device is locked can always be skipped.  Have I missed something in this
> reasoning, Mikulas?
> 
> Alasdair

You can't skip sync.

The problem is this (assume that you have non-journaled filesystem):

- A process issues a write() call. The write call goes to 
__generic_file_aio_write, suppose that the process goes immediatelly after 
"vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);" and then is rescheduled.

- You suspend the filesystem

- A process that issued write() is scheduled to run, note that it already 
passed "vfs_check_frozen", so it goes on even on syspended filesystem. 
This process creates a dirty page in the page cache.

- The process eventually returns to userspace from the write() syscall, 
the filesystem is still suspended. write() doesn't guarantee that the data 
hit disk, so there is no problem so far.

- The applications calls sync. Now, if you skip sync on the suspended 
filesystem, you violate sync semantics: when a process calls write() and 
sync(), it can assume that data are written to stable storage.

So, in order to keep sync working, you must wait for the filesystem to 
thaw and then write the dirty data.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 13:48             ` Alasdair G Kergon
@ 2011-11-30 16:36               ` Mikulas Patocka
  0 siblings, 0 replies; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30 16:36 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Jan Kara, esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

On Wed, 30 Nov 2011, Alasdair G Kergon wrote:

> On Wed, Nov 30, 2011 at 01:33:32PM +0000, Alasdair G Kergon wrote:
> > Have I missed something in this
> > reasoning, Mikulas?
>  
> Ah - counter-example: mmap?
> So sync might indeed be required to block on the filesystem iff there
> are newly-dirtied pages.
> 
> Alasdair

On non-journaled filesystem, both write and mmap can create dirty pages 
when suspended.

On ext4 I actually tested mmap, but it blocks any writes to mmaped pages 
while suspended.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 13:05                               ` Alasdair G Kergon
@ 2011-11-30 16:53                                 ` Jan Kara
  2011-11-30 17:09                                   ` Mikulas Patocka
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2011-11-30 16:53 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Mikulas Patocka, Jan Kara, esandeen, Surbhi Palande,
	linux-kernel, dm-devel, linux-fsdevel, Christopher Chaltain,
	Valerie Aurora

On Wed 30-11-11 13:05:14, Alasdair G Kergon wrote:
> On Wed, Nov 30, 2011 at 07:14:18AM -0500, Mikulas Patocka wrote:
> > On Wed, 30 Nov 2011, Jan Kara wrote:
> > > > So if you skip sync of frozen filesystems, you introduce a data
> > > > corruption if someone takes a snapshot of ext2.
> > >   Yes, because ext2 cannot really be frozen, it is (errorneously) marked
> > > as such but it is not frozen...
> 
> This is just getting into semantics.  AFAIK (and it was before my involvement)
> LVM used the term 'lockfs' for this operation when it was introduced to ext2.
> It later got renamed in-kernel to 'frozen' to bring it into line with newer
> filesystems.  But userspace and the interface still retain the original
> 'lockfs' name.
> 
> There is no further I/O sent to the filesystem during the 'lockfs' operation:
> LVM uses dm to block that.
  OK, so can we (at least in this discussion) discussion distinguish two
things?
a) Filesystems is frozen/locked - means filesystem is in a consistent state
  and disallows new dirty data to be created until fs is thawed/unlocked.
b) Device is frozen/locked - device does not process incoming writes, they
  are held in the queue until the device is thawed/unlocked.

  They are two different things and we seem to conflate them in the
discussion. In particular you can freeze a device under any filesystem
while you cannot freeze every filesystem. Freezing the device is enough for
LVM operations (e.g. snapshot) but if filesystem is not frozen, you have
to run fsck / journal replay to make result usable. Do we agree here?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 14:09           ` Alasdair G Kergon
@ 2011-11-30 16:53             ` Mikulas Patocka
  2011-12-01  0:03               ` Jan Kara
  2011-11-30 17:03             ` Mikulas Patocka
  1 sibling, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30 16:53 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Jan Kara, esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora



On Wed, 30 Nov 2011, Alasdair G Kergon wrote:

> On Tue, Nov 29, 2011 at 11:19:01AM +0100, Jan Kara wrote:
> > So I believe the consensus was that we should not block sync or flusher

Well, I think that not blocking sync actually doesn't help at all.

Suppose at first that you have a perfectly-barriered filesystem --- that 
is filesystem, that contains barriers around all code paths that could 
possibly create dirty data. In this case it is impossible to have dirty 
data while the filesystem is suspended. --- In this case you can call 
sync on suspened filesystem as much as you like, sync never finds ady 
dirty data, consequently it never tries to write anything and it can't 
deadlock. So skipping sync has no effect.

Suppose as a second case that you have imperfectly-barriered filesystem 
--- that means there exists a code path that creates dirty data while the 
filesystem is suspended. In this case if you skip sync, you are violating 
sync semantics, because the application can create dirty data while 
suspended, call sync while still suspended and assume that the dirty data 
was written.

So --- in case 1 skipping sync has no effect and in case 2 you trade one 
bug for another --- you avoid deadlock and you introduce violations of 
sync semantics.

Mikulas

> Consensus where?
> 
> > thread on frozen filesystem. Firstly, it's kind of ugly from user
> > perspective (you cannot sync filesystems on your system while one
> > filesystem is frozen???), secondly, in case of flusher thread it has some
> > serious implications if there are more filesystems on the same device - you
> > would effectively stop any writeback to the device possibly hanging the
> > whole system due to dirty limit being exceeded. So at least in these two
> > cases we should just ignore frozen filesystem.
> 
> The sync only needs to block on a particular fs if there is data to flush.
> 
> A sync that originated in a way that can only be independent of any
> application that is changing the fs may skip that fs if it is frozen.
> 
> It's the user's responsibility only to freeze filesystems for very brief
> periods of time if they are still being changed.
> 
> ?
>  
> Alasdair
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 14:09           ` Alasdair G Kergon
  2011-11-30 16:53             ` Mikulas Patocka
@ 2011-11-30 17:03             ` Mikulas Patocka
  1 sibling, 0 replies; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30 17:03 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Jan Kara, esandeen, linux-kernel, dm-devel, linux-fsdevel,
	Christopher Chaltain, Valerie Aurora

> The sync only needs to block on a particular fs if there is data to flush.
> 
> A sync that originated in a way that can only be independent of any
> application that is changing the fs may skip that fs if it is frozen.
> 
> It's the user's responsibility only to freeze filesystems for very brief
> periods of time if they are still being changed.

The problem is that the process that is freezing filesystem (lvm) is 
totally independent on the process that is writing data to the filesystem 
and syncing it.

For example, you have a transactional database running in userspace and 
that database is writing and syncing and you take a snapshot at the same 
time. --- so you cannot make these two processes cooperate in any way. You 
can't tell to the database server "please don't write now, I am suspending 
to make a snapshot". The database server still writes and still assumes 
that sync() commits to permanent storage, regardless on what is going on 
in lvm.

Mikulas

> ?
>  
> Alasdair

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 16:53                                 ` Jan Kara
@ 2011-11-30 17:09                                   ` Mikulas Patocka
  0 siblings, 0 replies; 33+ messages in thread
From: Mikulas Patocka @ 2011-11-30 17:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alasdair G Kergon, esandeen, Surbhi Palande, linux-kernel,
	dm-devel, linux-fsdevel, Christopher Chaltain, Valerie Aurora



On Wed, 30 Nov 2011, Jan Kara wrote:

> On Wed 30-11-11 13:05:14, Alasdair G Kergon wrote:
> > On Wed, Nov 30, 2011 at 07:14:18AM -0500, Mikulas Patocka wrote:
> > > On Wed, 30 Nov 2011, Jan Kara wrote:
> > > > > So if you skip sync of frozen filesystems, you introduce a data
> > > > > corruption if someone takes a snapshot of ext2.
> > > >   Yes, because ext2 cannot really be frozen, it is (errorneously) marked
> > > > as such but it is not frozen...
> > 
> > This is just getting into semantics.  AFAIK (and it was before my involvement)
> > LVM used the term 'lockfs' for this operation when it was introduced to ext2.
> > It later got renamed in-kernel to 'frozen' to bring it into line with newer
> > filesystems.  But userspace and the interface still retain the original
> > 'lockfs' name.
> > 
> > There is no further I/O sent to the filesystem during the 'lockfs' operation:
> > LVM uses dm to block that.
>   OK, so can we (at least in this discussion) discussion distinguish two
> things?
> a) Filesystems is frozen/locked - means filesystem is in a consistent state
>   and disallows new dirty data to be created until fs is thawed/unlocked.

Agreed. Note that if you observed any sync-related deadlocks when 
suspended, it means that the filesystem has some code path that allows 
creating dirty data on frozen filesystem.

This was observed on ext4 on RHEL-6 ... and maybe on upstream too. (I 
couldn't reproduce it on upstream, but maybe other people who started 
these sync-related patches could?)

> b) Device is frozen/locked - device does not process incoming writes, they
>   are held in the queue until the device is thawed/unlocked.
> 
>   They are two different things and we seem to conflate them in the
> discussion. In particular you can freeze a device under any filesystem
> while you cannot freeze every filesystem. Freezing the device is enough for
> LVM operations (e.g. snapshot) but if filesystem is not frozen, you have
> to run fsck / journal replay to make result usable. Do we agree here?
> 
> 								Honza

True. You have to run fsck on non-journaled filesystems when taking a 
snapshot.

Mikulas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 16:53             ` Mikulas Patocka
@ 2011-12-01  0:03               ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2011-12-01  0:03 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair G Kergon, Jan Kara, esandeen, linux-kernel, dm-devel,
	linux-fsdevel, Christopher Chaltain, Valerie Aurora

On Wed 30-11-11 11:53:40, Mikulas Patocka wrote:
> 
> 
> On Wed, 30 Nov 2011, Alasdair G Kergon wrote:
> 
> > On Tue, Nov 29, 2011 at 11:19:01AM +0100, Jan Kara wrote:
> > > So I believe the consensus was that we should not block sync or flusher
> 
> Well, I think that not blocking sync actually doesn't help at all.
> 
> Suppose at first that you have a perfectly-barriered filesystem --- that 
> is filesystem, that contains barriers around all code paths that could 
> possibly create dirty data. In this case it is impossible to have dirty 
> data while the filesystem is suspended. --- In this case you can call 
> sync on suspened filesystem as much as you like, sync never finds ady 
> dirty data, consequently it never tries to write anything and it can't 
> deadlock. So skipping sync has no effect.
> 
> Suppose as a second case that you have imperfectly-barriered filesystem 
> --- that means there exists a code path that creates dirty data while the 
> filesystem is suspended. In this case if you skip sync, you are violating 
> sync semantics, because the application can create dirty data while 
> suspended, call sync while still suspended and assume that the dirty data 
> was written.
  Except that currently we are in situation c) with e.g. ext4 and xfs. We
have perfectly-barriered filesystem *but* there are dirty bits set in this
filesystem although we are certain there are no dirty data. This
inconsistency between dirty bits and fact whether a page contains dirty
data is due to way how page faults are handled. I've already tried to
explain this in this thread in https://lkml.org/lkml/2011/11/29/109 but
apparently I failed. I can try to explain it once more but in fact I don't
think this technical detail makes a difference in this discussion. So in our
situation c) skipping sync makes difference and does not break guarantees
sync should have.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] [PATCH] deadlock with suspend and quotas
  2011-11-30 16:34             ` Mikulas Patocka
@ 2011-12-01  0:34               ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2011-12-01  0:34 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair G Kergon, Jan Kara, esandeen, linux-kernel, dm-devel,
	linux-fsdevel, Christopher Chaltain, Valerie Aurora

On Wed 30-11-11 11:34:23, Mikulas Patocka wrote:
> On Wed, 30 Nov 2011, Alasdair G Kergon wrote:
> > On Tue, Nov 29, 2011 at 11:19:01AM +0100, Jan Kara wrote:
> > > On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > > - skipping sync on frozen filesystem violates sync semantics. 
> > > > Applications, such as databases, assume that when sync finishes, data were 
> > > > written to stable storage. If we skip sync when the filesystem is frozen, 
> > > > we can cause data corruption in these applications (if the system crashes 
> > > > after we skipped a sync).
> > 
> > >   Here I don't agree. Filesystem must guarantee there are no dirty data on
> > > a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> > > page_mkwrite() implementation for this but that's the problem of ext3, not
> > > freezing code in general. If there are no dirty data, sync code (and also
> > > flusher thread) is free to return without doing anything.
> >  
> > Consider, during a 'create a snapshot' operation:
> >    I/O flow:  application -> filesystem -> LV -> disk
> > 
> > dm lockfs is issued by LVM.
> >   When this returns, the filesystem should be locked i.e. not issue any
> >   further I/O to the LV.  (But if it did happen to issue I/O, it
> >   wouldn't be a problem, as it would just get queued by dm and have no
> >   impact on the snapshot creation operation.)
> > 
> > The application is still running and might still be issuing writes to
> > the filesystem and might itself issue 'sync'.  But a 'sync' would only
> > be meaningful for already-completed writes and the lockfs process should
> > have already seen that they have hit disk.  So a sync issued while a
> > device is locked can always be skipped.  Have I missed something in this
> > reasoning, Mikulas?
> > 
> > Alasdair
> 
> You can't skip sync.
> 
> The problem is this (assume that you have non-journaled filesystem):
> 
> - A process issues a write() call. The write call goes to 
> __generic_file_aio_write, suppose that the process goes immediatelly after 
> "vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);" and then is rescheduled.
> 
> - You suspend the filesystem
> 
> - A process that issued write() is scheduled to run, note that it already 
> passed "vfs_check_frozen", so it goes on even on syspended filesystem. 
> This process creates a dirty page in the page cache.
> 
> - The process eventually returns to userspace from the write() syscall, 
> the filesystem is still suspended. write() doesn't guarantee that the data 
> hit disk, so there is no problem so far.
> 
> - The applications calls sync. Now, if you skip sync on the suspended 
> filesystem, you violate sync semantics: when a process calls write() and 
> sync(), it can assume that data are written to stable storage.
> 
> So, in order to keep sync working, you must wait for the filesystem to 
> thaw and then write the dirty data.
  So now when we have establied a difference betweeen freezing a device and
freezing a filesystem, I think the answer is that sync can skip frozen
filesystems but cannot skip unfrozen filesystems on frozen devices. Agreed?
We'd need to change the freezing code to distinguish these two cases but
that's not hard...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2011-11-25 20:25 [PATCH] deadlock with suspend and quotas Mikulas Patocka
  2011-11-28 15:04 ` Jan Kara
@ 2012-01-03  3:30 ` Al Viro
  2012-01-03 18:22   ` Mikulas Patocka
  1 sibling, 1 reply; 33+ messages in thread
From: Al Viro @ 2012-01-03  3:30 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, linux-fsdevel, Jan Kara, dm-devel

On Fri, Nov 25, 2011 at 03:25:16PM -0500, Mikulas Patocka wrote:

> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> the filesystem to resume and retry.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> CC: stable@kernel.org

So basically you want a variant of get_super() that would get you a
superblock for this bdev, locked and unfrozen?  Fair enough, but
	* that should be a proper helper in super.c, rather than
open-coded in fs/quota/quota.c, of all places
	* what about other existing callers get_super() and its friends?

and while we are at it, why in damnation name is it exported?  The only
caller outside of core VFS is under #if 0...

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2012-01-03  3:30 ` Al Viro
@ 2012-01-03 18:22   ` Mikulas Patocka
  2012-01-03 18:35     ` Mikulas Patocka
  0 siblings, 1 reply; 33+ messages in thread
From: Mikulas Patocka @ 2012-01-03 18:22 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, Jan Kara, dm-devel



On Tue, 3 Jan 2012, Al Viro wrote:

> On Fri, Nov 25, 2011 at 03:25:16PM -0500, Mikulas Patocka wrote:
> 
> > The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> > it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> > the filesystem to resume and retry.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > CC: stable@kernel.org
> 
> So basically you want a variant of get_super() that would get you a
> superblock for this bdev, locked and unfrozen?  Fair enough, but
> 	* that should be a proper helper in super.c, rather than
> open-coded in fs/quota/quota.c, of all places
> 	* what about other existing callers get_super() and its friends?

Hi

You can look at that functionality here: I've already made a patch that 
has it: 
https://www.redhat.com/archives/dm-devel/2011-November/msg00151.html

Basically, the patch introduces a function down_read_s_umount, which takes 
s_umount for read, but makes sure that the filesystem is not frozen. Then, 
it adds a boolean parameter to get_super() which tells get_super() that it 
should wait for unfreeze (i.e. call "down_read_s_umount" isntead of 
"down_read(&sb->s_umount)").

Mikulas

> and while we are at it, why in damnation name is it exported?  The only
> caller outside of core VFS is under #if 0...
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] deadlock with suspend and quotas
  2012-01-03 18:22   ` Mikulas Patocka
@ 2012-01-03 18:35     ` Mikulas Patocka
  0 siblings, 0 replies; 33+ messages in thread
From: Mikulas Patocka @ 2012-01-03 18:35 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, Jan Kara, dm-devel



On Tue, 3 Jan 2012, Mikulas Patocka wrote:

> 
> 
> On Tue, 3 Jan 2012, Al Viro wrote:
> 
> > On Fri, Nov 25, 2011 at 03:25:16PM -0500, Mikulas Patocka wrote:
> > 
> > > The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> > > it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> > > the filesystem to resume and retry.
> > > 
> > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > CC: stable@kernel.org
> > 
> > So basically you want a variant of get_super() that would get you a
> > superblock for this bdev, locked and unfrozen?  Fair enough, but
> > 	* that should be a proper helper in super.c, rather than
> > open-coded in fs/quota/quota.c, of all places
> > 	* what about other existing callers get_super() and its friends?
> 
> Hi
> 
> You can look at that functionality here: I've already made a patch that 
> has it: 
> https://www.redhat.com/archives/dm-devel/2011-November/msg00151.html

Sorry, I accidentally linked to the message you were responding to.

The point is that there are some users of get_super that need to get 
superblock even if it's frozen and other users of get_super that must not 
get a frozen superblock. So I made a bool parameter to distinguish these 
cases. Do you have some other ideas?

Mikulas

> Basically, the patch introduces a function down_read_s_umount, which takes 
> s_umount for read, but makes sure that the filesystem is not frozen. Then, 
> it adds a boolean parameter to get_super() which tells get_super() that it 
> should wait for unfreeze (i.e. call "down_read_s_umount" isntead of 
> "down_read(&sb->s_umount)").
> 
> Mikulas
> 
> > and while we are at it, why in damnation name is it exported?  The only
> > caller outside of core VFS is under #if 0...
> > 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2012-01-03 18:36 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-25 20:25 [PATCH] deadlock with suspend and quotas Mikulas Patocka
2011-11-28 15:04 ` Jan Kara
2011-11-28 21:00   ` Valerie Aurora
2011-11-28 21:14     ` Mikulas Patocka
2011-11-28 23:32       ` Mikulas Patocka
2011-11-29 10:19         ` Jan Kara
2011-11-29 10:21           ` Jan Kara
2011-11-29 11:06             ` Mikulas Patocka
2011-11-29 11:11               ` Jan Kara
2011-11-29 12:54                 ` Mikulas Patocka
2011-11-29 13:09                   ` Jan Kara
2011-11-29 13:18                     ` [dm-devel] " Alasdair G Kergon
2011-11-29 13:32                       ` Jan Kara
2011-11-29 16:33                         ` Eric Sandeen
2011-11-30  6:52                         ` Mikulas Patocka
2011-11-30 11:16                           ` Jan Kara
2011-11-30 12:14                             ` Mikulas Patocka
2011-11-30 13:05                               ` Alasdair G Kergon
2011-11-30 16:53                                 ` Jan Kara
2011-11-30 17:09                                   ` Mikulas Patocka
2011-11-30 13:33           ` Alasdair G Kergon
2011-11-30 13:48             ` Alasdair G Kergon
2011-11-30 16:36               ` Mikulas Patocka
2011-11-30 16:34             ` Mikulas Patocka
2011-12-01  0:34               ` Jan Kara
2011-11-30 14:09           ` Alasdair G Kergon
2011-11-30 16:53             ` Mikulas Patocka
2011-12-01  0:03               ` Jan Kara
2011-11-30 17:03             ` Mikulas Patocka
2011-11-29 20:00     ` Kamal Mostafa
2012-01-03  3:30 ` Al Viro
2012-01-03 18:22   ` Mikulas Patocka
2012-01-03 18:35     ` Mikulas Patocka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).