All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix mount failure caused by race with umount
@ 2020-07-10  0:44 Boris Burkov
  2020-07-10  1:23 ` Josef Bacik
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2020-07-10  0:44 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-team

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0                                                    CPU1
down_write sb->s_umount
btrfs_kill_super
  kill_anon_super(sb)
    generic_shutdown_super(sb);
      shrink_dcache_for_umount(sb);
      sync_filesystem(sb);
      evict_inodes(sb); // SLOW

                                              btrfs_mount_root
                                                btrfs_scan_one_device
                                                fs_devices = device->fs_devices
                                                fs_info->fs_devices = fs_devices
                                                // fs_devices-opened makes this a no-op
                                                btrfs_open_devices(fs_devices, mode, fs_type)
                                                s = sget(fs_type, test, set, flags, fs_info);
                                                  find sb in s_instances
                                                  grab_super(sb);
                                                    down_write(&s->s_umount); // blocks

      sop->put_super(sb)
        // sb->fs_devices->opened == 2; no-op
      spin_lock(&sb_lock);
      hlist_del_init(&sb->s_instances);
      spin_unlock(&sb_lock);
      up_write(&sb->s_umount);
                                                    return 0;
                                                  retry lookup
                                                  don't find sb in s_instances (deleted by CPU0)
                                                  s = alloc_super
                                                  return s;
                                                btrfs_fill_super(s, fs_devices, data)
                                                  open_ctree // fs_devices total_rw_bytes improperly set!
                                                    btrfs_read_chunk_tree
                                                      read_one_dev // increment total_rw_bytes again!!
                                                      super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

To fix this, we observe that if we have already filled the device, the
state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
use that to avoid filling it a second time for no reason and,
critically, avoid double counting in total_rw_bytes. One gotcha is that
read_one_chunk also sets this bit, which happens before read_one_dev (in
read_sys_array), so we must remove that setting of the bit as well, for
the state bit to truly correspond to the device struct being filled from
disk.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

for i in $(seq 0 500)
do
  dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo

does the trick for me.

A final note is that this fix actually breaks the fstest btrfs/163, but
having investigated it, I believe that is due to a subtle flaw in how
btrfs replace works when used on a seed device. The replace target device
never gets a correct dev_item with the sprout fsid written out. This
causes several problems, but for the sake of btrfs/163, read_one_chunk
marking the device with IN_FS_METADATA was wallpapering over it, which
this patch breaks. I will be sending a subsequent fix for the seed replace
issue which will also fix btrfs/163.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/volumes.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c7a3d4d730a3..1d9bd1bbf893 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
 			}
 			btrfs_report_missing_device(fs_info, devid, uuid, false);
 		}
-		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-				&(map->stripes[i].dev->dev_state));
-
 	}
 
 	write_lock(&map_tree->lock);
@@ -6815,6 +6812,15 @@ static int read_one_dev(struct extent_buffer *leaf,
 			return -EINVAL;
 	}
 
+	/*
+	 * It is possible for mount and umount to race in such a way that
+	 * we execute this code path, but the device is still in metadata.
+	 * If so, we don't need to call fill_device_from_item again and we
+	 * especially don't want to spuriously increment total_rw_bytes.
+	 */
+	if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state)) {
+		return 0;
+	}
 	fill_device_from_item(leaf, dev_item, device);
 	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: fix mount failure caused by race with umount
  2020-07-10  0:44 [PATCH] btrfs: fix mount failure caused by race with umount Boris Burkov
@ 2020-07-10  1:23 ` Josef Bacik
  2020-07-10 17:23   ` [PATCH v2] " Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2020-07-10  1:23 UTC (permalink / raw)
  To: Boris Burkov, Chris Mason, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-team

On 7/9/20 8:44 PM, Boris Burkov wrote:
> It is possible to cause a btrfs mount to fail by racing it with a slow
> umount. The crux of the sequence is generic_shutdown_super not yet
> calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
> If that occurs, btrfs_open_devices will decide the opened counter is
> non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
> 0. From here, mount will call sget which will result in grab_super
> trying to take the super block umount semaphore. That semaphore will be
> held by the slow umount, so mount will block. Before up-ing the
> semaphore, umount will delete the super block, resulting in mount's sget
> reliably allocating a new one, which causes the mount path to dutifully
> fill it out, and increment total_rw_bytes a second time, which causes
> the mount to fail, as we see double the expected bytes.
> 
> Here is the sequence laid out in greater detail:
> 
> CPU0                                                    CPU1
> down_write sb->s_umount
> btrfs_kill_super
>    kill_anon_super(sb)
>      generic_shutdown_super(sb);
>        shrink_dcache_for_umount(sb);
>        sync_filesystem(sb);
>        evict_inodes(sb); // SLOW
> 
>                                                btrfs_mount_root
>                                                  btrfs_scan_one_device
>                                                  fs_devices = device->fs_devices
>                                                  fs_info->fs_devices = fs_devices
>                                                  // fs_devices-opened makes this a no-op
>                                                  btrfs_open_devices(fs_devices, mode, fs_type)
>                                                  s = sget(fs_type, test, set, flags, fs_info);
>                                                    find sb in s_instances
>                                                    grab_super(sb);
>                                                      down_write(&s->s_umount); // blocks
> 
>        sop->put_super(sb)
>          // sb->fs_devices->opened == 2; no-op
>        spin_lock(&sb_lock);
>        hlist_del_init(&sb->s_instances);
>        spin_unlock(&sb_lock);
>        up_write(&sb->s_umount);
>                                                      return 0;
>                                                    retry lookup
>                                                    don't find sb in s_instances (deleted by CPU0)
>                                                    s = alloc_super
>                                                    return s;
>                                                  btrfs_fill_super(s, fs_devices, data)
>                                                    open_ctree // fs_devices total_rw_bytes improperly set!
>                                                      btrfs_read_chunk_tree
>                                                        read_one_dev // increment total_rw_bytes again!!
>                                                        super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
> 
> To fix this, we observe that if we have already filled the device, the
> state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
> use that to avoid filling it a second time for no reason and,
> critically, avoid double counting in total_rw_bytes. One gotcha is that
> read_one_chunk also sets this bit, which happens before read_one_dev (in
> read_sys_array), so we must remove that setting of the bit as well, for
> the state bit to truly correspond to the device struct being filled from
> disk.
> 
> To reproduce, it is sufficient to dirty a decent number of inodes, then
> quickly umount and mount.
> 
> for i in $(seq 0 500)
> do
>    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
> done
> umount /mnt/foo&
> mount /mnt/foo
> 
> does the trick for me.
> 
> A final note is that this fix actually breaks the fstest btrfs/163, but
> having investigated it, I believe that is due to a subtle flaw in how
> btrfs replace works when used on a seed device. The replace target device
> never gets a correct dev_item with the sprout fsid written out. This
> causes several problems, but for the sake of btrfs/163, read_one_chunk
> marking the device with IN_FS_METADATA was wallpapering over it, which
> this patch breaks. I will be sending a subsequent fix for the seed replace
> issue which will also fix btrfs/163.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/volumes.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c7a3d4d730a3..1d9bd1bbf893 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
>   			}
>   			btrfs_report_missing_device(fs_info, devid, uuid, false);
>   		}
> -		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> -				&(map->stripes[i].dev->dev_state));
> -
>   	}
>   
>   	write_lock(&map_tree->lock);
> @@ -6815,6 +6812,15 @@ static int read_one_dev(struct extent_buffer *leaf,
>   			return -EINVAL;
>   	}
>   
> +	/*
> +	 * It is possible for mount and umount to race in such a way that
> +	 * we execute this code path, but the device is still in metadata.
> +	 * If so, we don't need to call fill_device_from_item again and we
> +	 * especially don't want to spuriously increment total_rw_bytes.
> +	 */
> +	if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state)) {
> +		return 0;
> +	}

Lets kill the set_bit below, and changes this to

if (test_and_set_bit())

also you don't need {} for single line if statements.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] btrfs: fix mount failure caused by race with umount
  2020-07-10  1:23 ` Josef Bacik
@ 2020-07-10 17:23   ` Boris Burkov
  2020-07-10 17:51     ` Josef Bacik
  2020-07-16 18:51     ` David Sterba
  0 siblings, 2 replies; 7+ messages in thread
From: Boris Burkov @ 2020-07-10 17:23 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-team

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0                                                    CPU1
down_write sb->s_umount
btrfs_kill_super
  kill_anon_super(sb)
    generic_shutdown_super(sb);
      shrink_dcache_for_umount(sb);
      sync_filesystem(sb);
      evict_inodes(sb); // SLOW

                                              btrfs_mount_root
                                                btrfs_scan_one_device
                                                fs_devices = device->fs_devices
                                                fs_info->fs_devices = fs_devices
                                                // fs_devices-opened makes this a no-op
                                                btrfs_open_devices(fs_devices, mode, fs_type)
                                                s = sget(fs_type, test, set, flags, fs_info);
                                                  find sb in s_instances
                                                  grab_super(sb);
                                                    down_write(&s->s_umount); // blocks

      sop->put_super(sb)
        // sb->fs_devices->opened == 2; no-op
      spin_lock(&sb_lock);
      hlist_del_init(&sb->s_instances);
      spin_unlock(&sb_lock);
      up_write(&sb->s_umount);
                                                    return 0;
                                                  retry lookup
                                                  don't find sb in s_instances (deleted by CPU0)
                                                  s = alloc_super
                                                  return s;
                                                btrfs_fill_super(s, fs_devices, data)
                                                  open_ctree // fs_devices total_rw_bytes improperly set!
                                                    btrfs_read_chunk_tree
                                                      read_one_dev // increment total_rw_bytes again!!
                                                      super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

To fix this, we observe that if we have already filled the device, the
state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
use that to avoid filling it a second time for no reason and,
critically, avoid double counting in total_rw_bytes. One gotcha is that
read_one_chunk also sets this bit, which happens before read_one_dev (in
read_sys_array), so we must remove that setting of the bit as well, for
the state bit to truly correspond to the device struct being filled from
disk.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

for i in $(seq 0 500)
do
  dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo

does the trick for me.

A final note is that this fix actually breaks the fstest btrfs/163, but
having investigated it, I believe that is due to a subtle flaw in how
btrfs replace works when used on a seed device. The replace target device
never gets a correct dev_item with the sprout fsid written out. This
causes several problems, but for the sake of btrfs/163, read_one_chunk
marking the device with IN_FS_METADATA was wallpapering over it, which
this patch breaks. I will be sending a subsequent fix for the seed replace
issue which will also fix btrfs/163.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/volumes.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c7a3d4d730a3..865fab39ef43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
 			}
 			btrfs_report_missing_device(fs_info, devid, uuid, false);
 		}
-		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-				&(map->stripes[i].dev->dev_state));
-
 	}
 
 	write_lock(&map_tree->lock);
@@ -6815,8 +6812,15 @@ static int read_one_dev(struct extent_buffer *leaf,
 			return -EINVAL;
 	}
 
+	/*
+	 * It is possible for mount and umount to race in such a way that
+	 * we execute this code path, but the device is still in metadata.
+	 * If so, we don't need to call fill_device_from_item again and we
+	 * especially don't want to spuriously increment total_rw_bytes.
+	 */
+	if (test_and_set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state))
+		return 0;
 	fill_device_from_item(leaf, dev_item, device);
-	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	   !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
 		device->fs_devices->total_rw_bytes += device->total_bytes;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: fix mount failure caused by race with umount
  2020-07-10 17:23   ` [PATCH v2] " Boris Burkov
@ 2020-07-10 17:51     ` Josef Bacik
  2020-07-16 18:51     ` David Sterba
  1 sibling, 0 replies; 7+ messages in thread
From: Josef Bacik @ 2020-07-10 17:51 UTC (permalink / raw)
  To: Boris Burkov, Chris Mason, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-team

On 7/10/20 1:23 PM, Boris Burkov wrote:
> It is possible to cause a btrfs mount to fail by racing it with a slow
> umount. The crux of the sequence is generic_shutdown_super not yet
> calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
> If that occurs, btrfs_open_devices will decide the opened counter is
> non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
> 0. From here, mount will call sget which will result in grab_super
> trying to take the super block umount semaphore. That semaphore will be
> held by the slow umount, so mount will block. Before up-ing the
> semaphore, umount will delete the super block, resulting in mount's sget
> reliably allocating a new one, which causes the mount path to dutifully
> fill it out, and increment total_rw_bytes a second time, which causes
> the mount to fail, as we see double the expected bytes.
> 
> Here is the sequence laid out in greater detail:
> 
> CPU0                                                    CPU1
> down_write sb->s_umount
> btrfs_kill_super
>    kill_anon_super(sb)
>      generic_shutdown_super(sb);
>        shrink_dcache_for_umount(sb);
>        sync_filesystem(sb);
>        evict_inodes(sb); // SLOW
> 
>                                                btrfs_mount_root
>                                                  btrfs_scan_one_device
>                                                  fs_devices = device->fs_devices
>                                                  fs_info->fs_devices = fs_devices
>                                                  // fs_devices-opened makes this a no-op
>                                                  btrfs_open_devices(fs_devices, mode, fs_type)
>                                                  s = sget(fs_type, test, set, flags, fs_info);
>                                                    find sb in s_instances
>                                                    grab_super(sb);
>                                                      down_write(&s->s_umount); // blocks
> 
>        sop->put_super(sb)
>          // sb->fs_devices->opened == 2; no-op
>        spin_lock(&sb_lock);
>        hlist_del_init(&sb->s_instances);
>        spin_unlock(&sb_lock);
>        up_write(&sb->s_umount);
>                                                      return 0;
>                                                    retry lookup
>                                                    don't find sb in s_instances (deleted by CPU0)
>                                                    s = alloc_super
>                                                    return s;
>                                                  btrfs_fill_super(s, fs_devices, data)
>                                                    open_ctree // fs_devices total_rw_bytes improperly set!
>                                                      btrfs_read_chunk_tree
>                                                        read_one_dev // increment total_rw_bytes again!!
>                                                        super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
> 
> To fix this, we observe that if we have already filled the device, the
> state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
> use that to avoid filling it a second time for no reason and,
> critically, avoid double counting in total_rw_bytes. One gotcha is that
> read_one_chunk also sets this bit, which happens before read_one_dev (in
> read_sys_array), so we must remove that setting of the bit as well, for
> the state bit to truly correspond to the device struct being filled from
> disk.
> 
> To reproduce, it is sufficient to dirty a decent number of inodes, then
> quickly umount and mount.
> 
> for i in $(seq 0 500)
> do
>    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
> done
> umount /mnt/foo&
> mount /mnt/foo
> 
> does the trick for me.
> 
> A final note is that this fix actually breaks the fstest btrfs/163, but
> having investigated it, I believe that is due to a subtle flaw in how
> btrfs replace works when used on a seed device. The replace target device
> never gets a correct dev_item with the sprout fsid written out. This
> causes several problems, but for the sake of btrfs/163, read_one_chunk
> marking the device with IN_FS_METADATA was wallpapering over it, which
> this patch breaks. I will be sending a subsequent fix for the seed replace
> issue which will also fix btrfs/163.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: fix mount failure caused by race with umount
  2020-07-10 17:23   ` [PATCH v2] " Boris Burkov
  2020-07-10 17:51     ` Josef Bacik
@ 2020-07-16 18:51     ` David Sterba
  2020-07-16 20:29       ` [PATCH v3] " Boris Burkov
  1 sibling, 1 reply; 7+ messages in thread
From: David Sterba @ 2020-07-16 18:51 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	linux-kernel, kernel-team

On Fri, Jul 10, 2020 at 10:23:04AM -0700, Boris Burkov wrote:
> Here is the sequence laid out in greater detail:
> 
> CPU0                                                    CPU1
> down_write sb->s_umount
> btrfs_kill_super
>   kill_anon_super(sb)
>     generic_shutdown_super(sb);
>       shrink_dcache_for_umount(sb);
>       sync_filesystem(sb);
>       evict_inodes(sb); // SLOW
> 
>                                               btrfs_mount_root
>                                                 btrfs_scan_one_device
>                                                 fs_devices = device->fs_devices
>                                                 fs_info->fs_devices = fs_devices
>                                                 // fs_devices-opened makes this a no-op
>                                                 btrfs_open_devices(fs_devices, mode, fs_type)
>                                                 s = sget(fs_type, test, set, flags, fs_info);
>                                                   find sb in s_instances
>                                                   grab_super(sb);
>                                                     down_write(&s->s_umount); // blocks
> 
>       sop->put_super(sb)
>         // sb->fs_devices->opened == 2; no-op
>       spin_lock(&sb_lock);
>       hlist_del_init(&sb->s_instances);
>       spin_unlock(&sb_lock);
>       up_write(&sb->s_umount);
>                                                     return 0;
>                                                   retry lookup
>                                                   don't find sb in s_instances (deleted by CPU0)
>                                                   s = alloc_super
>                                                   return s;
>                                                 btrfs_fill_super(s, fs_devices, data)
>                                                   open_ctree // fs_devices total_rw_bytes improperly set!
>                                                     btrfs_read_chunk_tree
>                                                       read_one_dev // increment total_rw_bytes again!!
>                                                       super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

It seems weird that umount and mount can be mixed in such way but with
the VFS locks and structures it's valid, so the devices managed by btrfs
slipped through.

With the suggested fix, the bit BTRFS_DEV_STATE_IN_FS_METADATA becomes
quite important and the synchronization of the device related data.
The semantics seems quite subtle and inconsistent regarding other uses
of set_bit or clear_bit and the total_rw_bytes.

I'm thinkig about unconditional setting of IN_FS_METADATA as it is now,
but recalculating total_rw_size outside of read_one_dev in
btrfs_read_chunk_tree. There it should not matter if the bit was set by
the unmounted or the mounted filesystem, as long as the locking rules
for updating fs_devices hold. For that we have uuid_mutex and
fs_devices::device_list_mutex, this is used elsewhere so fixing it using
existing mechanisms is IMHO better way than relying on subtle
undocumented semantics of the state bit.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3] btrfs: fix mount failure caused by race with umount
  2020-07-16 18:51     ` David Sterba
@ 2020-07-16 20:29       ` Boris Burkov
  2020-07-20 16:32         ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2020-07-16 20:29 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba; +Cc: linux-btrfs, kernel-team

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0                                                    CPU1
down_write sb->s_umount
btrfs_kill_super
  kill_anon_super(sb)
    generic_shutdown_super(sb);
      shrink_dcache_for_umount(sb);
      sync_filesystem(sb);
      evict_inodes(sb); // SLOW

                                              btrfs_mount_root
                                                btrfs_scan_one_device
                                                fs_devices = device->fs_devices
                                                fs_info->fs_devices = fs_devices
                                                // fs_devices-opened makes this a no-op
                                                btrfs_open_devices(fs_devices, mode, fs_type)
                                                s = sget(fs_type, test, set, flags, fs_info);
                                                  find sb in s_instances
                                                  grab_super(sb);
                                                    down_write(&s->s_umount); // blocks

      sop->put_super(sb)
        // sb->fs_devices->opened == 2; no-op
      spin_lock(&sb_lock);
      hlist_del_init(&sb->s_instances);
      spin_unlock(&sb_lock);
      up_write(&sb->s_umount);
                                                    return 0;
                                                  retry lookup
                                                  don't find sb in s_instances (deleted by CPU0)
                                                  s = alloc_super
                                                  return s;
                                                btrfs_fill_super(s, fs_devices, data)
                                                  open_ctree // fs_devices total_rw_bytes improperly set!
                                                    btrfs_read_chunk_tree
                                                      read_one_dev // increment total_rw_bytes again!!
                                                      super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

for i in $(seq 0 500)
do
  dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo

does the trick for me.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/volumes.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c7a3d4d730a3..26b9bcb00c2b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7035,6 +7035,14 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 	mutex_lock(&uuid_mutex);
 	mutex_lock(&fs_info->chunk_mutex);
 
+	/*
+	 * It is possible for mount and umount to race in such a way that
+	 * we execute this code path, but open_fs_devices failed to clear
+	 * total_rw_bytes. We certainly want it cleared before reading the
+	 * device items, so clear it here.
+	 */
+	fs_info->fs_devices->total_rw_bytes = 0;
+
 	/*
 	 * Read all device items, and then all the chunk items. All
 	 * device items are found before any chunk item (their object id
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] btrfs: fix mount failure caused by race with umount
  2020-07-16 20:29       ` [PATCH v3] " Boris Burkov
@ 2020-07-20 16:32         ` David Sterba
  0 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2020-07-20 16:32 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs, kernel-team

On Thu, Jul 16, 2020 at 01:29:46PM -0700, Boris Burkov wrote:
> It is possible to cause a btrfs mount to fail by racing it with a slow
> umount. The crux of the sequence is generic_shutdown_super not yet
> calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
> If that occurs, btrfs_open_devices will decide the opened counter is
> non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
> 0. From here, mount will call sget which will result in grab_super
> trying to take the super block umount semaphore. That semaphore will be
> held by the slow umount, so mount will block. Before up-ing the
> semaphore, umount will delete the super block, resulting in mount's sget
> reliably allocating a new one, which causes the mount path to dutifully
> fill it out, and increment total_rw_bytes a second time, which causes
> the mount to fail, as we see double the expected bytes.
> 
> Here is the sequence laid out in greater detail:
> 
> CPU0                                                    CPU1
> down_write sb->s_umount
> btrfs_kill_super
>   kill_anon_super(sb)
>     generic_shutdown_super(sb);
>       shrink_dcache_for_umount(sb);
>       sync_filesystem(sb);
>       evict_inodes(sb); // SLOW
> 
>                                               btrfs_mount_root
>                                                 btrfs_scan_one_device
>                                                 fs_devices = device->fs_devices
>                                                 fs_info->fs_devices = fs_devices
>                                                 // fs_devices-opened makes this a no-op
>                                                 btrfs_open_devices(fs_devices, mode, fs_type)
>                                                 s = sget(fs_type, test, set, flags, fs_info);
>                                                   find sb in s_instances
>                                                   grab_super(sb);
>                                                     down_write(&s->s_umount); // blocks
> 
>       sop->put_super(sb)
>         // sb->fs_devices->opened == 2; no-op
>       spin_lock(&sb_lock);
>       hlist_del_init(&sb->s_instances);
>       spin_unlock(&sb_lock);
>       up_write(&sb->s_umount);
>                                                     return 0;
>                                                   retry lookup
>                                                   don't find sb in s_instances (deleted by CPU0)
>                                                   s = alloc_super
>                                                   return s;
>                                                 btrfs_fill_super(s, fs_devices, data)
>                                                   open_ctree // fs_devices total_rw_bytes improperly set!
>                                                     btrfs_read_chunk_tree
>                                                       read_one_dev // increment total_rw_bytes again!!
>                                                       super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
> 
> To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
> before the calls to read_one_dev, while holding the sb umount semaphore
> and the uuid mutex.
> 
> To reproduce, it is sufficient to dirty a decent number of inodes, then
> quickly umount and mount.
> 
> for i in $(seq 0 500)
> do
>   dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
> done
> umount /mnt/foo&
> mount /mnt/foo
> 
> does the trick for me.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Added to misc-next, thanks.

> ---

For patch iterations, please put a short list of changes description
under the "---" marker. This does not get applied to the patch and is
intended to help people reviewing the patches to see only what's new.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-20 16:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-10  0:44 [PATCH] btrfs: fix mount failure caused by race with umount Boris Burkov
2020-07-10  1:23 ` Josef Bacik
2020-07-10 17:23   ` [PATCH v2] " Boris Burkov
2020-07-10 17:51     ` Josef Bacik
2020-07-16 18:51     ` David Sterba
2020-07-16 20:29       ` [PATCH v3] " Boris Burkov
2020-07-20 16:32         ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.