All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Boris Burkov <boris@bur.io>, Chris Mason <clm@fb.com>,
	David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: Re: [PATCH] btrfs: fix mount failure caused by race with umount
Date: Thu, 9 Jul 2020 21:23:17 -0400	[thread overview]
Message-ID: <e2857658-230e-48e6-d6cf-587be4a8a0bc@toxicpanda.com> (raw)
In-Reply-To: <20200710004408.1246282-1-boris@bur.io>

On 7/9/20 8:44 PM, Boris Burkov wrote:
> It is possible to cause a btrfs mount to fail by racing it with a slow
> umount. The crux of the sequence is generic_shutdown_super not yet
> calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
> If that occurs, btrfs_open_devices will decide the opened counter is
> non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
> 0. From here, mount will call sget which will result in grab_super
> trying to take the super block umount semaphore. That semaphore will be
> held by the slow umount, so mount will block. Before up-ing the
> semaphore, umount will delete the super block, resulting in mount's sget
> reliably allocating a new one, which causes the mount path to dutifully
> fill it out, and increment total_rw_bytes a second time, which causes
> the mount to fail, as we see double the expected bytes.
> 
> Here is the sequence laid out in greater detail:
> 
> CPU0                                                    CPU1
> down_write sb->s_umount
> btrfs_kill_super
>    kill_anon_super(sb)
>      generic_shutdown_super(sb);
>        shrink_dcache_for_umount(sb);
>        sync_filesystem(sb);
>        evict_inodes(sb); // SLOW
> 
>                                                btrfs_mount_root
>                                                  btrfs_scan_one_device
>                                                  fs_devices = device->fs_devices
>                                                  fs_info->fs_devices = fs_devices
>                                                  // fs_devices-opened makes this a no-op
>                                                  btrfs_open_devices(fs_devices, mode, fs_type)
>                                                  s = sget(fs_type, test, set, flags, fs_info);
>                                                    find sb in s_instances
>                                                    grab_super(sb);
>                                                      down_write(&s->s_umount); // blocks
> 
>        sop->put_super(sb)
>          // sb->fs_devices->opened == 2; no-op
>        spin_lock(&sb_lock);
>        hlist_del_init(&sb->s_instances);
>        spin_unlock(&sb_lock);
>        up_write(&sb->s_umount);
>                                                      return 0;
>                                                    retry lookup
>                                                    don't find sb in s_instances (deleted by CPU0)
>                                                    s = alloc_super
>                                                    return s;
>                                                  btrfs_fill_super(s, fs_devices, data)
>                                                    open_ctree // fs_devices total_rw_bytes improperly set!
>                                                      btrfs_read_chunk_tree
>                                                        read_one_dev // increment total_rw_bytes again!!
>                                                        super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
> 
> To fix this, we observe that if we have already filled the device, the
> state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
> use that to avoid filling it a second time for no reason and,
> critically, avoid double counting in total_rw_bytes. One gotcha is that
> read_one_chunk also sets this bit, which happens before read_one_dev (in
> read_sys_array), so we must remove that setting of the bit as well, for
> the state bit to truly correspond to the device struct being filled from
> disk.
> 
> To reproduce, it is sufficient to dirty a decent number of inodes, then
> quickly umount and mount.
> 
> for i in $(seq 0 500)
> do
>    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
> done
> umount /mnt/foo&
> mount /mnt/foo
> 
> does the trick for me.
> 
> A final note is that this fix actually breaks the fstest btrfs/163, but
> having investigated it, I believe that is due to a subtle flaw in how
> btrfs replace works when used on a seed device. The replace target device
> never gets a correct dev_item with the sprout fsid written out. This
> causes several problems, but for the sake of btrfs/163, read_one_chunk
> marking the device with IN_FS_METADATA was wallpapering over it, which
> this patch breaks. I will be sending a subsequent fix for the seed replace
> issue which will also fix btrfs/163.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/volumes.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c7a3d4d730a3..1d9bd1bbf893 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
>   			}
>   			btrfs_report_missing_device(fs_info, devid, uuid, false);
>   		}
> -		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> -				&(map->stripes[i].dev->dev_state));
> -
>   	}
>   
>   	write_lock(&map_tree->lock);
> @@ -6815,6 +6812,15 @@ static int read_one_dev(struct extent_buffer *leaf,
>   			return -EINVAL;
>   	}
>   
> +	/*
> +	 * It is possible for mount and umount to race in such a way that
> +	 * we execute this code path, but the device is still in metadata.
> +	 * If so, we don't need to call fill_device_from_item again and we
> +	 * especially don't want to spuriously increment total_rw_bytes.
> +	 */
> +	if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state)) {
> +		return 0;
> +	}

Lets kill the set_bit below, and changes this to

if (test_and_set_bit())

also you don't need {} for single line if statements.  Thanks,

Josef

  reply	other threads:[~2020-07-10  1:23 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-10  0:44 [PATCH] btrfs: fix mount failure caused by race with umount Boris Burkov
2020-07-10  1:23 ` Josef Bacik [this message]
2020-07-10 17:23   ` [PATCH v2] " Boris Burkov
2020-07-10 17:51     ` Josef Bacik
2020-07-16 18:51     ` David Sterba
2020-07-16 20:29       ` [PATCH v3] " Boris Burkov
2020-07-20 16:32         ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e2857658-230e-48e6-d6cf-587be4a8a0bc@toxicpanda.com \
    --to=josef@toxicpanda.com \
    --cc=boris@bur.io \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.