From: Anand Jain <anand.jain@oracle.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
David Sterba <dsterba@suse.com>
Subject: Re: [PATCH v2 3/7] btrfs: do not read super look for a device path
Date: Tue, 28 Sep 2021 19:50:13 +0800 [thread overview]
Message-ID: <afcc1d56-8c94-f55c-4359-494ae3ca17e5@oracle.com> (raw)
In-Reply-To: <ffe45a8d-7ff9-ddda-2c5a-b0e1aae1441a@toxicpanda.com>
On 27/09/2021 23:32, Josef Bacik wrote:
> On 8/24/21 10:00 PM, Anand Jain wrote:
>> On 28/07/2021 05:01, Josef Bacik wrote:
>>> For device removal and replace we call btrfs_find_device_by_devspec,
>>> which if we give it a device path and nothing else will call
>>> btrfs_find_device_by_path, which opens the block device and reads the
>>> super block and then looks up our device based on that.
>>>
>>> However this is completely unnecessary because we have the path stored
>>> in our device on our fsdevices. All we need to do if we're given a path
>>> is look through the fs_devices on our file system and use that device if
>>> we find it, reading the super block is just silly.
>>
>> The device path as stored in our fs_devices can differ from the path
>> provided by the user for the same device (for example, dm, lvm).
>>
>> btrfs-progs sanitize the device path but, others (for example, an ioctl
>> test case) might not. And the path lookup would fail.
>>
>> Also, btrfs dev scan <path> can update the device path anytime, even
>> after it is mounted. Fixing that failed the subsequent subvolume mounts
>> (if I remember correctly).
>>
>
> This is a good point, that's kind of a big deal from a UX perspective.
>
>>> This fixes the case where we end up with our sb write "lock" getting the
>>> dependency of the block device ->open_mutex, which resulted in the
>>> following lockdep splat
>>
>> Can we do..
>>
>> btrfs_exclop_start()
>> ::
>> find device part (read sb)
>> ::
>> mnt_want_write_file()?
>>
>>
>
> I looked into this, but we'd have to re-order the exclop_start to above
> the mnt_want_write_file() part everywhere to be consistent, and this is
> mostly OK except for balance. Balance the exclop is tied to the
> lifetime of the balance ctl, which can exist past the task running
> balance because we could pause the balance.
>
> Could we get around this? Sure, but in my head exclop == lock. This
> means we have something akin to
>
> exclop_start
> mnt_want_write_file()
>
> pause balance
> mnt_drop_write()
>
> resume balance
>
> exclop_start magic stuff in balance to resume without doing the exclop
> mnt_want_write_file()
> <do balance>
> exclop_finish
> mnt_drop_write()
>
> If we're OK with this then we can definitely do that.
This is getting complex. IMO.
> The other option is simply to make userspace do the superblock read and
> use the devid thing for us. Then we just eat the UX problem for older
> tools where you want to do btrfs rm device /dev/mapper/whatever and we
> have the pathname as /dev/dm-#.
> Both options are unattractive in their own way.
I agree.
> I think the first
> option is only annoying to us, and maintains the UX expectations. But I
> want more than me to make this decision, so if you and Dave are OK with
> that I'll go with re-ordering exclop+mnt_want_write_file(), and then put
> the device lookup between the two of them for device removal. Thanks,
There is a 3rd option.
Here, the root of the problem is about reading superblock after
mnt_drop_write().
So to avoid this, can we read the sb -> devid before mnt_drop_write()?
and use the devid later on?
But when we read the sb we have neither mnt_drop_write() nor
exclop_start..().
If the devid we read is stale, at a later stage the btrfs_rm_device()
will still verify it.
I experimented this option, the diff is here [1]. This change needs a
lot of clean up and did not copy the same logic to btrfs_ioctl_rm_dev()
or tried to merge it. This diff passed the -g volume test cases with no
new regressions. But I wasn't able to reproduce the original issue for
which we wrote this patch.
[1]
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9eb0c1eb568e..e9c6bd05abf9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3164,15 +3164,13 @@ static long btrfs_ioctl_rm_dev_v2(struct file
*file, void __user *arg)
struct block_device *bdev = NULL;
fmode_t mode;
int ret;
+ bool cancel_or_missing = false;
bool cancel = false;
+ u64 devid;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- ret = mnt_want_write_file(file);
- if (ret)
- return ret;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args)) {
ret = PTR_ERR(vol_args);
@@ -3184,9 +3182,32 @@ static long btrfs_ioctl_rm_dev_v2(struct file
*file, void __user *arg)
goto out;
}
vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
- if (!(vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) &&
- strcmp("cancel", vol_args->name) == 0)
- cancel = true;
+ if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
+ devid = vol_args->devid;
+ else {
+ if (strcmp("cancel", vol_args->name) == 0) {
+ cancel_or_missing = true;
+ cancel = true;
+ } else if (!strcmp("missing", vol_args->name))
+ cancel_or_missing = true;
+ else {
+ struct block_device *bdev;
+ struct btrfs_super_block *disk_super;
+
+ ret = btrfs_get_bdev_and_sb(vol_args->name,
FMODE_READ,
+ fs_info->bdev_holder, 0,
+ &bdev, &disk_super);
+ if (ret)
+ goto out;
+ devid =
btrfs_stack_device_id(&disk_super->dev_item);
+ btrfs_release_disk_super(disk_super);
+ blkdev_put(bdev, FMODE_READ);
+ }
+ }
+
+ ret = mnt_want_write_file(file);
+ if (ret)
+ goto out;
ret = exclop_start_or_cancel_reloc(fs_info,
BTRFS_EXCLOP_DEV_REMOVE,
cancel);
@@ -3194,10 +3215,10 @@ static long btrfs_ioctl_rm_dev_v2(struct file
*file, void __user *arg)
goto out;
/* Exclusive operation is now claimed */
- if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
- ret = btrfs_rm_device(fs_info, NULL, vol_args->devid,
&bdev, &mode);
- else
+ if (cancel_or_missing)
ret = btrfs_rm_device(fs_info, vol_args->name, 0,
&bdev, &mode);
+ else
+ ret = btrfs_rm_device(fs_info, NULL, devid, &bdev, &mode);
btrfs_exclop_finish(fs_info);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6ade80bae3a5..85ae7294cea2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -493,7 +493,7 @@ static struct btrfs_fs_devices
*find_fsid_with_metadata_uuid(
}
-static int
+int
btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void
*holder,
int flush, struct block_device **bdev,
struct btrfs_super_block **disk_super)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index c7ac43d8a7e8..fa1d1faa70d4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -502,6 +502,9 @@ void btrfs_close_devices(struct btrfs_fs_devices
*fs_devices);
void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices);
void btrfs_assign_next_active_device(struct btrfs_device *device,
struct btrfs_device *this_dev);
+int btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void
*holder,
+ int flush, struct block_device **bdev,
+ struct btrfs_super_block **disk_super);
struct btrfs_device *btrfs_find_device_by_devspec(struct btrfs_fs_info
*fs_info,
u64 devid,
const char *devpath);
Thanks, Anand
next prev parent reply other threads:[~2021-09-28 11:50 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-27 21:01 [PATCH v2 0/7] Josef Bacik
2021-07-27 21:01 ` [PATCH v2 1/7] btrfs: do not call close_fs_devices in btrfs_rm_device Josef Bacik
2021-09-01 8:13 ` Anand Jain
2021-07-27 21:01 ` [PATCH v2 2/7] btrfs: do not take the uuid_mutex " Josef Bacik
2021-09-01 12:01 ` Anand Jain
2021-09-01 17:08 ` David Sterba
2021-09-01 17:10 ` Josef Bacik
2021-09-01 19:49 ` Anand Jain
2021-09-02 12:58 ` David Sterba
2021-09-02 14:10 ` Josef Bacik
2021-09-17 14:33 ` David Sterba
2021-09-20 7:45 ` Anand Jain
2021-09-20 8:26 ` David Sterba
2021-09-20 9:41 ` Anand Jain
2021-09-23 4:33 ` Anand Jain
2021-09-21 11:59 ` Filipe Manana
2021-09-21 12:17 ` Filipe Manana
2021-09-22 15:33 ` Filipe Manana
2021-09-23 4:15 ` Anand Jain
2021-09-23 3:58 ` [PATCH] btrfs: drop lockdep assert in close_fs_devices() Anand Jain
2021-09-23 4:04 ` Anand Jain
2021-07-27 21:01 ` [PATCH v2 3/7] btrfs: do not read super look for a device path Josef Bacik
2021-08-25 2:00 ` Anand Jain
2021-09-27 15:32 ` Josef Bacik
2021-09-28 11:50 ` Anand Jain [this message]
2021-07-27 21:01 ` [PATCH v2 4/7] btrfs: update the bdev time directly when closing Josef Bacik
2021-08-25 0:35 ` Anand Jain
2021-09-02 12:16 ` David Sterba
2021-07-27 21:01 ` [PATCH v2 5/7] btrfs: delay blkdev_put until after the device remove Josef Bacik
2021-08-25 1:00 ` Anand Jain
2021-09-02 12:16 ` David Sterba
2021-07-27 21:01 ` [PATCH v2 6/7] btrfs: unify common code for the v1 and v2 versions of " Josef Bacik
2021-08-25 1:19 ` Anand Jain
2021-09-01 14:05 ` Nikolay Borisov
2021-07-27 21:01 ` [PATCH v2 7/7] btrfs: do not take the device_list_mutex in clone_fs_devices Josef Bacik
2021-08-24 22:08 ` Anand Jain
2021-09-01 13:35 ` Nikolay Borisov
2021-09-02 12:59 ` David Sterba
2021-09-17 15:06 ` [PATCH v2 0/7] David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afcc1d56-8c94-f55c-4359-494ae3ca17e5@oracle.com \
--to=anand.jain@oracle.com \
--cc=dsterba@suse.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).