linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH v2 3/7] btrfs: do not read super look for a device path
Date: Tue, 28 Sep 2021 19:50:13 +0800	[thread overview]
Message-ID: <afcc1d56-8c94-f55c-4359-494ae3ca17e5@oracle.com> (raw)
In-Reply-To: <ffe45a8d-7ff9-ddda-2c5a-b0e1aae1441a@toxicpanda.com>

On 27/09/2021 23:32, Josef Bacik wrote:
> On 8/24/21 10:00 PM, Anand Jain wrote:
>> On 28/07/2021 05:01, Josef Bacik wrote:
>>> For device removal and replace we call btrfs_find_device_by_devspec,
>>> which if we give it a device path and nothing else will call
>>> btrfs_find_device_by_path, which opens the block device and reads the
>>> super block and then looks up our device based on that.
>>>
>>> However this is completely unnecessary because we have the path stored
>>> in our device on our fsdevices.  All we need to do if we're given a path
>>> is look through the fs_devices on our file system and use that device if
>>> we find it, reading the super block is just silly.
>>
>> The device path as stored in our fs_devices can differ from the path
>> provided by the user for the same device (for example, dm, lvm).
>>
>> btrfs-progs sanitize the device path but, others (for example, an ioctl
>> test case) might not. And the path lookup would fail.
>>
>> Also, btrfs dev scan <path> can update the device path anytime, even
>> after it is mounted. Fixing that failed the subsequent subvolume mounts
>> (if I remember correctly).
>>
> 
> This is a good point, that's kind of a big deal from a UX perspective.
> 
>>> This fixes the case where we end up with our sb write "lock" getting the
>>> dependency of the block device ->open_mutex, which resulted in the
>>> following lockdep splat
>>
>> Can we do..
>>
>> btrfs_exclop_start()
>>   ::
>> find device part (read sb)
>>   ::
>> mnt_want_write_file()?
>>
>>
> 


> I looked into this, but we'd have to re-order the exclop_start to above 
> the mnt_want_write_file() part everywhere to be consistent, and this is 
> mostly OK except for balance.  Balance the exclop is tied to the 
> lifetime of the balance ctl, which can exist past the task running 
> balance because we could pause the balance.
> 
> Could we get around this?  Sure, but in my head exclop == lock.  This 
> means we have something akin to
> 
> exclop_start
> mnt_want_write_file()
> 
> pause balance
> mnt_drop_write()
> 
> resume balance
> 
> exclop_start magic stuff in balance to resume without doing the exclop
> mnt_want_write_file()
> <do balance>
> exclop_finish
> mnt_drop_write()
> 
> If we're OK with this then we can definitely do that.

This is getting complex. IMO.

> The other option is simply to make userspace do the superblock read and 
> use the devid thing for us.  Then we just eat the UX problem for older 
> tools where you want to do btrfs rm device /dev/mapper/whatever and we 
> have the pathname as /dev/dm-#.


> Both options are unattractive in their own way. 

I agree.

> I think the first 
> option is only annoying to us, and maintains the UX expectations.  But I 
> want more than me to make this decision, so if you and Dave are OK with 
> that I'll go with re-ordering exclop+mnt_want_write_file(), and then put 
> the device lookup between the two of them for device removal.  Thanks,


There is a 3rd option.

Here, the root of the problem is about reading superblock after 
mnt_drop_write().

So to avoid this, can we read the sb -> devid before mnt_drop_write()? 
and use the devid later on?

But when we read the sb we have neither mnt_drop_write() nor 
exclop_start..().

If the devid we read is stale, at a later stage the btrfs_rm_device() 
will still verify it.

I experimented this option, the diff is here [1]. This change needs a 
lot of clean up and did not copy the same logic to btrfs_ioctl_rm_dev() 
or tried to merge it. This diff passed the -g volume test cases with no 
new regressions. But I wasn't able to reproduce the original issue for 
which we wrote this patch.



[1]

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9eb0c1eb568e..e9c6bd05abf9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3164,15 +3164,13 @@ static long btrfs_ioctl_rm_dev_v2(struct file 
*file, void __user *arg)
         struct block_device *bdev = NULL;
         fmode_t mode;
         int ret;
+       bool cancel_or_missing = false;
         bool cancel = false;
+       u64 devid;

         if (!capable(CAP_SYS_ADMIN))
                 return -EPERM;

-       ret = mnt_want_write_file(file);
-       if (ret)
-               return ret;
-
         vol_args = memdup_user(arg, sizeof(*vol_args));
         if (IS_ERR(vol_args)) {
                 ret = PTR_ERR(vol_args);
@@ -3184,9 +3182,32 @@ static long btrfs_ioctl_rm_dev_v2(struct file 
*file, void __user *arg)
                 goto out;
         }
         vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
-       if (!(vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) &&
-           strcmp("cancel", vol_args->name) == 0)
-               cancel = true;
+       if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
+               devid = vol_args->devid;
+       else {
+               if (strcmp("cancel", vol_args->name) == 0) {
+                       cancel_or_missing = true;
+                       cancel = true;
+               } else if (!strcmp("missing", vol_args->name))
+                       cancel_or_missing = true;
+               else {
+                       struct block_device *bdev;
+                       struct btrfs_super_block *disk_super;
+
+                       ret = btrfs_get_bdev_and_sb(vol_args->name, 
FMODE_READ,
+                                                   fs_info->bdev_holder, 0,
+                                                   &bdev, &disk_super);
+                       if (ret)
+                               goto out;
+                       devid = 
btrfs_stack_device_id(&disk_super->dev_item);
+                       btrfs_release_disk_super(disk_super);
+                       blkdev_put(bdev, FMODE_READ);
+               }
+       }
+
+       ret = mnt_want_write_file(file);
+       if (ret)
+               goto out;

         ret = exclop_start_or_cancel_reloc(fs_info, 
BTRFS_EXCLOP_DEV_REMOVE,
                                            cancel);
@@ -3194,10 +3215,10 @@ static long btrfs_ioctl_rm_dev_v2(struct file 
*file, void __user *arg)
                 goto out;
         /* Exclusive operation is now claimed */

-       if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
-               ret = btrfs_rm_device(fs_info, NULL, vol_args->devid, 
&bdev, &mode);
-       else
+       if (cancel_or_missing)
                 ret = btrfs_rm_device(fs_info, vol_args->name, 0, 
&bdev, &mode);
+       else
+               ret = btrfs_rm_device(fs_info, NULL, devid, &bdev, &mode);

         btrfs_exclop_finish(fs_info);

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6ade80bae3a5..85ae7294cea2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -493,7 +493,7 @@ static struct btrfs_fs_devices 
*find_fsid_with_metadata_uuid(
  }


-static int
+int
  btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void 
*holder,
                       int flush, struct block_device **bdev,
                       struct btrfs_super_block **disk_super)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index c7ac43d8a7e8..fa1d1faa70d4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -502,6 +502,9 @@ void btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices);
  void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices);
  void btrfs_assign_next_active_device(struct btrfs_device *device,
                                      struct btrfs_device *this_dev);
+int btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void 
*holder,
+                         int flush, struct block_device **bdev,
+                         struct btrfs_super_block **disk_super);
  struct btrfs_device *btrfs_find_device_by_devspec(struct btrfs_fs_info 
*fs_info,
                                                   u64 devid,
                                                   const char *devpath);



Thanks, Anand



  reply	other threads:[~2021-09-28 11:50 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 21:01 [PATCH v2 0/7] Josef Bacik
2021-07-27 21:01 ` [PATCH v2 1/7] btrfs: do not call close_fs_devices in btrfs_rm_device Josef Bacik
2021-09-01  8:13   ` Anand Jain
2021-07-27 21:01 ` [PATCH v2 2/7] btrfs: do not take the uuid_mutex " Josef Bacik
2021-09-01 12:01   ` Anand Jain
2021-09-01 17:08     ` David Sterba
2021-09-01 17:10     ` Josef Bacik
2021-09-01 19:49       ` Anand Jain
2021-09-02 12:58   ` David Sterba
2021-09-02 14:10     ` Josef Bacik
2021-09-17 14:33       ` David Sterba
2021-09-20  7:45   ` Anand Jain
2021-09-20  8:26     ` David Sterba
2021-09-20  9:41       ` Anand Jain
2021-09-23  4:33         ` Anand Jain
2021-09-21 11:59   ` Filipe Manana
2021-09-21 12:17     ` Filipe Manana
2021-09-22 15:33       ` Filipe Manana
2021-09-23  4:15         ` Anand Jain
2021-09-23  3:58   ` [PATCH] btrfs: drop lockdep assert in close_fs_devices() Anand Jain
2021-09-23  4:04     ` Anand Jain
2021-07-27 21:01 ` [PATCH v2 3/7] btrfs: do not read super look for a device path Josef Bacik
2021-08-25  2:00   ` Anand Jain
2021-09-27 15:32     ` Josef Bacik
2021-09-28 11:50       ` Anand Jain [this message]
2021-07-27 21:01 ` [PATCH v2 4/7] btrfs: update the bdev time directly when closing Josef Bacik
2021-08-25  0:35   ` Anand Jain
2021-09-02 12:16   ` David Sterba
2021-07-27 21:01 ` [PATCH v2 5/7] btrfs: delay blkdev_put until after the device remove Josef Bacik
2021-08-25  1:00   ` Anand Jain
2021-09-02 12:16   ` David Sterba
2021-07-27 21:01 ` [PATCH v2 6/7] btrfs: unify common code for the v1 and v2 versions of " Josef Bacik
2021-08-25  1:19   ` Anand Jain
2021-09-01 14:05   ` Nikolay Borisov
2021-07-27 21:01 ` [PATCH v2 7/7] btrfs: do not take the device_list_mutex in clone_fs_devices Josef Bacik
2021-08-24 22:08   ` Anand Jain
2021-09-01 13:35   ` Nikolay Borisov
2021-09-02 12:59   ` David Sterba
2021-09-17 15:06 ` [PATCH v2 0/7] David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afcc1d56-8c94-f55c-4359-494ae3ca17e5@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).