All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH v2 3/7] btrfs: do not read super look for a device path
Date: Tue, 28 Sep 2021 19:50:13 +0800	[thread overview]
Message-ID: <afcc1d56-8c94-f55c-4359-494ae3ca17e5@oracle.com> (raw)
In-Reply-To: <ffe45a8d-7ff9-ddda-2c5a-b0e1aae1441a@toxicpanda.com>

On 27/09/2021 23:32, Josef Bacik wrote:
> On 8/24/21 10:00 PM, Anand Jain wrote:
>> On 28/07/2021 05:01, Josef Bacik wrote:
>>> For device removal and replace we call btrfs_find_device_by_devspec,
>>> which if we give it a device path and nothing else will call
>>> btrfs_find_device_by_path, which opens the block device and reads the
>>> super block and then looks up our device based on that.
>>>
>>> However this is completely unnecessary because we have the path stored
>>> in our device on our fsdevices.  All we need to do if we're given a path
>>> is look through the fs_devices on our file system and use that device if
>>> we find it, reading the super block is just silly.
>>
>> The device path as stored in our fs_devices can differ from the path
>> provided by the user for the same device (for example, dm, lvm).
>>
>> btrfs-progs sanitize the device path but, others (for example, an ioctl
>> test case) might not. And the path lookup would fail.
>>
>> Also, btrfs dev scan <path> can update the device path anytime, even
>> after it is mounted. Fixing that failed the subsequent subvolume mounts
>> (if I remember correctly).
>>
> 
> This is a good point, that's kind of a big deal from a UX perspective.
> 
>>> This fixes the case where we end up with our sb write "lock" getting the
>>> dependency of the block device ->open_mutex, which resulted in the
>>> following lockdep splat
>>
>> Can we do..
>>
>> btrfs_exclop_start()
>>   ::
>> find device part (read sb)
>>   ::
>> mnt_want_write_file()?
>>
>>
> 


> I looked into this, but we'd have to re-order the exclop_start to above 
> the mnt_want_write_file() part everywhere to be consistent, and this is 
> mostly OK except for balance.  Balance the exclop is tied to the 
> lifetime of the balance ctl, which can exist past the task running 
> balance because we could pause the balance.
> 
> Could we get around this?  Sure, but in my head exclop == lock.  This 
> means we have something akin to
> 
> exclop_start
> mnt_want_write_file()
> 
> pause balance
> mnt_drop_write()
> 
> resume balance
> 
> exclop_start magic stuff in balance to resume without doing the exclop
> mnt_want_write_file()
> <do balance>
> exclop_finish
> mnt_drop_write()
> 
> If we're OK with this then we can definitely do that.

This is getting complex. IMO.

> The other option is simply to make userspace do the superblock read and 
> use the devid thing for us.  Then we just eat the UX problem for older 
> tools where you want to do btrfs rm device /dev/mapper/whatever and we 
> have the pathname as /dev/dm-#.


> Both options are unattractive in their own way. 

I agree.

> I think the first 
> option is only annoying to us, and maintains the UX expectations.  But I 
> want more than me to make this decision, so if you and Dave are OK with 
> that I'll go with re-ordering exclop+mnt_want_write_file(), and then put 
> the device lookup between the two of them for device removal.  Thanks,


There is a 3rd option.

Here, the root of the problem is about reading superblock after 
mnt_drop_write().

So to avoid this, can we read the sb -> devid before mnt_drop_write()? 
and use the devid later on?

But when we read the sb we have neither mnt_drop_write() nor 
exclop_start..().

If the devid we read is stale, at a later stage the btrfs_rm_device() 
will still verify it.

I experimented this option, the diff is here [1]. This change needs a 
lot of clean up and did not copy the same logic to btrfs_ioctl_rm_dev() 
or tried to merge it. This diff passed the -g volume test cases with no 
new regressions. But I wasn't able to reproduce the original issue for 
which we wrote this patch.



[1]

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9eb0c1eb568e..e9c6bd05abf9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3164,15 +3164,13 @@ static long btrfs_ioctl_rm_dev_v2(struct file 
*file, void __user *arg)
         struct block_device *bdev = NULL;
         fmode_t mode;
         int ret;
+       bool cancel_or_missing = false;
         bool cancel = false;
+       u64 devid;

         if (!capable(CAP_SYS_ADMIN))
                 return -EPERM;

-       ret = mnt_want_write_file(file);
-       if (ret)
-               return ret;
-
         vol_args = memdup_user(arg, sizeof(*vol_args));
         if (IS_ERR(vol_args)) {
                 ret = PTR_ERR(vol_args);
@@ -3184,9 +3182,32 @@ static long btrfs_ioctl_rm_dev_v2(struct file 
*file, void __user *arg)
                 goto out;
         }
         vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
-       if (!(vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) &&
-           strcmp("cancel", vol_args->name) == 0)
-               cancel = true;
+       if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
+               devid = vol_args->devid;
+       else {
+               if (strcmp("cancel", vol_args->name) == 0) {
+                       cancel_or_missing = true;
+                       cancel = true;
+               } else if (!strcmp("missing", vol_args->name))
+                       cancel_or_missing = true;
+               else {
+                       struct block_device *bdev;
+                       struct btrfs_super_block *disk_super;
+
+                       ret = btrfs_get_bdev_and_sb(vol_args->name, 
FMODE_READ,
+                                                   fs_info->bdev_holder, 0,
+                                                   &bdev, &disk_super);
+                       if (ret)
+                               goto out;
+                       devid = 
btrfs_stack_device_id(&disk_super->dev_item);
+                       btrfs_release_disk_super(disk_super);
+                       blkdev_put(bdev, FMODE_READ);
+               }
+       }
+
+       ret = mnt_want_write_file(file);
+       if (ret)
+               goto out;

         ret = exclop_start_or_cancel_reloc(fs_info, 
BTRFS_EXCLOP_DEV_REMOVE,
                                            cancel);
@@ -3194,10 +3215,10 @@ static long btrfs_ioctl_rm_dev_v2(struct file 
*file, void __user *arg)
                 goto out;
         /* Exclusive operation is now claimed */

-       if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
-               ret = btrfs_rm_device(fs_info, NULL, vol_args->devid, 
&bdev, &mode);
-       else
+       if (cancel_or_missing)
                 ret = btrfs_rm_device(fs_info, vol_args->name, 0, 
&bdev, &mode);
+       else
+               ret = btrfs_rm_device(fs_info, NULL, devid, &bdev, &mode);

         btrfs_exclop_finish(fs_info);

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6ade80bae3a5..85ae7294cea2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -493,7 +493,7 @@ static struct btrfs_fs_devices 
*find_fsid_with_metadata_uuid(
  }


-static int
+int
  btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void 
*holder,
                       int flush, struct block_device **bdev,
                       struct btrfs_super_block **disk_super)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index c7ac43d8a7e8..fa1d1faa70d4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -502,6 +502,9 @@ void btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices);
  void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices);
  void btrfs_assign_next_active_device(struct btrfs_device *device,
                                      struct btrfs_device *this_dev);
+int btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void 
*holder,
+                         int flush, struct block_device **bdev,
+                         struct btrfs_super_block **disk_super);
  struct btrfs_device *btrfs_find_device_by_devspec(struct btrfs_fs_info 
*fs_info,
                                                   u64 devid,
                                                   const char *devpath);



Thanks, Anand



  reply	other threads:[~2021-09-28 11:50 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 21:01 [PATCH v2 0/7] Josef Bacik
2021-07-27 21:01 ` [PATCH v2 1/7] btrfs: do not call close_fs_devices in btrfs_rm_device Josef Bacik
2021-09-01  8:13   ` Anand Jain
2021-07-27 21:01 ` [PATCH v2 2/7] btrfs: do not take the uuid_mutex " Josef Bacik
2021-09-01 12:01   ` Anand Jain
2021-09-01 17:08     ` David Sterba
2021-09-01 17:10     ` Josef Bacik
2021-09-01 19:49       ` Anand Jain
2021-09-02 12:58   ` David Sterba
2021-09-02 14:10     ` Josef Bacik
2021-09-17 14:33       ` David Sterba
2021-09-20  7:45   ` Anand Jain
2021-09-20  8:26     ` David Sterba
2021-09-20  9:41       ` Anand Jain
2021-09-23  4:33         ` Anand Jain
2021-09-21 11:59   ` Filipe Manana
2021-09-21 12:17     ` Filipe Manana
2021-09-22 15:33       ` Filipe Manana
2021-09-23  4:15         ` Anand Jain
2021-09-23  3:58   ` [PATCH] btrfs: drop lockdep assert in close_fs_devices() Anand Jain
2021-09-23  4:04     ` Anand Jain
2021-07-27 21:01 ` [PATCH v2 3/7] btrfs: do not read super look for a device path Josef Bacik
2021-08-25  2:00   ` Anand Jain
2021-09-27 15:32     ` Josef Bacik
2021-09-28 11:50       ` Anand Jain [this message]
2021-07-27 21:01 ` [PATCH v2 4/7] btrfs: update the bdev time directly when closing Josef Bacik
2021-08-25  0:35   ` Anand Jain
2021-09-02 12:16   ` David Sterba
2021-07-27 21:01 ` [PATCH v2 5/7] btrfs: delay blkdev_put until after the device remove Josef Bacik
2021-08-25  1:00   ` Anand Jain
2021-09-02 12:16   ` David Sterba
2021-07-27 21:01 ` [PATCH v2 6/7] btrfs: unify common code for the v1 and v2 versions of " Josef Bacik
2021-08-25  1:19   ` Anand Jain
2021-09-01 14:05   ` Nikolay Borisov
2021-07-27 21:01 ` [PATCH v2 7/7] btrfs: do not take the device_list_mutex in clone_fs_devices Josef Bacik
2021-08-24 22:08   ` Anand Jain
2021-09-01 13:35   ` Nikolay Borisov
2021-09-02 12:59   ` David Sterba
2021-09-17 15:06 ` [PATCH v2 0/7] David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afcc1d56-8c94-f55c-4359-494ae3ca17e5@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.