linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Coly Li <colyli@suse.de>
To: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	linux-bcache@vger.kernel.org
Subject: Re: [RFC] Live resize of backing device
Date: Wed, 18 Jan 2023 00:18:38 +0800	[thread overview]
Message-ID: <B7718488-B00D-4F72-86CA-0FF335AD633F@suse.de> (raw)
In-Reply-To: <CAHykVA4WfYysOcKnQETkUyUjx_tFypFCWYG1RidRMVNqObGmRg@mail.gmail.com>



> 2023年1月12日 00:01,Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> 写道:
> 
> Hi Coly,
> thank you for taking the time to reply, I really hope you are feeling
> better now.

Thanks. But the recovery is really slow, and my response cannot be in time. I was told it might be better after 1 month, hope it is true.

> 
> On Fri, Dec 30, 2022 at 11:41 AM Coly Li <colyli@suse.de> wrote:
>> 
>> On 9/8/22 4:32 PM, Andrea Tomassetti wrote:
>>> From 59787372cf21af0b79e895578ae05b6586dfeb09 Mon Sep 17 00:00:00 2001
>>> From: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
>>> Date: Thu, 8 Sep 2022 09:47:55 +0200
>>> Subject: [PATCH] bcache: Add support for live resize of backing devices
>>> 
>>> Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
>> 
>> Hi Andrea,
>> 
>> I am just recovered from Omicron, and able to reply email. Let me place
>> my comments inline.
>> 
>> 
>>> ---
>>> Hi Coly,
>>> Here is the first version of the patch. There are some points I noted
>>> down
>>> that I would like to discuss with you:
>>> - I found it pretty convenient to hook the call of the new added
>>> function
>>>   inside the `register_bcache`. In fact, every time (at least from my
>>>   understandings) a disk changes size, it will trigger a new probe and,
>>>   thus, `register_bcache` will be triggered. The only inconvenient
>>>   is that, in case of success, the function will output
>> 
>> The resize should be triggered manually, and not to do it automatically.
>> 
>> You may create a sysfs file under the cached device's directory, name it
>> as "extend_size" or something else you think better.
>> 
>> Then the sysadmin may extend the cached device size explicitly on a
>> predictable time.
>> 
>>> `error: capacity changed` even if it's not really an error.
>>> - I'm using `kvrealloc`, introduced in kernel version 5.15, to resize
>>>   `stripe_sectors_dirty` and `full_dirty_stripes`. It shouldn't be a
>>>   problem, right?
>>> - There is some reused code between this new function and
>>>   `bcache_device_init`. Maybe I can move `const size_t max_stripes` to
>>>   a broader scope or make it a macro, what do you think?
>>> 
>>> Thank you very much,
>>> Andrea
>>> 
>>> drivers/md/bcache/super.c | 75 ++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 74 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>> index ba3909bb6bea..9a77caf2a18f 100644
>>> --- a/drivers/md/bcache/super.c
>>> +++ b/drivers/md/bcache/super.c
>>> @@ -2443,6 +2443,76 @@ static bool bch_is_open(dev_t dev)
>>>     return bch_is_open_cache(dev) || bch_is_open_backing(dev);
>>> }
>>> 
>>> +static bool bch_update_capacity(dev_t dev)
>>> +{
>>> +    const size_t max_stripes = min_t(size_t, INT_MAX,
>>> +                     SIZE_MAX / sizeof(atomic_t));
>>> +
>>> +    uint64_t n, n_old;
>>> +    int nr_stripes_old;
>>> +    bool res = false;
>>> +
>>> +    struct bcache_device *d;
>>> +    struct cache_set *c, *tc;
>>> +    struct cached_dev *dcp, *t, *dc = NULL;
>>> +
>>> +    uint64_t parent_nr_sectors;
>>> +
>>> +    list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
>>> +        list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
>>> +            if (dcp->bdev->bd_dev == dev) {
>>> +                dc = dcp;
>>> +                goto dc_found;
>>> +            }
>>> +
>>> +dc_found:
>>> +    if (!dc)
>>> +        return false;
>>> +
>>> +    parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
>>> +
>>> +    if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
>>> +        return false;
>>> +
>> 
>> The above code only handles whole disk using as cached device. If a
>> partition of a hard drive is used as cache device, and there are other
>> data after this partition, such condition should be handled as well. So
>> far I am fine to only extend size when using the whole hard drive as
>> cached device, but more code is necessary to check and only permits
>> size-extend for such condition.
> I don't understand if there's a misalignment here so let me be more
> clear: this patch is intended to support the live resize of *backing
> devices*, is this what you mean with "cached device"?

Yes, backing device is cached device.


> When I was working on the patch I didn't consider the possibility of
> live resizing the cache devices, but it should be trivial.
> So, as far as I understand, a partition cannot be used as a backing
> device, correct? The whole disk should be used as a backing device,
> that's why I'm checking and that's why this check should be correct.
> Am I wrong?

Yes a partition can be used as a backing (cached) device.
That is to say, if a file system is format on top of the cached device, this file system cannot be directly accessed via the partition. It has to be accessed via the bcache device e.g. /dev/bcache0.


> 
>> 
>>> +    if (!set_capacity_and_notify(dc->disk.disk, parent_nr_sectors))
>>> +        return false;
>> 
>> The above code should be done when all rested are set.
>> 
>> 
>>> +
>>> +    d = &dc->disk;
>>> +
>>> +    /* Force cached device sectors re-calc */
>>> +    calc_cached_dev_sectors(d->c);
>> 
>> Here c->cached_dev_sectors might be changed, if any of the following
>> steps fails, it should be restored to previous value.
>> 
>> 
>>> +
>>> +    /* Block writeback thread */
>>> +    down_write(&dc->writeback_lock);
>>> +    nr_stripes_old = d->nr_stripes;
>>> +    n = DIV_ROUND_UP_ULL(parent_nr_sectors, d->stripe_size);
>>> +    if (!n || n > max_stripes) {
>>> +        pr_err("nr_stripes too large or invalid: %llu (start sector
>>> beyond end of disk?)\n",
>>> +            n);
>>> +        goto unblock_and_exit;
>>> +    }
>>> +    d->nr_stripes = n;
>>> +
>>> +    n = d->nr_stripes * sizeof(atomic_t);
>>> +    n_old = nr_stripes_old * sizeof(atomic_t);
>>> +    d->stripe_sectors_dirty = kvrealloc(d->stripe_sectors_dirty, n_old,
>>> +        n, GFP_KERNEL);
>>> +    if (!d->stripe_sectors_dirty)
>>> +        goto unblock_and_exit;
>>> +
>>> +    n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
>>> +    n_old = BITS_TO_LONGS(nr_stripes_old) * sizeof(unsigned long);
>>> +    d->full_dirty_stripes = kvrealloc(d->full_dirty_stripes, n_old,
>>> n, GFP_KERNEL);
>>> +    if (!d->full_dirty_stripes)
>>> +        goto unblock_and_exit;
>> 
>> If kvrealloc() fails and NULL set to d->full_dirty_sripes, the previous
>> array should be restored.
>> 
>>> +
>>> +    res = true;
>>> +
>>> +unblock_and_exit:
>>> +    up_write(&dc->writeback_lock);
>>> +    return res;
>>> +}
>>> +
>> 
>> I didn't test the patch, from the first glance, I feel the failure
>> handling should restore all previous values, otherwise the cache device
>> may be in a non-consistent state when extend size fails.
>> 
> Completely agree with you on this point. I changed it locally and, as
> soon as we agree on the other points I'll send a newer version of the
> patch.
>> 
>>> struct async_reg_args {
>>>     struct delayed_work reg_work;
>>>     char *path;
>>> @@ -2569,7 +2639,10 @@ static ssize_t register_bcache(struct kobject
>>> *k, struct kobj_attribute *attr,
>>>             mutex_lock(&bch_register_lock);
>>>             if (lookup_bdev(strim(path), &dev) == 0 &&
>>>                 bch_is_open(dev))
>>> -                err = "device already registered";
>>> +                if (bch_update_capacity(dev))
>>> +                    err = "capacity changed";
>>> +                else
>>> +                    err = "device already registered";
>> 
>> 
>> As I said, it should be a separated write-only sysfile under the cache
>> device's directory.
> Can I ask why you don't like the automatic resize way? Why should the
> resize be manual?

Most of system administrators don’t like such silently automatic things. They want to extend the size explicitly, especially when there is other dependences in their configurations.


> Someone needs to trigger the block device resize, so shouldn't that be enough?

Yes, an explicit write operation on a sysfs file to trigger the resize. Then we can permit people to do the size extend explicit and avoid to change current behavior.

Thanks.

Coly Li

> 
> Andrea
>> 
>> 
>>> else
>>>                 err = "device busy";
>>>             mutex_unlock(&bch_register_lock);
>>> --
>>> 2.37.3
>>> 
>> 


  parent reply	other threads:[~2023-01-17 16:18 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-03 10:05 [RFC] Live resize of backing device Andrea Tomassetti
2022-08-04 14:32 ` Coly Li
2022-08-05 19:38 ` Eric Wheeler
2022-09-06 13:22   ` Andrea Tomassetti
2022-09-08  8:32     ` Andrea Tomassetti
2022-09-19 11:42       ` Andrea Tomassetti
2022-09-19 12:16         ` Coly Li
2022-12-09  8:57           ` Andrea Tomassetti
2022-12-09  9:36             ` Coly Li
2022-12-30 10:40       ` Coly Li
2023-01-11 16:01         ` Andrea Tomassetti
2023-01-17 13:08           ` Error messages with kernel 6.1.[56] Pierre Juhen
2023-01-17 16:08             ` Coly Li
2023-01-17 16:18           ` Coly Li [this message]
2023-01-25 10:07             ` [RFC] Live resize of backing device Andrea Tomassetti
2023-01-25 17:59               ` Coly Li
2023-01-27 12:44                 ` Andrea Tomassetti
2023-01-27 22:40                   ` Eric Wheeler
2023-01-31 10:20                     ` Andrea Tomassetti
2023-02-02 17:18                   ` Coly Li
2023-02-02 20:48                     ` Eric Wheeler
2023-02-03  2:41                       ` Coly Li
2023-02-19  9:39                   ` Coly Li
2023-02-20  8:27                     ` mingzhe
2023-02-20 12:29                       ` Coly Li
2023-02-22  8:42                         ` Andrea Tomassetti
2023-02-27 22:08                       ` Eric Wheeler
2023-02-28  2:46                         ` mingzhe
2023-01-27  2:53               ` [RFC] Live resize of bcache " Eric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B7718488-B00D-4F72-86CA-0FF335AD633F@suse.de \
    --to=colyli@suse.de \
    --cc=andrea.tomassetti-opensource@devo.com \
    --cc=bcache@lists.ewheeler.net \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-bcache@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).