From: Eric Wheeler <bcache@lists.ewheeler.net>
To: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Cc: Coly Li <colyli@suse.de>,
Kent Overstreet <kent.overstreet@gmail.com>,
linux-bcache@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: [RFC] Live resize of bcache backing device
Date: Thu, 26 Jan 2023 18:53:41 -0800 (PST) [thread overview]
Message-ID: <9e68cc3-b13f-68e0-61c-59d5d044064@ewheeler.net> (raw)
In-Reply-To: <CAHykVA7_e1r9x2PfiDe8czH2WRaWtNxTJWcNmdyxJTSVGCxDHA@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 9032 bytes --]
On Wed, 25 Jan 2023, Andrea Tomassetti wrote:
> On Tue, Jan 17, 2023 at 5:18 PM Coly Li <colyli@suse.de> wrote:
> > > 2023年1月12日 00:01,Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> 写道:
> > > On Fri, Dec 30, 2022 at 11:41 AM Coly Li <colyli@suse.de> wrote:
> > >> On 9/8/22 4:32 PM, Andrea Tomassetti wrote:
> > >>> From 59787372cf21af0b79e895578ae05b6586dfeb09 Mon Sep 17 00:00:00 2001
> > >>> From: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
> > >>> Date: Thu, 8 Sep 2022 09:47:55 +0200
> > >>> Subject: [PATCH] bcache: Add support for live resize of backing devices
> > >>>
> > >>> Here is the first version of the patch. There are some points I noted
> > >>> down
> > >>> that I would like to discuss with you:
> > >>> - I found it pretty convenient to hook the call of the new added
> > >>> function
> > >>> inside the `register_bcache`. In fact, every time (at least from my
> > >>> understandings) a disk changes size, it will trigger a new probe and,
> > >>> thus, `register_bcache` will be triggered. The only inconvenient
> > >>> is that, in case of success, the function will output
> > >>
> > >> The resize should be triggered manually, and not to do it automatically.
> > >>
> > >> You may create a sysfs file under the cached device's directory, name it
> > >> as "extend_size" or something else you think better.
This is the way it works for other blockdevs:
echo 1 > /sys/class/block/sdX/device/rescan
... but of course bcache doesn't have a "device" folder. Maybe a sysfs
flag named "rescan" or "resize" would be appropriate:
echo 1 > /sys/block/bcache0/bcache/resize
> > >> Then the sysadmin may extend the cached device size explicitly on a
> > >> predictable time.
> > >>>
> > >>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > >>> index ba3909bb6bea..9a77caf2a18f 100644
> > >>> --- a/drivers/md/bcache/super.c
> > >>> +++ b/drivers/md/bcache/super.c
> > >>> @@ -2443,6 +2443,76 @@ static bool bch_is_open(dev_t dev)
> > >>> return bch_is_open_cache(dev) || bch_is_open_backing(dev);
> > >>> }
> > >>>
> > >>> +static bool bch_update_capacity(dev_t dev)
> > >>> +{
> > >>> + const size_t max_stripes = min_t(size_t, INT_MAX,
> > >>> + SIZE_MAX / sizeof(atomic_t));
> > >>> +
> > >>> + uint64_t n, n_old;
> > >>> + int nr_stripes_old;
> > >>> + bool res = false;
> > >>> +
> > >>> + struct bcache_device *d;
> > >>> + struct cache_set *c, *tc;
> > >>> + struct cached_dev *dcp, *t, *dc = NULL;
> > >>> +
> > >>> + uint64_t parent_nr_sectors;
> > >>> +
> > >>> + list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
> > >>> + list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
> > >>> + if (dcp->bdev->bd_dev == dev) {
> > >>> + dc = dcp;
> > >>> + goto dc_found;
> > >>> + }
> > >>> +
> > >>> +dc_found:
> > >>> + if (!dc)
> > >>> + return false;
> > >>> +
> > >>> + parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
> > >>> +
> > >>> + if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
> > >>> + return false;
> > >>> +
> > >>
> > >> The above code only handles whole disk using as cached device. If a
> > >> partition of a hard drive is used as cache device, and there are other
> > >> data after this partition, such condition should be handled as well. So
> > >> far I am fine to only extend size when using the whole hard drive as
> > >> cached device, but more code is necessary to check and only permits
> > >> size-extend for such condition.
Is it even possible to hot-resize a partition that is in use? I usually
get an error if a filesystem from sdb is mounted (or, presumably, if a
partition is in use as a backing device by bcache):
# blockdev --rereadpt /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
(I've always wanted a feature in Linux's partition code to support
hot-grow for non-overlapping partition changes that keep the original
starting sector, like when you grow the last partition, but I
digress...)
> > > I don't understand if there's a misalignment here so let me be more
> > > clear: this patch is intended to support the live resize of *backing
> > > devices*, is this what you mean with "cached device"?
> >
> > Yes, backing device is cached device.
> >
> >
> > > When I was working on the patch I didn't consider the possibility of
> > > live resizing the cache devices, but it should be trivial.
> > > So, as far as I understand, a partition cannot be used as a backing
> > > device, correct? The whole disk should be used as a backing device,
> > > that's why I'm checking and that's why this check should be correct.
> > > Am I wrong?
> >
> > Yes a partition can be used as a backing (cached) device. That is to
> > say, if a file system is format on top of the cached device, this file
> > system cannot be directly accessed via the partition. It has to be
> > accessed via the bcache device e.g. /dev/bcache0.
> > >> As I said, it should be a separated write-only sysfile under the cache
> > >> device's directory.
>
> > > Can I ask why you don't like the automatic resize way? Why should the
> > > resize be manual?
> >
> > Most of system administrators don’t like such silently automatic
> > things. They want to extend the size explicitly, especially when there
> > is other dependences in their configurations.
+1, I'm one of those administrators...
> What I was trying to say is that, in order to resize a block device, a
> manual command should be executed. So, this is already a "non-silent"
> automatic thing.
In some cases the "manual command" to which you refer may take weeks:
Lets say you have a big hardware RAID5 and add a disk. After the resize
completes the RAID controller reports a new disk size to the OS:
kernel: sd 2:0:0:1: [sdb] 503296888 512-byte logical blocks: (257 GB/239 GiB)
kernel: sdb: detected capacity change from 150317101056 to 257688006656
Are you suggesting that if `sdb` is the backing device to `bcache0` that
`bcache0` should be resized immediately when `sdb` detects a capacity
change?
I would recommend against it: while I can't control when the RAID
reconstruction finishes (since it takes 2 weeks for us to grow our
arrays), I would like to control when `bcache0` grows in case there is a
bug that triggers an IO hang (or worse). At least then we can grow
`bcache0` in a maintenance window in case there is a "surprise".
IMHO, this should be controlable by the admin and should be a _manual_
procedure (ie, echo 1 > resize)
> Moreover, if the block device has a FS on it, the FS needs to be
> manually grown with some special utilities, e.g. xfs_growfs.
I've never seen a block dev resize call userspace commands from the kernel
(like xfs_growfs or resize2fs). Does that exist? Again, to avoid
surprises, IMHO this should be a udev thing if someone wants it.
> So, again, another non-silent automatic step. Don't you agree? For
> example, to resize a qcow device attached to a VM I'm manually doing a
> `virsh blockresize`. As soon as I issue that command, the virtio_blk
> driver inside the VM detects the disk size change and calls the
> `set_capacity_and_notify` function.
Yes, that is expected and desired:
> Why then should bcache behave differently?
Because a VM presents itself as hardware. When you `virsh blockresize`
from _outside_ of the VM, then the result is effectively the same as a
RAID reconstruction completing and notifying the OS as in my example
above.
However, bcache _should_ call `set_capacity_and_notify` when
echo 1 > /sys/fs/bcache0/bcache/resize
happens, but I'm not convinced that it is a good idea to auto-resize
`bcache0` when the backing device `sdb` grows.
> If you're concerned that this can somehow break the
> behaviour-compatibility with older versions of the driver, can we
> protect this automatic discovery with an optional parameter? Will this
> be an option you will take into account?
I like the idea of supporting it as a feature for those who might want it,
but please default it off for the sanity of admins who like more control.
Maybe /sys/fs/bcache0/bcache/resize can be tri-state sysfs attribute:
echo 0 > /sys/fs/bcache0/bcache/resize # default
echo 1 > /sys/fs/bcache0/bcache/resize # one-shot, resets to 0
echo 2 > /sys/fs/bcache0/bcache/resize # auto-resize on bdev change
-Eric
>
> Thank you very much,
> Andrea
> >
> > > Someone needs to trigger the block device resize, so shouldn't that be enough?
> >
> > Yes, an explicit write operation on a sysfs file to trigger the resize. Then we can permit people to do the size extend explicit and avoid to change current behavior.
> >
> > Thanks.
> >
> > Coly Li
> >
> > >
> > > Andrea
> > >>
> > >>
> > >>> else
> > >>> err = "device busy";
> > >>> mutex_unlock(&bch_register_lock);
> > >>> --
> > >>> 2.37.3
> > >>>
> > >>
> >
>
prev parent reply other threads:[~2023-01-27 3:30 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-03 10:05 [RFC] Live resize of backing device Andrea Tomassetti
2022-08-04 14:32 ` Coly Li
2022-08-05 19:38 ` Eric Wheeler
2022-09-06 13:22 ` Andrea Tomassetti
2022-09-08 8:32 ` Andrea Tomassetti
2022-09-19 11:42 ` Andrea Tomassetti
2022-09-19 12:16 ` Coly Li
2022-12-09 8:57 ` Andrea Tomassetti
2022-12-09 9:36 ` Coly Li
2022-12-30 10:40 ` Coly Li
2023-01-11 16:01 ` Andrea Tomassetti
2023-01-17 13:08 ` Error messages with kernel 6.1.[56] Pierre Juhen
2023-01-17 16:08 ` Coly Li
2023-01-17 16:18 ` [RFC] Live resize of backing device Coly Li
2023-01-25 10:07 ` Andrea Tomassetti
2023-01-25 17:59 ` Coly Li
2023-01-27 12:44 ` Andrea Tomassetti
2023-01-27 22:40 ` Eric Wheeler
2023-01-31 10:20 ` Andrea Tomassetti
2023-02-02 17:18 ` Coly Li
2023-02-02 20:48 ` Eric Wheeler
2023-02-03 2:41 ` Coly Li
2023-02-19 9:39 ` Coly Li
2023-02-20 8:27 ` mingzhe
2023-02-20 12:29 ` Coly Li
2023-02-22 8:42 ` Andrea Tomassetti
2023-02-27 22:08 ` Eric Wheeler
2023-02-28 2:46 ` mingzhe
2023-01-27 2:53 ` Eric Wheeler [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9e68cc3-b13f-68e0-61c-59d5d044064@ewheeler.net \
--to=bcache@lists.ewheeler.net \
--cc=andrea.tomassetti-opensource@devo.com \
--cc=colyli@suse.de \
--cc=kent.overstreet@gmail.com \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).