From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67F51C47255 for ; Mon, 11 May 2020 11:00:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3AD892082E for ; Mon, 11 May 2020 11:00:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729553AbgEKLAK (ORCPT ); Mon, 11 May 2020 07:00:10 -0400 Received: from mx2.suse.de ([195.135.220.15]:33568 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728209AbgEKLAK (ORCPT ); Mon, 11 May 2020 07:00:10 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 2A8ABACC2; Mon, 11 May 2020 11:00:07 +0000 (UTC) To: Damien Le Moal , "linux-bcache@vger.kernel.org" Cc: "linux-block@vger.kernel.org" , Hannes Reinecke , Johannes Thumshirn References: <20200510165232.67500-1-colyli@suse.de> From: Coly Li Autocrypt: addr=colyli@suse.de; keydata= mQINBFYX6S8BEAC9VSamb2aiMTQREFXK4K/W7nGnAinca7MRuFUD4JqWMJ9FakNRd/E0v30F qvZ2YWpidPjaIxHwu3u9tmLKqS+2vnP0k7PRHXBYbtZEMpy3kCzseNfdrNqwJ54A430BHf2S GMVRVENiScsnh4SnaYjFVvB8SrlhTsgVEXEBBma5Ktgq9YSoy5miatWmZvHLFTQgFMabCz/P j5/xzykrF6yHo0rHZtwzQzF8rriOplAFCECp/t05+OeHHxjSqSI0P/G79Ll+AJYLRRm9til/ K6yz/1hX5xMToIkYrshDJDrUc8DjEpISQQPhG19PzaUf3vFpmnSVYprcWfJWsa2wZyyjRFkf J51S82WfclafNC6N7eRXedpRpG6udUAYOA1YdtlyQRZa84EJvMzW96iSL1Gf+ZGtRuM3k49H 1wiWOjlANiJYSIWyzJjxAd/7Xtiy/s3PRKL9u9y25ftMLFa1IljiDG+mdY7LyAGfvdtIkanr iBpX4gWXd7lNQFLDJMfShfu+CTMCdRzCAQ9hIHPmBeZDJxKq721CyBiGAhRxDN+TYiaG/UWT 7IB7LL4zJrIe/xQ8HhRO+2NvT89o0LxEFKBGg39yjTMIrjbl2ZxY488+56UV4FclubrG+t16 r2KrandM7P5RjR+cuHhkKseim50Qsw0B+Eu33Hjry7YCihmGswARAQABtBhDb2x5IExpIDxj b2x5bGlAc3VzZS5kZT6JAlYEEwEIAEACGyMHCwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgBYh BOo+RS/0+Uhgjej60Mc5B5Nrffj8BQJcR84dBQkY++fuAAoJEMc5B5Nrffj8ixcP/3KAKg1X EcoW4u/0z+Ton5rCyb/NpAww8MuRjNW82UBUac7yCi1y3OW7NtLjuBLw5SaVG5AArb7IF3U0 qTOobqfl5XHsT0o5wFHZaKUrnHb6y7V3SplsJWfkP3JmOooJsQB3z3K96ZTkFelsNb0ZaBRu gV+LA4MomhQ+D3BCDR1it1OX/tpvm2uaDF6s/8uFtcDEM9eQeqATN/QAJ49nvU/I8zDSY9rc 0x9mP0x+gH4RccbnoPu/rUG6Fm1ZpLrbb6NpaYBBJ/V1BC4lIOjnd24bsoQrQmnJn9dSr60X 1MY60XDszIyzRw7vbJcUn6ZzPNFDxFFT9diIb+wBp+DD8ZlD/hnVpl4f921ZbvfOSsXAJrKB 1hGY17FPwelp1sPcK2mDT+pfHEMV+OQdZzD2OCKtza/5IYismJJm3oVUYMogb5vDNAw9X2aP XgwUuG+FDEFPamFMUwIfzYHcePfqf0mMsaeSgtA/xTxzx/0MLjUJHl46Bc0uKDhv7QUyGz0j Ywgr2mHTvG+NWQ/mDeHNGkcnsnp3IY7koDHnN2xMFXzY4bn9m8ctqKo2roqjCzoxD/njoAhf KBzdybLHATqJG/yiZSbCxDA1n/J4FzPyZ0rNHUAJ/QndmmVspE9syFpFCKigvvyrzm016+k+ FJ59Q6RG4MSy/+J565Xj+DNY3/dCuQINBFYX6S8BEADZP+2cl4DRFaSaBms08W8/smc5T2CO YhAoygZn71rB7Djml2ZdvrLRjR8Qbn0Q/2L2gGUVc63pJnbrjlXSx2LfAFE0SlfYIJ11aFdF 9w7RvqWByQjDJor3Z0fWvPExplNgMvxpD0U0QrVT5dIGTx9hadejCl/ug09Lr6MPQn+a4+qs aRWwgCSHaIuDkH3zI1MJXiqXXFKUzJ/Fyx6R72rqiMPHH2nfwmMu6wOXAXb7+sXjZz5Po9GJ g2OcEc+rpUtKUJGyeQsnCDxUcqJXZDBi/GnhPCcraQuqiQ7EGWuJfjk51vaI/rW4bZkA9yEP B9rBYngbz7cQymUsfxuTT8OSlhxjP3l4ZIZFKIhDaQeZMj8pumBfEVUyiF6KVSfgfNQ/5PpM R4/pmGbRqrAAElhrRPbKQnCkGWDr8zG+AjN1KF6rHaFgAIO7TtZ+F28jq4reLkur0N5tQFww wFwxzROdeLHuZjL7eEtcnNnzSkXHczLkV4kQ3+vr/7Gm65mQfnVpg6JpwpVrbDYQeOFlxZ8+ GERY5Dag4KgKa/4cSZX2x/5+KkQx9wHwackw5gDCvAdZ+Q81nm6tRxEYBBiVDQZYqO73stgT ZyrkxykUbQIy8PI+g7XMDCMnPiDncQqgf96KR3cvw4wN8QrgA6xRo8xOc2C3X7jTMQUytCz9 0MyV1QARAQABiQI8BBgBCAAmAhsMFiEE6j5FL/T5SGCN6PrQxzkHk2t9+PwFAlxHziAFCRj7 5/EACgkQxzkHk2t9+PxgfA//cH5R1DvpJPwraTAl24SUcG9EWe+NXyqveApe05nk15zEuxxd e4zFEjo+xYZilSveLqYHrm/amvQhsQ6JLU+8N60DZHVcXbw1Eb8CEjM5oXdbcJpXh1/1BEwl 4phsQMkxOTns51bGDhTQkv4lsZKvNByB9NiiMkT43EOx14rjkhHw3rnqoI7ogu8OO7XWfKcL CbchjJ8t3c2XK1MUe056yPpNAT2XPNF2EEBPG2Y2F4vLgEbPv1EtpGUS1+JvmK3APxjXUl5z 6xrxCQDWM5AAtGfM/IswVjbZYSJYyH4BQKrShzMb0rWUjkpXvvjsjt8rEXpZEYJgX9jvCoxt oqjCKiVLpwje9WkEe9O9VxljmPvxAhVqJjX62S+TGp93iD+mvpCoHo3+CcvyRcilz+Ko8lfO hS9tYT0HDUiDLvpUyH1AR2xW9RGDevGfwGTpF0K6cLouqyZNdhlmNciX48tFUGjakRFsxRmX K0Jx4CEZubakJe+894sX6pvNFiI7qUUdB882i5GR3v9ijVPhaMr8oGuJ3kvwBIA8lvRBGVGn 9xvzkQ8Prpbqh30I4NMp8MjFdkwCN6znBKPHdjNTwE5PRZH0S9J0o67IEIvHfH0eAWAsgpTz +jwc7VKH7vkvgscUhq/v1/PEWCAqh9UHy7R/jiUxwzw/288OpgO+i+2l11Y= Subject: Re: [RFC PATCH v2] bcache: export zoned information for backing device Message-ID: <3672c550-5453-5de5-5c5c-34ab8dafb017@suse.de> Date: Mon, 11 May 2020 19:00:00 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On 2020/5/11 17:06, Damien Le Moal wrote: > On 2020/05/11 1:52, Coly Li wrote: >> This is a very basic zoned device support. With this patch, bcache >> device is able to, >> - Export zoned device attribution via sysfs >> - Response report zones request, e.g. by command 'blkzone report' >> But the bcache device is still NOT able to, >> - Response any zoned device management request or IOCTL command >> >> Here are the testings I have done, >> - read /sys/block/bcache0/queue/zoned, content is 'host-managed' >> - read /sys/block/bcache0/queue/nr_zones, content is number of zones >> including all zone types. >> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size >> in sectors. >> - run 'blkzone report /dev/bcache0', all zones information displayed. >> - run 'blkzone reset /dev/bcache0', operation is rejected with error >> information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed: >> Operation not supported" > > Weird. If report zones works, reset should also, modulo the zone size problem > for the first zone. You may get EINVAL, but not ENOTTY. > This is on purpose. Because reset the underlying zones layout may corrupt the bcache mapping between cache device and backing device. So such management commands are rejected. >> - Sequential writes by dd, I can see some zones' write pointer 'wptr' >> values updated. >> >> All of these are very basic testings, if you have better testing >> tools or cases, please offer me hint. >> >> Thanks in advance for your review and comments. >> >> Signed-off-by: Coly Li >> CC: Hannes Reinecke >> CC: Damien Le Moal >> CC: Johannes Thumshirn >> --- >> drivers/md/bcache/bcache.h | 10 +++++ >> drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++ >> drivers/md/bcache/super.c | 51 +++++++++++++++++++++---- >> 3 files changed, 128 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h >> index 74a9849ea164..0d298b48707f 100644 >> --- a/drivers/md/bcache/bcache.h >> +++ b/drivers/md/bcache/bcache.h >> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1); >> struct search; >> struct btree; >> struct keybuf; >> +struct bch_report_zones_args; >> >> struct keybuf_key { >> struct rb_node node; >> @@ -277,6 +278,8 @@ struct bcache_device { >> struct bio *bio, unsigned int sectors); >> int (*ioctl)(struct bcache_device *d, fmode_t mode, >> unsigned int cmd, unsigned long arg); >> + int (*report_zones)(struct bch_report_zones_args *args, >> + unsigned int nr_zones); >> }; >> >> struct io { >> @@ -748,6 +751,13 @@ struct bbio { >> struct bio bio; >> }; >> >> +struct bch_report_zones_args { >> + struct bcache_device *bcache_device; >> + sector_t sector; >> + void *orig_data; >> + report_zones_cb orig_cb; >> +}; >> + >> #define BTREE_PRIO USHRT_MAX >> #define INITIAL_PRIO 32768U >> >> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c >> index 71a90fbec314..bd50204caac7 100644 >> --- a/drivers/md/bcache/request.c >> +++ b/drivers/md/bcache/request.c >> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio) >> } >> } >> >> + /* >> + * zone management request may change the data layout and content >> + * implicitly on backing device, which introduces unacceptible > > s/unacceptible/unacceptable > >> + * inconsistency between the backing device and the cache device. >> + * Therefore all zone management related request will be denied here. >> + */ >> + if (op_is_zone_mgmt(bio->bi_opf)) { >> + pr_err_ratelimited("zoned device management request denied."); >> + bio->bi_status = BLK_STS_NOTSUPP; >> + bio_endio(bio); >> + return BLK_QC_T_NONE; > > OK. That explains the operation not supported for "blkzone reset". I do not > think this is good. With this, the drive will be writable only exactly once, > without the possibility for the user to reset any zone and rewrite them. All > zone compliant file systems will fail (f2fs, zonefs, btrfs coming). > > At the very least REQ_OP_ZONE_RESET should be allowed and trigger an > invalidation on the cache device of all blocks of the zone that are cached. > Copied. Then I should find a method to invalid the cached data on SSD in a proper way. > Note: the zone management operations will need to be remapped like report zones, > but in reverse: the BIO start sector must be increase by the zone size. > Thanks for the hint :-) >> + } >> + >> generic_start_io_acct(q, >> bio_op(bio), >> bio_sectors(bio), >> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode, >> if (dc->io_disable) >> return -EIO; >> >> + /* >> + * zone management ioctl commands may change the data layout >> + * and content implicitly on backing device, which introduces >> + * unacceptible inconsistency between the backing device and >> + * the cache device. Therefore all zone management related >> + * ioctl commands will be denied here. >> + */ >> + switch (cmd) { >> + case BLKRESETZONE: >> + case BLKOPENZONE: >> + case BLKCLOSEZONE: >> + case BLKFINISHZONE: >> + return -EOPNOTSUPP; > > Same comment as above. Of note is that only BLKRESETZONE will result in cache > inconsistency if not taken care of with an invalidation of the cached blocks of > the zone on the cache device. Open and close operations have no effect on data. > Finish zone will artificially increase the zone write pointer to the end of the > zone to make it full but without actually writing any data. So there is no need > I think to change anything on the cache device in that case. > Copied. I will handle this. One thing not cleared to me is, what is the purpose of BLKOPENZONE and BLKCLOSEZONE, is it used to update writer pointer only ? Or it is also used to set a specific zone to be offline ? >> + default: >> + /* Other commands fall through*/ >> + break; >> + } >> + >> return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg); >> } >> >> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits) >> return ret; >> } >> >> +static int cached_dev_report_zones_cb(struct blk_zone *zone, >> + unsigned int idx, >> + void *data) >> +{ >> + struct bch_report_zones_args *args = data; >> + struct bcache_device *d = args->bcache_device; >> + struct cached_dev *dc = container_of(d, struct cached_dev, disk); >> + unsigned long data_offset = dc->sb.data_offset; >> + >> + if (zone->start >= data_offset) { >> + zone->start -= data_offset; >> + zone->wp -= data_offset; > > Since the zone that contains the super block has to be ignored, the remapping of > the zone start can be done unconditionally. For the write pointer remapping, you > need to handle several cases: conventional zones, full zones and read-only zones > do not have a valid write pointer value, so no updating. You also need to skip > offline zones. > Copied, I will fix here in next version. >> + } else { >> + /* >> + * Normally the first zone is conventional zone, >> + * we don't need to worry about how to maintain >> + * the write pointer. >> + * But the zone->len is special, because the >> + * sector 0 on bcache device is 8KB offset on >> + * backing device indeed. >> + */ >> + zone->len -= data_offset; > > This should not be necessary if the first zone containing the super block is > skipped entirely. > Yes, it will be. >> + } >> + >> + return args->orig_cb(zone, idx, args->orig_data); >> +} >> + >> +static int cached_dev_report_zones(struct bch_report_zones_args *args, >> + unsigned int nr_zones) >> +{ >> + struct bcache_device *d = args->bcache_device; >> + struct cached_dev *dc = container_of(d, struct cached_dev, disk); >> + sector_t sector = args->sector + dc->sb.data_offset; >> + >> + /* sector is real LBA of backing device */ >> + return blkdev_report_zones(dc->bdev, >> + sector, >> + nr_zones, >> + cached_dev_report_zones_cb, >> + args); >> +} >> + >> void bch_cached_dev_request_init(struct cached_dev *dc) >> { >> struct gendisk *g = dc->disk.disk; >> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc) >> g->queue->backing_dev_info->congested_fn = cached_dev_congested; >> dc->disk.cache_miss = cached_dev_cache_miss; >> dc->disk.ioctl = cached_dev_ioctl; >> + dc->disk.report_zones = cached_dev_report_zones; >> } >> >> /* Flash backed devices */ >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index d98354fa28e3..b7d496040cee 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode, >> return d->ioctl(d, mode, cmd, arg); >> } >> >> +static int report_zones_dev(struct gendisk *disk, >> + sector_t sector, >> + unsigned int nr_zones, >> + report_zones_cb cb, >> + void *data) >> +{ >> + struct bcache_device *d = disk->private_data; >> + struct bch_report_zones_args args = { >> + .bcache_device = d, >> + .sector = sector, >> + .orig_data = data, >> + .orig_cb = cb, >> + }; >> + >> + if (d->report_zones) >> + return d->report_zones(&args, nr_zones); > > It looks to me like this is not necessary. This function could just be > cached_dev_report_zones() and you can drop the dc->disk.report_zones method. > Yes, I will fix it. >> + >> + return -EOPNOTSUPP; >> +} >> + >> static const struct block_device_operations bcache_ops = { >> .open = open_dev, >> .release = release_dev, >> .ioctl = ioctl_dev, >> + .report_zones = report_zones_dev, >> .owner = THIS_MODULE, >> }; >> >> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d) >> closure_debug_destroy(&d->cl); >> } >> >> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size, >> - sector_t sectors, make_request_fn make_request_fn) >> +static int bcache_device_init(struct cached_dev *dc, >> + struct bcache_device *d, >> + unsigned int block_size, >> + sector_t sectors, >> + make_request_fn make_request_fn) >> + >> { >> struct request_queue *q; >> const size_t max_stripes = min_t(size_t, INT_MAX, >> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size, >> blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue); >> blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue); >> >> + /* >> + * If this is for backing device registration, and it is an >> + * zoned device (e.g. host-managed S.M.R. hard drive), set >> + * up zoned device attribution properly for sysfs interface. >> + */ >> + if (dc && bdev_is_zoned(dc->bdev)) { >> + q->limits.zoned = bdev_zoned_model(dc->bdev); >> + q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk); >> + q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev); > > You need to call blk_revalidate_disk_zones() here: > > blk_revalidate_disk_zones(dc->bdev->bd_disk); > > but call it *after* setting the bc device capacity to > > get_capacity(dc->bdev->bd_disk) - bdev_zone_sectors(dc->bdev); > > Which I think is in fact the sectors argument to this function ? > > With that information blk_revalidate_disk_zones() will check the zones and set > q->nr_zones. > > This is something I didn't notice. Yes I should revalidate teh zones. Will do it in next version. >> + } >> + >> blk_queue_write_cache(q, true, true); >> >> return 0; >> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size) >> dc->partial_stripes_expensive = >> q->limits.raid_partial_stripes_expensive; >> >> - ret = bcache_device_init(&dc->disk, block_size, >> - dc->bdev->bd_part->nr_sects - dc->sb.data_offset, >> - cached_dev_make_request); >> + ret = bcache_device_init(dc, &dc->disk, block_size, >> + dc->bdev->bd_part->nr_sects - dc->sb.data_offset, > > dc->sb.data_offset should be the device zone size (chunk sectors) to skip the > entire first zone and preserve the "all zones have the same size" constraint. > Sure, and the bcache-tools should be fixed to recognize zoned device as well. >> + cached_dev_make_request); >> if (ret) >> return ret; >> >> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u) >> >> kobject_init(&d->kobj, &bch_flash_dev_ktype); >> >> - if (bcache_device_init(d, block_bytes(c), u->sectors, >> - flash_dev_make_request)) >> + if (bcache_device_init(NULL, d, block_bytes(c), >> + u->sectors, >> + flash_dev_make_request)) >> goto err; >> >> bcache_device_attach(d, c, u - c->uuids); >> > > Thanks for your review. I will post v2 soon. Coly Li