Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device
@ 2020-05-22 12:18 Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Hi folks,

This is series, now bcache can support zoned device (e.g. host managed
SMR hard drive) as the backing deice. Currently writeback mode is not
support yet, which is on the to-do list (requires on-disk super block
format change).

The first patch makes bcache to export the zoned information to upper
layer code, for example formatting zonefs on top of the bcache device.
By default, zone 0 of the zoned device is fully reserved for bcache
super block, therefore the reported zones number is 1 less than the
exact zones number of the physical SMR hard drive.

The second patch handles zone management command for bcache. Indeed
these zone management commands are wrappered as zone management bios.
For REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL zone management bios,
before forwarding the bio to backing device, all cached data covered
by the resetting zone(s) must be invalidated to keep data consistency.
For rested zone management bios just minus the bi_sector by data_offset
and simply forward to the zoned backing device.

The third patch is to make sure after bcache device starts, the cache
mode cannot be changed to writeback via sysfs interface. Bcache-tools
is modified to notice users and convert to writeback mode to the default
writethrough mode when making a bcache device.

There is one thing not addressed by this series, that is re-write the
bcache super block after REQ_OP_ZONE_RESET_ALL command. There will be
quite soon that all seq zones device may appear, but it is OK to make
bcache support such all seq-zones device a bit later.

Now a bcache device created with a zoned SMR drive can pass these test
cases,
- read /sys/block/bcache0/queue/zoned, content is 'host-managed'
- read /sys/block/bcache0/queue/nr_zones, content is number of zones
  excluding zone 0 of the backing device (reserved for bcache super
  block).
- read /sys/block/bcache0/queue/chunk_sectors, content is zone size
  in sectors.
- run 'blkzone report /dev/bcache0', all zones information displayed.
- run 'blkzone reset -o <zone LBA> -c <zones number> /dev/bcache0',
  conventional zones will reject the command, seqential zones covered
  by the command range will reset its write pointer to start LBA of
  their zones. If <zone LBA> is 0 and <zones number> covers all zones,
  REQ_OP_ZONE_RESET_ALL command will be received and handled by bcache
  device properly.
- zonefs can be created on top of the bcache device, with/without cache
  device attached. All sequential direct write and random read work well
  and zone reset by 'truncate -s 0 <zone file>' works too.
- Writeback cache mode does not support yet.

Now all prevous code review comments are addressed by this RFC version.
Please don't hesitate to offer your opinion on this version.

Thanks in advance for your help.

Coly Li
---
Changelog:
v4: another improved version without any other generic block change.
v3: an improved version depends on other generic block layer changes.
v2: the first RFC version for comments and review.
v1: the initial version posted just for information.


Coly Li (3):
  bcache: export bcache zone information for zoned backing device
  bcache: handle zone management bios for bcache device
  bcache: reject writeback cache mode for zoned backing device

 drivers/md/bcache/bcache.h  |  10 +++
 drivers/md/bcache/request.c | 168 +++++++++++++++++++++++++++++++++++-
 drivers/md/bcache/super.c   |  98 ++++++++++++++++++++-
 drivers/md/bcache/sysfs.c   |   5 ++
 4 files changed, 279 insertions(+), 2 deletions(-)

-- 
2.25.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
@ 2020-05-22 12:18 ` Coly Li
  2020-05-25  1:10   ` Damien Le Moal
  2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn

When using a zoned device e.g. SMR hard drive as the backing device,
if bcache can export the zoned device information then it is possible
to help the upper layer code to accelerate hot READ I/O requests.

This patch adds the report_zones method for the bcache device which has
zoned device as backing device. Now such bcache device can be treated as
a zoned device, by configured to writethrough, writearound mode or none
mode, zonefs can be formated on top of such bcache device.

Here is a simple performance data for read requests via zonefs on top of
bcache. The cache device of bcache is a 4TB NVMe SSD, the backing device
is a 14TB host managed SMR hard drive. The formatted zonefs has 52155
zone files, 523 of them are for convential zones (1 zone is reserved
for bcache super block and not reported), 51632 of them are for
sequential zones.

First run to read first 4KB from all the zone files with 50 processes,
it takes 5 minutes 55 seconds. Second run takes 12 seconds only because
all the read requests hit on cache device.

29 times faster is as expected for an ideal case when all READ I/Os hit
on NVMe cache device.

Besides providing report_zones method of the bcache gendisk structure,
this patch also initializes the following zoned device attribution for
the bcache device,
- zones number: the total zones number minus reserved zone(s) for bcache
  super block.
- zone size: same size as reported from the underlying zoned device
- zone mode: same mode as reported from the underlying zoned device
Currently blk_revalidate_disk_zones() does not accept non-mq drivers, so
all the above attribution are initialized mannally in bcache code.

This patch just provides the report_zones method only. Handling all zone
management commands will be addressed in following patches.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Changelog:
v4: the version without any generic block layer change.
v3: the version depends on other generic block layer changes.
v2: an improved version for comments.
v1: initial version.
 drivers/md/bcache/bcache.h  | 10 ++++
 drivers/md/bcache/request.c | 69 ++++++++++++++++++++++++++
 drivers/md/bcache/super.c   | 96 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 174 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 74a9849ea164..0d298b48707f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
 struct search;
 struct btree;
 struct keybuf;
+struct bch_report_zones_args;
 
 struct keybuf_key {
 	struct rb_node		node;
@@ -277,6 +278,8 @@ struct bcache_device {
 			  struct bio *bio, unsigned int sectors);
 	int (*ioctl)(struct bcache_device *d, fmode_t mode,
 		     unsigned int cmd, unsigned long arg);
+	int (*report_zones)(struct bch_report_zones_args *args,
+			    unsigned int nr_zones);
 };
 
 struct io {
@@ -748,6 +751,13 @@ struct bbio {
 	struct bio		bio;
 };
 
+struct bch_report_zones_args {
+	struct bcache_device *bcache_device;
+	sector_t sector;
+	void *orig_data;
+	report_zones_cb orig_cb;
+};
+
 #define BTREE_PRIO		USHRT_MAX
 #define INITIAL_PRIO		32768U
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 71a90fbec314..34f63da2338d 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1233,6 +1233,30 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
 	if (dc->io_disable)
 		return -EIO;
 
+	/*
+	 * All zoned device ioctl commands are handled in
+	 * other code paths,
+	 * - BLKREPORTZONE: by report_zones method of bcache_ops.
+	 * - BLKRESETZONE/BLKOPENZONE/BLKCLOSEZONE/BLKFINISHZONE: all handled
+	 *   by bio code path.
+	 * - BLKGETZONESZ/BLKGETNRZONES:directly handled inside generic block
+	 *   ioctl handler blkdev_common_ioctl().
+	 */
+	switch (cmd) {
+	case BLKREPORTZONE:
+	case BLKRESETZONE:
+	case BLKGETZONESZ:
+	case BLKGETNRZONES:
+	case BLKOPENZONE:
+	case BLKCLOSEZONE:
+	case BLKFINISHZONE:
+		pr_warn("Zoned device ioctl cmd should not be here.\n");
+		return -EOPNOTSUPP;
+	default:
+		/* Other commands  */
+		break;
+	}
+
 	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
 }
 
@@ -1261,6 +1285,50 @@ static int cached_dev_congested(void *data, int bits)
 	return ret;
 }
 
+/*
+ * The callback routine to parse a specific zone from all reporting
+ * zones. args->orig_cb() is the upper layer report zones callback,
+ * which should be called after the LBA conversion.
+ * Notice: all zones after zone 0 will be reported, including the
+ * offlined zones, how to handle the different types of zones are
+ * fully decided by upper layer who calss for reporting zones of
+ * the bcache device.
+ */
+static int cached_dev_report_zones_cb(struct blk_zone *zone,
+				      unsigned int idx,
+				      void *data)
+{
+	struct bch_report_zones_args *args = data;
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	unsigned long data_offset = dc->sb.data_offset;
+
+	/* Zone 0 should not be reported */
+	BUG_ON(zone->start < data_offset);
+
+	/* convert back to LBA of the bcache device*/
+	zone->start -= data_offset;
+	zone->wp -= data_offset;
+
+	return args->orig_cb(zone, idx, args->orig_data);
+}
+
+static int cached_dev_report_zones(struct bch_report_zones_args *args,
+				   unsigned int nr_zones)
+{
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	/* skip zone 0 which is fully occupied by bcache super block */
+	sector_t sector = args->sector + dc->sb.data_offset;
+
+	/* sector is real LBA of backing device */
+	return blkdev_report_zones(dc->bdev,
+				   sector,
+				   nr_zones,
+				   cached_dev_report_zones_cb,
+				   args);
+}
+
 void bch_cached_dev_request_init(struct cached_dev *dc)
 {
 	struct gendisk *g = dc->disk.disk;
@@ -1268,6 +1336,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
 	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
 	dc->disk.cache_miss			= cached_dev_cache_miss;
 	dc->disk.ioctl				= cached_dev_ioctl;
+	dc->disk.report_zones			= cached_dev_report_zones;
 }
 
 /* Flash backed devices */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d98354fa28e3..d5da7ad5157d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -679,10 +679,36 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
 	return d->ioctl(d, mode, cmd, arg);
 }
 
+static int report_zones_dev(struct gendisk *disk,
+			    sector_t sector,
+			    unsigned int nr_zones,
+			    report_zones_cb cb,
+			    void *data)
+{
+	struct bcache_device *d = disk->private_data;
+	struct bch_report_zones_args args = {
+		.bcache_device = d,
+		.sector = sector,
+		.orig_data = data,
+		.orig_cb = cb,
+	};
+
+	/*
+	 * only bcache device with backing device has
+	 * report_zones method, flash device does not.
+	 */
+	if (d->report_zones)
+		return d->report_zones(&args, nr_zones);
+
+	/* flash dev does not have report_zones method */
+	return -EOPNOTSUPP;
+}
+
 static const struct block_device_operations bcache_ops = {
 	.open		= open_dev,
 	.release	= release_dev,
 	.ioctl		= ioctl_dev,
+	.report_zones	= report_zones_dev,
 	.owner		= THIS_MODULE,
 };
 
@@ -817,6 +843,7 @@ static void bcache_device_free(struct bcache_device *d)
 
 static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 			      sector_t sectors, make_request_fn make_request_fn)
+
 {
 	struct request_queue *q;
 	const size_t max_stripes = min_t(size_t, INT_MAX,
@@ -1307,6 +1334,67 @@ static void cached_dev_flush(struct closure *cl)
 	continue_at(cl, cached_dev_free, system_wq);
 }
 
+static inline int cached_dev_data_offset_check(struct cached_dev *dc)
+{
+	if (!bdev_is_zoned(dc->bdev))
+		return 0;
+
+	/*
+	 * If the backing hard drive is zoned device, sb.data_offset
+	 * should be aligned to zone size, which is automatically
+	 * handled by 'bcache' util of bcache-tools. If the data_offset
+	 * is not aligned to zone size, it means the bcache-tools is
+	 * outdated.
+	 */
+	if (dc->sb.data_offset & (bdev_zone_sectors(dc->bdev) - 1)) {
+		pr_err("data_offset %llu is not aligned to zone size %llu, please update bcache-tools and re-make the zoned backing device.\n",
+			dc->sb.data_offset, bdev_zone_sectors(dc->bdev));
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Initialize zone information for the bcache device, this function
+ * assumes the bcache device has a cached device (dc != NULL), and
+ * the cached device is zoned device (bdev_is_zoned(dc->bdev) == true).
+ *
+ * The following zone information of the bcache device will be set,
+ * - zoned mode, same as the mode of zoned backing device
+ * - zone size in sectors, same as the zoned backing device
+ * - zones number, it is zones number of zoned backing device minus the
+ *   reserved zones for bcache super blocks.
+ */
+static int bch_cached_dev_zone_init(struct cached_dev *dc)
+{
+	struct request_queue *d_q, *b_q;
+	enum blk_zoned_model mode;
+
+	if (!bdev_is_zoned(dc->bdev))
+		return 0;
+
+	/* queue of bcache device */
+	d_q = dc->disk.disk->queue;
+	/* queue of backing device */
+	b_q = bdev_get_queue(dc->bdev);
+
+	mode = blk_queue_zoned_model(b_q);
+	if (mode != BLK_ZONED_NONE) {
+		d_q->limits.zoned = mode;
+		blk_queue_chunk_sectors(d_q, b_q->limits.chunk_sectors);
+		/*
+		 * (dc->sb.data_offset / q->limits.chunk_sectors) is the
+		 * zones number reserved for bcache super block. By default
+		 * it is set to 1 by bcache-tools.
+		 */
+		d_q->nr_zones = b_q->nr_zones -
+			(dc->sb.data_offset / d_q->limits.chunk_sectors);
+	}
+
+	return 0;
+}
+
 static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 {
 	int ret;
@@ -1333,6 +1421,10 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 	dc->disk.stripe_size = q->limits.io_opt >> 9;
 
+	ret = cached_dev_data_offset_check(dc);
+	if (ret)
+		return ret;
+
 	if (dc->disk.stripe_size)
 		dc->partial_stripes_expensive =
 			q->limits.raid_partial_stripes_expensive;
@@ -1355,7 +1447,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 	bch_cached_dev_request_init(dc);
 	bch_cached_dev_writeback_init(dc);
-	return 0;
+	ret = bch_cached_dev_zone_init(dc);
+
+	return ret;
 }
 
 /* Cached device - bcache superblock */
-- 
2.25.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
@ 2020-05-22 12:18 ` Coly Li
  2020-05-25  1:24   ` Damien Le Moal
  2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
  2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
  3 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn

Fortunately the zoned device management interface is designed to use
bio operations, the bcache code just needs to do a little change to
handle the zone management bios.

Bcache driver needs to handle 5 zone management bios for now, all of
them are handled by cached_dev_nodata() since their bi_size values
are 0.
- REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_FINISH:
    The LBA conversion is already done in cached_dev_make_request(),
  just simply send such zone management bios to backing device is
  enough.
- REQ_OP_ZONE_RESET, REQ_OP_ZONE_RESET_ALL:
    If thre is cached data (clean or dirty) on cache device covered
  by the LBA range of resetting zone(s), such cached data must be
  invalidate from the cache device before forwarding the zone reset
  bios to the backing device. Otherwise data inconsistency or further
  corruption may happen from the view of bcache device.
    The difference of REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_RESET is
  when the zone management bio is to reset all zones, send all zones
  number reported by the bcache device (s->d->disk->queue->nr_zones)
  into bch_data_invalidate_zones() as parameter 'size_t nr_zones'. If
  only reset a single zone, just set 1 as 'size_t nr_zones'.

By supporting zone managememnt bios with this patch, now a bcache device
can be formatted by zonefs, and the zones can be reset by truncate -s 0
on the zone files under seq/ directory. Supporting REQ_OP_ZONE_RESET_ALL
makes the whole disk zones reset much faster. In my testing on a 14TB
zoned SMR hard drive, 1 by 1 resetting 51632 seq zones by sending
REQ_OP_ZONE_RESET bios takes 4m34s, by sending a single
REQ_OP_ZONE_RESET_ALL bio takes 12s, which is 22x times faster.

REQ_OP_ZONE_RESET_ALL is supported by bcache only when the backing device
supports it. So the bcache queue flag is set QUEUE_FLAG_ZONE_RESETALL on
only when queue of backing device has such flag, which can be checked by
calling blk_queue_zone_resetall() on backing device's request queue.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Changelog:
v2: an improved version without any generic block layer change.
v1: initial version depends on other generic block layer changes.

 drivers/md/bcache/request.c | 99 ++++++++++++++++++++++++++++++++++++-
 drivers/md/bcache/super.c   |  2 +
 2 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 34f63da2338d..700b8ab5dec9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1052,18 +1052,115 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 	continue_at(cl, cached_dev_write_complete, NULL);
 }
 
+/*
+ * Invalidate the LBA range on cache device which is covered by the
+ * the resetting zones.
+ */
+static int bch_data_invalidate_zones(struct closure *cl,
+				      size_t zone_sector,
+				      size_t nr_zones)
+{
+	struct search *s = container_of(cl, struct search, cl);
+	struct data_insert_op *iop = &s->iop;
+	int max_keys, free_keys;
+	size_t start = zone_sector;
+	int ret;
+
+	max_keys = (block_bytes(iop->c) - sizeof(struct jset)) /
+		   sizeof(uint64_t);
+	bch_keylist_init(&iop->insert_keys);
+	ret = bch_keylist_realloc(&iop->insert_keys, max_keys, iop->c);
+	if (ret < 0)
+		return -ENOMEM;
+
+	while (nr_zones-- > 0) {
+		atomic_t *journal_ref = NULL;
+		size_t size = s->d->disk->queue->limits.chunk_sectors;
+more_keys:
+		bch_keylist_reset(&iop->insert_keys);
+		free_keys = max_keys;
+
+		while (size > 0 && free_keys >= 2) {
+			size_t sectors = min_t(size_t, size,
+					       1U << (KEY_SIZE_BITS - 1));
+
+			bch_keylist_add(&iop->insert_keys,
+					&KEY(iop->inode, start, sectors));
+			start += sectors;
+			size -= sectors;
+			free_keys -= 2;
+		}
+
+		/* Insert invalidate keys into btree */
+		journal_ref = bch_journal(iop->c, &iop->insert_keys, NULL);
+		if (!journal_ref) {
+			ret = -EIO;
+			break;
+		}
+
+		ret = bch_btree_insert(iop->c,
+				&iop->insert_keys, journal_ref, NULL);
+		atomic_dec_bug(journal_ref);
+		if (ret < 0)
+			break;
+
+		if (size)
+			goto more_keys;
+	}
+
+	bch_keylist_free(&iop->insert_keys);
+
+	return ret;
+}
+
 static void cached_dev_nodata(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct bio *bio = &s->bio.bio;
+	int nr_zones = 1;
 
 	if (s->iop.flush_journal)
 		bch_journal_meta(s->iop.c, cl);
 
-	/* If it's a flush, we send the flush to the backing device too */
+	if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL ||
+	    bio_op(bio) == REQ_OP_ZONE_RESET) {
+		int err = 0;
+		/*
+		 * If this is REQ_OP_ZONE_RESET_ALL bio, cached data
+		 * covered by all zones should be invalidate from the
+		 * cache device.
+		 */
+		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
+			nr_zones = s->d->disk->queue->nr_zones;
+
+		err = bch_data_invalidate_zones(
+			cl, bio->bi_iter.bi_sector, nr_zones);
+
+		if (err < 0) {
+			s->iop.status = errno_to_blk_status(err);
+			/* debug, should be removed before post patch */
+			bio->bi_status = BLK_STS_TIMEOUT;
+			/* set by bio_cnt_set() in do_bio_hook() */
+			bio_put(bio);
+			/*
+			 * Invalidate cached data fails, don't send
+			 * the zone reset bio to backing device and
+			 * return failure. Otherwise potential data
+			 * corruption on bcache device may happen.
+			 */
+			goto continue_bio_complete;
+		}
+
+	}
+
+	/*
+	 * For flush or zone management bios, of cause
+	 * they should be sent to backing device too.
+	 */
 	bio->bi_end_io = backing_request_endio;
 	closure_bio_submit(s->iop.c, bio, cl);
 
+continue_bio_complete:
 	continue_at(cl, cached_dev_bio_complete, NULL);
 }
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d5da7ad5157d..70c31950ec1b 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1390,6 +1390,8 @@ static int bch_cached_dev_zone_init(struct cached_dev *dc)
 		 */
 		d_q->nr_zones = b_q->nr_zones -
 			(dc->sb.data_offset / d_q->limits.chunk_sectors);
+		if (blk_queue_zone_resetall(b_q))
+			blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, d_q);
 	}
 
 	return 0;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
@ 2020-05-22 12:18 ` Coly Li
  2020-05-25  1:26   ` Damien Le Moal
  2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
  3 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn

Currently we don't support writeback mode for zoned device as backing
device. So reject it by sysfs interface.

This rejection will be removed after the writeback cache mode support
for zoned device gets done.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/md/bcache/sysfs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 323276994aab..41bdbc42a17d 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -359,6 +359,11 @@ STORE(__cached_dev)
 		if (v < 0)
 			return v;
 
+		if ((unsigned int) v == CACHE_MODE_WRITEBACK) {
+			pr_err("writeback mode is not supported for zoned backing device.\n");
+			return -ENOTSUPP;
+		}
+
 		if ((unsigned int) v != BDEV_CACHE_MODE(&dc->sb)) {
 			SET_BDEV_CACHE_MODE(&dc->sb, v);
 			bch_write_bdev_super(dc, NULL);
-- 
2.25.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
@ 2020-05-25  1:10   ` Damien Le Moal
  2020-06-01 12:34     ` Coly Li
  0 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2020-05-25  1:10 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/05/22 21:18, Coly Li wrote:
> When using a zoned device e.g. SMR hard drive as the backing device,
> if bcache can export the zoned device information then it is possible
> to help the upper layer code to accelerate hot READ I/O requests.
> 
> This patch adds the report_zones method for the bcache device which has
> zoned device as backing device. Now such bcache device can be treated as
> a zoned device, by configured to writethrough, writearound mode or none
> mode, zonefs can be formated on top of such bcache device.
> 
> Here is a simple performance data for read requests via zonefs on top of
> bcache. The cache device of bcache is a 4TB NVMe SSD, the backing device
> is a 14TB host managed SMR hard drive. The formatted zonefs has 52155
> zone files, 523 of them are for convential zones (1 zone is reserved

s/convential/conventional

> for bcache super block and not reported), 51632 of them are for
> sequential zones.
> 
> First run to read first 4KB from all the zone files with 50 processes,
> it takes 5 minutes 55 seconds. Second run takes 12 seconds only because
> all the read requests hit on cache device.

Did you write anything first to the bcache device ? Otherwise, all zonefs files
will be empty and there is not going to be any file access... Question though:
when writing to a bcache device with writethrough mode, does the data go to the
SSD cache too ? Or is it written only to the backend device ?

> 
> 29 times faster is as expected for an ideal case when all READ I/Os hit
> on NVMe cache device.
> 
> Besides providing report_zones method of the bcache gendisk structure,
> this patch also initializes the following zoned device attribution for
> the bcache device,
> - zones number: the total zones number minus reserved zone(s) for bcache

s/zones number/number of zones

>   super block.
> - zone size: same size as reported from the underlying zoned device
> - zone mode: same mode as reported from the underlying zoned device

s/zone mode/zoned model

> Currently blk_revalidate_disk_zones() does not accept non-mq drivers, so
> all the above attribution are initialized mannally in bcache code.

s/mannally/manually

> 
> This patch just provides the report_zones method only. Handling all zone
> management commands will be addressed in following patches.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> Changelog:
> v4: the version without any generic block layer change.
> v3: the version depends on other generic block layer changes.
> v2: an improved version for comments.
> v1: initial version.
>  drivers/md/bcache/bcache.h  | 10 ++++
>  drivers/md/bcache/request.c | 69 ++++++++++++++++++++++++++
>  drivers/md/bcache/super.c   | 96 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 174 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 74a9849ea164..0d298b48707f 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>  struct search;
>  struct btree;
>  struct keybuf;
> +struct bch_report_zones_args;
>  
>  struct keybuf_key {
>  	struct rb_node		node;
> @@ -277,6 +278,8 @@ struct bcache_device {
>  			  struct bio *bio, unsigned int sectors);
>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>  		     unsigned int cmd, unsigned long arg);
> +	int (*report_zones)(struct bch_report_zones_args *args,
> +			    unsigned int nr_zones);
>  };
>  
>  struct io {
> @@ -748,6 +751,13 @@ struct bbio {
>  	struct bio		bio;
>  };
>  
> +struct bch_report_zones_args {
> +	struct bcache_device *bcache_device;
> +	sector_t sector;
> +	void *orig_data;
> +	report_zones_cb orig_cb;
> +};
> +
>  #define BTREE_PRIO		USHRT_MAX
>  #define INITIAL_PRIO		32768U
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 71a90fbec314..34f63da2338d 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1233,6 +1233,30 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>  	if (dc->io_disable)
>  		return -EIO;
>  
> +	/*
> +	 * All zoned device ioctl commands are handled in
> +	 * other code paths,
> +	 * - BLKREPORTZONE: by report_zones method of bcache_ops.
> +	 * - BLKRESETZONE/BLKOPENZONE/BLKCLOSEZONE/BLKFINISHZONE: all handled
> +	 *   by bio code path.
> +	 * - BLKGETZONESZ/BLKGETNRZONES:directly handled inside generic block
> +	 *   ioctl handler blkdev_common_ioctl().
> +	 */
> +	switch (cmd) {
> +	case BLKREPORTZONE:
> +	case BLKRESETZONE:
> +	case BLKGETZONESZ:
> +	case BLKGETNRZONES:
> +	case BLKOPENZONE:
> +	case BLKCLOSEZONE:
> +	case BLKFINISHZONE:
> +		pr_warn("Zoned device ioctl cmd should not be here.\n");
> +		return -EOPNOTSUPP;
> +	default:
> +		/* Other commands  */
> +		break;
> +	}
> +
>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>  }
>  
> @@ -1261,6 +1285,50 @@ static int cached_dev_congested(void *data, int bits)
>  	return ret;
>  }
>  
> +/*
> + * The callback routine to parse a specific zone from all reporting
> + * zones. args->orig_cb() is the upper layer report zones callback,
> + * which should be called after the LBA conversion.
> + * Notice: all zones after zone 0 will be reported, including the
> + * offlined zones, how to handle the different types of zones are
> + * fully decided by upper layer who calss for reporting zones of
> + * the bcache device.
> + */
> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
> +				      unsigned int idx,
> +				      void *data)

I do not think you need the line break for the last argument.

> +{
> +	struct bch_report_zones_args *args = data;
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	unsigned long data_offset = dc->sb.data_offset;
> +
> +	/* Zone 0 should not be reported */
> +	BUG_ON(zone->start < data_offset);

Wouldn't a WARN_ON_ONCE and return -EIO be better here ?

> +
> +	/* convert back to LBA of the bcache device*/
> +	zone->start -= data_offset;
> +	zone->wp -= data_offset;

This has to be done depending on the zone type and zone condition: zone->wp is
"invalid" for conventional zones, and sequential zones that are full, read-only
or offline. So you need something like this:

	/* Remap LBA to the bcache device */
	zone->start -= data_offset;
	switch(zone->cond) {
	case BLK_ZONE_COND_NOT_WP:
	case BLK_ZONE_COND_READONLY:
	case BLK_ZONE_COND_FULL:
	case BLK_ZONE_COND_OFFLINE:
		break;
	case BLK_ZONE_COND_EMPTY:
		zone->wp = zone->start;
		break;
	default:
		zone->wp -= data_offset;
		break;
	}

	return args->orig_cb(zone, idx, args->orig_data);

> +
> +	return args->orig_cb(zone, idx, args->orig_data);
> +}
> +
> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
> +				   unsigned int nr_zones)
> +{
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	/* skip zone 0 which is fully occupied by bcache super block */
> +	sector_t sector = args->sector + dc->sb.data_offset;
> +
> +	/* sector is real LBA of backing device */
> +	return blkdev_report_zones(dc->bdev,
> +				   sector,
> +				   nr_zones,
> +				   cached_dev_report_zones_cb,
> +				   args);

You could have multiple arguments on a couple of lines only here...

> +}
> +
>  void bch_cached_dev_request_init(struct cached_dev *dc)
>  {
>  	struct gendisk *g = dc->disk.disk;
> @@ -1268,6 +1336,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>  	dc->disk.ioctl				= cached_dev_ioctl;
> +	dc->disk.report_zones			= cached_dev_report_zones;

Why set this method unconditionally ? Should it be set only for a zoned bcache
device ? E.g.:
	
	if (bdev_is_zoned(bcache bdev))
		dc->disk.report_zones = cached_dev_report_zones;

>  }
>  
>  /* Flash backed devices */
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index d98354fa28e3..d5da7ad5157d 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -679,10 +679,36 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>  	return d->ioctl(d, mode, cmd, arg);
>  }
>  
> +static int report_zones_dev(struct gendisk *disk,
> +			    sector_t sector,
> +			    unsigned int nr_zones,
> +			    report_zones_cb cb,
> +			    void *data)
> +{
> +	struct bcache_device *d = disk->private_data;
> +	struct bch_report_zones_args args = {
> +		.bcache_device = d,
> +		.sector = sector,
> +		.orig_data = data,
> +		.orig_cb = cb,
> +	};
> +
> +	/*
> +	 * only bcache device with backing device has
> +	 * report_zones method, flash device does not.
> +	 */
> +	if (d->report_zones)
> +		return d->report_zones(&args, nr_zones);
> +
> +	/* flash dev does not have report_zones method */

This comment is confusing. Report zones is called against the bcache device, not
against its components... In any case, if the bcache device is not zoned, the
report_zones method will never be called by the block layer. So you probably
should just check that on entry:

	if (WARN_ON_ONCE(!blk_queue_is_zoned(disk->queue))
		return -EOPNOTSUPP;

	return d->report_zones(&args, nr_zones);

> +	return -EOPNOTSUPP;
> +}
> +
>  static const struct block_device_operations bcache_ops = {
>  	.open		= open_dev,
>  	.release	= release_dev,
>  	.ioctl		= ioctl_dev,
> +	.report_zones	= report_zones_dev,
>  	.owner		= THIS_MODULE,
>  };

Same here. It may be better to set the report zones method only for a zoned
bcache dev. So you will need an additional block_device_operations struct for
that type.

static const struct block_device_operations bcache_zoned_ops = {
 	.open		= open_dev,
 	.release	= release_dev,
 	.ioctl		= ioctl_dev,
	.report_zones	= report_zones_dev,
 	.owner		= THIS_MODULE,
};

>  
> @@ -817,6 +843,7 @@ static void bcache_device_free(struct bcache_device *d)
>  
>  static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  			      sector_t sectors, make_request_fn make_request_fn)
> +
>  {
>  	struct request_queue *q;
>  	const size_t max_stripes = min_t(size_t, INT_MAX,
> @@ -1307,6 +1334,67 @@ static void cached_dev_flush(struct closure *cl)
>  	continue_at(cl, cached_dev_free, system_wq);
>  }
>  
> +static inline int cached_dev_data_offset_check(struct cached_dev *dc)
> +{
> +	if (!bdev_is_zoned(dc->bdev))
> +		return 0;
> +
> +	/*
> +	 * If the backing hard drive is zoned device, sb.data_offset
> +	 * should be aligned to zone size, which is automatically
> +	 * handled by 'bcache' util of bcache-tools. If the data_offset
> +	 * is not aligned to zone size, it means the bcache-tools is
> +	 * outdated.
> +	 */
> +	if (dc->sb.data_offset & (bdev_zone_sectors(dc->bdev) - 1)) {
> +		pr_err("data_offset %llu is not aligned to zone size %llu, please update bcache-tools and re-make the zoned backing device.\n",

Long line... May be split the pr_err in 2 calls ?

> +			dc->sb.data_offset, bdev_zone_sectors(dc->bdev));
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Initialize zone information for the bcache device, this function
> + * assumes the bcache device has a cached device (dc != NULL), and
> + * the cached device is zoned device (bdev_is_zoned(dc->bdev) == true).
> + *
> + * The following zone information of the bcache device will be set,
> + * - zoned mode, same as the mode of zoned backing device
> + * - zone size in sectors, same as the zoned backing device
> + * - zones number, it is zones number of zoned backing device minus the
> + *   reserved zones for bcache super blocks.
> + */
> +static int bch_cached_dev_zone_init(struct cached_dev *dc)
> +{
> +	struct request_queue *d_q, *b_q;
> +	enum blk_zoned_model mode;

To be clear, may be call this variable "model" ?

> +
> +	if (!bdev_is_zoned(dc->bdev))
> +		return 0;
> +
> +	/* queue of bcache device */
> +	d_q = dc->disk.disk->queue;
> +	/* queue of backing device */
> +	b_q = bdev_get_queue(dc->bdev);
> +
> +	mode = blk_queue_zoned_model(b_q);
> +	if (mode != BLK_ZONED_NONE) {
> +		d_q->limits.zoned = mode;
> +		blk_queue_chunk_sectors(d_q, b_q->limits.chunk_sectors);
> +		/*
> +		 * (dc->sb.data_offset / q->limits.chunk_sectors) is the
> +		 * zones number reserved for bcache super block. By default
> +		 * it is set to 1 by bcache-tools.
> +		 */
> +		d_q->nr_zones = b_q->nr_zones -
> +			(dc->sb.data_offset / d_q->limits.chunk_sectors);

Does this compile on 32bits arch ? Don't you need a do_div() here ?

> +	}
> +
> +	return 0;
> +}
> +
>  static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  {
>  	int ret;
> @@ -1333,6 +1421,10 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  
>  	dc->disk.stripe_size = q->limits.io_opt >> 9;
>  
> +	ret = cached_dev_data_offset_check(dc);
> +	if (ret)
> +		return ret;
> +
>  	if (dc->disk.stripe_size)
>  		dc->partial_stripes_expensive =
>  			q->limits.raid_partial_stripes_expensive;
> @@ -1355,7 +1447,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  
>  	bch_cached_dev_request_init(dc);
>  	bch_cached_dev_writeback_init(dc);
> -	return 0;
> +	ret = bch_cached_dev_zone_init(dc);
> +
> +	return ret;
>  }
>  
>  /* Cached device - bcache superblock */
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
@ 2020-05-25  1:24   ` Damien Le Moal
  0 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2020-05-25  1:24 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/05/22 21:18, Coly Li wrote:
> Fortunately the zoned device management interface is designed to use
> bio operations, the bcache code just needs to do a little change to
> handle the zone management bios.
> 
> Bcache driver needs to handle 5 zone management bios for now, all of
> them are handled by cached_dev_nodata() since their bi_size values
> are 0.
> - REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_FINISH:
>     The LBA conversion is already done in cached_dev_make_request(),
>   just simply send such zone management bios to backing device is
>   enough.
> - REQ_OP_ZONE_RESET, REQ_OP_ZONE_RESET_ALL:
>     If thre is cached data (clean or dirty) on cache device covered
>   by the LBA range of resetting zone(s), such cached data must be

The first part of the sentence is a little unclear... Did you mean:

If the cache device holds data of the target zones, cache invalidation is needed
before forwarding...

>   invalidate from the cache device before forwarding the zone reset
>   bios to the backing device. Otherwise data inconsistency or further
>   corruption may happen from the view of bcache device.
>     The difference of REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_RESET is
>   when the zone management bio is to reset all zones, send all zones
>   number reported by the bcache device (s->d->disk->queue->nr_zones)
>   into bch_data_invalidate_zones() as parameter 'size_t nr_zones'. If
>   only reset a single zone, just set 1 as 'size_t nr_zones'.
> 
> By supporting zone managememnt bios with this patch, now a bcache device

s/managememnt/management

> can be formatted by zonefs, and the zones can be reset by truncate -s 0
> on the zone files under seq/ directory. Supporting REQ_OP_ZONE_RESET_ALL
> makes the whole disk zones reset much faster. In my testing on a 14TB
> zoned SMR hard drive, 1 by 1 resetting 51632 seq zones by sending
> REQ_OP_ZONE_RESET bios takes 4m34s, by sending a single
> REQ_OP_ZONE_RESET_ALL bio takes 12s, which is 22x times faster.
> 
> REQ_OP_ZONE_RESET_ALL is supported by bcache only when the backing device
> supports it. So the bcache queue flag is set QUEUE_FLAG_ZONE_RESETALL on
> only when queue of backing device has such flag, which can be checked by
> calling blk_queue_zone_resetall() on backing device's request queue.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> Changelog:
> v2: an improved version without any generic block layer change.
> v1: initial version depends on other generic block layer changes.
> 
>  drivers/md/bcache/request.c | 99 ++++++++++++++++++++++++++++++++++++-
>  drivers/md/bcache/super.c   |  2 +
>  2 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 34f63da2338d..700b8ab5dec9 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1052,18 +1052,115 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  	continue_at(cl, cached_dev_write_complete, NULL);
>  }
>  
> +/*
> + * Invalidate the LBA range on cache device which is covered by the
> + * the resetting zones.
> + */
> +static int bch_data_invalidate_zones(struct closure *cl,
> +				      size_t zone_sector,
> +				      size_t nr_zones)

No need for the line break after the second argument.

> +{
> +	struct search *s = container_of(cl, struct search, cl);
> +	struct data_insert_op *iop = &s->iop;
> +	int max_keys, free_keys;
> +	size_t start = zone_sector;
> +	int ret;
> +
> +	max_keys = (block_bytes(iop->c) - sizeof(struct jset)) /
> +		   sizeof(uint64_t);
> +	bch_keylist_init(&iop->insert_keys);
> +	ret = bch_keylist_realloc(&iop->insert_keys, max_keys, iop->c);
> +	if (ret < 0)
> +		return -ENOMEM;
> +
> +	while (nr_zones-- > 0) {
> +		atomic_t *journal_ref = NULL;
> +		size_t size = s->d->disk->queue->limits.chunk_sectors;
> +more_keys:
> +		bch_keylist_reset(&iop->insert_keys);
> +		free_keys = max_keys;
> +
> +		while (size > 0 && free_keys >= 2) {
> +			size_t sectors = min_t(size_t, size,
> +					       1U << (KEY_SIZE_BITS - 1));
> +
> +			bch_keylist_add(&iop->insert_keys,
> +					&KEY(iop->inode, start, sectors));
> +			start += sectors;
> +			size -= sectors;
> +			free_keys -= 2;
> +		}
> +
> +		/* Insert invalidate keys into btree */
> +		journal_ref = bch_journal(iop->c, &iop->insert_keys, NULL);
> +		if (!journal_ref) {
> +			ret = -EIO;
> +			break;
> +		}
> +
> +		ret = bch_btree_insert(iop->c,
> +				&iop->insert_keys, journal_ref, NULL);
> +		atomic_dec_bug(journal_ref);
> +		if (ret < 0)
> +			break;
> +
> +		if (size)
> +			goto more_keys;
> +	}
> +
> +	bch_keylist_free(&iop->insert_keys);
> +
> +	return ret;
> +}
> +
>  static void cached_dev_nodata(struct closure *cl)
>  {
>  	struct search *s = container_of(cl, struct search, cl);
>  	struct bio *bio = &s->bio.bio;
> +	int nr_zones = 1;
>  
>  	if (s->iop.flush_journal)
>  		bch_journal_meta(s->iop.c, cl);
>  
> -	/* If it's a flush, we send the flush to the backing device too */
> +	if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL ||
> +	    bio_op(bio) == REQ_OP_ZONE_RESET) {
> +		int err = 0;
> +		/*
> +		 * If this is REQ_OP_ZONE_RESET_ALL bio, cached data
> +		 * covered by all zones should be invalidate from the

s/invalidate/invalidated

> +		 * cache device.
> +		 */
> +		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
> +			nr_zones = s->d->disk->queue->nr_zones;

Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by
the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache
for conventional zones is not really necessary. But as an initial support, I
think this is fine. This can be optimized later.

> +
> +		err = bch_data_invalidate_zones(
> +			cl, bio->bi_iter.bi_sector, nr_zones);
> +
> +		if (err < 0) {
> +			s->iop.status = errno_to_blk_status(err);
> +			/* debug, should be removed before post patch */
> +			bio->bi_status = BLK_STS_TIMEOUT;

You did not remove it :)

> +			/* set by bio_cnt_set() in do_bio_hook() */
> +			bio_put(bio);
> +			/*
> +			 * Invalidate cached data fails, don't send
> +			 * the zone reset bio to backing device and
> +			 * return failure. Otherwise potential data
> +			 * corruption on bcache device may happen.
> +			 */
> +			goto continue_bio_complete;
> +		}
> +
> +	}
> +
> +	/*
> +	 * For flush or zone management bios, of cause
> +	 * they should be sent to backing device too.
> +	 */
>  	bio->bi_end_io = backing_request_endio;
>  	closure_bio_submit(s->iop.c, bio, cl);

You cannot submit a REQ_OP_ZONE_RESET_ALL to the backing dev here, at least not
unconditionally. The problem is that if the backing dev doe not have any
conventional zones at its LBA 0, REQ_OP_ZONE_RESET_ALL will really reset all
zones, including the first zone of the device that contains bcache super block.
So you will loose/destroy the bcache setup. You probably did not notice this
because your test drive has conventional zones at LBA 0 and reset all does not
have any effect on conventional zones...

The easy way to deal with this is to not set the QUEUE_FLAG_ZONE_RESETALL flag
for the bcache device queue. If it is not set, the block layer will
automatically issue only single zone REQ_OP_ZONE_RESET BIOs. That is slower,
yes, but that cannot be avoided when the backend disk does not have any
conventional zones. The QUEUE_FLAG_ZONE_RESETALL flag can be kept if the backend
disk first zone containing the bcache super block is a conventional zone.

>  
> +continue_bio_complete:
>  	continue_at(cl, cached_dev_bio_complete, NULL);
>  }
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index d5da7ad5157d..70c31950ec1b 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1390,6 +1390,8 @@ static int bch_cached_dev_zone_init(struct cached_dev *dc)
>  		 */
>  		d_q->nr_zones = b_q->nr_zones -
>  			(dc->sb.data_offset / d_q->limits.chunk_sectors);
> +		if (blk_queue_zone_resetall(b_q))
> +			blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, d_q);

See above comment.

>  	}
>  
>  	return 0;
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device
  2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
@ 2020-05-25  1:26   ` Damien Le Moal
  0 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2020-05-25  1:26 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/05/22 21:19, Coly Li wrote:
> Currently we don't support writeback mode for zoned device as backing
> device. So reject it by sysfs interface.
> 
> This rejection will be removed after the writeback cache mode support
> for zoned device gets done.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  drivers/md/bcache/sysfs.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 323276994aab..41bdbc42a17d 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -359,6 +359,11 @@ STORE(__cached_dev)
>  		if (v < 0)
>  			return v;
>  
> +		if ((unsigned int) v == CACHE_MODE_WRITEBACK) {
> +			pr_err("writeback mode is not supported for zoned backing device.\n");
> +			return -ENOTSUPP;
> +		}
> +
>  		if ((unsigned int) v != BDEV_CACHE_MODE(&dc->sb)) {
>  			SET_BDEV_CACHE_MODE(&dc->sb, v);
>  			bch_write_bdev_super(dc, NULL);
> 

Do you have a similar check in bcache user tools at format time ? Or is the
cache mode specified only when the bcache device is started ?

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
                   ` (2 preceding siblings ...)
  2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
@ 2020-05-25  5:25 ` Damien Le Moal
  2020-05-25  8:14   ` Coly Li
  3 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2020-05-25  5:25 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 2020/05/22 21:19, Coly Li wrote:
> Hi folks,
> 
> This is series, now bcache can support zoned device (e.g. host managed
> SMR hard drive) as the backing deice. Currently writeback mode is not
> support yet, which is on the to-do list (requires on-disk super block
> format change).
> 
> The first patch makes bcache to export the zoned information to upper
> layer code, for example formatting zonefs on top of the bcache device.
> By default, zone 0 of the zoned device is fully reserved for bcache
> super block, therefore the reported zones number is 1 less than the
> exact zones number of the physical SMR hard drive.
> 
> The second patch handles zone management command for bcache. Indeed
> these zone management commands are wrappered as zone management bios.
> For REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL zone management bios,
> before forwarding the bio to backing device, all cached data covered
> by the resetting zone(s) must be invalidated to keep data consistency.
> For rested zone management bios just minus the bi_sector by data_offset
> and simply forward to the zoned backing device.
> 
> The third patch is to make sure after bcache device starts, the cache
> mode cannot be changed to writeback via sysfs interface. Bcache-tools
> is modified to notice users and convert to writeback mode to the default
> writethrough mode when making a bcache device.
> 
> There is one thing not addressed by this series, that is re-write the
> bcache super block after REQ_OP_ZONE_RESET_ALL command. There will be
> quite soon that all seq zones device may appear, but it is OK to make
> bcache support such all seq-zones device a bit later.
> 
> Now a bcache device created with a zoned SMR drive can pass these test
> cases,
> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>   excluding zone 0 of the backing device (reserved for bcache super
>   block).
> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>   in sectors.
> - run 'blkzone report /dev/bcache0', all zones information displayed.
> - run 'blkzone reset -o <zone LBA> -c <zones number> /dev/bcache0',
>   conventional zones will reject the command, seqential zones covered
>   by the command range will reset its write pointer to start LBA of
>   their zones. If <zone LBA> is 0 and <zones number> covers all zones,
>   REQ_OP_ZONE_RESET_ALL command will be received and handled by bcache
>   device properly.
> - zonefs can be created on top of the bcache device, with/without cache
>   device attached. All sequential direct write and random read work well
>   and zone reset by 'truncate -s 0 <zone file>' works too.
> - Writeback cache mode does not support yet.
> 
> Now all prevous code review comments are addressed by this RFC version.
> Please don't hesitate to offer your opinion on this version.
> 
> Thanks in advance for your help.

Coly,

One more thing: your patch series lacks support for REQ_OP_ZONE_APPEND. It would
be great to add that. As is, since you do not set the max_zone_append_sectors
queue limit for the bcache device, that command will not be issued by the block
layer. But zonefs (and btrfs) will use zone append in (support for zonefs is
queued already in 5.8, btrfs will come later).

If bcache writethrough policy results in a data write to be issued to both the
backend device and the cache device, then some special code will be needed:
these 2 BIOs will need to be serialized since the actual write location of a
zone append command is known only on completion of the command. That is, the
zone append BIO needs to be issued to the backend device first, then to the
cache SSD device as a regular write once the zone append completes and its write
location is known.


> 
> Coly Li
> ---
> Changelog:
> v4: another improved version without any other generic block change.
> v3: an improved version depends on other generic block layer changes.
> v2: the first RFC version for comments and review.
> v1: the initial version posted just for information.
> 
> 
> Coly Li (3):
>   bcache: export bcache zone information for zoned backing device
>   bcache: handle zone management bios for bcache device
>   bcache: reject writeback cache mode for zoned backing device
> 
>  drivers/md/bcache/bcache.h  |  10 +++
>  drivers/md/bcache/request.c | 168 +++++++++++++++++++++++++++++++++++-
>  drivers/md/bcache/super.c   |  98 ++++++++++++++++++++-
>  drivers/md/bcache/sysfs.c   |   5 ++
>  4 files changed, 279 insertions(+), 2 deletions(-)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device
  2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
@ 2020-05-25  8:14   ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2020-05-25  8:14 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache; +Cc: linux-block

On 2020/5/25 13:25, Damien Le Moal wrote:
> On 2020/05/22 21:19, Coly Li wrote:
>> Hi folks,
>>
>> This is series, now bcache can support zoned device (e.g. host managed
>> SMR hard drive) as the backing deice. Currently writeback mode is not
>> support yet, which is on the to-do list (requires on-disk super block
>> format change).
>>
>> The first patch makes bcache to export the zoned information to upper
>> layer code, for example formatting zonefs on top of the bcache device.
>> By default, zone 0 of the zoned device is fully reserved for bcache
>> super block, therefore the reported zones number is 1 less than the
>> exact zones number of the physical SMR hard drive.
>>
>> The second patch handles zone management command for bcache. Indeed
>> these zone management commands are wrappered as zone management bios.
>> For REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL zone management bios,
>> before forwarding the bio to backing device, all cached data covered
>> by the resetting zone(s) must be invalidated to keep data consistency.
>> For rested zone management bios just minus the bi_sector by data_offset
>> and simply forward to the zoned backing device.
>>
>> The third patch is to make sure after bcache device starts, the cache
>> mode cannot be changed to writeback via sysfs interface. Bcache-tools
>> is modified to notice users and convert to writeback mode to the default
>> writethrough mode when making a bcache device.
>>
>> There is one thing not addressed by this series, that is re-write the
>> bcache super block after REQ_OP_ZONE_RESET_ALL command. There will be
>> quite soon that all seq zones device may appear, but it is OK to make
>> bcache support such all seq-zones device a bit later.
>>
>> Now a bcache device created with a zoned SMR drive can pass these test
>> cases,
>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>   excluding zone 0 of the backing device (reserved for bcache super
>>   block).
>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>   in sectors.
>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>> - run 'blkzone reset -o <zone LBA> -c <zones number> /dev/bcache0',
>>   conventional zones will reject the command, seqential zones covered
>>   by the command range will reset its write pointer to start LBA of
>>   their zones. If <zone LBA> is 0 and <zones number> covers all zones,
>>   REQ_OP_ZONE_RESET_ALL command will be received and handled by bcache
>>   device properly.
>> - zonefs can be created on top of the bcache device, with/without cache
>>   device attached. All sequential direct write and random read work well
>>   and zone reset by 'truncate -s 0 <zone file>' works too.
>> - Writeback cache mode does not support yet.
>>
>> Now all prevous code review comments are addressed by this RFC version.
>> Please don't hesitate to offer your opinion on this version.
>>
>> Thanks in advance for your help.
> 
> Coly,
> 
> One more thing: your patch series lacks support for REQ_OP_ZONE_APPEND. It would
> be great to add that. As is, since you do not set the max_zone_append_sectors
> queue limit for the bcache device, that command will not be issued by the block
> layer. But zonefs (and btrfs) will use zone append in (support for zonefs is
> queued already in 5.8, btrfs will come later).

Hi Damien,

Thank you for the suggestion, I will work on it now and post in next
version.

> 
> If bcache writethrough policy results in a data write to be issued to both the
> backend device and the cache device, then some special code will be needed:
> these 2 BIOs will need to be serialized since the actual write location of a
> zone append command is known only on completion of the command. That is, the
> zone append BIO needs to be issued to the backend device first, then to the
> cache SSD device as a regular write once the zone append completes and its write
> location is known.
> 

Copied. It should be OK for bcache. For writethrough mode the data will
be inserted into SSD only after bio to the backing storage accomplished.

Thank you for all your comments, I start to work on your comments on the
series and reply your comments (maybe with more questions) latter.

Coly Li



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-05-25  1:10   ` Damien Le Moal
@ 2020-06-01 12:34     ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2020-06-01 12:34 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache
  Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/5/25 09:10, Damien Le Moal wrote:
> On 2020/05/22 21:18, Coly Li wrote:
[snip]>> zone files, 523 of them are for convential zones (1 zone is
reserved
> 
> s/convential/conventional
> 
>> for bcache super block and not reported), 51632 of them are for
>> sequential zones.
>>
>> First run to read first 4KB from all the zone files with 50 processes,
>> it takes 5 minutes 55 seconds. Second run takes 12 seconds only because
>> all the read requests hit on cache device.
> 

Hi Damien,

> Did you write anything first to the bcache device ? Otherwise, all zonefs files
> will be empty and there is not going to be any file access... Question though:
> when writing to a bcache device with writethrough mode, does the data go to the
> SSD cache too ? Or is it written only to the backend device ?
> 

Yes, I did. For all random and sequential zone files, I write 8KB on
their head and read back 4KB from offset 0 of each zone files.

In my test case, all data will first go into backing device, after
completed they will be inserted into SSD. The order is guaranteed by
bcache code.

>>
>> 29 times faster is as expected for an ideal case when all READ I/Os hit
>> on NVMe cache device.
>>
>> Besides providing report_zones method of the bcache gendisk structure,
>> this patch also initializes the following zoned device attribution for
>> the bcache device,
>> - zones number: the total zones number minus reserved zone(s) for bcache
> 
> s/zones number/number of zones
> 
>>   super block.
>> - zone size: same size as reported from the underlying zoned device
>> - zone mode: same mode as reported from the underlying zoned device
> 
> s/zone mode/zoned model
> 
>> Currently blk_revalidate_disk_zones() does not accept non-mq drivers, so
>> all the above attribution are initialized mannally in bcache code.
> 
> s/mannally/manually
> 
>>
>> This patch just provides the report_zones method only. Handling all zone
>> management commands will be addressed in following patches.
>>

The above typos will be fixed in next version.


>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Damien Le Moal <damien.lemoal@wdc.com>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> Changelog:
>> v4: the version without any generic block layer change.
>> v3: the version depends on other generic block layer changes.
>> v2: an improved version for comments.
>> v1: initial version.
>>  drivers/md/bcache/bcache.h  | 10 ++++
>>  drivers/md/bcache/request.c | 69 ++++++++++++++++++++++++++
>>  drivers/md/bcache/super.c   | 96 ++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 174 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 74a9849ea164..0d298b48707f 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h

[snip]
>>  
>> +/*
>> + * The callback routine to parse a specific zone from all reporting
>> + * zones. args->orig_cb() is the upper layer report zones callback,
>> + * which should be called after the LBA conversion.
>> + * Notice: all zones after zone 0 will be reported, including the
>> + * offlined zones, how to handle the different types of zones are
>> + * fully decided by upper layer who calss for reporting zones of
>> + * the bcache device.
>> + */
>> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
>> +				      unsigned int idx,
>> +				      void *data)
> 
> I do not think you need the line break for the last argument.
> 

OK, let me change the style.

>> +{
>> +	struct bch_report_zones_args *args = data;
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	unsigned long data_offset = dc->sb.data_offset;
>> +
>> +	/* Zone 0 should not be reported */
>> +	BUG_ON(zone->start < data_offset);
> 
> Wouldn't a WARN_ON_ONCE and return -EIO be better here ?

Do it in next version.


> 
>> +
>> +	/* convert back to LBA of the bcache device*/
>> +	zone->start -= data_offset;
>> +	zone->wp -= data_offset;
> 
> This has to be done depending on the zone type and zone condition: zone->wp is
> "invalid" for conventional zones, and sequential zones that are full, read-only
> or offline. So you need something like this:
> 
> 	/* Remap LBA to the bcache device */
> 	zone->start -= data_offset;
> 	switch(zone->cond) {
> 	case BLK_ZONE_COND_NOT_WP:
> 	case BLK_ZONE_COND_READONLY:
> 	case BLK_ZONE_COND_FULL:
> 	case BLK_ZONE_COND_OFFLINE:
> 		break;
> 	case BLK_ZONE_COND_EMPTY:
> 		zone->wp = zone->start;
> 		break;
> 	default:
> 		zone->wp -= data_offset;
> 		break;
> 	}
> 
> 	return args->orig_cb(zone, idx, args->orig_data);
> 

Hmm, do we have a unified spec to handle the wp on different zone
condition ?

In zonefs I see zone->wp sets to zone->start for,
- BLK_ZONE_COND_OFFLINE
- BLK_ZONE_COND_READONLY

In sd_zbc.c, I see wp sets to end of the zone for
- BLK_ZONE_COND_FULL

And in dm.c I see zone->wp is set to zone->start for,
- BLK_ZONE_COND_EMPTY

It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
writer pointer should be taken cared by device driver already.

So I write such switch-case in the following way by the inspair of your
comments,

        /* Zone 0 should not be reported */
        if(WARN_ON_ONCE(zone->start < data_offset))
                return -EIO;

        /* convert back to LBA of the bcache device*/
        zone->start -= data_offset;
        if (zone->cond != BLK_ZONE_COND_NOT_WP)
                zone->wp -= data_offset;

        switch (zone->cond) {
        case BLK_ZONE_COND_NOT_WP:
                /* No write pointer available */
                break;
        case BLK_ZONE_COND_EMPTY:
        case BLK_ZONE_COND_READONLY:
        case BLK_ZONE_COND_OFFLINE:
                /*
                 * If underlying device driver does not properly
                 * set writer pointer, warn and fix it.
                 */
                if (WARN_ON_ONCE(zone->wp != zone->start))
                        zone->wp = zone->start;
                break;
        case BLK_ZONE_COND_FULL:
                /*
                 * If underlying device driver does not properly
                 * set writer pointer, warn and fix it.
                 */
                if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
                        zone->wp = zone->start + zone->len;
                break;
        default:
                break;
        }

        return args->orig_cb(zone, idx, args->orig_data);

The above code converts zone->wp by minus data_offset for
non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
I just check whether the underlying device driver updates write pointer
properly (as I observed in other drivers), if not then drop a warning
and fix the write pointer to the expected value.

Now the write pointer is only fixed when it was wrong value. If the
underlying device driver does not maintain the value properly, we figure
out and fix it.

>> +
>> +	return args->orig_cb(zone, idx, args->orig_data);
>> +}
>> +
>> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
>> +				   unsigned int nr_zones)
>> +{
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	/* skip zone 0 which is fully occupied by bcache super block */
>> +	sector_t sector = args->sector + dc->sb.data_offset;
>> +
>> +	/* sector is real LBA of backing device */
>> +	return blkdev_report_zones(dc->bdev,
>> +				   sector,
>> +				   nr_zones,
>> +				   cached_dev_report_zones_cb,
>> +				   args);
> 
> You could have multiple arguments on a couple of lines only here...
> 

Sure I change the style in next version.

>> +}
>> +
>>  void bch_cached_dev_request_init(struct cached_dev *dc)
>>  {
>>  	struct gendisk *g = dc->disk.disk;
>> @@ -1268,6 +1336,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>>  	dc->disk.ioctl				= cached_dev_ioctl;
>> +	dc->disk.report_zones			= cached_dev_report_zones;
> 
> Why set this method unconditionally ? Should it be set only for a zoned bcache
> device ? E.g.:
> 	
> 	if (bdev_is_zoned(bcache bdev))
> 		dc->disk.report_zones = cached_dev_report_zones;
> 

Will fix it in next version. Thanks.


>>  }
>>  
>>  /* Flash backed devices */
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index d98354fa28e3..d5da7ad5157d 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -679,10 +679,36 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>>  	return d->ioctl(d, mode, cmd, arg);
>>  }
>>  
>> +static int report_zones_dev(struct gendisk *disk,
>> +			    sector_t sector,
>> +			    unsigned int nr_zones,
>> +			    report_zones_cb cb,
>> +			    void *data)
>> +{
>> +	struct bcache_device *d = disk->private_data;
>> +	struct bch_report_zones_args args = {
>> +		.bcache_device = d,
>> +		.sector = sector,
>> +		.orig_data = data,
>> +		.orig_cb = cb,
>> +	};
>> +
>> +	/*
>> +	 * only bcache device with backing device has
>> +	 * report_zones method, flash device does not.
>> +	 */
>> +	if (d->report_zones)
>> +		return d->report_zones(&args, nr_zones);
>> +
>> +	/* flash dev does not have report_zones method */
> 
> This comment is confusing. Report zones is called against the bcache device, not
> against its components... In any case, if the bcache device is not zoned, the

My fault. There are two kinds of bcache device for now, one is the
bcache device with a backing device, one is the bcache device without a
backing device. For the bcache device without backing device, its I/Os
all go into the cache device (SSD). Now such bcache device is called
flash device or flash only volume IMHO. Yes this is confusing, I need to
fix it.


> report_zones method will never be called by the block layer. So you probably
> should just check that on entry:
> 
> 	if (WARN_ON_ONCE(!blk_queue_is_zoned(disk->queue))
> 		return -EOPNOTSUPP;
> 
> 	return d->report_zones(&args, nr_zones);
> 
>> +	return -EOPNOTSUPP;
>> +}
>> +

Good point, I use it in next version.


>>  static const struct block_device_operations bcache_ops = {
>>  	.open		= open_dev,
>>  	.release	= release_dev,
>>  	.ioctl		= ioctl_dev,
>> +	.report_zones	= report_zones_dev,
>>  	.owner		= THIS_MODULE,
>>  };
> 
> Same here. It may be better to set the report zones method only for a zoned
> bcache dev. So you will need an additional block_device_operations struct for
> that type.
> 
> static const struct block_device_operations bcache_zoned_ops = {
>  	.open		= open_dev,
>  	.release	= release_dev,
>  	.ioctl		= ioctl_dev,
> 	.report_zones	= report_zones_dev,
>  	.owner		= THIS_MODULE,
> };
> 

I see you point. But the purpose of such coding style is to avoid an
extra block_device_operations, just like ioctl_dev() does. Let's keep
the existsing style at this moment, and may change it in future when
necessary.


[snipped]
>>  
>> +static inline int cached_dev_data_offset_check(struct cached_dev *dc)
>> +{
>> +	if (!bdev_is_zoned(dc->bdev))
>> +		return 0;
>> +
>> +	/*
>> +	 * If the backing hard drive is zoned device, sb.data_offset
>> +	 * should be aligned to zone size, which is automatically
>> +	 * handled by 'bcache' util of bcache-tools. If the data_offset
>> +	 * is not aligned to zone size, it means the bcache-tools is
>> +	 * outdated.
>> +	 */
>> +	if (dc->sb.data_offset & (bdev_zone_sectors(dc->bdev) - 1)) {
>> +		pr_err("data_offset %llu is not aligned to zone size %llu, please update bcache-tools and re-make the zoned backing device.\n",
> 
> Long line... May be split the pr_err in 2 calls ?

This is what I learned from md code style, which is don't split the
error message into multiple lines. See the following commit,

commit 9d48739ef19aa8ad6026fd312b3ed81b560c8579
Author: NeilBrown <neilb@suse.com>
Date:   Wed Nov 2 14:16:49 2016 +1100

    md: change all printk() to pr_err() or pr_warn() etc.

This is just a choice of code style, and I'd like to take the single
line message for the following cited reason,

    3/ When strings have been split onto multiple lines, rejoin into
       a single string.
       The cost of having lines > 80 chars is less than the cost of not
       being able to easily search for a particular message.


> 
>> +			dc->sb.data_offset, bdev_zone_sectors(dc->bdev));
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
[snipped]

>> + */
>> +static int bch_cached_dev_zone_init(struct cached_dev *dc)
>> +{
>> +	struct request_queue *d_q, *b_q;
>> +	enum blk_zoned_model mode;
> 
> To be clear, may be call this variable "model" ?

Sure, it will be fixed in next version.


[snipped]
>> +	if (mode != BLK_ZONED_NONE) {
>> +		d_q->limits.zoned = mode;
>> +		blk_queue_chunk_sectors(d_q, b_q->limits.chunk_sectors);
>> +		/*
>> +		 * (dc->sb.data_offset / q->limits.chunk_sectors) is the
>> +		 * zones number reserved for bcache super block. By default
>> +		 * it is set to 1 by bcache-tools.
>> +		 */
>> +		d_q->nr_zones = b_q->nr_zones -
>> +			(dc->sb.data_offset / d_q->limits.chunk_sectors);
> 
> Does this compile on 32bits arch ? Don't you need a do_div() here ?
> 

Yes, I wil fix it in next version.

[snipped]

Thank you for the very detailed review comments :-)

Coly Li




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
2020-05-25  1:10   ` Damien Le Moal
2020-06-01 12:34     ` Coly Li
2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
2020-05-25  1:24   ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
2020-05-25  1:26   ` Damien Le Moal
2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
2020-05-25  8:14   ` Coly Li

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git