linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device
@ 2020-05-22 12:18 Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Hi folks,

This is series, now bcache can support zoned device (e.g. host managed
SMR hard drive) as the backing deice. Currently writeback mode is not
support yet, which is on the to-do list (requires on-disk super block
format change).

The first patch makes bcache to export the zoned information to upper
layer code, for example formatting zonefs on top of the bcache device.
By default, zone 0 of the zoned device is fully reserved for bcache
super block, therefore the reported zones number is 1 less than the
exact zones number of the physical SMR hard drive.

The second patch handles zone management command for bcache. Indeed
these zone management commands are wrappered as zone management bios.
For REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL zone management bios,
before forwarding the bio to backing device, all cached data covered
by the resetting zone(s) must be invalidated to keep data consistency.
For rested zone management bios just minus the bi_sector by data_offset
and simply forward to the zoned backing device.

The third patch is to make sure after bcache device starts, the cache
mode cannot be changed to writeback via sysfs interface. Bcache-tools
is modified to notice users and convert to writeback mode to the default
writethrough mode when making a bcache device.

There is one thing not addressed by this series, that is re-write the
bcache super block after REQ_OP_ZONE_RESET_ALL command. There will be
quite soon that all seq zones device may appear, but it is OK to make
bcache support such all seq-zones device a bit later.

Now a bcache device created with a zoned SMR drive can pass these test
cases,
- read /sys/block/bcache0/queue/zoned, content is 'host-managed'
- read /sys/block/bcache0/queue/nr_zones, content is number of zones
  excluding zone 0 of the backing device (reserved for bcache super
  block).
- read /sys/block/bcache0/queue/chunk_sectors, content is zone size
  in sectors.
- run 'blkzone report /dev/bcache0', all zones information displayed.
- run 'blkzone reset -o <zone LBA> -c <zones number> /dev/bcache0',
  conventional zones will reject the command, seqential zones covered
  by the command range will reset its write pointer to start LBA of
  their zones. If <zone LBA> is 0 and <zones number> covers all zones,
  REQ_OP_ZONE_RESET_ALL command will be received and handled by bcache
  device properly.
- zonefs can be created on top of the bcache device, with/without cache
  device attached. All sequential direct write and random read work well
  and zone reset by 'truncate -s 0 <zone file>' works too.
- Writeback cache mode does not support yet.

Now all prevous code review comments are addressed by this RFC version.
Please don't hesitate to offer your opinion on this version.

Thanks in advance for your help.

Coly Li
---
Changelog:
v4: another improved version without any other generic block change.
v3: an improved version depends on other generic block layer changes.
v2: the first RFC version for comments and review.
v1: the initial version posted just for information.


Coly Li (3):
  bcache: export bcache zone information for zoned backing device
  bcache: handle zone management bios for bcache device
  bcache: reject writeback cache mode for zoned backing device

 drivers/md/bcache/bcache.h  |  10 +++
 drivers/md/bcache/request.c | 168 +++++++++++++++++++++++++++++++++++-
 drivers/md/bcache/super.c   |  98 ++++++++++++++++++++-
 drivers/md/bcache/sysfs.c   |   5 ++
 4 files changed, 279 insertions(+), 2 deletions(-)

-- 
2.25.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
@ 2020-05-22 12:18 ` Coly Li
  2020-05-25  1:10   ` Damien Le Moal
  2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn

When using a zoned device e.g. SMR hard drive as the backing device,
if bcache can export the zoned device information then it is possible
to help the upper layer code to accelerate hot READ I/O requests.

This patch adds the report_zones method for the bcache device which has
zoned device as backing device. Now such bcache device can be treated as
a zoned device, by configured to writethrough, writearound mode or none
mode, zonefs can be formated on top of such bcache device.

Here is a simple performance data for read requests via zonefs on top of
bcache. The cache device of bcache is a 4TB NVMe SSD, the backing device
is a 14TB host managed SMR hard drive. The formatted zonefs has 52155
zone files, 523 of them are for convential zones (1 zone is reserved
for bcache super block and not reported), 51632 of them are for
sequential zones.

First run to read first 4KB from all the zone files with 50 processes,
it takes 5 minutes 55 seconds. Second run takes 12 seconds only because
all the read requests hit on cache device.

29 times faster is as expected for an ideal case when all READ I/Os hit
on NVMe cache device.

Besides providing report_zones method of the bcache gendisk structure,
this patch also initializes the following zoned device attribution for
the bcache device,
- zones number: the total zones number minus reserved zone(s) for bcache
  super block.
- zone size: same size as reported from the underlying zoned device
- zone mode: same mode as reported from the underlying zoned device
Currently blk_revalidate_disk_zones() does not accept non-mq drivers, so
all the above attribution are initialized mannally in bcache code.

This patch just provides the report_zones method only. Handling all zone
management commands will be addressed in following patches.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Changelog:
v4: the version without any generic block layer change.
v3: the version depends on other generic block layer changes.
v2: an improved version for comments.
v1: initial version.
 drivers/md/bcache/bcache.h  | 10 ++++
 drivers/md/bcache/request.c | 69 ++++++++++++++++++++++++++
 drivers/md/bcache/super.c   | 96 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 174 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 74a9849ea164..0d298b48707f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
 struct search;
 struct btree;
 struct keybuf;
+struct bch_report_zones_args;
 
 struct keybuf_key {
 	struct rb_node		node;
@@ -277,6 +278,8 @@ struct bcache_device {
 			  struct bio *bio, unsigned int sectors);
 	int (*ioctl)(struct bcache_device *d, fmode_t mode,
 		     unsigned int cmd, unsigned long arg);
+	int (*report_zones)(struct bch_report_zones_args *args,
+			    unsigned int nr_zones);
 };
 
 struct io {
@@ -748,6 +751,13 @@ struct bbio {
 	struct bio		bio;
 };
 
+struct bch_report_zones_args {
+	struct bcache_device *bcache_device;
+	sector_t sector;
+	void *orig_data;
+	report_zones_cb orig_cb;
+};
+
 #define BTREE_PRIO		USHRT_MAX
 #define INITIAL_PRIO		32768U
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 71a90fbec314..34f63da2338d 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1233,6 +1233,30 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
 	if (dc->io_disable)
 		return -EIO;
 
+	/*
+	 * All zoned device ioctl commands are handled in
+	 * other code paths,
+	 * - BLKREPORTZONE: by report_zones method of bcache_ops.
+	 * - BLKRESETZONE/BLKOPENZONE/BLKCLOSEZONE/BLKFINISHZONE: all handled
+	 *   by bio code path.
+	 * - BLKGETZONESZ/BLKGETNRZONES:directly handled inside generic block
+	 *   ioctl handler blkdev_common_ioctl().
+	 */
+	switch (cmd) {
+	case BLKREPORTZONE:
+	case BLKRESETZONE:
+	case BLKGETZONESZ:
+	case BLKGETNRZONES:
+	case BLKOPENZONE:
+	case BLKCLOSEZONE:
+	case BLKFINISHZONE:
+		pr_warn("Zoned device ioctl cmd should not be here.\n");
+		return -EOPNOTSUPP;
+	default:
+		/* Other commands  */
+		break;
+	}
+
 	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
 }
 
@@ -1261,6 +1285,50 @@ static int cached_dev_congested(void *data, int bits)
 	return ret;
 }
 
+/*
+ * The callback routine to parse a specific zone from all reporting
+ * zones. args->orig_cb() is the upper layer report zones callback,
+ * which should be called after the LBA conversion.
+ * Notice: all zones after zone 0 will be reported, including the
+ * offlined zones, how to handle the different types of zones are
+ * fully decided by upper layer who calss for reporting zones of
+ * the bcache device.
+ */
+static int cached_dev_report_zones_cb(struct blk_zone *zone,
+				      unsigned int idx,
+				      void *data)
+{
+	struct bch_report_zones_args *args = data;
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	unsigned long data_offset = dc->sb.data_offset;
+
+	/* Zone 0 should not be reported */
+	BUG_ON(zone->start < data_offset);
+
+	/* convert back to LBA of the bcache device*/
+	zone->start -= data_offset;
+	zone->wp -= data_offset;
+
+	return args->orig_cb(zone, idx, args->orig_data);
+}
+
+static int cached_dev_report_zones(struct bch_report_zones_args *args,
+				   unsigned int nr_zones)
+{
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	/* skip zone 0 which is fully occupied by bcache super block */
+	sector_t sector = args->sector + dc->sb.data_offset;
+
+	/* sector is real LBA of backing device */
+	return blkdev_report_zones(dc->bdev,
+				   sector,
+				   nr_zones,
+				   cached_dev_report_zones_cb,
+				   args);
+}
+
 void bch_cached_dev_request_init(struct cached_dev *dc)
 {
 	struct gendisk *g = dc->disk.disk;
@@ -1268,6 +1336,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
 	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
 	dc->disk.cache_miss			= cached_dev_cache_miss;
 	dc->disk.ioctl				= cached_dev_ioctl;
+	dc->disk.report_zones			= cached_dev_report_zones;
 }
 
 /* Flash backed devices */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d98354fa28e3..d5da7ad5157d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -679,10 +679,36 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
 	return d->ioctl(d, mode, cmd, arg);
 }
 
+static int report_zones_dev(struct gendisk *disk,
+			    sector_t sector,
+			    unsigned int nr_zones,
+			    report_zones_cb cb,
+			    void *data)
+{
+	struct bcache_device *d = disk->private_data;
+	struct bch_report_zones_args args = {
+		.bcache_device = d,
+		.sector = sector,
+		.orig_data = data,
+		.orig_cb = cb,
+	};
+
+	/*
+	 * only bcache device with backing device has
+	 * report_zones method, flash device does not.
+	 */
+	if (d->report_zones)
+		return d->report_zones(&args, nr_zones);
+
+	/* flash dev does not have report_zones method */
+	return -EOPNOTSUPP;
+}
+
 static const struct block_device_operations bcache_ops = {
 	.open		= open_dev,
 	.release	= release_dev,
 	.ioctl		= ioctl_dev,
+	.report_zones	= report_zones_dev,
 	.owner		= THIS_MODULE,
 };
 
@@ -817,6 +843,7 @@ static void bcache_device_free(struct bcache_device *d)
 
 static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 			      sector_t sectors, make_request_fn make_request_fn)
+
 {
 	struct request_queue *q;
 	const size_t max_stripes = min_t(size_t, INT_MAX,
@@ -1307,6 +1334,67 @@ static void cached_dev_flush(struct closure *cl)
 	continue_at(cl, cached_dev_free, system_wq);
 }
 
+static inline int cached_dev_data_offset_check(struct cached_dev *dc)
+{
+	if (!bdev_is_zoned(dc->bdev))
+		return 0;
+
+	/*
+	 * If the backing hard drive is zoned device, sb.data_offset
+	 * should be aligned to zone size, which is automatically
+	 * handled by 'bcache' util of bcache-tools. If the data_offset
+	 * is not aligned to zone size, it means the bcache-tools is
+	 * outdated.
+	 */
+	if (dc->sb.data_offset & (bdev_zone_sectors(dc->bdev) - 1)) {
+		pr_err("data_offset %llu is not aligned to zone size %llu, please update bcache-tools and re-make the zoned backing device.\n",
+			dc->sb.data_offset, bdev_zone_sectors(dc->bdev));
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Initialize zone information for the bcache device, this function
+ * assumes the bcache device has a cached device (dc != NULL), and
+ * the cached device is zoned device (bdev_is_zoned(dc->bdev) == true).
+ *
+ * The following zone information of the bcache device will be set,
+ * - zoned mode, same as the mode of zoned backing device
+ * - zone size in sectors, same as the zoned backing device
+ * - zones number, it is zones number of zoned backing device minus the
+ *   reserved zones for bcache super blocks.
+ */
+static int bch_cached_dev_zone_init(struct cached_dev *dc)
+{
+	struct request_queue *d_q, *b_q;
+	enum blk_zoned_model mode;
+
+	if (!bdev_is_zoned(dc->bdev))
+		return 0;
+
+	/* queue of bcache device */
+	d_q = dc->disk.disk->queue;
+	/* queue of backing device */
+	b_q = bdev_get_queue(dc->bdev);
+
+	mode = blk_queue_zoned_model(b_q);
+	if (mode != BLK_ZONED_NONE) {
+		d_q->limits.zoned = mode;
+		blk_queue_chunk_sectors(d_q, b_q->limits.chunk_sectors);
+		/*
+		 * (dc->sb.data_offset / q->limits.chunk_sectors) is the
+		 * zones number reserved for bcache super block. By default
+		 * it is set to 1 by bcache-tools.
+		 */
+		d_q->nr_zones = b_q->nr_zones -
+			(dc->sb.data_offset / d_q->limits.chunk_sectors);
+	}
+
+	return 0;
+}
+
 static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 {
 	int ret;
@@ -1333,6 +1421,10 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 	dc->disk.stripe_size = q->limits.io_opt >> 9;
 
+	ret = cached_dev_data_offset_check(dc);
+	if (ret)
+		return ret;
+
 	if (dc->disk.stripe_size)
 		dc->partial_stripes_expensive =
 			q->limits.raid_partial_stripes_expensive;
@@ -1355,7 +1447,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 	bch_cached_dev_request_init(dc);
 	bch_cached_dev_writeback_init(dc);
-	return 0;
+	ret = bch_cached_dev_zone_init(dc);
+
+	return ret;
 }
 
 /* Cached device - bcache superblock */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
@ 2020-05-22 12:18 ` Coly Li
  2020-05-25  1:24   ` Damien Le Moal
  2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
  2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
  3 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn

Fortunately the zoned device management interface is designed to use
bio operations, the bcache code just needs to do a little change to
handle the zone management bios.

Bcache driver needs to handle 5 zone management bios for now, all of
them are handled by cached_dev_nodata() since their bi_size values
are 0.
- REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_FINISH:
    The LBA conversion is already done in cached_dev_make_request(),
  just simply send such zone management bios to backing device is
  enough.
- REQ_OP_ZONE_RESET, REQ_OP_ZONE_RESET_ALL:
    If thre is cached data (clean or dirty) on cache device covered
  by the LBA range of resetting zone(s), such cached data must be
  invalidate from the cache device before forwarding the zone reset
  bios to the backing device. Otherwise data inconsistency or further
  corruption may happen from the view of bcache device.
    The difference of REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_RESET is
  when the zone management bio is to reset all zones, send all zones
  number reported by the bcache device (s->d->disk->queue->nr_zones)
  into bch_data_invalidate_zones() as parameter 'size_t nr_zones'. If
  only reset a single zone, just set 1 as 'size_t nr_zones'.

By supporting zone managememnt bios with this patch, now a bcache device
can be formatted by zonefs, and the zones can be reset by truncate -s 0
on the zone files under seq/ directory. Supporting REQ_OP_ZONE_RESET_ALL
makes the whole disk zones reset much faster. In my testing on a 14TB
zoned SMR hard drive, 1 by 1 resetting 51632 seq zones by sending
REQ_OP_ZONE_RESET bios takes 4m34s, by sending a single
REQ_OP_ZONE_RESET_ALL bio takes 12s, which is 22x times faster.

REQ_OP_ZONE_RESET_ALL is supported by bcache only when the backing device
supports it. So the bcache queue flag is set QUEUE_FLAG_ZONE_RESETALL on
only when queue of backing device has such flag, which can be checked by
calling blk_queue_zone_resetall() on backing device's request queue.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Changelog:
v2: an improved version without any generic block layer change.
v1: initial version depends on other generic block layer changes.

 drivers/md/bcache/request.c | 99 ++++++++++++++++++++++++++++++++++++-
 drivers/md/bcache/super.c   |  2 +
 2 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 34f63da2338d..700b8ab5dec9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1052,18 +1052,115 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 	continue_at(cl, cached_dev_write_complete, NULL);
 }
 
+/*
+ * Invalidate the LBA range on cache device which is covered by the
+ * the resetting zones.
+ */
+static int bch_data_invalidate_zones(struct closure *cl,
+				      size_t zone_sector,
+				      size_t nr_zones)
+{
+	struct search *s = container_of(cl, struct search, cl);
+	struct data_insert_op *iop = &s->iop;
+	int max_keys, free_keys;
+	size_t start = zone_sector;
+	int ret;
+
+	max_keys = (block_bytes(iop->c) - sizeof(struct jset)) /
+		   sizeof(uint64_t);
+	bch_keylist_init(&iop->insert_keys);
+	ret = bch_keylist_realloc(&iop->insert_keys, max_keys, iop->c);
+	if (ret < 0)
+		return -ENOMEM;
+
+	while (nr_zones-- > 0) {
+		atomic_t *journal_ref = NULL;
+		size_t size = s->d->disk->queue->limits.chunk_sectors;
+more_keys:
+		bch_keylist_reset(&iop->insert_keys);
+		free_keys = max_keys;
+
+		while (size > 0 && free_keys >= 2) {
+			size_t sectors = min_t(size_t, size,
+					       1U << (KEY_SIZE_BITS - 1));
+
+			bch_keylist_add(&iop->insert_keys,
+					&KEY(iop->inode, start, sectors));
+			start += sectors;
+			size -= sectors;
+			free_keys -= 2;
+		}
+
+		/* Insert invalidate keys into btree */
+		journal_ref = bch_journal(iop->c, &iop->insert_keys, NULL);
+		if (!journal_ref) {
+			ret = -EIO;
+			break;
+		}
+
+		ret = bch_btree_insert(iop->c,
+				&iop->insert_keys, journal_ref, NULL);
+		atomic_dec_bug(journal_ref);
+		if (ret < 0)
+			break;
+
+		if (size)
+			goto more_keys;
+	}
+
+	bch_keylist_free(&iop->insert_keys);
+
+	return ret;
+}
+
 static void cached_dev_nodata(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct bio *bio = &s->bio.bio;
+	int nr_zones = 1;
 
 	if (s->iop.flush_journal)
 		bch_journal_meta(s->iop.c, cl);
 
-	/* If it's a flush, we send the flush to the backing device too */
+	if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL ||
+	    bio_op(bio) == REQ_OP_ZONE_RESET) {
+		int err = 0;
+		/*
+		 * If this is REQ_OP_ZONE_RESET_ALL bio, cached data
+		 * covered by all zones should be invalidate from the
+		 * cache device.
+		 */
+		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
+			nr_zones = s->d->disk->queue->nr_zones;
+
+		err = bch_data_invalidate_zones(
+			cl, bio->bi_iter.bi_sector, nr_zones);
+
+		if (err < 0) {
+			s->iop.status = errno_to_blk_status(err);
+			/* debug, should be removed before post patch */
+			bio->bi_status = BLK_STS_TIMEOUT;
+			/* set by bio_cnt_set() in do_bio_hook() */
+			bio_put(bio);
+			/*
+			 * Invalidate cached data fails, don't send
+			 * the zone reset bio to backing device and
+			 * return failure. Otherwise potential data
+			 * corruption on bcache device may happen.
+			 */
+			goto continue_bio_complete;
+		}
+
+	}
+
+	/*
+	 * For flush or zone management bios, of cause
+	 * they should be sent to backing device too.
+	 */
 	bio->bi_end_io = backing_request_endio;
 	closure_bio_submit(s->iop.c, bio, cl);
 
+continue_bio_complete:
 	continue_at(cl, cached_dev_bio_complete, NULL);
 }
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d5da7ad5157d..70c31950ec1b 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1390,6 +1390,8 @@ static int bch_cached_dev_zone_init(struct cached_dev *dc)
 		 */
 		d_q->nr_zones = b_q->nr_zones -
 			(dc->sb.data_offset / d_q->limits.chunk_sectors);
+		if (blk_queue_zone_resetall(b_q))
+			blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, d_q);
 	}
 
 	return 0;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
  2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
@ 2020-05-22 12:18 ` Coly Li
  2020-05-25  1:26   ` Damien Le Moal
  2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
  3 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-05-22 12:18 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn

Currently we don't support writeback mode for zoned device as backing
device. So reject it by sysfs interface.

This rejection will be removed after the writeback cache mode support
for zoned device gets done.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/md/bcache/sysfs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 323276994aab..41bdbc42a17d 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -359,6 +359,11 @@ STORE(__cached_dev)
 		if (v < 0)
 			return v;
 
+		if ((unsigned int) v == CACHE_MODE_WRITEBACK) {
+			pr_err("writeback mode is not supported for zoned backing device.\n");
+			return -ENOTSUPP;
+		}
+
 		if ((unsigned int) v != BDEV_CACHE_MODE(&dc->sb)) {
 			SET_BDEV_CACHE_MODE(&dc->sb, v);
 			bch_write_bdev_super(dc, NULL);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
@ 2020-05-25  1:10   ` Damien Le Moal
  2020-06-01 12:34     ` Coly Li
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2020-05-25  1:10 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/05/22 21:18, Coly Li wrote:
> When using a zoned device e.g. SMR hard drive as the backing device,
> if bcache can export the zoned device information then it is possible
> to help the upper layer code to accelerate hot READ I/O requests.
> 
> This patch adds the report_zones method for the bcache device which has
> zoned device as backing device. Now such bcache device can be treated as
> a zoned device, by configured to writethrough, writearound mode or none
> mode, zonefs can be formated on top of such bcache device.
> 
> Here is a simple performance data for read requests via zonefs on top of
> bcache. The cache device of bcache is a 4TB NVMe SSD, the backing device
> is a 14TB host managed SMR hard drive. The formatted zonefs has 52155
> zone files, 523 of them are for convential zones (1 zone is reserved

s/convential/conventional

> for bcache super block and not reported), 51632 of them are for
> sequential zones.
> 
> First run to read first 4KB from all the zone files with 50 processes,
> it takes 5 minutes 55 seconds. Second run takes 12 seconds only because
> all the read requests hit on cache device.

Did you write anything first to the bcache device ? Otherwise, all zonefs files
will be empty and there is not going to be any file access... Question though:
when writing to a bcache device with writethrough mode, does the data go to the
SSD cache too ? Or is it written only to the backend device ?

> 
> 29 times faster is as expected for an ideal case when all READ I/Os hit
> on NVMe cache device.
> 
> Besides providing report_zones method of the bcache gendisk structure,
> this patch also initializes the following zoned device attribution for
> the bcache device,
> - zones number: the total zones number minus reserved zone(s) for bcache

s/zones number/number of zones

>   super block.
> - zone size: same size as reported from the underlying zoned device
> - zone mode: same mode as reported from the underlying zoned device

s/zone mode/zoned model

> Currently blk_revalidate_disk_zones() does not accept non-mq drivers, so
> all the above attribution are initialized mannally in bcache code.

s/mannally/manually

> 
> This patch just provides the report_zones method only. Handling all zone
> management commands will be addressed in following patches.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> Changelog:
> v4: the version without any generic block layer change.
> v3: the version depends on other generic block layer changes.
> v2: an improved version for comments.
> v1: initial version.
>  drivers/md/bcache/bcache.h  | 10 ++++
>  drivers/md/bcache/request.c | 69 ++++++++++++++++++++++++++
>  drivers/md/bcache/super.c   | 96 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 174 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 74a9849ea164..0d298b48707f 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>  struct search;
>  struct btree;
>  struct keybuf;
> +struct bch_report_zones_args;
>  
>  struct keybuf_key {
>  	struct rb_node		node;
> @@ -277,6 +278,8 @@ struct bcache_device {
>  			  struct bio *bio, unsigned int sectors);
>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>  		     unsigned int cmd, unsigned long arg);
> +	int (*report_zones)(struct bch_report_zones_args *args,
> +			    unsigned int nr_zones);
>  };
>  
>  struct io {
> @@ -748,6 +751,13 @@ struct bbio {
>  	struct bio		bio;
>  };
>  
> +struct bch_report_zones_args {
> +	struct bcache_device *bcache_device;
> +	sector_t sector;
> +	void *orig_data;
> +	report_zones_cb orig_cb;
> +};
> +
>  #define BTREE_PRIO		USHRT_MAX
>  #define INITIAL_PRIO		32768U
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 71a90fbec314..34f63da2338d 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1233,6 +1233,30 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>  	if (dc->io_disable)
>  		return -EIO;
>  
> +	/*
> +	 * All zoned device ioctl commands are handled in
> +	 * other code paths,
> +	 * - BLKREPORTZONE: by report_zones method of bcache_ops.
> +	 * - BLKRESETZONE/BLKOPENZONE/BLKCLOSEZONE/BLKFINISHZONE: all handled
> +	 *   by bio code path.
> +	 * - BLKGETZONESZ/BLKGETNRZONES:directly handled inside generic block
> +	 *   ioctl handler blkdev_common_ioctl().
> +	 */
> +	switch (cmd) {
> +	case BLKREPORTZONE:
> +	case BLKRESETZONE:
> +	case BLKGETZONESZ:
> +	case BLKGETNRZONES:
> +	case BLKOPENZONE:
> +	case BLKCLOSEZONE:
> +	case BLKFINISHZONE:
> +		pr_warn("Zoned device ioctl cmd should not be here.\n");
> +		return -EOPNOTSUPP;
> +	default:
> +		/* Other commands  */
> +		break;
> +	}
> +
>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>  }
>  
> @@ -1261,6 +1285,50 @@ static int cached_dev_congested(void *data, int bits)
>  	return ret;
>  }
>  
> +/*
> + * The callback routine to parse a specific zone from all reporting
> + * zones. args->orig_cb() is the upper layer report zones callback,
> + * which should be called after the LBA conversion.
> + * Notice: all zones after zone 0 will be reported, including the
> + * offlined zones, how to handle the different types of zones are
> + * fully decided by upper layer who calss for reporting zones of
> + * the bcache device.
> + */
> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
> +				      unsigned int idx,
> +				      void *data)

I do not think you need the line break for the last argument.

> +{
> +	struct bch_report_zones_args *args = data;
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	unsigned long data_offset = dc->sb.data_offset;
> +
> +	/* Zone 0 should not be reported */
> +	BUG_ON(zone->start < data_offset);

Wouldn't a WARN_ON_ONCE and return -EIO be better here ?

> +
> +	/* convert back to LBA of the bcache device*/
> +	zone->start -= data_offset;
> +	zone->wp -= data_offset;

This has to be done depending on the zone type and zone condition: zone->wp is
"invalid" for conventional zones, and sequential zones that are full, read-only
or offline. So you need something like this:

	/* Remap LBA to the bcache device */
	zone->start -= data_offset;
	switch(zone->cond) {
	case BLK_ZONE_COND_NOT_WP:
	case BLK_ZONE_COND_READONLY:
	case BLK_ZONE_COND_FULL:
	case BLK_ZONE_COND_OFFLINE:
		break;
	case BLK_ZONE_COND_EMPTY:
		zone->wp = zone->start;
		break;
	default:
		zone->wp -= data_offset;
		break;
	}

	return args->orig_cb(zone, idx, args->orig_data);

> +
> +	return args->orig_cb(zone, idx, args->orig_data);
> +}
> +
> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
> +				   unsigned int nr_zones)
> +{
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	/* skip zone 0 which is fully occupied by bcache super block */
> +	sector_t sector = args->sector + dc->sb.data_offset;
> +
> +	/* sector is real LBA of backing device */
> +	return blkdev_report_zones(dc->bdev,
> +				   sector,
> +				   nr_zones,
> +				   cached_dev_report_zones_cb,
> +				   args);

You could have multiple arguments on a couple of lines only here...

> +}
> +
>  void bch_cached_dev_request_init(struct cached_dev *dc)
>  {
>  	struct gendisk *g = dc->disk.disk;
> @@ -1268,6 +1336,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>  	dc->disk.ioctl				= cached_dev_ioctl;
> +	dc->disk.report_zones			= cached_dev_report_zones;

Why set this method unconditionally ? Should it be set only for a zoned bcache
device ? E.g.:
	
	if (bdev_is_zoned(bcache bdev))
		dc->disk.report_zones = cached_dev_report_zones;

>  }
>  
>  /* Flash backed devices */
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index d98354fa28e3..d5da7ad5157d 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -679,10 +679,36 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>  	return d->ioctl(d, mode, cmd, arg);
>  }
>  
> +static int report_zones_dev(struct gendisk *disk,
> +			    sector_t sector,
> +			    unsigned int nr_zones,
> +			    report_zones_cb cb,
> +			    void *data)
> +{
> +	struct bcache_device *d = disk->private_data;
> +	struct bch_report_zones_args args = {
> +		.bcache_device = d,
> +		.sector = sector,
> +		.orig_data = data,
> +		.orig_cb = cb,
> +	};
> +
> +	/*
> +	 * only bcache device with backing device has
> +	 * report_zones method, flash device does not.
> +	 */
> +	if (d->report_zones)
> +		return d->report_zones(&args, nr_zones);
> +
> +	/* flash dev does not have report_zones method */

This comment is confusing. Report zones is called against the bcache device, not
against its components... In any case, if the bcache device is not zoned, the
report_zones method will never be called by the block layer. So you probably
should just check that on entry:

	if (WARN_ON_ONCE(!blk_queue_is_zoned(disk->queue))
		return -EOPNOTSUPP;

	return d->report_zones(&args, nr_zones);

> +	return -EOPNOTSUPP;
> +}
> +
>  static const struct block_device_operations bcache_ops = {
>  	.open		= open_dev,
>  	.release	= release_dev,
>  	.ioctl		= ioctl_dev,
> +	.report_zones	= report_zones_dev,
>  	.owner		= THIS_MODULE,
>  };

Same here. It may be better to set the report zones method only for a zoned
bcache dev. So you will need an additional block_device_operations struct for
that type.

static const struct block_device_operations bcache_zoned_ops = {
 	.open		= open_dev,
 	.release	= release_dev,
 	.ioctl		= ioctl_dev,
	.report_zones	= report_zones_dev,
 	.owner		= THIS_MODULE,
};

>  
> @@ -817,6 +843,7 @@ static void bcache_device_free(struct bcache_device *d)
>  
>  static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  			      sector_t sectors, make_request_fn make_request_fn)
> +
>  {
>  	struct request_queue *q;
>  	const size_t max_stripes = min_t(size_t, INT_MAX,
> @@ -1307,6 +1334,67 @@ static void cached_dev_flush(struct closure *cl)
>  	continue_at(cl, cached_dev_free, system_wq);
>  }
>  
> +static inline int cached_dev_data_offset_check(struct cached_dev *dc)
> +{
> +	if (!bdev_is_zoned(dc->bdev))
> +		return 0;
> +
> +	/*
> +	 * If the backing hard drive is zoned device, sb.data_offset
> +	 * should be aligned to zone size, which is automatically
> +	 * handled by 'bcache' util of bcache-tools. If the data_offset
> +	 * is not aligned to zone size, it means the bcache-tools is
> +	 * outdated.
> +	 */
> +	if (dc->sb.data_offset & (bdev_zone_sectors(dc->bdev) - 1)) {
> +		pr_err("data_offset %llu is not aligned to zone size %llu, please update bcache-tools and re-make the zoned backing device.\n",

Long line... May be split the pr_err in 2 calls ?

> +			dc->sb.data_offset, bdev_zone_sectors(dc->bdev));
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Initialize zone information for the bcache device, this function
> + * assumes the bcache device has a cached device (dc != NULL), and
> + * the cached device is zoned device (bdev_is_zoned(dc->bdev) == true).
> + *
> + * The following zone information of the bcache device will be set,
> + * - zoned mode, same as the mode of zoned backing device
> + * - zone size in sectors, same as the zoned backing device
> + * - zones number, it is zones number of zoned backing device minus the
> + *   reserved zones for bcache super blocks.
> + */
> +static int bch_cached_dev_zone_init(struct cached_dev *dc)
> +{
> +	struct request_queue *d_q, *b_q;
> +	enum blk_zoned_model mode;

To be clear, may be call this variable "model" ?

> +
> +	if (!bdev_is_zoned(dc->bdev))
> +		return 0;
> +
> +	/* queue of bcache device */
> +	d_q = dc->disk.disk->queue;
> +	/* queue of backing device */
> +	b_q = bdev_get_queue(dc->bdev);
> +
> +	mode = blk_queue_zoned_model(b_q);
> +	if (mode != BLK_ZONED_NONE) {
> +		d_q->limits.zoned = mode;
> +		blk_queue_chunk_sectors(d_q, b_q->limits.chunk_sectors);
> +		/*
> +		 * (dc->sb.data_offset / q->limits.chunk_sectors) is the
> +		 * zones number reserved for bcache super block. By default
> +		 * it is set to 1 by bcache-tools.
> +		 */
> +		d_q->nr_zones = b_q->nr_zones -
> +			(dc->sb.data_offset / d_q->limits.chunk_sectors);

Does this compile on 32bits arch ? Don't you need a do_div() here ?

> +	}
> +
> +	return 0;
> +}
> +
>  static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  {
>  	int ret;
> @@ -1333,6 +1421,10 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  
>  	dc->disk.stripe_size = q->limits.io_opt >> 9;
>  
> +	ret = cached_dev_data_offset_check(dc);
> +	if (ret)
> +		return ret;
> +
>  	if (dc->disk.stripe_size)
>  		dc->partial_stripes_expensive =
>  			q->limits.raid_partial_stripes_expensive;
> @@ -1355,7 +1447,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  
>  	bch_cached_dev_request_init(dc);
>  	bch_cached_dev_writeback_init(dc);
> -	return 0;
> +	ret = bch_cached_dev_zone_init(dc);
> +
> +	return ret;
>  }
>  
>  /* Cached device - bcache superblock */
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
@ 2020-05-25  1:24   ` Damien Le Moal
  2020-06-01 16:06     ` Coly Li
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2020-05-25  1:24 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/05/22 21:18, Coly Li wrote:
> Fortunately the zoned device management interface is designed to use
> bio operations, the bcache code just needs to do a little change to
> handle the zone management bios.
> 
> Bcache driver needs to handle 5 zone management bios for now, all of
> them are handled by cached_dev_nodata() since their bi_size values
> are 0.
> - REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_FINISH:
>     The LBA conversion is already done in cached_dev_make_request(),
>   just simply send such zone management bios to backing device is
>   enough.
> - REQ_OP_ZONE_RESET, REQ_OP_ZONE_RESET_ALL:
>     If thre is cached data (clean or dirty) on cache device covered
>   by the LBA range of resetting zone(s), such cached data must be

The first part of the sentence is a little unclear... Did you mean:

If the cache device holds data of the target zones, cache invalidation is needed
before forwarding...

>   invalidate from the cache device before forwarding the zone reset
>   bios to the backing device. Otherwise data inconsistency or further
>   corruption may happen from the view of bcache device.
>     The difference of REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_RESET is
>   when the zone management bio is to reset all zones, send all zones
>   number reported by the bcache device (s->d->disk->queue->nr_zones)
>   into bch_data_invalidate_zones() as parameter 'size_t nr_zones'. If
>   only reset a single zone, just set 1 as 'size_t nr_zones'.
> 
> By supporting zone managememnt bios with this patch, now a bcache device

s/managememnt/management

> can be formatted by zonefs, and the zones can be reset by truncate -s 0
> on the zone files under seq/ directory. Supporting REQ_OP_ZONE_RESET_ALL
> makes the whole disk zones reset much faster. In my testing on a 14TB
> zoned SMR hard drive, 1 by 1 resetting 51632 seq zones by sending
> REQ_OP_ZONE_RESET bios takes 4m34s, by sending a single
> REQ_OP_ZONE_RESET_ALL bio takes 12s, which is 22x times faster.
> 
> REQ_OP_ZONE_RESET_ALL is supported by bcache only when the backing device
> supports it. So the bcache queue flag is set QUEUE_FLAG_ZONE_RESETALL on
> only when queue of backing device has such flag, which can be checked by
> calling blk_queue_zone_resetall() on backing device's request queue.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> Changelog:
> v2: an improved version without any generic block layer change.
> v1: initial version depends on other generic block layer changes.
> 
>  drivers/md/bcache/request.c | 99 ++++++++++++++++++++++++++++++++++++-
>  drivers/md/bcache/super.c   |  2 +
>  2 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 34f63da2338d..700b8ab5dec9 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1052,18 +1052,115 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  	continue_at(cl, cached_dev_write_complete, NULL);
>  }
>  
> +/*
> + * Invalidate the LBA range on cache device which is covered by the
> + * the resetting zones.
> + */
> +static int bch_data_invalidate_zones(struct closure *cl,
> +				      size_t zone_sector,
> +				      size_t nr_zones)

No need for the line break after the second argument.

> +{
> +	struct search *s = container_of(cl, struct search, cl);
> +	struct data_insert_op *iop = &s->iop;
> +	int max_keys, free_keys;
> +	size_t start = zone_sector;
> +	int ret;
> +
> +	max_keys = (block_bytes(iop->c) - sizeof(struct jset)) /
> +		   sizeof(uint64_t);
> +	bch_keylist_init(&iop->insert_keys);
> +	ret = bch_keylist_realloc(&iop->insert_keys, max_keys, iop->c);
> +	if (ret < 0)
> +		return -ENOMEM;
> +
> +	while (nr_zones-- > 0) {
> +		atomic_t *journal_ref = NULL;
> +		size_t size = s->d->disk->queue->limits.chunk_sectors;
> +more_keys:
> +		bch_keylist_reset(&iop->insert_keys);
> +		free_keys = max_keys;
> +
> +		while (size > 0 && free_keys >= 2) {
> +			size_t sectors = min_t(size_t, size,
> +					       1U << (KEY_SIZE_BITS - 1));
> +
> +			bch_keylist_add(&iop->insert_keys,
> +					&KEY(iop->inode, start, sectors));
> +			start += sectors;
> +			size -= sectors;
> +			free_keys -= 2;
> +		}
> +
> +		/* Insert invalidate keys into btree */
> +		journal_ref = bch_journal(iop->c, &iop->insert_keys, NULL);
> +		if (!journal_ref) {
> +			ret = -EIO;
> +			break;
> +		}
> +
> +		ret = bch_btree_insert(iop->c,
> +				&iop->insert_keys, journal_ref, NULL);
> +		atomic_dec_bug(journal_ref);
> +		if (ret < 0)
> +			break;
> +
> +		if (size)
> +			goto more_keys;
> +	}
> +
> +	bch_keylist_free(&iop->insert_keys);
> +
> +	return ret;
> +}
> +
>  static void cached_dev_nodata(struct closure *cl)
>  {
>  	struct search *s = container_of(cl, struct search, cl);
>  	struct bio *bio = &s->bio.bio;
> +	int nr_zones = 1;
>  
>  	if (s->iop.flush_journal)
>  		bch_journal_meta(s->iop.c, cl);
>  
> -	/* If it's a flush, we send the flush to the backing device too */
> +	if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL ||
> +	    bio_op(bio) == REQ_OP_ZONE_RESET) {
> +		int err = 0;
> +		/*
> +		 * If this is REQ_OP_ZONE_RESET_ALL bio, cached data
> +		 * covered by all zones should be invalidate from the

s/invalidate/invalidated

> +		 * cache device.
> +		 */
> +		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
> +			nr_zones = s->d->disk->queue->nr_zones;

Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by
the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache
for conventional zones is not really necessary. But as an initial support, I
think this is fine. This can be optimized later.

> +
> +		err = bch_data_invalidate_zones(
> +			cl, bio->bi_iter.bi_sector, nr_zones);
> +
> +		if (err < 0) {
> +			s->iop.status = errno_to_blk_status(err);
> +			/* debug, should be removed before post patch */
> +			bio->bi_status = BLK_STS_TIMEOUT;

You did not remove it :)

> +			/* set by bio_cnt_set() in do_bio_hook() */
> +			bio_put(bio);
> +			/*
> +			 * Invalidate cached data fails, don't send
> +			 * the zone reset bio to backing device and
> +			 * return failure. Otherwise potential data
> +			 * corruption on bcache device may happen.
> +			 */
> +			goto continue_bio_complete;
> +		}
> +
> +	}
> +
> +	/*
> +	 * For flush or zone management bios, of cause
> +	 * they should be sent to backing device too.
> +	 */
>  	bio->bi_end_io = backing_request_endio;
>  	closure_bio_submit(s->iop.c, bio, cl);

You cannot submit a REQ_OP_ZONE_RESET_ALL to the backing dev here, at least not
unconditionally. The problem is that if the backing dev doe not have any
conventional zones at its LBA 0, REQ_OP_ZONE_RESET_ALL will really reset all
zones, including the first zone of the device that contains bcache super block.
So you will loose/destroy the bcache setup. You probably did not notice this
because your test drive has conventional zones at LBA 0 and reset all does not
have any effect on conventional zones...

The easy way to deal with this is to not set the QUEUE_FLAG_ZONE_RESETALL flag
for the bcache device queue. If it is not set, the block layer will
automatically issue only single zone REQ_OP_ZONE_RESET BIOs. That is slower,
yes, but that cannot be avoided when the backend disk does not have any
conventional zones. The QUEUE_FLAG_ZONE_RESETALL flag can be kept if the backend
disk first zone containing the bcache super block is a conventional zone.

>  
> +continue_bio_complete:
>  	continue_at(cl, cached_dev_bio_complete, NULL);
>  }
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index d5da7ad5157d..70c31950ec1b 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1390,6 +1390,8 @@ static int bch_cached_dev_zone_init(struct cached_dev *dc)
>  		 */
>  		d_q->nr_zones = b_q->nr_zones -
>  			(dc->sb.data_offset / d_q->limits.chunk_sectors);
> +		if (blk_queue_zone_resetall(b_q))
> +			blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, d_q);

See above comment.

>  	}
>  
>  	return 0;
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device
  2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
@ 2020-05-25  1:26   ` Damien Le Moal
  2020-06-01 16:09     ` Coly Li
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2020-05-25  1:26 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/05/22 21:19, Coly Li wrote:
> Currently we don't support writeback mode for zoned device as backing
> device. So reject it by sysfs interface.
> 
> This rejection will be removed after the writeback cache mode support
> for zoned device gets done.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  drivers/md/bcache/sysfs.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 323276994aab..41bdbc42a17d 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -359,6 +359,11 @@ STORE(__cached_dev)
>  		if (v < 0)
>  			return v;
>  
> +		if ((unsigned int) v == CACHE_MODE_WRITEBACK) {
> +			pr_err("writeback mode is not supported for zoned backing device.\n");
> +			return -ENOTSUPP;
> +		}
> +
>  		if ((unsigned int) v != BDEV_CACHE_MODE(&dc->sb)) {
>  			SET_BDEV_CACHE_MODE(&dc->sb, v);
>  			bch_write_bdev_super(dc, NULL);
> 

Do you have a similar check in bcache user tools at format time ? Or is the
cache mode specified only when the bcache device is started ?

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device
  2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
                   ` (2 preceding siblings ...)
  2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
@ 2020-05-25  5:25 ` Damien Le Moal
  2020-05-25  8:14   ` Coly Li
  3 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2020-05-25  5:25 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 2020/05/22 21:19, Coly Li wrote:
> Hi folks,
> 
> This is series, now bcache can support zoned device (e.g. host managed
> SMR hard drive) as the backing deice. Currently writeback mode is not
> support yet, which is on the to-do list (requires on-disk super block
> format change).
> 
> The first patch makes bcache to export the zoned information to upper
> layer code, for example formatting zonefs on top of the bcache device.
> By default, zone 0 of the zoned device is fully reserved for bcache
> super block, therefore the reported zones number is 1 less than the
> exact zones number of the physical SMR hard drive.
> 
> The second patch handles zone management command for bcache. Indeed
> these zone management commands are wrappered as zone management bios.
> For REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL zone management bios,
> before forwarding the bio to backing device, all cached data covered
> by the resetting zone(s) must be invalidated to keep data consistency.
> For rested zone management bios just minus the bi_sector by data_offset
> and simply forward to the zoned backing device.
> 
> The third patch is to make sure after bcache device starts, the cache
> mode cannot be changed to writeback via sysfs interface. Bcache-tools
> is modified to notice users and convert to writeback mode to the default
> writethrough mode when making a bcache device.
> 
> There is one thing not addressed by this series, that is re-write the
> bcache super block after REQ_OP_ZONE_RESET_ALL command. There will be
> quite soon that all seq zones device may appear, but it is OK to make
> bcache support such all seq-zones device a bit later.
> 
> Now a bcache device created with a zoned SMR drive can pass these test
> cases,
> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>   excluding zone 0 of the backing device (reserved for bcache super
>   block).
> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>   in sectors.
> - run 'blkzone report /dev/bcache0', all zones information displayed.
> - run 'blkzone reset -o <zone LBA> -c <zones number> /dev/bcache0',
>   conventional zones will reject the command, seqential zones covered
>   by the command range will reset its write pointer to start LBA of
>   their zones. If <zone LBA> is 0 and <zones number> covers all zones,
>   REQ_OP_ZONE_RESET_ALL command will be received and handled by bcache
>   device properly.
> - zonefs can be created on top of the bcache device, with/without cache
>   device attached. All sequential direct write and random read work well
>   and zone reset by 'truncate -s 0 <zone file>' works too.
> - Writeback cache mode does not support yet.
> 
> Now all prevous code review comments are addressed by this RFC version.
> Please don't hesitate to offer your opinion on this version.
> 
> Thanks in advance for your help.

Coly,

One more thing: your patch series lacks support for REQ_OP_ZONE_APPEND. It would
be great to add that. As is, since you do not set the max_zone_append_sectors
queue limit for the bcache device, that command will not be issued by the block
layer. But zonefs (and btrfs) will use zone append in (support for zonefs is
queued already in 5.8, btrfs will come later).

If bcache writethrough policy results in a data write to be issued to both the
backend device and the cache device, then some special code will be needed:
these 2 BIOs will need to be serialized since the actual write location of a
zone append command is known only on completion of the command. That is, the
zone append BIO needs to be issued to the backend device first, then to the
cache SSD device as a regular write once the zone append completes and its write
location is known.


> 
> Coly Li
> ---
> Changelog:
> v4: another improved version without any other generic block change.
> v3: an improved version depends on other generic block layer changes.
> v2: the first RFC version for comments and review.
> v1: the initial version posted just for information.
> 
> 
> Coly Li (3):
>   bcache: export bcache zone information for zoned backing device
>   bcache: handle zone management bios for bcache device
>   bcache: reject writeback cache mode for zoned backing device
> 
>  drivers/md/bcache/bcache.h  |  10 +++
>  drivers/md/bcache/request.c | 168 +++++++++++++++++++++++++++++++++++-
>  drivers/md/bcache/super.c   |  98 ++++++++++++++++++++-
>  drivers/md/bcache/sysfs.c   |   5 ++
>  4 files changed, 279 insertions(+), 2 deletions(-)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device
  2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
@ 2020-05-25  8:14   ` Coly Li
  0 siblings, 0 replies; 18+ messages in thread
From: Coly Li @ 2020-05-25  8:14 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache; +Cc: linux-block

On 2020/5/25 13:25, Damien Le Moal wrote:
> On 2020/05/22 21:19, Coly Li wrote:
>> Hi folks,
>>
>> This is series, now bcache can support zoned device (e.g. host managed
>> SMR hard drive) as the backing deice. Currently writeback mode is not
>> support yet, which is on the to-do list (requires on-disk super block
>> format change).
>>
>> The first patch makes bcache to export the zoned information to upper
>> layer code, for example formatting zonefs on top of the bcache device.
>> By default, zone 0 of the zoned device is fully reserved for bcache
>> super block, therefore the reported zones number is 1 less than the
>> exact zones number of the physical SMR hard drive.
>>
>> The second patch handles zone management command for bcache. Indeed
>> these zone management commands are wrappered as zone management bios.
>> For REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL zone management bios,
>> before forwarding the bio to backing device, all cached data covered
>> by the resetting zone(s) must be invalidated to keep data consistency.
>> For rested zone management bios just minus the bi_sector by data_offset
>> and simply forward to the zoned backing device.
>>
>> The third patch is to make sure after bcache device starts, the cache
>> mode cannot be changed to writeback via sysfs interface. Bcache-tools
>> is modified to notice users and convert to writeback mode to the default
>> writethrough mode when making a bcache device.
>>
>> There is one thing not addressed by this series, that is re-write the
>> bcache super block after REQ_OP_ZONE_RESET_ALL command. There will be
>> quite soon that all seq zones device may appear, but it is OK to make
>> bcache support such all seq-zones device a bit later.
>>
>> Now a bcache device created with a zoned SMR drive can pass these test
>> cases,
>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>   excluding zone 0 of the backing device (reserved for bcache super
>>   block).
>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>   in sectors.
>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>> - run 'blkzone reset -o <zone LBA> -c <zones number> /dev/bcache0',
>>   conventional zones will reject the command, seqential zones covered
>>   by the command range will reset its write pointer to start LBA of
>>   their zones. If <zone LBA> is 0 and <zones number> covers all zones,
>>   REQ_OP_ZONE_RESET_ALL command will be received and handled by bcache
>>   device properly.
>> - zonefs can be created on top of the bcache device, with/without cache
>>   device attached. All sequential direct write and random read work well
>>   and zone reset by 'truncate -s 0 <zone file>' works too.
>> - Writeback cache mode does not support yet.
>>
>> Now all prevous code review comments are addressed by this RFC version.
>> Please don't hesitate to offer your opinion on this version.
>>
>> Thanks in advance for your help.
> 
> Coly,
> 
> One more thing: your patch series lacks support for REQ_OP_ZONE_APPEND. It would
> be great to add that. As is, since you do not set the max_zone_append_sectors
> queue limit for the bcache device, that command will not be issued by the block
> layer. But zonefs (and btrfs) will use zone append in (support for zonefs is
> queued already in 5.8, btrfs will come later).

Hi Damien,

Thank you for the suggestion, I will work on it now and post in next
version.

> 
> If bcache writethrough policy results in a data write to be issued to both the
> backend device and the cache device, then some special code will be needed:
> these 2 BIOs will need to be serialized since the actual write location of a
> zone append command is known only on completion of the command. That is, the
> zone append BIO needs to be issued to the backend device first, then to the
> cache SSD device as a regular write once the zone append completes and its write
> location is known.
> 

Copied. It should be OK for bcache. For writethrough mode the data will
be inserted into SSD only after bio to the backing storage accomplished.

Thank you for all your comments, I start to work on your comments on the
series and reply your comments (maybe with more questions) latter.

Coly Li



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-05-25  1:10   ` Damien Le Moal
@ 2020-06-01 12:34     ` Coly Li
  2020-06-02  8:48       ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-06-01 12:34 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache
  Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/5/25 09:10, Damien Le Moal wrote:
> On 2020/05/22 21:18, Coly Li wrote:
[snip]>> zone files, 523 of them are for convential zones (1 zone is
reserved
> 
> s/convential/conventional
> 
>> for bcache super block and not reported), 51632 of them are for
>> sequential zones.
>>
>> First run to read first 4KB from all the zone files with 50 processes,
>> it takes 5 minutes 55 seconds. Second run takes 12 seconds only because
>> all the read requests hit on cache device.
> 

Hi Damien,

> Did you write anything first to the bcache device ? Otherwise, all zonefs files
> will be empty and there is not going to be any file access... Question though:
> when writing to a bcache device with writethrough mode, does the data go to the
> SSD cache too ? Or is it written only to the backend device ?
> 

Yes, I did. For all random and sequential zone files, I write 8KB on
their head and read back 4KB from offset 0 of each zone files.

In my test case, all data will first go into backing device, after
completed they will be inserted into SSD. The order is guaranteed by
bcache code.

>>
>> 29 times faster is as expected for an ideal case when all READ I/Os hit
>> on NVMe cache device.
>>
>> Besides providing report_zones method of the bcache gendisk structure,
>> this patch also initializes the following zoned device attribution for
>> the bcache device,
>> - zones number: the total zones number minus reserved zone(s) for bcache
> 
> s/zones number/number of zones
> 
>>   super block.
>> - zone size: same size as reported from the underlying zoned device
>> - zone mode: same mode as reported from the underlying zoned device
> 
> s/zone mode/zoned model
> 
>> Currently blk_revalidate_disk_zones() does not accept non-mq drivers, so
>> all the above attribution are initialized mannally in bcache code.
> 
> s/mannally/manually
> 
>>
>> This patch just provides the report_zones method only. Handling all zone
>> management commands will be addressed in following patches.
>>

The above typos will be fixed in next version.


>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Damien Le Moal <damien.lemoal@wdc.com>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> Changelog:
>> v4: the version without any generic block layer change.
>> v3: the version depends on other generic block layer changes.
>> v2: an improved version for comments.
>> v1: initial version.
>>  drivers/md/bcache/bcache.h  | 10 ++++
>>  drivers/md/bcache/request.c | 69 ++++++++++++++++++++++++++
>>  drivers/md/bcache/super.c   | 96 ++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 174 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 74a9849ea164..0d298b48707f 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h

[snip]
>>  
>> +/*
>> + * The callback routine to parse a specific zone from all reporting
>> + * zones. args->orig_cb() is the upper layer report zones callback,
>> + * which should be called after the LBA conversion.
>> + * Notice: all zones after zone 0 will be reported, including the
>> + * offlined zones, how to handle the different types of zones are
>> + * fully decided by upper layer who calss for reporting zones of
>> + * the bcache device.
>> + */
>> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
>> +				      unsigned int idx,
>> +				      void *data)
> 
> I do not think you need the line break for the last argument.
> 

OK, let me change the style.

>> +{
>> +	struct bch_report_zones_args *args = data;
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	unsigned long data_offset = dc->sb.data_offset;
>> +
>> +	/* Zone 0 should not be reported */
>> +	BUG_ON(zone->start < data_offset);
> 
> Wouldn't a WARN_ON_ONCE and return -EIO be better here ?

Do it in next version.


> 
>> +
>> +	/* convert back to LBA of the bcache device*/
>> +	zone->start -= data_offset;
>> +	zone->wp -= data_offset;
> 
> This has to be done depending on the zone type and zone condition: zone->wp is
> "invalid" for conventional zones, and sequential zones that are full, read-only
> or offline. So you need something like this:
> 
> 	/* Remap LBA to the bcache device */
> 	zone->start -= data_offset;
> 	switch(zone->cond) {
> 	case BLK_ZONE_COND_NOT_WP:
> 	case BLK_ZONE_COND_READONLY:
> 	case BLK_ZONE_COND_FULL:
> 	case BLK_ZONE_COND_OFFLINE:
> 		break;
> 	case BLK_ZONE_COND_EMPTY:
> 		zone->wp = zone->start;
> 		break;
> 	default:
> 		zone->wp -= data_offset;
> 		break;
> 	}
> 
> 	return args->orig_cb(zone, idx, args->orig_data);
> 

Hmm, do we have a unified spec to handle the wp on different zone
condition ?

In zonefs I see zone->wp sets to zone->start for,
- BLK_ZONE_COND_OFFLINE
- BLK_ZONE_COND_READONLY

In sd_zbc.c, I see wp sets to end of the zone for
- BLK_ZONE_COND_FULL

And in dm.c I see zone->wp is set to zone->start for,
- BLK_ZONE_COND_EMPTY

It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
writer pointer should be taken cared by device driver already.

So I write such switch-case in the following way by the inspair of your
comments,

        /* Zone 0 should not be reported */
        if(WARN_ON_ONCE(zone->start < data_offset))
                return -EIO;

        /* convert back to LBA of the bcache device*/
        zone->start -= data_offset;
        if (zone->cond != BLK_ZONE_COND_NOT_WP)
                zone->wp -= data_offset;

        switch (zone->cond) {
        case BLK_ZONE_COND_NOT_WP:
                /* No write pointer available */
                break;
        case BLK_ZONE_COND_EMPTY:
        case BLK_ZONE_COND_READONLY:
        case BLK_ZONE_COND_OFFLINE:
                /*
                 * If underlying device driver does not properly
                 * set writer pointer, warn and fix it.
                 */
                if (WARN_ON_ONCE(zone->wp != zone->start))
                        zone->wp = zone->start;
                break;
        case BLK_ZONE_COND_FULL:
                /*
                 * If underlying device driver does not properly
                 * set writer pointer, warn and fix it.
                 */
                if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
                        zone->wp = zone->start + zone->len;
                break;
        default:
                break;
        }

        return args->orig_cb(zone, idx, args->orig_data);

The above code converts zone->wp by minus data_offset for
non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
I just check whether the underlying device driver updates write pointer
properly (as I observed in other drivers), if not then drop a warning
and fix the write pointer to the expected value.

Now the write pointer is only fixed when it was wrong value. If the
underlying device driver does not maintain the value properly, we figure
out and fix it.

>> +
>> +	return args->orig_cb(zone, idx, args->orig_data);
>> +}
>> +
>> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
>> +				   unsigned int nr_zones)
>> +{
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	/* skip zone 0 which is fully occupied by bcache super block */
>> +	sector_t sector = args->sector + dc->sb.data_offset;
>> +
>> +	/* sector is real LBA of backing device */
>> +	return blkdev_report_zones(dc->bdev,
>> +				   sector,
>> +				   nr_zones,
>> +				   cached_dev_report_zones_cb,
>> +				   args);
> 
> You could have multiple arguments on a couple of lines only here...
> 

Sure I change the style in next version.

>> +}
>> +
>>  void bch_cached_dev_request_init(struct cached_dev *dc)
>>  {
>>  	struct gendisk *g = dc->disk.disk;
>> @@ -1268,6 +1336,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>>  	dc->disk.ioctl				= cached_dev_ioctl;
>> +	dc->disk.report_zones			= cached_dev_report_zones;
> 
> Why set this method unconditionally ? Should it be set only for a zoned bcache
> device ? E.g.:
> 	
> 	if (bdev_is_zoned(bcache bdev))
> 		dc->disk.report_zones = cached_dev_report_zones;
> 

Will fix it in next version. Thanks.


>>  }
>>  
>>  /* Flash backed devices */
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index d98354fa28e3..d5da7ad5157d 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -679,10 +679,36 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>>  	return d->ioctl(d, mode, cmd, arg);
>>  }
>>  
>> +static int report_zones_dev(struct gendisk *disk,
>> +			    sector_t sector,
>> +			    unsigned int nr_zones,
>> +			    report_zones_cb cb,
>> +			    void *data)
>> +{
>> +	struct bcache_device *d = disk->private_data;
>> +	struct bch_report_zones_args args = {
>> +		.bcache_device = d,
>> +		.sector = sector,
>> +		.orig_data = data,
>> +		.orig_cb = cb,
>> +	};
>> +
>> +	/*
>> +	 * only bcache device with backing device has
>> +	 * report_zones method, flash device does not.
>> +	 */
>> +	if (d->report_zones)
>> +		return d->report_zones(&args, nr_zones);
>> +
>> +	/* flash dev does not have report_zones method */
> 
> This comment is confusing. Report zones is called against the bcache device, not
> against its components... In any case, if the bcache device is not zoned, the

My fault. There are two kinds of bcache device for now, one is the
bcache device with a backing device, one is the bcache device without a
backing device. For the bcache device without backing device, its I/Os
all go into the cache device (SSD). Now such bcache device is called
flash device or flash only volume IMHO. Yes this is confusing, I need to
fix it.


> report_zones method will never be called by the block layer. So you probably
> should just check that on entry:
> 
> 	if (WARN_ON_ONCE(!blk_queue_is_zoned(disk->queue))
> 		return -EOPNOTSUPP;
> 
> 	return d->report_zones(&args, nr_zones);
> 
>> +	return -EOPNOTSUPP;
>> +}
>> +

Good point, I use it in next version.


>>  static const struct block_device_operations bcache_ops = {
>>  	.open		= open_dev,
>>  	.release	= release_dev,
>>  	.ioctl		= ioctl_dev,
>> +	.report_zones	= report_zones_dev,
>>  	.owner		= THIS_MODULE,
>>  };
> 
> Same here. It may be better to set the report zones method only for a zoned
> bcache dev. So you will need an additional block_device_operations struct for
> that type.
> 
> static const struct block_device_operations bcache_zoned_ops = {
>  	.open		= open_dev,
>  	.release	= release_dev,
>  	.ioctl		= ioctl_dev,
> 	.report_zones	= report_zones_dev,
>  	.owner		= THIS_MODULE,
> };
> 

I see you point. But the purpose of such coding style is to avoid an
extra block_device_operations, just like ioctl_dev() does. Let's keep
the existsing style at this moment, and may change it in future when
necessary.


[snipped]
>>  
>> +static inline int cached_dev_data_offset_check(struct cached_dev *dc)
>> +{
>> +	if (!bdev_is_zoned(dc->bdev))
>> +		return 0;
>> +
>> +	/*
>> +	 * If the backing hard drive is zoned device, sb.data_offset
>> +	 * should be aligned to zone size, which is automatically
>> +	 * handled by 'bcache' util of bcache-tools. If the data_offset
>> +	 * is not aligned to zone size, it means the bcache-tools is
>> +	 * outdated.
>> +	 */
>> +	if (dc->sb.data_offset & (bdev_zone_sectors(dc->bdev) - 1)) {
>> +		pr_err("data_offset %llu is not aligned to zone size %llu, please update bcache-tools and re-make the zoned backing device.\n",
> 
> Long line... May be split the pr_err in 2 calls ?

This is what I learned from md code style, which is don't split the
error message into multiple lines. See the following commit,

commit 9d48739ef19aa8ad6026fd312b3ed81b560c8579
Author: NeilBrown <neilb@suse.com>
Date:   Wed Nov 2 14:16:49 2016 +1100

    md: change all printk() to pr_err() or pr_warn() etc.

This is just a choice of code style, and I'd like to take the single
line message for the following cited reason,

    3/ When strings have been split onto multiple lines, rejoin into
       a single string.
       The cost of having lines > 80 chars is less than the cost of not
       being able to easily search for a particular message.


> 
>> +			dc->sb.data_offset, bdev_zone_sectors(dc->bdev));
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
[snipped]

>> + */
>> +static int bch_cached_dev_zone_init(struct cached_dev *dc)
>> +{
>> +	struct request_queue *d_q, *b_q;
>> +	enum blk_zoned_model mode;
> 
> To be clear, may be call this variable "model" ?

Sure, it will be fixed in next version.


[snipped]
>> +	if (mode != BLK_ZONED_NONE) {
>> +		d_q->limits.zoned = mode;
>> +		blk_queue_chunk_sectors(d_q, b_q->limits.chunk_sectors);
>> +		/*
>> +		 * (dc->sb.data_offset / q->limits.chunk_sectors) is the
>> +		 * zones number reserved for bcache super block. By default
>> +		 * it is set to 1 by bcache-tools.
>> +		 */
>> +		d_q->nr_zones = b_q->nr_zones -
>> +			(dc->sb.data_offset / d_q->limits.chunk_sectors);
> 
> Does this compile on 32bits arch ? Don't you need a do_div() here ?
> 

Yes, I wil fix it in next version.

[snipped]

Thank you for the very detailed review comments :-)

Coly Li




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-05-25  1:24   ` Damien Le Moal
@ 2020-06-01 16:06     ` Coly Li
  2020-06-02  8:54       ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-06-01 16:06 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache
  Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/5/25 09:24, Damien Le Moal wrote:
> On 2020/05/22 21:18, Coly Li wrote:
>> Fortunately the zoned device management interface is designed to use
>> bio operations, the bcache code just needs to do a little change to
>> handle the zone management bios.
>>
>> Bcache driver needs to handle 5 zone management bios for now, all of
>> them are handled by cached_dev_nodata() since their bi_size values
>> are 0.
>> - REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_FINISH:
>>     The LBA conversion is already done in cached_dev_make_request(),
>>   just simply send such zone management bios to backing device is
>>   enough.
>> - REQ_OP_ZONE_RESET, REQ_OP_ZONE_RESET_ALL:
>>     If thre is cached data (clean or dirty) on cache device covered
>>   by the LBA range of resetting zone(s), such cached data must be
> 
> The first part of the sentence is a little unclear... Did you mean:
> 
> If the cache device holds data of the target zones, cache invalidation is needed
> before forwarding...

It will be fixed in next version.

> 
>>   invalidate from the cache device before forwarding the zone reset
>>   bios to the backing device. Otherwise data inconsistency or further
>>   corruption may happen from the view of bcache device.
>>     The difference of REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_RESET is
>>   when the zone management bio is to reset all zones, send all zones
>>   number reported by the bcache device (s->d->disk->queue->nr_zones)
>>   into bch_data_invalidate_zones() as parameter 'size_t nr_zones'. If
>>   only reset a single zone, just set 1 as 'size_t nr_zones'.
>>
>> By supporting zone managememnt bios with this patch, now a bcache device
> 
> s/managememnt/management
> 

It will be fixed in next version.

>> can be formatted by zonefs, and the zones can be reset by truncate -s 0
>> on the zone files under seq/ directory. Supporting REQ_OP_ZONE_RESET_ALL
>> makes the whole disk zones reset much faster. In my testing on a 14TB
>> zoned SMR hard drive, 1 by 1 resetting 51632 seq zones by sending
>> REQ_OP_ZONE_RESET bios takes 4m34s, by sending a single
>> REQ_OP_ZONE_RESET_ALL bio takes 12s, which is 22x times faster.
>>
>> REQ_OP_ZONE_RESET_ALL is supported by bcache only when the backing device
>> supports it. So the bcache queue flag is set QUEUE_FLAG_ZONE_RESETALL on
>> only when queue of backing device has such flag, which can be checked by
>> calling blk_queue_zone_resetall() on backing device's request queue.

The above part about REQ_OP_ZONE_RESET_ALL will be removed.


>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Damien Le Moal <damien.lemoal@wdc.com>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> Changelog:
>> v2: an improved version without any generic block layer change.
>> v1: initial version depends on other generic block layer changes.
>>
>>  drivers/md/bcache/request.c | 99 ++++++++++++++++++++++++++++++++++++-
>>  drivers/md/bcache/super.c   |  2 +
>>  2 files changed, 100 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 34f63da2338d..700b8ab5dec9 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -1052,18 +1052,115 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>>  	continue_at(cl, cached_dev_write_complete, NULL);
>>  }
>>  
>> +/*
>> + * Invalidate the LBA range on cache device which is covered by the
>> + * the resetting zones.
>> + */
>> +static int bch_data_invalidate_zones(struct closure *cl,
>> +				      size_t zone_sector,
>> +				      size_t nr_zones)
> 
> No need for the line break after the second argument.

Believe me, I tried hard to make them into 2 lines. But the second line
is a bit longer, I try to persuade me to accept it but still quite
uncomfortable for the unaligned tails. I choose to keep current aligned
format ....

[snipped]

>> +
>>  static void cached_dev_nodata(struct closure *cl)
>>  {
>>  	struct search *s = container_of(cl, struct search, cl);
>>  	struct bio *bio = &s->bio.bio;
>> +	int nr_zones = 1;
>>  
>>  	if (s->iop.flush_journal)
>>  		bch_journal_meta(s->iop.c, cl);
>>  
>> -	/* If it's a flush, we send the flush to the backing device too */
>> +	if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL ||
>> +	    bio_op(bio) == REQ_OP_ZONE_RESET) {
>> +		int err = 0;
>> +		/*
>> +		 * If this is REQ_OP_ZONE_RESET_ALL bio, cached data
>> +		 * covered by all zones should be invalidate from the
> 
> s/invalidate/invalidated

It will be fixed in next version.

> 
>> +		 * cache device.
>> +		 */
>> +		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
>> +			nr_zones = s->d->disk->queue->nr_zones;
> 
> Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by
> the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache
> for conventional zones is not really necessary. But as an initial support, I
> think this is fine. This can be optimized later.
>
Copied, will think of how to optimized later. So far in my testing,
resetting conventional zones may receive error and timeout from
underlying drivers and bcache code just forwards such error to upper
layer. What I see is the reset command hangs for a quite long time and
failed. I will find a way to make the zone reset command on conventional
zone fail immediately.


>> +
>> +		err = bch_data_invalidate_zones(
>> +			cl, bio->bi_iter.bi_sector, nr_zones);
>> +
>> +		if (err < 0) {
>> +			s->iop.status = errno_to_blk_status(err);
>> +			/* debug, should be removed before post patch */
>> +			bio->bi_status = BLK_STS_TIMEOUT;
> 
> You did not remove it :)

Remove it in next version LOL

> 
>> +			/* set by bio_cnt_set() in do_bio_hook() */
>> +			bio_put(bio);
>> +			/*
>> +			 * Invalidate cached data fails, don't send
>> +			 * the zone reset bio to backing device and
>> +			 * return failure. Otherwise potential data
>> +			 * corruption on bcache device may happen.
>> +			 */
>> +			goto continue_bio_complete;
>> +		}
>> +
>> +	}
>> +
>> +	/*
>> +	 * For flush or zone management bios, of cause
>> +	 * they should be sent to backing device too.
>> +	 */
>>  	bio->bi_end_io = backing_request_endio;
>>  	closure_bio_submit(s->iop.c, bio, cl);
> 
> You cannot submit a REQ_OP_ZONE_RESET_ALL to the backing dev here, at least not
> unconditionally. The problem is that if the backing dev doe not have any
> conventional zones at its LBA 0, REQ_OP_ZONE_RESET_ALL will really reset all
> zones, including the first zone of the device that contains bcache super block.
> So you will loose/destroy the bcache setup. You probably did not notice this
> because your test drive has conventional zones at LBA 0 and reset all does not
> have any effect on conventional zones...
> 
> The easy way to deal with this is to not set the QUEUE_FLAG_ZONE_RESETALL flag
> for the bcache device queue. If it is not set, the block layer will
> automatically issue only single zone REQ_OP_ZONE_RESET BIOs. That is slower,
> yes, but that cannot be avoided when the backend disk does not have any
> conventional zones. The QUEUE_FLAG_ZONE_RESETALL flag can be kept if the backend
> disk first zone containing the bcache super block is a conventional zone.
> 

You are totally right. Finally I realize why zonefs does not use
REQ_OP_ZONE_RESET_ALL and reset each non-conventional and non-offline
and non-read-only zones one-by-one. Yes, I remove the support of
REQ_OP_ZONE_RESET_ALL in next version.


>> +continue_bio_complete:
>>  	continue_at(cl, cached_dev_bio_complete, NULL);
>>  }
>>  
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index d5da7ad5157d..70c31950ec1b 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1390,6 +1390,8 @@ static int bch_cached_dev_zone_init(struct cached_dev *dc)
>>  		 */
>>  		d_q->nr_zones = b_q->nr_zones -
>>  			(dc->sb.data_offset / d_q->limits.chunk_sectors);
>> +		if (blk_queue_zone_resetall(b_q))
>> +			blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, d_q);
> 
> See above comment.
> 

Removed in next version.

>>  	}
>>  
>>  	return 0;
>>

Thank you for the detailed review comments.

Coly Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device
  2020-05-25  1:26   ` Damien Le Moal
@ 2020-06-01 16:09     ` Coly Li
  0 siblings, 0 replies; 18+ messages in thread
From: Coly Li @ 2020-06-01 16:09 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache
  Cc: linux-block, Hannes Reinecke, Johannes Thumshirn

On 2020/5/25 09:26, Damien Le Moal wrote:
> On 2020/05/22 21:19, Coly Li wrote:
>> Currently we don't support writeback mode for zoned device as backing
>> device. So reject it by sysfs interface.
>>
>> This rejection will be removed after the writeback cache mode support
>> for zoned device gets done.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Damien Le Moal <damien.lemoal@wdc.com>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>  drivers/md/bcache/sysfs.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>> index 323276994aab..41bdbc42a17d 100644
>> --- a/drivers/md/bcache/sysfs.c
>> +++ b/drivers/md/bcache/sysfs.c
>> @@ -359,6 +359,11 @@ STORE(__cached_dev)
>>  		if (v < 0)
>>  			return v;
>>  
>> +		if ((unsigned int) v == CACHE_MODE_WRITEBACK) {
>> +			pr_err("writeback mode is not supported for zoned backing device.\n");
>> +			return -ENOTSUPP;
>> +		}
>> +
>>  		if ((unsigned int) v != BDEV_CACHE_MODE(&dc->sb)) {
>>  			SET_BDEV_CACHE_MODE(&dc->sb, v);
>>  			bch_write_bdev_super(dc, NULL);
>>
> 
> Do you have a similar check in bcache user tools at format time ? Or is the
> cache mode specified only when the bcache device is started ?

Yes I do the cache mode check in bcache-tools, and if user sets
writeback mode, bcache-tools will inform user and switch it to
writethrough mode explicitly.

> 
> Looks good.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> 

Thank you for the review!

Coly Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-06-01 12:34     ` Coly Li
@ 2020-06-02  8:48       ` Damien Le Moal
  2020-06-02 12:50         ` Coly Li
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2020-06-02  8:48 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: Johannes Thumshirn, linux-block, hare

On Mon, 2020-06-01 at 20:34 +0800, Coly Li wrote:
> [...]
> > > +
> > > +	/* convert back to LBA of the bcache device*/
> > > +	zone->start -= data_offset;
> > > +	zone->wp -= data_offset;
> > 
> > This has to be done depending on the zone type and zone condition: zone->wp is
> > "invalid" for conventional zones, and sequential zones that are full, read-only
> > or offline. So you need something like this:
> > 
> > 	/* Remap LBA to the bcache device */
> > 	zone->start -= data_offset;
> > 	switch(zone->cond) {
> > 	case BLK_ZONE_COND_NOT_WP:
> > 	case BLK_ZONE_COND_READONLY:
> > 	case BLK_ZONE_COND_FULL:
> > 	case BLK_ZONE_COND_OFFLINE:
> > 		break;
> > 	case BLK_ZONE_COND_EMPTY:
> > 		zone->wp = zone->start;
> > 		break;
> > 	default:
> > 		zone->wp -= data_offset;
> > 		break;
> > 	}
> > 
> > 	return args->orig_cb(zone, idx, args->orig_data);
> > 
> 
> Hmm, do we have a unified spec to handle the wp on different zone
> condition ?

This comes from SCSI ZBC and ATA ZAC specifications, version r05 of the
documents for both (ratified specifications). These 2 specifications
define identical semantic and states for SMR zones. Upcoming NVMe ZNS
has similar and compatible zone management too.

> In zonefs I see zone->wp sets to zone->start for,
> - BLK_ZONE_COND_OFFLINE
> - BLK_ZONE_COND_READONLY

This is because zonefs uses a zone write pointer position for the file
size: file size = zone->wp - zone->start;
For the read-only and offline zone conditions, since the wp given by
the drive is defined as "invalid", so unknown, zonefs changes the wp to
zone start to make the file size 0 and prevent any read IO issuing.

> 
> In sd_zbc.c, I see wp sets to end of the zone for
> - BLK_ZONE_COND_FULL

This is only to be nice to host software users, so that the wp position
relative to the zone start is exactly the zone size, indicating that
the entire zone was written. This allows to have equivalence between
the tests

zone->cond == BLK_ZONE_COND_FULL

and

zone->wp == zone->start + zone->len

to make coding easier. But the fact remain that the value given for the
wp of a full zone by the disk is undefined. Generally, drives return
0xfffff...ffff.

> And in dm.c I see zone->wp is set to zone->start for,
> - BLK_ZONE_COND_EMPTY

That is because like bcache, dm may remap zones to indicate sector
values that make sense for the target device. The values given by the
physical underlying disk cannot be used as is. The code snippet above I
sent does the same for the same reason.

> It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
> writer pointer should be taken cared by device driver already.

The physical drive device driver gives you what the disk indicated,
modulo the change for full zone. Bcache *is* a device driver too,
driver of the target device. The report zones method for that device
needs to take care of the zone remapping if needed. That is not for the
underlying physical device driver lower in the stack to do this as that
layer does not know how the drive is being used.

> So I write such switch-case in the following way by the inspair of your
> comments,
> 
>         /* Zone 0 should not be reported */
>         if(WARN_ON_ONCE(zone->start < data_offset))
>                 return -EIO;

That depends on the start sector specified when blkdev_report_zones()
is called. Beware to have the start sector being >= zone size. That is,
that report start sector needs remapping too.

>         /* convert back to LBA of the bcache device*/
>         zone->start -= data_offset;
>         if (zone->cond != BLK_ZONE_COND_NOT_WP)
>                 zone->wp -= data_offset;

Since you have the BLK_ZONE_COND_NOT_WP condition in the switch-case,
this is not needed. 

> 
>         switch (zone->cond) {
>         case BLK_ZONE_COND_NOT_WP:
>                 /* No write pointer available */
>                 break;
>         case BLK_ZONE_COND_EMPTY:

zone->wp = zone->start;

needs to ba added here so that your wp is remapped.

>         case BLK_ZONE_COND_READONLY:
>         case BLK_ZONE_COND_OFFLINE:
>                 /*
>                  * If underlying device driver does not properly
>                  * set writer pointer, warn and fix it.
>                  */
>                 if (WARN_ON_ONCE(zone->wp != zone->start))
>                         zone->wp = zone->start;

This is not needed. The wp value is undefined for these conditions.
touching it, looking at it or even thinking of it does not make sense
:) So leave the wp as is for these 2 cases.

>                 break;
>         case BLK_ZONE_COND_FULL:
>                 /*
>                  * If underlying device driver does not properly
>                  * set writer pointer, warn and fix it.
>                  */
>                 if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
>                         zone->wp = zone->start + zone->len;

Simply unconditionally set the wp to zone->start + zone->len. No WARN
ON needed at all. I actually forgot this case in the code snippet I
sent.

>                 break;
>         default:
>                 break;
>         }
> 
>         return args->orig_cb(zone, idx, args->orig_data);
> 
> The above code converts zone->wp by minus data_offset for
> non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
> I just check whether the underlying device driver updates write pointer
> properly (as I observed in other drivers), if not then drop a warning
> and fix the write pointer to the expected value.

The wp essentially comes from the drive. Except for a full zone, the
value is passed on as is by the driver. The physical device driver does
not "set" the wp.

> Now the write pointer is only fixed when it was wrong value. If the
> underlying device driver does not maintain the value properly, we figure
> out and fix it.

The wp will be a wrong value only if the drive you are using has a
firmware bug. The "fixing" needed is because bcache device needs
remapping (off-by-one-zone) of the physical device zone
information. That is for bcache to do. And that remapping needs to be
done carefully since some wp values are undefined. Your code above as
is will for instance change the wp value from "undefined"
(0xffffffffffffffff for drives out there, generally, but it could be
anything) to "undefined - data_offset" for any readonly or offline
zone. The result is still "undefined"... Doing that kind of operation
on the wp is not necessary at all and does not serve any useful
purpose.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-06-01 16:06     ` Coly Li
@ 2020-06-02  8:54       ` Damien Le Moal
  2020-06-02 10:18         ` Coly Li
  0 siblings, 1 reply; 18+ messages in thread
From: Damien Le Moal @ 2020-06-02  8:54 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: Johannes Thumshirn, linux-block, hare

On Tue, 2020-06-02 at 00:06 +0800, Coly Li wrote:
> > > +		 * cache device.
> > > +		 */
> > > +		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
> > > +			nr_zones = s->d->disk->queue->nr_zones;
> > 
> > Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by
> > the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache
> > for conventional zones is not really necessary. But as an initial support, I
> > think this is fine. This can be optimized later.
> > 
> Copied, will think of how to optimized later. So far in my testing,
> resetting conventional zones may receive error and timeout from
> underlying drivers and bcache code just forwards such error to upper
> layer. What I see is the reset command hangs for a quite long time and
> failed. I will find a way to make the zone reset command on conventional
> zone fail immediately.

It is 100% guaranteed that a zone reset issued to a conventional zone
will fail. That is defined in ZBC/ZAC specifications. Resetting a
single conventional zone is an error. We know the command will fail and
the failure is instantaneous from the drive. The scsi layer should not
retry these failed reset zone command, we have special error handling
code preventing retries since we know that the command can only fail
again. So I am not sure why you are seeing hang/long time before the
failure is signaled... This may need investigation.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-06-02  8:54       ` Damien Le Moal
@ 2020-06-02 10:18         ` Coly Li
  2020-06-03  0:51           ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-06-02 10:18 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache; +Cc: Johannes Thumshirn, linux-block, hare

On 2020/6/2 16:54, Damien Le Moal wrote:
> On Tue, 2020-06-02 at 00:06 +0800, Coly Li wrote:
>>>> +		 * cache device.
>>>> +		 */
>>>> +		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
>>>> +			nr_zones = s->d->disk->queue->nr_zones;
>>>
>>> Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by
>>> the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache
>>> for conventional zones is not really necessary. But as an initial support, I
>>> think this is fine. This can be optimized later.
>>>
>> Copied, will think of how to optimized later. So far in my testing,
>> resetting conventional zones may receive error and timeout from
>> underlying drivers and bcache code just forwards such error to upper
>> layer. What I see is the reset command hangs for a quite long time and
>> failed. I will find a way to make the zone reset command on conventional
>> zone fail immediately.
> 
> It is 100% guaranteed that a zone reset issued to a conventional zone
> will fail. That is defined in ZBC/ZAC specifications. Resetting a
> single conventional zone is an error. We know the command will fail and
> the failure is instantaneous from the drive. The scsi layer should not
> retry these failed reset zone command, we have special error handling
> code preventing retries since we know that the command can only fail
> again. So I am not sure why you are seeing hang/long time before the
> failure is signaled... This may need investigation.
> 
> 

Copied. Currently I plan to add a first_seq_zone_nr to bcache on-disk
super block, its value will be set by user space bcache-tools when the
backing device is formatted for bcache. Then the zone reset bio which
has smaller zone number will be rejected immediately by bcache code.

This requires on-disk format change, I will do it later with other
on-disk change stuffs.

Coly Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-06-02  8:48       ` Damien Le Moal
@ 2020-06-02 12:50         ` Coly Li
  2020-06-03  0:58           ` Damien Le Moal
  0 siblings, 1 reply; 18+ messages in thread
From: Coly Li @ 2020-06-02 12:50 UTC (permalink / raw)
  To: Damien Le Moal, linux-bcache; +Cc: Johannes Thumshirn, linux-block, hare

On 2020/6/2 16:48, Damien Le Moal wrote:
> On Mon, 2020-06-01 at 20:34 +0800, Coly Li wrote:
>> [...]
>>>> +
>>>> +	/* convert back to LBA of the bcache device*/
>>>> +	zone->start -= data_offset;
>>>> +	zone->wp -= data_offset;
>>>
>>> This has to be done depending on the zone type and zone condition: zone->wp is
>>> "invalid" for conventional zones, and sequential zones that are full, read-only
>>> or offline. So you need something like this:
>>>
>>> 	/* Remap LBA to the bcache device */
>>> 	zone->start -= data_offset;
>>> 	switch(zone->cond) {
>>> 	case BLK_ZONE_COND_NOT_WP:
>>> 	case BLK_ZONE_COND_READONLY:
>>> 	case BLK_ZONE_COND_FULL:
>>> 	case BLK_ZONE_COND_OFFLINE:
>>> 		break;
>>> 	case BLK_ZONE_COND_EMPTY:
>>> 		zone->wp = zone->start;
>>> 		break;
>>> 	default:
>>> 		zone->wp -= data_offset;
>>> 		break;
>>> 	}
>>>
>>> 	return args->orig_cb(zone, idx, args->orig_data);
>>>
>>
>> Hmm, do we have a unified spec to handle the wp on different zone
>> condition ?
> 
> This comes from SCSI ZBC and ATA ZAC specifications, version r05 of the
> documents for both (ratified specifications). These 2 specifications
> define identical semantic and states for SMR zones. Upcoming NVMe ZNS
> has similar and compatible zone management too.
> 
>> In zonefs I see zone->wp sets to zone->start for,
>> - BLK_ZONE_COND_OFFLINE
>> - BLK_ZONE_COND_READONLY
> 
> This is because zonefs uses a zone write pointer position for the file
> size: file size = zone->wp - zone->start;
> For the read-only and offline zone conditions, since the wp given by
> the drive is defined as "invalid", so unknown, zonefs changes the wp to
> zone start to make the file size 0 and prevent any read IO issuing.
> 
>>
>> In sd_zbc.c, I see wp sets to end of the zone for
>> - BLK_ZONE_COND_FULL
> 
> This is only to be nice to host software users, so that the wp position
> relative to the zone start is exactly the zone size, indicating that
> the entire zone was written. This allows to have equivalence between
> the tests
> 
> zone->cond == BLK_ZONE_COND_FULL
> 
> and
> 
> zone->wp == zone->start + zone->len
> 
> to make coding easier. But the fact remain that the value given for the
> wp of a full zone by the disk is undefined. Generally, drives return
> 0xfffff...ffff.
> 

Copied. Thanks for the above information.


>> And in dm.c I see zone->wp is set to zone->start for,
>> - BLK_ZONE_COND_EMPTY
> 
> That is because like bcache, dm may remap zones to indicate sector
> values that make sense for the target device. The values given by the
> physical underlying disk cannot be used as is. The code snippet above I
> sent does the same for the same reason.
> 
>> It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
>> writer pointer should be taken cared by device driver already.
> 
> The physical drive device driver gives you what the disk indicated,
> modulo the change for full zone. Bcache *is* a device driver too,
> driver of the target device. The report zones method for that device
> needs to take care of the zone remapping if needed. That is not for the
> underlying physical device driver lower in the stack to do this as that
> layer does not know how the drive is being used.
> 


Copied, thanks for the information :-)

>> So I write such switch-case in the following way by the inspair of your
>> comments,
>>
>>         /* Zone 0 should not be reported */
>>         if(WARN_ON_ONCE(zone->start < data_offset))
>>                 return -EIO;
> 
> That depends on the start sector specified when blkdev_report_zones()
> is called. Beware to have the start sector being >= zone size. That is,
> that report start sector needs remapping too.
> 

Because zones before data_offset should not visible by upper layer code,
such zones should not be sent to cached_dev_report_zones_cb(). This is a
double check that caller of cached_dev_report_zones_cb() does things in
correct way.


>>         /* convert back to LBA of the bcache device*/
>>         zone->start -= data_offset;
>>         if (zone->cond != BLK_ZONE_COND_NOT_WP)
>>                 zone->wp -= data_offset;
> 
> Since you have the BLK_ZONE_COND_NOT_WP condition in the switch-case,
> this is not needed. 
> 
>>
>>         switch (zone->cond) {
>>         case BLK_ZONE_COND_NOT_WP:
>>                 /* No write pointer available */
>>                 break;
>>         case BLK_ZONE_COND_EMPTY:
> 
> zone->wp = zone->start;
> 

Correct me if I am wrong. I assume when the zone is in empty condition,
LBA of zone->wp should be exactly equal to zone->start before bcache
does the conversion. Therefore 'zone->wp - data_offset' should still be
equal to 'zone->start - data_offset'. Therefore explicitly handle it in
'case BLK_ZONE_COND_EMPTY:' should be equal to handle it in 'default:' part.

Just want to double check I understand the wp correctly, thanks.


> needs to ba added here so that your wp is remapped.
> 
>>         case BLK_ZONE_COND_READONLY:
>>         case BLK_ZONE_COND_OFFLINE:
>>                 /*
>>                  * If underlying device driver does not properly
>>                  * set writer pointer, warn and fix it.
>>                  */
>>                 if (WARN_ON_ONCE(zone->wp != zone->start))
>>                         zone->wp = zone->start;
> 
> This is not needed. The wp value is undefined for these conditions.
> touching it, looking at it or even thinking of it does not make sense
> :) So leave the wp as is for these 2 cases.

Copied.

> 
>>                 break;
>>         case BLK_ZONE_COND_FULL:
>>                 /*
>>                  * If underlying device driver does not properly
>>                  * set writer pointer, warn and fix it.
>>                  */
>>                 if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
>>                         zone->wp = zone->start + zone->len;
> 
> Simply unconditionally set the wp to zone->start + zone->len. No WARN
> ON needed at all. I actually forgot this case in the code snippet I
> sent.

Copied.


> 
>>                 break;
>>         default:
>>                 break;
>>         }
>>
>>         return args->orig_cb(zone, idx, args->orig_data);
>>
>> The above code converts zone->wp by minus data_offset for
>> non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
>> I just check whether the underlying device driver updates write pointer
>> properly (as I observed in other drivers), if not then drop a warning
>> and fix the write pointer to the expected value.
> 
> The wp essentially comes from the drive. Except for a full zone, the
> value is passed on as is by the driver. The physical device driver does
> not "set" the wp.
> 
>> Now the write pointer is only fixed when it was wrong value. If the
>> underlying device driver does not maintain the value properly, we figure
>> out and fix it.
> 
> The wp will be a wrong value only if the drive you are using has a
> firmware bug. The "fixing" needed is because bcache device needs
> remapping (off-by-one-zone) of the physical device zone
> information. That is for bcache to do. And that remapping needs to be
> done carefully since some wp values are undefined. Your code above as
> is will for instance change the wp value from "undefined"
> (0xffffffffffffffff for drives out there, generally, but it could be
> anything) to "undefined - data_offset" for any readonly or offline
> zone. The result is still "undefined"... Doing that kind of operation
> on the wp is not necessary at all and does not serve any useful
> purpose.

I see. Now it is more clear to me. Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device
  2020-06-02 10:18         ` Coly Li
@ 2020-06-03  0:51           ` Damien Le Moal
  0 siblings, 0 replies; 18+ messages in thread
From: Damien Le Moal @ 2020-06-03  0:51 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: Johannes Thumshirn, linux-block, hare

On 2020/06/02 19:18, Coly Li wrote:
> On 2020/6/2 16:54, Damien Le Moal wrote:
>> On Tue, 2020-06-02 at 00:06 +0800, Coly Li wrote:
>>>>> +		 * cache device.
>>>>> +		 */
>>>>> +		if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)
>>>>> +			nr_zones = s->d->disk->queue->nr_zones;
>>>>
>>>> Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by
>>>> the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache
>>>> for conventional zones is not really necessary. But as an initial support, I
>>>> think this is fine. This can be optimized later.
>>>>
>>> Copied, will think of how to optimized later. So far in my testing,
>>> resetting conventional zones may receive error and timeout from
>>> underlying drivers and bcache code just forwards such error to upper
>>> layer. What I see is the reset command hangs for a quite long time and
>>> failed. I will find a way to make the zone reset command on conventional
>>> zone fail immediately.
>>
>> It is 100% guaranteed that a zone reset issued to a conventional zone
>> will fail. That is defined in ZBC/ZAC specifications. Resetting a
>> single conventional zone is an error. We know the command will fail and
>> the failure is instantaneous from the drive. The scsi layer should not
>> retry these failed reset zone command, we have special error handling
>> code preventing retries since we know that the command can only fail
>> again. So I am not sure why you are seeing hang/long time before the
>> failure is signaled... This may need investigation.
>>
>>
> 
> Copied. Currently I plan to add a first_seq_zone_nr to bcache on-disk
> super block, its value will be set by user space bcache-tools when the
> backing device is formatted for bcache. Then the zone reset bio which
> has smaller zone number will be rejected immediately by bcache code.
> 
> This requires on-disk format change, I will do it later with other
> on-disk change stuffs.

I do not think you actually need an on-disk format change for this since that
information can simply live in memory. Also, it is not entirely correct to
assume that all conventional zones are at LBA 0 onward up to the first
sequential zone. It just happens that this is what drives on the market look
like today, but the standard allows conventional zones to be anywhere in the
disk address space. The safe way to handle this is to do something like the
block layer does using a bitmap indicating if a zone is sequential or
conventional. Not that using the bitmap already attached to the device request
queue is possible but not safe since the backend device could be a DM target, so
a BIO device with a request queue that does not have zone bitmaps. So allocating
your own bitmap and doing a report zones to initialize it when the device is
started is safer and will give you something very solid with no on-disk format
change needed.

See fs/f2fs/super.c function init_blkz_info(), that is exactly what is done:
allocation and initialization of a bitmap to identify zone types.

> 
> Coly Li
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device
  2020-06-02 12:50         ` Coly Li
@ 2020-06-03  0:58           ` Damien Le Moal
  0 siblings, 0 replies; 18+ messages in thread
From: Damien Le Moal @ 2020-06-03  0:58 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: Johannes Thumshirn, linux-block, hare

On 2020/06/02 21:50, Coly Li wrote:
>>>         /* convert back to LBA of the bcache device*/
>>>         zone->start -= data_offset;
>>>         if (zone->cond != BLK_ZONE_COND_NOT_WP)
>>>                 zone->wp -= data_offset;
>>
>> Since you have the BLK_ZONE_COND_NOT_WP condition in the switch-case,
>> this is not needed. 
>>
>>>
>>>         switch (zone->cond) {
>>>         case BLK_ZONE_COND_NOT_WP:
>>>                 /* No write pointer available */
>>>                 break;
>>>         case BLK_ZONE_COND_EMPTY:
>>
>> zone->wp = zone->start;
>>
> 
> Correct me if I am wrong. I assume when the zone is in empty condition,
> LBA of zone->wp should be exactly equal to zone->start before bcache
> does the conversion. Therefore 'zone->wp - data_offset' should still be
> equal to 'zone->start - data_offset'. Therefore explicitly handle it in
> 'case BLK_ZONE_COND_EMPTY:' should be equal to handle it in 'default:' part.

Yes, for an empty zone, the zone wp is always equal to the start sector of the
zone. So yes, for an empty zone, wp - data_offset is correct. But it is not
necessary since you already need to do zone->start -= data_offset, all you need
to do for an empty zone is: zone->wp = zone->start. That is less arithmetic (no
need for the subtraction), so is faster :)

This code you have:

	if (zone->cond != BLK_ZONE_COND_NOT_WP)
		zone->wp -= data_offset;

will indeed lead to the correct result for an empty zone, but as I explained, it
will also do the subtraction for zone conditions with an undefined value. No
need to do all this. A switch case handling all conditions makes things very clear.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-06-03  0:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22 12:18 [RFC PATCH v4 0/3] bcache: support zoned device as bcache backing device Coly Li
2020-05-22 12:18 ` [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned " Coly Li
2020-05-25  1:10   ` Damien Le Moal
2020-06-01 12:34     ` Coly Li
2020-06-02  8:48       ` Damien Le Moal
2020-06-02 12:50         ` Coly Li
2020-06-03  0:58           ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 2/3] bcache: handle zone management bios for bcache device Coly Li
2020-05-25  1:24   ` Damien Le Moal
2020-06-01 16:06     ` Coly Li
2020-06-02  8:54       ` Damien Le Moal
2020-06-02 10:18         ` Coly Li
2020-06-03  0:51           ` Damien Le Moal
2020-05-22 12:18 ` [RFC PATCH v4 3/3] bcache: reject writeback cache mode for zoned backing device Coly Li
2020-05-25  1:26   ` Damien Le Moal
2020-06-01 16:09     ` Coly Li
2020-05-25  5:25 ` [RFC PATCH v4 0/3] bcache: support zoned device as bcache " Damien Le Moal
2020-05-25  8:14   ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).