linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] bcache: enable zoned device support
@ 2019-12-05 15:25 Coly Li
  2019-12-06  0:21 ` Eric Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Coly Li @ 2019-12-05 15:25 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Damien Le Moal, Hannes Reinecke

This is a very basic zoned device support. With this patch, bcache
device is able to,
- Export zoned device attribution via sysfs
- Response report zones request, e.g. by command 'blkzone report'
But the bcache device is still NOT able to,
- Response any zoned device management request or IOCTL command

Here are the testings I have done,
- read /sys/block/bcache0/queue/zoned, content is 'host-managed'
- read /sys/block/bcache0/queue/nr_zones, content is number of zones
  including all zone types.
- read /sys/block/bcache0/queue/chunk_sectors, content is zone size
  in sectors.
- run 'blkzone report /dev/bcache0', all zones information displayed.
- run 'blkzone reset /dev/bcache0', operation is rejected with error
  information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
  Operation not supported"
- Sequential writes by dd, I can see some zones' write pointer 'wptr'
  values updated.

All of these are very basic testings, if you have better testing
tools or cases, please offer me hint.

Thanks in advance for your review and comments.

Signed-off-by: Coly Li <colyli@suse.de>
CC: Damien Le Moal <damien.lemoal@wdc.com>
CC: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/bcache.h  | 10 ++++++
 drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/bcache/super.c   | 41 +++++++++++++++++++++++--
 3 files changed, 122 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 9198c1b480d9..77c2040c99ee 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
 struct search;
 struct btree;
 struct keybuf;
+struct bch_report_zones_args;
 
 struct keybuf_key {
 	struct rb_node		node;
@@ -277,6 +278,8 @@ struct bcache_device {
 			  struct bio *bio, unsigned int sectors);
 	int (*ioctl)(struct bcache_device *d, fmode_t mode,
 		     unsigned int cmd, unsigned long arg);
+	int (*report_zones)(struct bch_report_zones_args *args,
+			    unsigned int nr_zones);
 };
 
 struct io {
@@ -743,6 +746,13 @@ struct bbio {
 	struct bio		bio;
 };
 
+struct bch_report_zones_args {
+	struct bcache_device *bcache_device;
+	sector_t sector;
+	void *orig_data;
+	report_zones_cb orig_cb;
+};
+
 #define BTREE_PRIO		USHRT_MAX
 #define INITIAL_PRIO		32768U
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 7555e4a93145..d82425300383 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1162,6 +1162,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
 		}
 	}
 
+	/*
+	 * zone management request may change the data layout and content
+	 * implicitly on backing device, which introduces unacceptible
+	 * inconsistency between the backing device and the cache device.
+	 * Therefore all zone management related request will be denied here.
+	 */
+	if (op_is_zone_mgmt(bio->bi_opf)) {
+		pr_err_ratelimited("zoned device management request denied.");
+		bio->bi_status = BLK_STS_NOTSUPP;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
 	generic_start_io_acct(q,
 			      bio_op(bio),
 			      bio_sectors(bio),
@@ -1205,6 +1218,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
 	if (dc->io_disable)
 		return -EIO;
 
+	/*
+	 * zone management ioctl commands may change the data layout
+	 * and content implicitly on backing device, which introduces
+	 * unacceptible inconsistency between the backing device and
+	 * the cache device. Therefore all zone management related
+	 * ioctl commands will be denied here.
+	 */
+	switch (cmd) {
+	case BLKRESETZONE:
+	case BLKOPENZONE:
+	case BLKCLOSEZONE:
+	case BLKFINISHZONE:
+		return -EOPNOTSUPP;
+	default:
+		/* Other commands fall through*/
+		break;
+	}
+
 	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
 }
 
@@ -1233,6 +1264,48 @@ static int cached_dev_congested(void *data, int bits)
 	return ret;
 }
 
+static int cached_dev_report_zones_cb(struct blk_zone *zone,
+				      unsigned int idx,
+				      void *data)
+{
+	struct bch_report_zones_args *args = data;
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	unsigned long data_offset = dc->sb.data_offset;
+
+	if (zone->start >= data_offset) {
+		zone->start -= data_offset;
+		zone->wp -= data_offset;
+	} else {
+		/*
+		 * Normally the first zone is conventional zone,
+		 * we don't need to worry about how to maintain
+		 * the write pointer.
+		 * But the zone->len is special, because the
+		 * sector 0 on bcache device is 8KB offset on
+		 * backing device indeed.
+		 */
+		zone->len -= data_offset;
+	}
+
+	return args->orig_cb(zone, idx, args->orig_data);
+}
+
+static int cached_dev_report_zones(struct bch_report_zones_args *args,
+				   unsigned int nr_zones)
+{
+	struct bcache_device *d = args->bcache_device;
+	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+	sector_t sector = args->sector + dc->sb.data_offset;
+
+	/* sector is real LBA of backing device */
+	return blkdev_report_zones(dc->bdev,
+				   sector,
+				   nr_zones,
+				   cached_dev_report_zones_cb,
+				   args);
+}
+
 void bch_cached_dev_request_init(struct cached_dev *dc)
 {
 	struct gendisk *g = dc->disk.disk;
@@ -1241,6 +1314,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
 	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
 	dc->disk.cache_miss			= cached_dev_cache_miss;
 	dc->disk.ioctl				= cached_dev_ioctl;
+	dc->disk.report_zones			= cached_dev_report_zones;
 }
 
 /* Flash backed devices */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 77e9869345e7..7222fcafaf50 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -672,10 +672,32 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
 	return d->ioctl(d, mode, cmd, arg);
 }
 
+
+static int report_zones_dev(struct gendisk *disk,
+			    sector_t sector,
+			    unsigned int nr_zones,
+			    report_zones_cb cb,
+			    void *data)
+{
+	struct bcache_device *d = disk->private_data;
+	struct bch_report_zones_args args = {
+		.bcache_device = d,
+		.sector = sector,
+		.orig_data = data,
+		.orig_cb = cb,
+	};
+
+	if (d->report_zones)
+		return d->report_zones(&args, nr_zones);
+
+	return -EOPNOTSUPP;
+}
+
 static const struct block_device_operations bcache_ops = {
 	.open		= open_dev,
 	.release	= release_dev,
 	.ioctl		= ioctl_dev,
+	.report_zones	= report_zones_dev,
 	.owner		= THIS_MODULE,
 };
 
@@ -808,7 +830,9 @@ static void bcache_device_free(struct bcache_device *d)
 	closure_debug_destroy(&d->cl);
 }
 
-static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
+static int bcache_device_init(struct cached_dev *dc,
+			      struct bcache_device *d,
+			      unsigned int block_size,
 			      sector_t sectors)
 {
 	struct request_queue *q;
@@ -882,6 +906,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 
 	blk_queue_write_cache(q, true, true);
 
+	/*
+	 * If this is for backing device registration, and it is an
+	 * zoned device (e.g. host-managed S.M.R. hard drive), set
+	 * up zoned device attribution properly for sysfs interface.
+	 */
+	if (dc && bdev_is_zoned(dc->bdev)) {
+		q->limits.zoned = bdev_zoned_model(dc->bdev);
+		q->nr_zones = blkdev_nr_zones(dc->bdev);
+		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
+	}
+
 	return 0;
 
 err:
@@ -1328,7 +1363,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 		dc->partial_stripes_expensive =
 			q->limits.raid_partial_stripes_expensive;
 
-	ret = bcache_device_init(&dc->disk, block_size,
+	ret = bcache_device_init(dc, &dc->disk, block_size,
 			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
 	if (ret)
 		return ret;
@@ -1445,7 +1480,7 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
 
 	kobject_init(&d->kobj, &bch_flash_dev_ktype);
 
-	if (bcache_device_init(d, block_bytes(c), u->sectors))
+	if (bcache_device_init(NULL, d, block_bytes(c), u->sectors))
 		goto err;
 
 	bcache_device_attach(d, c, u - c->uuids);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-05 15:25 [RFC PATCH] bcache: enable zoned device support Coly Li
@ 2019-12-06  0:21 ` Eric Wheeler
  2019-12-06  0:30   ` Damien Le Moal
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Wheeler @ 2019-12-06  0:21 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block, Damien Le Moal, Hannes Reinecke

On Thu, 5 Dec 2019, Coly Li wrote:
> This is a very basic zoned device support. With this patch, bcache
> device is able to,
> - Export zoned device attribution via sysfs
> - Response report zones request, e.g. by command 'blkzone report'
> But the bcache device is still NOT able to,
> - Response any zoned device management request or IOCTL command
> 
> Here are the testings I have done,
> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>   including all zone types.
> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>   in sectors.
> - run 'blkzone report /dev/bcache0', all zones information displayed.
> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>   Operation not supported"
> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>   values updated.
> 
> All of these are very basic testings, if you have better testing
> tools or cases, please offer me hint.

Interesting. 

1. should_writeback() could benefit by hinting true when an IO would fall 
   in a zoned region.

2. The writeback thread could writeback such that they prefer 
   fully(mostly)-populated zones when choosing what to write out.

--
Eric Wheeler



> Thanks in advance for your review and comments.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> CC: Damien Le Moal <damien.lemoal@wdc.com>
> CC: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/md/bcache/bcache.h  | 10 ++++++
>  drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/bcache/super.c   | 41 +++++++++++++++++++++++--
>  3 files changed, 122 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 9198c1b480d9..77c2040c99ee 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>  struct search;
>  struct btree;
>  struct keybuf;
> +struct bch_report_zones_args;
>  
>  struct keybuf_key {
>  	struct rb_node		node;
> @@ -277,6 +278,8 @@ struct bcache_device {
>  			  struct bio *bio, unsigned int sectors);
>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>  		     unsigned int cmd, unsigned long arg);
> +	int (*report_zones)(struct bch_report_zones_args *args,
> +			    unsigned int nr_zones);
>  };
>  
>  struct io {
> @@ -743,6 +746,13 @@ struct bbio {
>  	struct bio		bio;
>  };
>  
> +struct bch_report_zones_args {
> +	struct bcache_device *bcache_device;
> +	sector_t sector;
> +	void *orig_data;
> +	report_zones_cb orig_cb;
> +};
> +
>  #define BTREE_PRIO		USHRT_MAX
>  #define INITIAL_PRIO		32768U
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 7555e4a93145..d82425300383 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1162,6 +1162,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
>  		}
>  	}
>  
> +	/*
> +	 * zone management request may change the data layout and content
> +	 * implicitly on backing device, which introduces unacceptible
> +	 * inconsistency between the backing device and the cache device.
> +	 * Therefore all zone management related request will be denied here.
> +	 */
> +	if (op_is_zone_mgmt(bio->bi_opf)) {
> +		pr_err_ratelimited("zoned device management request denied.");
> +		bio->bi_status = BLK_STS_NOTSUPP;
> +		bio_endio(bio);
> +		return BLK_QC_T_NONE;
> +	}
> +
>  	generic_start_io_acct(q,
>  			      bio_op(bio),
>  			      bio_sectors(bio),
> @@ -1205,6 +1218,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>  	if (dc->io_disable)
>  		return -EIO;
>  
> +	/*
> +	 * zone management ioctl commands may change the data layout
> +	 * and content implicitly on backing device, which introduces
> +	 * unacceptible inconsistency between the backing device and
> +	 * the cache device. Therefore all zone management related
> +	 * ioctl commands will be denied here.
> +	 */
> +	switch (cmd) {
> +	case BLKRESETZONE:
> +	case BLKOPENZONE:
> +	case BLKCLOSEZONE:
> +	case BLKFINISHZONE:
> +		return -EOPNOTSUPP;
> +	default:
> +		/* Other commands fall through*/
> +		break;
> +	}
> +
>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>  }
>  
> @@ -1233,6 +1264,48 @@ static int cached_dev_congested(void *data, int bits)
>  	return ret;
>  }
>  
> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
> +				      unsigned int idx,
> +				      void *data)
> +{
> +	struct bch_report_zones_args *args = data;
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	unsigned long data_offset = dc->sb.data_offset;
> +
> +	if (zone->start >= data_offset) {
> +		zone->start -= data_offset;
> +		zone->wp -= data_offset;
> +	} else {
> +		/*
> +		 * Normally the first zone is conventional zone,
> +		 * we don't need to worry about how to maintain
> +		 * the write pointer.
> +		 * But the zone->len is special, because the
> +		 * sector 0 on bcache device is 8KB offset on
> +		 * backing device indeed.
> +		 */
> +		zone->len -= data_offset;
> +	}
> +
> +	return args->orig_cb(zone, idx, args->orig_data);
> +}
> +
> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
> +				   unsigned int nr_zones)
> +{
> +	struct bcache_device *d = args->bcache_device;
> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +	sector_t sector = args->sector + dc->sb.data_offset;
> +
> +	/* sector is real LBA of backing device */
> +	return blkdev_report_zones(dc->bdev,
> +				   sector,
> +				   nr_zones,
> +				   cached_dev_report_zones_cb,
> +				   args);
> +}
> +
>  void bch_cached_dev_request_init(struct cached_dev *dc)
>  {
>  	struct gendisk *g = dc->disk.disk;
> @@ -1241,6 +1314,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>  	dc->disk.ioctl				= cached_dev_ioctl;
> +	dc->disk.report_zones			= cached_dev_report_zones;
>  }
>  
>  /* Flash backed devices */
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 77e9869345e7..7222fcafaf50 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -672,10 +672,32 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>  	return d->ioctl(d, mode, cmd, arg);
>  }
>  
> +
> +static int report_zones_dev(struct gendisk *disk,
> +			    sector_t sector,
> +			    unsigned int nr_zones,
> +			    report_zones_cb cb,
> +			    void *data)
> +{
> +	struct bcache_device *d = disk->private_data;
> +	struct bch_report_zones_args args = {
> +		.bcache_device = d,
> +		.sector = sector,
> +		.orig_data = data,
> +		.orig_cb = cb,
> +	};
> +
> +	if (d->report_zones)
> +		return d->report_zones(&args, nr_zones);
> +
> +	return -EOPNOTSUPP;
> +}
> +
>  static const struct block_device_operations bcache_ops = {
>  	.open		= open_dev,
>  	.release	= release_dev,
>  	.ioctl		= ioctl_dev,
> +	.report_zones	= report_zones_dev,
>  	.owner		= THIS_MODULE,
>  };
>  
> @@ -808,7 +830,9 @@ static void bcache_device_free(struct bcache_device *d)
>  	closure_debug_destroy(&d->cl);
>  }
>  
> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
> +static int bcache_device_init(struct cached_dev *dc,
> +			      struct bcache_device *d,
> +			      unsigned int block_size,
>  			      sector_t sectors)
>  {
>  	struct request_queue *q;
> @@ -882,6 +906,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  
>  	blk_queue_write_cache(q, true, true);
>  
> +	/*
> +	 * If this is for backing device registration, and it is an
> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
> +	 * up zoned device attribution properly for sysfs interface.
> +	 */
> +	if (dc && bdev_is_zoned(dc->bdev)) {
> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
> +		q->nr_zones = blkdev_nr_zones(dc->bdev);
> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
> +	}
> +
>  	return 0;
>  
>  err:
> @@ -1328,7 +1363,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>  		dc->partial_stripes_expensive =
>  			q->limits.raid_partial_stripes_expensive;
>  
> -	ret = bcache_device_init(&dc->disk, block_size,
> +	ret = bcache_device_init(dc, &dc->disk, block_size,
>  			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
>  	if (ret)
>  		return ret;
> @@ -1445,7 +1480,7 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>  
>  	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>  
> -	if (bcache_device_init(d, block_bytes(c), u->sectors))
> +	if (bcache_device_init(NULL, d, block_bytes(c), u->sectors))
>  		goto err;
>  
>  	bcache_device_attach(d, c, u - c->uuids);
> -- 
> 2.16.4
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-06  0:21 ` Eric Wheeler
@ 2019-12-06  0:30   ` Damien Le Moal
  2019-12-06  4:37     ` Coly Li
  0 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2019-12-06  0:30 UTC (permalink / raw)
  To: Eric Wheeler, Coly Li; +Cc: linux-bcache, linux-block, Hannes Reinecke

On 2019/12/06 9:22, Eric Wheeler wrote:
> On Thu, 5 Dec 2019, Coly Li wrote:
>> This is a very basic zoned device support. With this patch, bcache
>> device is able to,
>> - Export zoned device attribution via sysfs
>> - Response report zones request, e.g. by command 'blkzone report'
>> But the bcache device is still NOT able to,
>> - Response any zoned device management request or IOCTL command
>>
>> Here are the testings I have done,
>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>   including all zone types.
>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>   in sectors.
>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>   Operation not supported"
>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>   values updated.
>>
>> All of these are very basic testings, if you have better testing
>> tools or cases, please offer me hint.
> 
> Interesting. 
> 
> 1. should_writeback() could benefit by hinting true when an IO would fall 
>    in a zoned region.
> 
> 2. The writeback thread could writeback such that they prefer 
>    fully(mostly)-populated zones when choosing what to write out.

That definitely would be a good idea since that would certainly benefit
backend-GC (that will be needed).

However, I do not see the point in exposing the /dev/bcacheX block
device itself as a zoned disk. In fact, I think we want exactly the
opposite: expose it as a regular disk so that any FS or application can
run. If the bcache backend disk is zoned, then the writeback handles
sequential writes. This would be in the end a solution similar to
dm-zoned, that is, a zoned disk becomes useable as a regular block
device (random writes anywhere are possible), but likely far more
efficient and faster. That may result in imposing some limitations on
bcache operations though, e.g. it can only be setup with writeback, no
writethrough allowed (not sure though...).
Thoughts ?

> 
> --
> Eric Wheeler
> 
> 
> 
>> Thanks in advance for your review and comments.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> CC: Damien Le Moal <damien.lemoal@wdc.com>
>> CC: Hannes Reinecke <hare@suse.com>
>> ---
>>  drivers/md/bcache/bcache.h  | 10 ++++++
>>  drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/bcache/super.c   | 41 +++++++++++++++++++++++--
>>  3 files changed, 122 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 9198c1b480d9..77c2040c99ee 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>>  struct search;
>>  struct btree;
>>  struct keybuf;
>> +struct bch_report_zones_args;
>>  
>>  struct keybuf_key {
>>  	struct rb_node		node;
>> @@ -277,6 +278,8 @@ struct bcache_device {
>>  			  struct bio *bio, unsigned int sectors);
>>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>>  		     unsigned int cmd, unsigned long arg);
>> +	int (*report_zones)(struct bch_report_zones_args *args,
>> +			    unsigned int nr_zones);
>>  };
>>  
>>  struct io {
>> @@ -743,6 +746,13 @@ struct bbio {
>>  	struct bio		bio;
>>  };
>>  
>> +struct bch_report_zones_args {
>> +	struct bcache_device *bcache_device;
>> +	sector_t sector;
>> +	void *orig_data;
>> +	report_zones_cb orig_cb;
>> +};
>> +
>>  #define BTREE_PRIO		USHRT_MAX
>>  #define INITIAL_PRIO		32768U
>>  
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 7555e4a93145..d82425300383 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -1162,6 +1162,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
>>  		}
>>  	}
>>  
>> +	/*
>> +	 * zone management request may change the data layout and content
>> +	 * implicitly on backing device, which introduces unacceptible
>> +	 * inconsistency between the backing device and the cache device.
>> +	 * Therefore all zone management related request will be denied here.
>> +	 */
>> +	if (op_is_zone_mgmt(bio->bi_opf)) {
>> +		pr_err_ratelimited("zoned device management request denied.");
>> +		bio->bi_status = BLK_STS_NOTSUPP;
>> +		bio_endio(bio);
>> +		return BLK_QC_T_NONE;
>> +	}
>> +
>>  	generic_start_io_acct(q,
>>  			      bio_op(bio),
>>  			      bio_sectors(bio),
>> @@ -1205,6 +1218,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>>  	if (dc->io_disable)
>>  		return -EIO;
>>  
>> +	/*
>> +	 * zone management ioctl commands may change the data layout
>> +	 * and content implicitly on backing device, which introduces
>> +	 * unacceptible inconsistency between the backing device and
>> +	 * the cache device. Therefore all zone management related
>> +	 * ioctl commands will be denied here.
>> +	 */
>> +	switch (cmd) {
>> +	case BLKRESETZONE:
>> +	case BLKOPENZONE:
>> +	case BLKCLOSEZONE:
>> +	case BLKFINISHZONE:
>> +		return -EOPNOTSUPP;
>> +	default:
>> +		/* Other commands fall through*/
>> +		break;
>> +	}
>> +
>>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>>  }
>>  
>> @@ -1233,6 +1264,48 @@ static int cached_dev_congested(void *data, int bits)
>>  	return ret;
>>  }
>>  
>> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
>> +				      unsigned int idx,
>> +				      void *data)
>> +{
>> +	struct bch_report_zones_args *args = data;
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	unsigned long data_offset = dc->sb.data_offset;
>> +
>> +	if (zone->start >= data_offset) {
>> +		zone->start -= data_offset;
>> +		zone->wp -= data_offset;
>> +	} else {
>> +		/*
>> +		 * Normally the first zone is conventional zone,
>> +		 * we don't need to worry about how to maintain
>> +		 * the write pointer.
>> +		 * But the zone->len is special, because the
>> +		 * sector 0 on bcache device is 8KB offset on
>> +		 * backing device indeed.
>> +		 */
>> +		zone->len -= data_offset;
>> +	}
>> +
>> +	return args->orig_cb(zone, idx, args->orig_data);
>> +}
>> +
>> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
>> +				   unsigned int nr_zones)
>> +{
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	sector_t sector = args->sector + dc->sb.data_offset;
>> +
>> +	/* sector is real LBA of backing device */
>> +	return blkdev_report_zones(dc->bdev,
>> +				   sector,
>> +				   nr_zones,
>> +				   cached_dev_report_zones_cb,
>> +				   args);
>> +}
>> +
>>  void bch_cached_dev_request_init(struct cached_dev *dc)
>>  {
>>  	struct gendisk *g = dc->disk.disk;
>> @@ -1241,6 +1314,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>>  	dc->disk.ioctl				= cached_dev_ioctl;
>> +	dc->disk.report_zones			= cached_dev_report_zones;
>>  }
>>  
>>  /* Flash backed devices */
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 77e9869345e7..7222fcafaf50 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -672,10 +672,32 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>>  	return d->ioctl(d, mode, cmd, arg);
>>  }
>>  
>> +
>> +static int report_zones_dev(struct gendisk *disk,
>> +			    sector_t sector,
>> +			    unsigned int nr_zones,
>> +			    report_zones_cb cb,
>> +			    void *data)
>> +{
>> +	struct bcache_device *d = disk->private_data;
>> +	struct bch_report_zones_args args = {
>> +		.bcache_device = d,
>> +		.sector = sector,
>> +		.orig_data = data,
>> +		.orig_cb = cb,
>> +	};
>> +
>> +	if (d->report_zones)
>> +		return d->report_zones(&args, nr_zones);
>> +
>> +	return -EOPNOTSUPP;
>> +}
>> +
>>  static const struct block_device_operations bcache_ops = {
>>  	.open		= open_dev,
>>  	.release	= release_dev,
>>  	.ioctl		= ioctl_dev,
>> +	.report_zones	= report_zones_dev,
>>  	.owner		= THIS_MODULE,
>>  };
>>  
>> @@ -808,7 +830,9 @@ static void bcache_device_free(struct bcache_device *d)
>>  	closure_debug_destroy(&d->cl);
>>  }
>>  
>> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>> +static int bcache_device_init(struct cached_dev *dc,
>> +			      struct bcache_device *d,
>> +			      unsigned int block_size,
>>  			      sector_t sectors)
>>  {
>>  	struct request_queue *q;
>> @@ -882,6 +906,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>>  
>>  	blk_queue_write_cache(q, true, true);
>>  
>> +	/*
>> +	 * If this is for backing device registration, and it is an
>> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
>> +	 * up zoned device attribution properly for sysfs interface.
>> +	 */
>> +	if (dc && bdev_is_zoned(dc->bdev)) {
>> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
>> +		q->nr_zones = blkdev_nr_zones(dc->bdev);
>> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
>> +	}
>> +
>>  	return 0;
>>  
>>  err:
>> @@ -1328,7 +1363,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>>  		dc->partial_stripes_expensive =
>>  			q->limits.raid_partial_stripes_expensive;
>>  
>> -	ret = bcache_device_init(&dc->disk, block_size,
>> +	ret = bcache_device_init(dc, &dc->disk, block_size,
>>  			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
>>  	if (ret)
>>  		return ret;
>> @@ -1445,7 +1480,7 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>>  
>>  	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>>  
>> -	if (bcache_device_init(d, block_bytes(c), u->sectors))
>> +	if (bcache_device_init(NULL, d, block_bytes(c), u->sectors))
>>  		goto err;
>>  
>>  	bcache_device_attach(d, c, u - c->uuids);
>> -- 
>> 2.16.4
>>
>>
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-06  0:30   ` Damien Le Moal
@ 2019-12-06  4:37     ` Coly Li
  2019-12-06  7:09       ` Hannes Reinecke
  0 siblings, 1 reply; 8+ messages in thread
From: Coly Li @ 2019-12-06  4:37 UTC (permalink / raw)
  To: Damien Le Moal, Eric Wheeler; +Cc: linux-bcache, linux-block, Hannes Reinecke

On 2019/12/6 8:30 上午, Damien Le Moal wrote:
> On 2019/12/06 9:22, Eric Wheeler wrote:
>> On Thu, 5 Dec 2019, Coly Li wrote:
>>> This is a very basic zoned device support. With this patch, bcache
>>> device is able to,
>>> - Export zoned device attribution via sysfs
>>> - Response report zones request, e.g. by command 'blkzone report'
>>> But the bcache device is still NOT able to,
>>> - Response any zoned device management request or IOCTL command
>>>
>>> Here are the testings I have done,
>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>>   including all zone types.
>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>>   in sectors.
>>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>>   Operation not supported"
>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>>   values updated.
>>>
>>> All of these are very basic testings, if you have better testing
>>> tools or cases, please offer me hint.
>>
>> Interesting. 
>>
>> 1. should_writeback() could benefit by hinting true when an IO would fall 
>>    in a zoned region.
>>
>> 2. The writeback thread could writeback such that they prefer 
>>    fully(mostly)-populated zones when choosing what to write out.
> 
> That definitely would be a good idea since that would certainly benefit
> backend-GC (that will be needed).
> 
> However, I do not see the point in exposing the /dev/bcacheX block
> device itself as a zoned disk. In fact, I think we want exactly the
> opposite: expose it as a regular disk so that any FS or application can
> run. If the bcache backend disk is zoned, then the writeback handles
> sequential writes. This would be in the end a solution similar to
> dm-zoned, that is, a zoned disk becomes useable as a regular block
> device (random writes anywhere are possible), but likely far more
> efficient and faster. That may result in imposing some limitations on
> bcache operations though, e.g. it can only be setup with writeback, no
> writethrough allowed (not sure though...).
> Thoughts ?
> 

I come to realize this is really an idea on the opposite. Let me try to
explain what I understand, please correct me if I am wrong. The idea you
proposed indeed is to make bcache act as something like FTL for the
backend zoned SMR drive, that is, for all random writes, bcache may
convert them into sequential write onto the backend zoned SMR drive. In
the meantime, if there are hot data, bcache continues to act as a
caching device to accelerate read request.

Yes, if I understand your proposal correctly, writeback mode might be
mandatory and backend-GC will be needed. The idea is interesting, it
looks like adding a log-structure storage layer between current bcache
B+tree indexing and zoned SMR hard drive.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-06  4:37     ` Coly Li
@ 2019-12-06  7:09       ` Hannes Reinecke
  2019-12-06  7:42         ` Damien Le Moal
  0 siblings, 1 reply; 8+ messages in thread
From: Hannes Reinecke @ 2019-12-06  7:09 UTC (permalink / raw)
  To: Coly Li, Damien Le Moal, Eric Wheeler
  Cc: linux-bcache, linux-block, Hannes Reinecke

On 12/6/19 5:37 AM, Coly Li wrote:
> On 2019/12/6 8:30 上午, Damien Le Moal wrote:
>> On 2019/12/06 9:22, Eric Wheeler wrote:
>>> On Thu, 5 Dec 2019, Coly Li wrote:
>>>> This is a very basic zoned device support. With this patch, bcache
>>>> device is able to,
>>>> - Export zoned device attribution via sysfs
>>>> - Response report zones request, e.g. by command 'blkzone report'
>>>> But the bcache device is still NOT able to,
>>>> - Response any zoned device management request or IOCTL command
>>>>
>>>> Here are the testings I have done,
>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>>>   including all zone types.
>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>>>   in sectors.
>>>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>>>   Operation not supported"
>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>>>   values updated.
>>>>
>>>> All of these are very basic testings, if you have better testing
>>>> tools or cases, please offer me hint.
>>>
>>> Interesting. 
>>>
>>> 1. should_writeback() could benefit by hinting true when an IO would fall 
>>>    in a zoned region.
>>>
>>> 2. The writeback thread could writeback such that they prefer 
>>>    fully(mostly)-populated zones when choosing what to write out.
>>
>> That definitely would be a good idea since that would certainly benefit
>> backend-GC (that will be needed).
>>
>> However, I do not see the point in exposing the /dev/bcacheX block
>> device itself as a zoned disk. In fact, I think we want exactly the
>> opposite: expose it as a regular disk so that any FS or application can
>> run. If the bcache backend disk is zoned, then the writeback handles
>> sequential writes. This would be in the end a solution similar to
>> dm-zoned, that is, a zoned disk becomes useable as a regular block
>> device (random writes anywhere are possible), but likely far more
>> efficient and faster. That may result in imposing some limitations on
>> bcache operations though, e.g. it can only be setup with writeback, no
>> writethrough allowed (not sure though...).
>> Thoughts ?
>>
> 
> I come to realize this is really an idea on the opposite. Let me try to
> explain what I understand, please correct me if I am wrong. The idea you
> proposed indeed is to make bcache act as something like FTL for the
> backend zoned SMR drive, that is, for all random writes, bcache may
> convert them into sequential write onto the backend zoned SMR drive. In
> the meantime, if there are hot data, bcache continues to act as a
> caching device to accelerate read request.
> 
> Yes, if I understand your proposal correctly, writeback mode might be
> mandatory and backend-GC will be needed. The idea is interesting, it
> looks like adding a log-structure storage layer between current bcache
> B+tree indexing and zoned SMR hard drive.
> 
Well, not sure if that's required.

Or, to be correct, we actually have _two_ use-cases:
1) Have a SMR drive as a backing device. This was my primary goal for
handling these devices, as SMR device are typically not _that_ fast.
(Damien once proudly reported getting the incredible speed of 1 IOPS :-)
So having bcache running on top of those will be a clear win.
But in this scenario the cache device will be a normal device (typically
an SSD), and we shouldn't need much modification here.
In fact, a good testcase would be the btrfs patches which got posted
earlier this week. With them you should be able to create a btrfs
filesystem on the SMR drive, and use an SSD as a cache device.
Getting this scenario to run would indeed be my primary goal, and I
guess your patches should be more or less sufficient for that.
2) Using a SMR drive as a _cache_ device. This seems to be contrary to
the above statement of SMR drive not being fast, but then the NVMe WG is
working on a similar mechanism for flash devices called 'ZNS' (zoned
namespaces). And for those it really would make sense to have bcache
being able to handle zoned devices as a cache device.
But this is to my understanding really in the early stages, with no real
hardware being available. Damien might disagree, though :-)
And the implementation is still on the works on the linux side, so it's
more of a long-term goal.

But the first use-case is definitely something we should be looking at;
SMR drives are available _and_ with large capacity, so any speedup there
would be greatly appreciated.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-06  7:09       ` Hannes Reinecke
@ 2019-12-06  7:42         ` Damien Le Moal
  2019-12-06  8:29           ` Coly Li
  2019-12-06  9:22           ` Hannes Reinecke
  0 siblings, 2 replies; 8+ messages in thread
From: Damien Le Moal @ 2019-12-06  7:42 UTC (permalink / raw)
  To: Hannes Reinecke, Coly Li, Eric Wheeler
  Cc: linux-bcache, linux-block, Hannes Reinecke

On 2019/12/06 16:09, Hannes Reinecke wrote:
> On 12/6/19 5:37 AM, Coly Li wrote:
>> On 2019/12/6 8:30 上午, Damien Le Moal wrote:
>>> On 2019/12/06 9:22, Eric Wheeler wrote:
>>>> On Thu, 5 Dec 2019, Coly Li wrote:
>>>>> This is a very basic zoned device support. With this patch, bcache
>>>>> device is able to,
>>>>> - Export zoned device attribution via sysfs
>>>>> - Response report zones request, e.g. by command 'blkzone report'
>>>>> But the bcache device is still NOT able to,
>>>>> - Response any zoned device management request or IOCTL command
>>>>>
>>>>> Here are the testings I have done,
>>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>>>>   including all zone types.
>>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>>>>   in sectors.
>>>>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>>>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>>>>   Operation not supported"
>>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>>>>   values updated.
>>>>>
>>>>> All of these are very basic testings, if you have better testing
>>>>> tools or cases, please offer me hint.
>>>>
>>>> Interesting. 
>>>>
>>>> 1. should_writeback() could benefit by hinting true when an IO would fall 
>>>>    in a zoned region.
>>>>
>>>> 2. The writeback thread could writeback such that they prefer 
>>>>    fully(mostly)-populated zones when choosing what to write out.
>>>
>>> That definitely would be a good idea since that would certainly benefit
>>> backend-GC (that will be needed).
>>>
>>> However, I do not see the point in exposing the /dev/bcacheX block
>>> device itself as a zoned disk. In fact, I think we want exactly the
>>> opposite: expose it as a regular disk so that any FS or application can
>>> run. If the bcache backend disk is zoned, then the writeback handles
>>> sequential writes. This would be in the end a solution similar to
>>> dm-zoned, that is, a zoned disk becomes useable as a regular block
>>> device (random writes anywhere are possible), but likely far more
>>> efficient and faster. That may result in imposing some limitations on
>>> bcache operations though, e.g. it can only be setup with writeback, no
>>> writethrough allowed (not sure though...).
>>> Thoughts ?
>>>
>>
>> I come to realize this is really an idea on the opposite. Let me try to
>> explain what I understand, please correct me if I am wrong. The idea you
>> proposed indeed is to make bcache act as something like FTL for the
>> backend zoned SMR drive, that is, for all random writes, bcache may
>> convert them into sequential write onto the backend zoned SMR drive. In
>> the meantime, if there are hot data, bcache continues to act as a
>> caching device to accelerate read request.
>>
>> Yes, if I understand your proposal correctly, writeback mode might be
>> mandatory and backend-GC will be needed. The idea is interesting, it
>> looks like adding a log-structure storage layer between current bcache
>> B+tree indexing and zoned SMR hard drive.
>>
> Well, not sure if that's required.
> 
> Or, to be correct, we actually have _two_ use-cases:
> 1) Have a SMR drive as a backing device. This was my primary goal for
> handling these devices, as SMR device are typically not _that_ fast.
> (Damien once proudly reported getting the incredible speed of 1 IOPS :-)

Yes, it can get to that with dm-zoned if one goes crazy with sustained
random writes :) The physical drive itself does a lot more than 1 iops
in that case though and is as fast as any other HDD. But from the DM
logical drive side, the user can sometimes fall into the 1 iops
territory for really nasty workloads. Tests for well behaved users like
f2fs show that SMR and regular HDDs are on par for performance.

> So having bcache running on top of those will be a clear win.
> But in this scenario the cache device will be a normal device (typically
> an SSD), and we shouldn't need much modification here.

I agree. That should work mostly as is since the user will be zone aware
and already be issuing sequential writes. bcache write-through only
needs to follow the same pattern, not reordering any write, and
write-back only has to replay the same.

> In fact, a good testcase would be the btrfs patches which got posted
> earlier this week. With them you should be able to create a btrfs
> filesystem on the SMR drive, and use an SSD as a cache device.
> Getting this scenario to run would indeed be my primary goal, and I
> guess your patches should be more or less sufficient for that.

+ Will need the zone revalidation and zone type & write lock bitmaps to
prevent reordering from the block IO stack, unless bcache is a BIO
driver ? My knowledge of bcache is limited. Would need to look into the
details a little more to be able to comment.

> 2) Using a SMR drive as a _cache_ device. This seems to be contrary to
> the above statement of SMR drive not being fast, but then the NVMe WG is
> working on a similar mechanism for flash devices called 'ZNS' (zoned
> namespaces). And for those it really would make sense to have bcache
> being able to handle zoned devices as a cache device.
> But this is to my understanding really in the early stages, with no real
> hardware being available. Damien might disagree, though :-)

Yes, that would be another potential use case and ZNS indeed could fit
this model, assuming that zone sizes align (multiples) between front and
back devices.

> And the implementation is still on the works on the linux side, so it's
> more of a long-term goal.>
> But the first use-case is definitely something we should be looking at;
> SMR drives are available _and_ with large capacity, so any speedup there
> would be greatly appreciated.

Yes. And what I was talking about in my earlier email is actually a
third use case:
3) SMR drive as backend + regular SSD as frontend and the resulting
bcache device advertising itself as a regular disk, hiding all the zone
& sequential write constraint to the user. Since bcache already has some
form of indirection table for cached blocks, I thought we could hijack
this to implement a sort of FTL that would allow serializing random
writes to the backend with the help of the frontend as a write staging
buffer. Doing so, we get full random write capability with the benefit
of "hot" blocks staying in the cache. But again, not knowing enough
details about bcache, I may be talking too lightly here. Not sure if
that is reasonably easily feasible with the current bcache code.

Cheers.

> 
> Cheers,
> 
> Hannes
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-06  7:42         ` Damien Le Moal
@ 2019-12-06  8:29           ` Coly Li
  2019-12-06  9:22           ` Hannes Reinecke
  1 sibling, 0 replies; 8+ messages in thread
From: Coly Li @ 2019-12-06  8:29 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, Eric Wheeler
  Cc: linux-bcache, linux-block, Hannes Reinecke

On 2019/12/6 3:42 下午, Damien Le Moal wrote:
> On 2019/12/06 16:09, Hannes Reinecke wrote:
>> On 12/6/19 5:37 AM, Coly Li wrote:
>>> On 2019/12/6 8:30 上午, Damien Le Moal wrote:
>>>> On 2019/12/06 9:22, Eric Wheeler wrote:
>>>>> On Thu, 5 Dec 2019, Coly Li wrote:
>>>>>> This is a very basic zoned device support. With this patch, bcache
>>>>>> device is able to,
>>>>>> - Export zoned device attribution via sysfs
>>>>>> - Response report zones request, e.g. by command 'blkzone report'
>>>>>> But the bcache device is still NOT able to,
>>>>>> - Response any zoned device management request or IOCTL command
>>>>>>
>>>>>> Here are the testings I have done,
>>>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>>>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>>>>>   including all zone types.
>>>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>>>>>   in sectors.
>>>>>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>>>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>>>>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>>>>>   Operation not supported"
>>>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>>>>>   values updated.
>>>>>>
>>>>>> All of these are very basic testings, if you have better testing
>>>>>> tools or cases, please offer me hint.
>>>>>
>>>>> Interesting. 
>>>>>
>>>>> 1. should_writeback() could benefit by hinting true when an IO would fall 
>>>>>    in a zoned region.
>>>>>
>>>>> 2. The writeback thread could writeback such that they prefer 
>>>>>    fully(mostly)-populated zones when choosing what to write out.
>>>>
>>>> That definitely would be a good idea since that would certainly benefit
>>>> backend-GC (that will be needed).
>>>>
>>>> However, I do not see the point in exposing the /dev/bcacheX block
>>>> device itself as a zoned disk. In fact, I think we want exactly the
>>>> opposite: expose it as a regular disk so that any FS or application can
>>>> run. If the bcache backend disk is zoned, then the writeback handles
>>>> sequential writes. This would be in the end a solution similar to
>>>> dm-zoned, that is, a zoned disk becomes useable as a regular block
>>>> device (random writes anywhere are possible), but likely far more
>>>> efficient and faster. That may result in imposing some limitations on
>>>> bcache operations though, e.g. it can only be setup with writeback, no
>>>> writethrough allowed (not sure though...).
>>>> Thoughts ?
>>>>
>>>
>>> I come to realize this is really an idea on the opposite. Let me try to
>>> explain what I understand, please correct me if I am wrong. The idea you
>>> proposed indeed is to make bcache act as something like FTL for the
>>> backend zoned SMR drive, that is, for all random writes, bcache may
>>> convert them into sequential write onto the backend zoned SMR drive. In
>>> the meantime, if there are hot data, bcache continues to act as a
>>> caching device to accelerate read request.
>>>
>>> Yes, if I understand your proposal correctly, writeback mode might be
>>> mandatory and backend-GC will be needed. The idea is interesting, it
>>> looks like adding a log-structure storage layer between current bcache
>>> B+tree indexing and zoned SMR hard drive.
>>>
>> Well, not sure if that's required.
>>
>> Or, to be correct, we actually have _two_ use-cases:
>> 1) Have a SMR drive as a backing device. This was my primary goal for
>> handling these devices, as SMR device are typically not _that_ fast.
>> (Damien once proudly reported getting the incredible speed of 1 IOPS :-)
> 
> Yes, it can get to that with dm-zoned if one goes crazy with sustained
> random writes :) The physical drive itself does a lot more than 1 iops
> in that case though and is as fast as any other HDD. But from the DM
> logical drive side, the user can sometimes fall into the 1 iops
> territory for really nasty workloads. Tests for well behaved users like
> f2fs show that SMR and regular HDDs are on par for performance.
> 
>> So having bcache running on top of those will be a clear win.
>> But in this scenario the cache device will be a normal device (typically
>> an SSD), and we shouldn't need much modification here.
> 
> I agree. That should work mostly as is since the user will be zone aware
> and already be issuing sequential writes. bcache write-through only
> needs to follow the same pattern, not reordering any write, and
> write-back only has to replay the same.
> 
>> In fact, a good testcase would be the btrfs patches which got posted
>> earlier this week. With them you should be able to create a btrfs
>> filesystem on the SMR drive, and use an SSD as a cache device.
>> Getting this scenario to run would indeed be my primary goal, and I
>> guess your patches should be more or less sufficient for that.
> 
> + Will need the zone revalidation and zone type & write lock bitmaps to
> prevent reordering from the block IO stack, unless bcache is a BIO
> driver ? My knowledge of bcache is limited. Would need to look into the
> details a little more to be able to comment.

Hi Damien,

Bcache should be a bio based driver, it splits and clones bios, and
submits it by generic_make_request() to underlying block layer code.

So zone revalidation and zone type & write lock bitmaps are unnecessary
for bcache ?

> 
>> 2) Using a SMR drive as a _cache_ device. This seems to be contrary to
>> the above statement of SMR drive not being fast, but then the NVMe WG is
>> working on a similar mechanism for flash devices called 'ZNS' (zoned
>> namespaces). And for those it really would make sense to have bcache
>> being able to handle zoned devices as a cache device.
>> But this is to my understanding really in the early stages, with no real
>> hardware being available. Damien might disagree, though :-)
> 
> Yes, that would be another potential use case and ZNS indeed could fit
> this model, assuming that zone sizes align (multiples) between front and
> back devices.
> 
>> And the implementation is still on the works on the linux side, so it's
>> more of a long-term goal.>
>> But the first use-case is definitely something we should be looking at;
>> SMR drives are available _and_ with large capacity, so any speedup there
>> would be greatly appreciated.
> 
> Yes. And what I was talking about in my earlier email is actually a
> third use case:
> 3) SMR drive as backend + regular SSD as frontend and the resulting
> bcache device advertising itself as a regular disk, hiding all the zone
> & sequential write constraint to the user. Since bcache already has some
> form of indirection table for cached blocks, I thought we could hijack
> this to implement a sort of FTL that would allow serializing random
> writes to the backend with the help of the frontend as a write staging
> buffer. Doing so, we get full random write capability with the benefit
> of "hot" blocks staying in the cache. But again, not knowing enough
> details about bcache, I may be talking too lightly here. Not sure if
> that is reasonably easily feasible with the current bcache code.

There are three addresses involved in the above proposal.
1) User space LBA address: the LBA of block device which are combiled by
bcache+SMR.
2) Cache device LBA address: where the random writing cached data blocks
are stored on SSD.
3) SMR drive LBA address: where the sequential writing data blocks are
stored on zoned SMR drive

Therefore we need at least two layers mapping to connect these 3
addresses together. Currently only 1 mapping from bcache B+tree is not
enough.

Maybe stacking bcache backing device on top of dm-zoned target is a
solution for proposal 3), let me try whether it works.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] bcache: enable zoned device support
  2019-12-06  7:42         ` Damien Le Moal
  2019-12-06  8:29           ` Coly Li
@ 2019-12-06  9:22           ` Hannes Reinecke
  1 sibling, 0 replies; 8+ messages in thread
From: Hannes Reinecke @ 2019-12-06  9:22 UTC (permalink / raw)
  To: Damien Le Moal, Coly Li, Eric Wheeler
  Cc: linux-bcache, linux-block, Hannes Reinecke

On 12/6/19 8:42 AM, Damien Le Moal wrote:
> On 2019/12/06 16:09, Hannes Reinecke wrote:
[ .. ]
>> So having bcache running on top of those will be a clear win.
>> But in this scenario the cache device will be a normal device (typically
>> an SSD), and we shouldn't need much modification here.
> 
> I agree. That should work mostly as is since the user will be zone aware
> and already be issuing sequential writes. bcache write-through only
> needs to follow the same pattern, not reordering any write, and
> write-back only has to replay the same.
> 
Bcache really should just act as a block-based cache; the only trick
here is to ensure to align the internal bcache buckets with zone sizes,
so that in the optimal case only full zones will be written.

>> In fact, a good testcase would be the btrfs patches which got posted
>> earlier this week. With them you should be able to create a btrfs
>> filesystem on the SMR drive, and use an SSD as a cache device.
>> Getting this scenario to run would indeed be my primary goal, and I
>> guess your patches should be more or less sufficient for that.
> 
> + Will need the zone revalidation and zone type & write lock bitmaps to
> prevent reordering from the block IO stack, unless bcache is a BIO
> driver ? My knowledge of bcache is limited. Would need to look into the
> details a little more to be able to comment.
> 
bcache is a bio-based driver, so it won't do a request reordering itself.
So from that perspective we should be fine.

>> 2) Using a SMR drive as a _cache_ device. This seems to be contrary to
>> the above statement of SMR drive not being fast, but then the NVMe WG is
>> working on a similar mechanism for flash devices called 'ZNS' (zoned
>> namespaces). And for those it really would make sense to have bcache
>> being able to handle zoned devices as a cache device.
>> But this is to my understanding really in the early stages, with no real
>> hardware being available. Damien might disagree, though :-)
> 
> Yes, that would be another potential use case and ZNS indeed could fit
> this model, assuming that zone sizes align (multiples) between front and
> back devices.
> 
Indeed, but I would defer to my friendly drive manufacturer to figure
that out :-)

>> And the implementation is still on the works on the linux side, so it's
>> more of a long-term goal.>
>> But the first use-case is definitely something we should be looking at;
>> SMR drives are available _and_ with large capacity, so any speedup there
>> would be greatly appreciated.
> 
> Yes. And what I was talking about in my earlier email is actually a
> third use case:
> 3) SMR drive as backend + regular SSD as frontend and the resulting
> bcache device advertising itself as a regular disk, hiding all the zone
> & sequential write constraint to the user. Since bcache already has some
> form of indirection table for cached blocks, I thought we could hijack
> this to implement a sort of FTL that would allow serializing random
> writes to the backend with the help of the frontend as a write staging
> buffer. Doing so, we get full random write capability with the benefit
> of "hot" blocks staying in the cache. But again, not knowing enough
> details about bcache, I may be talking too lightly here. Not sure if
> that is reasonably easily feasible with the current bcache code.
> 
That, however, will be tricky, as the underlying drive will _still_ have
to contain a normal filesystem.
While this mode of operation should be trivial for btrfs with the
hmzoned patches, others like ext4 or xfs will be ... interesting.
I wouldn't discount it out of hand, but there's a fair chance that it'll
lead to intense cache-trashing as we'd need to cover up for random
writes within zones, _and_ would have to read-in entire zones.
But sure, worth a shot anyway.

Once we get the btrfs case working, that it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-12-06  9:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-05 15:25 [RFC PATCH] bcache: enable zoned device support Coly Li
2019-12-06  0:21 ` Eric Wheeler
2019-12-06  0:30   ` Damien Le Moal
2019-12-06  4:37     ` Coly Li
2019-12-06  7:09       ` Hannes Reinecke
2019-12-06  7:42         ` Damien Le Moal
2019-12-06  8:29           ` Coly Li
2019-12-06  9:22           ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).