[PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache
@ 2016-08-22  4:31 Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 1/4] Enable support for Seagate HostAware drives Shaun Tancheff
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-22  4:31 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Damien Le Moal,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei

Hi,

As per Christoph's request this patch incorporates Hannes' cache of zone
information.

This approach is to have REQ_OP_ZONE_REPORT return data in the same
format regardless of the availability of the zone cache. So if the
is kernel being built with or without BLK_DEV_ZONED [and SCSI_ZBC]
users of blkdev_issue_zone_report() and/or REQ_OP_ZONE_REPORT bio's
will have a consistent data format to digest.

Additionally it seems reasonable to allow the REQ_OP_ZONE_* to
be able to indicate if the command *must* be delivered to the
device [and update the zone cache] accordingly. Here REQ_META is
being used as REQ_FUA can be dropped causing sd_done to be skipped.
Rather than special case the current code I chose to pick an otherwise
non-applicable flag.

This series is based off of Linus's v4.8-rc2 and builds on top of the
previous series of block layer support:
    Add ioctl to issue ZBC/ZAC commands via block layer
    Add bio/request flags to issue ZBC/ZAC commands
as well as the series posted by Hannes
    sd_zbc: Fix handling of ZBC read after write pointer
    sd: Limit messages for ZBC disks capacity change
    sd: Implement support for ZBC devices
    sd: Implement new RESET_WP provisioning mode
    sd: configure ZBC devices
...

Patches for util-linux can be found here:
    git@github.com:stancheff/util-linux.git v2.28.1+biof

    https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof

This patch is available here:
    https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v9

    git@github.com:stancheff/linux.git v4.8-rc2+biof.v9

v2:
 - Fully integrated bio <-> zone cache [<-> device]
 - Added discard -> write same for conventional zones.
 - Merged disparate constants into a canonical set.

Shaun Tancheff (4):
  Enable support for Seagate HostAware drives (testing).
  On Discard either do Reset WP or Write Same
  Merge ZBC constants
  Integrate ZBC command requests with zone cache.

 block/blk-lib.c                   |  16 -
 drivers/scsi/sd.c                 | 111 +++--
 drivers/scsi/sd.h                 |  49 ++-
 drivers/scsi/sd_zbc.c             | 904 ++++++++++++++++++++++++++++++++++----
 include/linux/blkdev.h            |  22 +-
 include/scsi/scsi_proto.h         |  17 -
 include/uapi/linux/blkzoned_api.h | 167 ++++---
 7 files changed, 1032 insertions(+), 254 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/4] Enable support for Seagate HostAware drives
  2016-08-22  4:31 [PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache Shaun Tancheff
@ 2016-08-22  4:31 ` Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 2/4] On Discard either do Reset WP or Write Same Shaun Tancheff
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-22  4:31 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Damien Le Moal,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei, Shaun Tancheff

Seagate drives report a SAME code of 0 due to having:
  - Zones of different types (CMR zones at the low LBA space).
  - Zones of different size (A terminating 'runt' zone in the high
    lba space).

Support loading the zone topology into the zone cache.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
 drivers/scsi/sd.c      |  22 +++---
 drivers/scsi/sd.h      |  20 ++++--
 drivers/scsi/sd_zbc.c  | 183 +++++++++++++++++++++++++++++++++++--------------
 include/linux/blkdev.h |  16 +++--
 4 files changed, 170 insertions(+), 71 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 059a57f..7903e21 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -693,8 +693,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
 		break;
 
 	case SD_ZBC_RESET_WP:
-		max_blocks = sdkp->unmap_granularity;
 		q->limits.discard_zeroes_data = 1;
+		q->limits.discard_granularity =
+			sd_zbc_discard_granularity(sdkp);
+
+		max_blocks = min_not_zero(sdkp->unmap_granularity,
+					  q->limits.discard_granularity >>
+						ilog2(logical_block_size));
 		break;
 
 	case SD_LBP_ZERO:
@@ -1955,13 +1960,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 			good_bytes = blk_rq_bytes(req);
 			scsi_set_resid(SCpnt, 0);
 		} else {
-#ifdef CONFIG_SCSI_ZBC
 			if (op == ZBC_OUT)
 				/* RESET WRITE POINTER failed */
 				sd_zbc_update_zones(sdkp,
 						    blk_rq_pos(req),
-						    512, true);
-#endif
+						    512, SD_ZBC_RESET_WP_ERR);
+
 			good_bytes = 0;
 			scsi_set_resid(SCpnt, blk_rq_bytes(req));
 		}
@@ -2034,7 +2038,6 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 				good_bytes = blk_rq_bytes(req);
 				scsi_set_resid(SCpnt, 0);
 			}
-#ifdef CONFIG_SCSI_ZBC
 			/*
 			 * ZBC: Unaligned write command.
 			 * Write did not start a write pointer position.
@@ -2042,8 +2045,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 			if (sshdr.ascq == 0x04)
 				sd_zbc_update_zones(sdkp,
 						    blk_rq_pos(req),
-						    512, true);
-#endif
+						    512, SD_ZBC_WRITE_ERR);
 		}
 		break;
 	default:
@@ -2270,7 +2272,7 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer)
 	 * supports equal zone sizes.
 	 */
 	same = buffer[4] & 0xf;
-	if (same == 0 || same > 3) {
+	if (same > 3) {
 		sd_printk(KERN_WARNING, sdkp,
 			  "REPORT ZONES SAME type %d not supported\n", same);
 		return;
@@ -2282,9 +2284,9 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer)
 	sdkp->unmap_granularity = zone_len;
 	blk_queue_chunk_sectors(sdkp->disk->queue,
 				logical_to_sectors(sdkp->device, zone_len));
-	sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 
-	sd_zbc_setup(sdkp, buffer, SD_BUF_SIZE);
+	sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE);
+	sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 }
 
 static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp,
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 6ae4505..ef6c132 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -283,19 +283,24 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, unsigned int a)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+
+#define SD_ZBC_INIT		0
+#define SD_ZBC_RESET_WP_ERR	1
+#define SD_ZBC_WRITE_ERR	2
+
 #ifdef CONFIG_SCSI_ZBC
 
 extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int,
 			       sector_t, enum zbc_zone_reporting_options, bool);
-extern int sd_zbc_setup(struct scsi_disk *, char *, int);
+extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len);
 extern void sd_zbc_remove(struct scsi_disk *);
 extern void sd_zbc_reset_zones(struct scsi_disk *);
 extern int sd_zbc_setup_discard(struct scsi_disk *, struct request *,
 				sector_t, unsigned int);
 extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *,
 				   sector_t, unsigned int *);
-extern void sd_zbc_update_zones(struct scsi_disk *, sector_t, int, bool);
-extern void sd_zbc_refresh_zone_work(struct work_struct *);
+extern void sd_zbc_update_zones(struct scsi_disk *, sector_t, int, int reason);
+extern unsigned int sd_zbc_discard_granularity(struct scsi_disk *sdkp);
 
 #else /* CONFIG_SCSI_ZBC */
 
@@ -308,7 +313,7 @@ static inline int sd_zbc_report_zones(struct scsi_disk *sdkp,
 	return -EOPNOTSUPP;
 }
 
-static inline int sd_zbc_setup(struct scsi_disk *sdkp,
+static inline int sd_zbc_setup(struct scsi_disk *sdkp, u64 zlen,
 			       unsigned char *buf, int buf_len)
 {
 	return 0;
@@ -328,6 +333,13 @@ static inline int sd_zbc_setup_read_write(struct scsi_disk *sdkp,
 	return BLKPREP_OK;
 }
 
+static inline unsigned int sd_zbc_discard_granularity(struct scsi_disk *sdkp)
+{
+	return sdkp->device->sector_size;
+}
+
+static inline void sd_zbc_update_zones(struct scsi_disk *sdkp, sector_t s,
+				       int buf_sz, int reason) {}
 static inline void sd_zbc_remove(struct scsi_disk *sdkp) {}
 #endif /* CONFIG_SCSI_ZBC */
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index f953d16..17414fb 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -36,17 +36,6 @@
 #include "sd.h"
 #include "scsi_priv.h"
 
-enum zbc_zone_cond {
-	ZBC_ZONE_COND_NO_WP,
-	ZBC_ZONE_COND_EMPTY,
-	ZBC_ZONE_COND_IMPLICIT_OPEN,
-	ZBC_ZONE_COND_EXPLICIT_OPEN,
-	ZBC_ZONE_COND_CLOSED,
-	ZBC_ZONE_COND_READONLY = 0xd,
-	ZBC_ZONE_COND_FULL,
-	ZBC_ZONE_COND_OFFLINE,
-};
-
 #define SD_ZBC_BUF_SIZE 524288
 
 #define sd_zbc_debug(sdkp, fmt, args...)				\
@@ -69,10 +58,10 @@ struct zbc_update_work {
 	char		zone_buf[0];
 };
 
+static
 struct blk_zone *zbc_desc_to_zone(struct scsi_disk *sdkp, unsigned char *rec)
 {
 	struct blk_zone *zone;
-	enum zbc_zone_cond zone_cond;
 	sector_t wp = (sector_t)-1;
 
 	zone = kzalloc(sizeof(struct blk_zone), GFP_KERNEL);
@@ -81,37 +70,27 @@ struct blk_zone *zbc_desc_to_zone(struct scsi_disk *sdkp, unsigned char *rec)
 
 	spin_lock_init(&zone->lock);
 	zone->type = rec[0] & 0xf;
-	zone_cond = (rec[1] >> 4) & 0xf;
+	zone->state = (rec[1] >> 4) & 0xf;
 	zone->len = logical_to_sectors(sdkp->device,
 				       get_unaligned_be64(&rec[8]));
 	zone->start = logical_to_sectors(sdkp->device,
 					 get_unaligned_be64(&rec[16]));
 
-	if (blk_zone_is_smr(zone)) {
+	if (blk_zone_is_smr(zone))
 		wp = logical_to_sectors(sdkp->device,
 					get_unaligned_be64(&rec[24]));
-		if (zone_cond == ZBC_ZONE_COND_READONLY) {
-			zone->state = BLK_ZONE_READONLY;
-		} else if (zone_cond == ZBC_ZONE_COND_OFFLINE) {
-			zone->state = BLK_ZONE_OFFLINE;
-		} else {
-			zone->state = BLK_ZONE_OPEN;
-		}
-	} else
-		zone->state = BLK_ZONE_NO_WP;
-
 	zone->wp = wp;
 	/*
 	 * Fixup block zone state
 	 */
-	if (zone_cond == ZBC_ZONE_COND_EMPTY &&
+	if (zone->state == BLK_ZONE_EMPTY &&
 	    zone->wp != zone->start) {
 		sd_zbc_debug(sdkp,
 			     "zone %zu state EMPTY wp %zu: adjust wp\n",
 			     zone->start, zone->wp);
 		zone->wp = zone->start;
 	}
-	if (zone_cond == ZBC_ZONE_COND_FULL &&
+	if (zone->state == BLK_ZONE_FULL &&
 	    zone->wp != zone->start + zone->len) {
 		sd_zbc_debug(sdkp,
 			     "zone %zu state FULL wp %zu: adjust wp\n",
@@ -122,7 +101,8 @@ struct blk_zone *zbc_desc_to_zone(struct scsi_disk *sdkp, unsigned char *rec)
 	return zone;
 }
 
-sector_t zbc_parse_zones(struct scsi_disk *sdkp, unsigned char *buf,
+static
+sector_t zbc_parse_zones(struct scsi_disk *sdkp, u64 zlen, unsigned char *buf,
 			 unsigned int buf_len)
 {
 	struct request_queue *q = sdkp->disk->queue;
@@ -149,6 +129,11 @@ sector_t zbc_parse_zones(struct scsi_disk *sdkp, unsigned char *buf,
 		if (!this)
 			break;
 
+		if (same == 0 && this->len != zlen) {
+			next_sector = this->start + this->len;
+			break;
+		}
+
 		next_sector = this->start + this->len;
 		old = blk_insert_zone(q, this);
 		if (old) {
@@ -171,29 +156,58 @@ sector_t zbc_parse_zones(struct scsi_disk *sdkp, unsigned char *buf,
 	return next_sector;
 }
 
-void sd_zbc_refresh_zone_work(struct work_struct *work)
+static void sd_zbc_refresh_zone_work(struct work_struct *work)
 {
 	struct zbc_update_work *zbc_work =
 		container_of(work, struct zbc_update_work, zone_work);
 	struct scsi_disk *sdkp = zbc_work->sdkp;
 	struct request_queue *q = sdkp->disk->queue;
-	unsigned int zone_buflen;
+	unsigned char *zone_buf = zbc_work->zone_buf;
+	unsigned int zone_buflen = zbc_work->zone_buflen;
 	int ret;
+	u8 same;
+	u64 zlen = 0;
 	sector_t last_sector;
 	sector_t capacity = logical_to_sectors(sdkp->device, sdkp->capacity);
 
-	zone_buflen = zbc_work->zone_buflen;
-	ret = sd_zbc_report_zones(sdkp, zbc_work->zone_buf, zone_buflen,
+	ret = sd_zbc_report_zones(sdkp, zone_buf, zone_buflen,
 				  zbc_work->zone_sector,
 				  ZBC_ZONE_REPORTING_OPTION_ALL, true);
 	if (ret)
 		goto done_free;
 
-	last_sector = zbc_parse_zones(sdkp, zbc_work->zone_buf, zone_buflen);
+	/* this whole path is unlikely so extra reports shouldn't be a
+	 * large impact */
+	same = zone_buf[4] & 0xf;
+	if (same == 0) {
+		unsigned char *desc = &zone_buf[64];
+		unsigned int blen = zone_buflen;
+
+		/* just pull the first zone */
+		if (blen > 512)
+			blen = 512;
+		ret = sd_zbc_report_zones(sdkp, zone_buf, blen, 0,
+					  ZBC_ZONE_REPORTING_OPTION_ALL, true);
+		if (ret)
+			goto done_free;
+
+		/* Read the zone length from the first zone descriptor */
+		zlen = logical_to_sectors(sdkp->device,
+					  get_unaligned_be64(&desc[8]));
+
+		ret = sd_zbc_report_zones(sdkp, zone_buf, zone_buflen,
+					  zbc_work->zone_sector,
+					  ZBC_ZONE_REPORTING_OPTION_ALL, true);
+		if (ret)
+			goto done_free;
+	}
+
+	last_sector = zbc_parse_zones(sdkp, zlen, zone_buf, zone_buflen);
+	capacity = logical_to_sectors(sdkp->device, sdkp->capacity);
 	if (last_sector != -1 && last_sector < capacity) {
 		if (test_bit(SD_ZBC_ZONE_RESET, &sdkp->zone_flags)) {
 			sd_zbc_debug(sdkp,
-				     "zones in reset, cancelling refresh\n");
+				     "zones in reset, canceling refresh\n");
 			ret = -EAGAIN;
 			goto done_free;
 		}
@@ -207,7 +221,7 @@ done_free:
 	kfree(zbc_work);
 	if (test_and_clear_bit(SD_ZBC_ZONE_INIT, &sdkp->zone_flags) && ret) {
 		sd_zbc_debug(sdkp,
-			     "Cancelling zone initialisation\n");
+			     "Canceling zone initialization\n");
 	}
 done_start_queue:
 	if (q->mq_ops)
@@ -226,10 +240,10 @@ done_start_queue:
  * @sdkp: SCSI disk for which the zone information needs to be updated
  * @sector: sector to be updated
  * @bufsize: buffersize to be allocated
- * @update: true if existing zones should be updated
+ * @reason: non-zero if existing zones should be updated
  */
 void sd_zbc_update_zones(struct scsi_disk *sdkp, sector_t sector, int bufsize,
-			 bool update)
+			 int reason)
 {
 	struct request_queue *q = sdkp->disk->queue;
 	struct zbc_update_work *zbc_work;
@@ -240,13 +254,24 @@ void sd_zbc_update_zones(struct scsi_disk *sdkp, sector_t sector, int bufsize,
 
 	if (test_bit(SD_ZBC_ZONE_RESET, &sdkp->zone_flags)) {
 		sd_zbc_debug(sdkp,
-			     "zones in reset, not starting update\n");
+			     "zones in reset, not starting reason\n");
 		return;
 	}
 
+	if (reason != SD_ZBC_INIT) {
+		/* lookup sector, is zone pref? then ignore */
+		struct blk_zone *zone = blk_lookup_zone(q, sector);
+
+		if (reason == SD_ZBC_RESET_WP)
+			sd_zbc_debug(sdkp, "RESET WP failed %lx\n", sector);
+
+		if (zone && blk_zone_is_seq_pref(zone))
+			return;
+	}
+
 retry:
 	zbc_work = kzalloc(sizeof(struct zbc_update_work) + bufsize,
-			   update ? GFP_NOWAIT : GFP_KERNEL);
+			   reason != SD_ZBC_INIT ? GFP_NOWAIT : GFP_KERNEL);
 	if (!zbc_work) {
 		if (bufsize > 512) {
 			sd_zbc_debug(sdkp,
@@ -256,7 +281,7 @@ retry:
 		}
 		sd_zbc_debug(sdkp,
 			     "failed to allocate %d bytes\n", bufsize);
-		if (!update)
+		if (reason == SD_ZBC_INIT)
 			clear_bit(SD_ZBC_ZONE_INIT, &sdkp->zone_flags);
 		return;
 	}
@@ -269,7 +294,7 @@ retry:
 	/*
 	 * Mark zones under update as BUSY
 	 */
-	if (update) {
+	if (reason != SD_ZBC_INIT) {
 		for (node = rb_first(&q->zones); node; node = rb_next(node)) {
 			unsigned long flags;
 
@@ -333,8 +358,7 @@ int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buffer,
 	if (!scsi_device_online(sdp))
 		return -ENODEV;
 
-	sd_zbc_debug(sdkp, "REPORT ZONES lba %zu len %d\n",
-		     start_lba, bufflen);
+	sd_zbc_debug(sdkp, "REPORT ZONES lba %zu len %d\n", start_lba, bufflen);
 
 	memset(cmd, 0, 16);
 	cmd[0] = ZBC_IN;
@@ -460,9 +484,37 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
 		goto out;
 	}
 
-	if (req_op(rq) == REQ_OP_WRITE || req_op(rq) == REQ_OP_WRITE_SAME) {
-		if (zone->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
-			goto out;
+	if (blk_zone_is_cmr(zone))
+		goto out;
+
+	if (blk_zone_is_seq_pref(zone)) {
+		if (op_is_write(req_op(rq))) {
+			u64 nwp = sector + sectors;
+
+			while (nwp > (zone->start + zone->len)) {
+				struct rb_node *node = rb_next(&zone->node);
+
+				zone->wp = zone->start + zone->len;
+				sector = zone->wp;
+				sectors = nwp - zone->wp;
+				spin_unlock_irqrestore(&zone->lock, flags);
+
+				if (!node)
+					return BLKPREP_OK;
+				zone = rb_entry(node, struct blk_zone, node);
+				if (!zone)
+					return BLKPREP_OK;
+
+				spin_lock_irqsave(&zone->lock, flags);
+				nwp = sector + sectors;
+			}
+			if (nwp > zone->wp)
+				zone->wp = nwp;
+		}
+		goto out;
+	}
+
+	if (op_is_write(req_op(rq))) {
 		if (zone->state == BLK_ZONE_READONLY)
 			goto out;
 		if (blk_zone_is_full(zone)) {
@@ -480,8 +532,7 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
 			goto out;
 		}
 		zone->wp += sectors;
-	} else if (zone->type == BLK_ZONE_TYPE_SEQWRITE_REQ &&
-		   zone->wp <= sector + sectors) {
+	} else if (zone->wp <= sector + sectors) {
 		if (zone->wp <= sector) {
 			/* Read beyond WP: clear request buffer */
 			struct req_iterator iter;
@@ -513,14 +564,18 @@ out:
 	return ret;
 }
 
-int sd_zbc_setup(struct scsi_disk *sdkp, char *buf, int buf_len)
+/**
+ * sd_zbc_setup - Load zones of matching zlen size into rb tree.
+ *
+ */
+int sd_zbc_setup(struct scsi_disk *sdkp, u64 zlen, char *buf, int buf_len)
 {
 	sector_t capacity = logical_to_sectors(sdkp->device, sdkp->capacity);
 	sector_t last_sector;
 
 	if (test_and_set_bit(SD_ZBC_ZONE_INIT, &sdkp->zone_flags)) {
 		sdev_printk(KERN_WARNING, sdkp->device,
-			    "zone initialisation already running\n");
+			    "zone initialization already running\n");
 		return 0;
 	}
 
@@ -539,15 +594,20 @@ int sd_zbc_setup(struct scsi_disk *sdkp, char *buf, int buf_len)
 		clear_bit(SD_ZBC_ZONE_RESET, &sdkp->zone_flags);
 	}
 
-	last_sector = zbc_parse_zones(sdkp, buf, buf_len);
+	last_sector = zbc_parse_zones(sdkp, zlen, buf, buf_len);
+	capacity = logical_to_sectors(sdkp->device, sdkp->capacity);
 	if (last_sector != -1 && last_sector < capacity) {
-		sd_zbc_update_zones(sdkp, last_sector, SD_ZBC_BUF_SIZE, false);
+		sd_zbc_update_zones(sdkp, last_sector,
+				    SD_ZBC_BUF_SIZE, SD_ZBC_INIT);
 	} else
 		clear_bit(SD_ZBC_ZONE_INIT, &sdkp->zone_flags);
 
 	return 0;
 }
 
+/**
+ * sd_zbc_remove -
+ */
 void sd_zbc_remove(struct scsi_disk *sdkp)
 {
 	if (sdkp->zone_work_q) {
@@ -557,3 +617,24 @@ void sd_zbc_remove(struct scsi_disk *sdkp)
 		destroy_workqueue(sdkp->zone_work_q);
 	}
 }
+/**
+ * sd_zbc_discard_granularity - Determine discard granularity.
+ * @sdkp: SCSI disk used to calculate discard granularity.
+ *
+ * Discard granularity should match the (maximum non-CMR) zone
+ * size reported on the drive.
+ */
+unsigned int sd_zbc_discard_granularity(struct scsi_disk *sdkp)
+{
+	unsigned int bytes = 1;
+	struct request_queue *q = sdkp->disk->queue;
+	struct rb_node *node = rb_first(&q->zones);
+
+	if (node) {
+		struct blk_zone *zone = rb_entry(node, struct blk_zone, node);
+
+		bytes = zone->len;
+	}
+	bytes <<= ilog2(sdkp->device->sector_size);
+	return bytes;
+}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9084a9e..68198eb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -272,12 +272,16 @@ enum blk_zone_type {
 };
 
 enum blk_zone_state {
-	BLK_ZONE_UNKNOWN,
 	BLK_ZONE_NO_WP,
+	BLK_ZONE_EMPTY,
 	BLK_ZONE_OPEN,
-	BLK_ZONE_READONLY,
+	BLK_ZONE_OPEN_EXPLICIT,
+	BLK_ZONE_CLOSED,
+	BLK_ZONE_UNKNOWN = 5,
+	BLK_ZONE_READONLY = 0xd,
+	BLK_ZONE_FULL,
 	BLK_ZONE_OFFLINE,
-	BLK_ZONE_BUSY,
+	BLK_ZONE_BUSY = 0x20,
 };
 
 struct blk_zone {
@@ -291,9 +295,9 @@ struct blk_zone {
 	void *private_data;
 };
 
-#define blk_zone_is_smr(z) ((z)->type == BLK_ZONE_TYPE_SEQWRITE_REQ ||	\
-			    (z)->type == BLK_ZONE_TYPE_SEQWRITE_PREF)
-
+#define blk_zone_is_seq_req(z) ((z)->type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+#define blk_zone_is_seq_pref(z) ((z)->type == BLK_ZONE_TYPE_SEQWRITE_PREF)
+#define blk_zone_is_smr(z) (blk_zone_is_seq_req(z) || blk_zone_is_seq_pref(z))
 #define blk_zone_is_cmr(z) ((z)->type == BLK_ZONE_TYPE_CONVENTIONAL)
 #define blk_zone_is_full(z) ((z)->wp == (z)->start + (z)->len)
 #define blk_zone_is_empty(z) ((z)->wp == (z)->start)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/4] On Discard either do Reset WP or Write Same
  2016-08-22  4:31 [PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 1/4] Enable support for Seagate HostAware drives Shaun Tancheff
@ 2016-08-22  4:31 ` Shaun Tancheff
  2016-08-22 23:57   ` Damien Le Moal
  2016-08-22  4:31 ` [PATCH v2 3/4] Merge ZBC constants Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 4/4] Integrate ZBC command requests with zone cache Shaun Tancheff
  3 siblings, 1 reply; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-22  4:31 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Damien Le Moal,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei, Shaun Tancheff

Based on the type of zone either perform a Reset WP
for Sequential zones or a Write Same for Conventional zones.

Also detect and handle the runt zone, if there is one.

One additional check is added to error on discard requests
that do not include all the active data in zone.
By way of example when the WP indicates that 2000 blocks
in the zone are in use and the discard indicated 1000 blocks
can be unmapped the discard should fail as a Reset WP will
unmap all the 2000 blocks in the zone.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
 drivers/scsi/sd.c     |  45 ++++++-----------
 drivers/scsi/sd.h     |   9 ++--
 drivers/scsi/sd_zbc.c | 135 +++++++++++++++++++++++++++++++++++---------------
 3 files changed, 114 insertions(+), 75 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7903e21..d5ef6d8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -729,21 +729,19 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
 	sector_t sector = blk_rq_pos(rq);
 	unsigned int nr_sectors = blk_rq_sectors(rq);
 	unsigned int nr_bytes = blk_rq_bytes(rq);
-	unsigned int len;
-	int ret = 0;
+	int ret;
 	char *buf;
-	struct page *page = NULL;
+	struct page *page;
 
 	sector >>= ilog2(sdp->sector_size) - 9;
 	nr_sectors >>= ilog2(sdp->sector_size) - 9;
 
-	if (sdkp->provisioning_mode != SD_ZBC_RESET_WP) {
-		page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-		if (!page)
-			return BLKPREP_DEFER;
-	}
+	page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+	if (!page)
+		return BLKPREP_DEFER;
 
 	rq->completion_data = page;
+	rq->timeout = SD_TIMEOUT;
 
 	switch (sdkp->provisioning_mode) {
 	case SD_LBP_UNMAP:
@@ -758,7 +756,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
 		put_unaligned_be64(sector, &buf[8]);
 		put_unaligned_be32(nr_sectors, &buf[16]);
 
-		len = 24;
+		cmd->transfersize = 24;
 		break;
 
 	case SD_LBP_WS16:
@@ -768,7 +766,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
 		put_unaligned_be64(sector, &cmd->cmnd[2]);
 		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
-		len = sdkp->device->sector_size;
+		cmd->transfersize = sdp->sector_size;
 		break;
 
 	case SD_LBP_WS10:
@@ -777,35 +775,24 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
 		cmd->cmnd[0] = WRITE_SAME;
 		if (sdkp->provisioning_mode == SD_LBP_WS10)
 			cmd->cmnd[1] = 0x8; /* UNMAP */
+		else
+			rq->timeout = SD_WRITE_SAME_TIMEOUT;
 		put_unaligned_be32(sector, &cmd->cmnd[2]);
 		put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
 
-		len = sdkp->device->sector_size;
+		cmd->transfersize = sdp->sector_size;
 		break;
 
 	case SD_ZBC_RESET_WP:
-		/* sd_zbc_setup_discard uses block layer sector units */
-		ret = sd_zbc_setup_discard(sdkp, rq, blk_rq_pos(rq),
-					   blk_rq_sectors(rq));
+		ret = sd_zbc_setup_discard(cmd);
 		if (ret != BLKPREP_OK)
 			goto out;
-		cmd->cmd_len = 16;
-		cmd->cmnd[0] = ZBC_OUT;
-		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
-		put_unaligned_be64(sector, &cmd->cmnd[2]);
-		/* Reset Write Pointer doesn't have a payload */
-		len = 0;
-		cmd->sc_data_direction = DMA_NONE;
 		break;
-
 	default:
 		ret = BLKPREP_INVALID;
 		goto out;
 	}
 
-	rq->timeout = SD_TIMEOUT;
-
-	cmd->transfersize = len;
 	cmd->allowed = SD_MAX_RETRIES;
 
 	/*
@@ -816,17 +803,15 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
 	 * discarded on disk. This allows us to report completion on the full
 	 * amount of blocks described by the request.
 	 */
-	if (len) {
-		blk_add_request_payload(rq, page, 0, len);
+	if (cmd->transfersize) {
+		blk_add_request_payload(rq, page, 0, cmd->transfersize);
 		ret = scsi_init_io(cmd);
 	}
 	rq->__data_len = nr_bytes;
 
 out:
-	if (page && ret != BLKPREP_OK) {
-		rq->completion_data = NULL;
+	if (ret != BLKPREP_OK)
 		__free_page(page);
-	}
 	return ret;
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index ef6c132..2792c10 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -295,8 +295,7 @@ extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int,
 extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len);
 extern void sd_zbc_remove(struct scsi_disk *);
 extern void sd_zbc_reset_zones(struct scsi_disk *);
-extern int sd_zbc_setup_discard(struct scsi_disk *, struct request *,
-				sector_t, unsigned int);
+extern int sd_zbc_setup_discard(struct scsi_cmnd *cmd);
 extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *,
 				   sector_t, unsigned int *);
 extern void sd_zbc_update_zones(struct scsi_disk *, sector_t, int, int reason);
@@ -319,11 +318,9 @@ static inline int sd_zbc_setup(struct scsi_disk *sdkp, u64 zlen,
 	return 0;
 }
 
-static inline int sd_zbc_setup_discard(struct scsi_disk *sdkp,
-				       struct request *rq, sector_t sector,
-				       unsigned int num_sectors)
+static inline int int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
 {
-	return BLKPREP_OK;
+	return BLKPREP_KILL;
 }
 
 static inline int sd_zbc_setup_read_write(struct scsi_disk *sdkp,
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 17414fb..0780118 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -382,23 +382,45 @@ int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buffer,
 	return 0;
 }
 
-int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
-			 sector_t sector, unsigned int num_sectors)
+int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
 {
-	struct blk_zone *zone;
+	struct request *rq = cmd->request;
+	struct scsi_device *sdp = cmd->device;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+	unsigned int nr_sectors = blk_rq_sectors(rq);
 	int ret = BLKPREP_OK;
+	struct blk_zone *zone;
 	unsigned long flags;
+	u32 wp_offset;
+	bool use_write_same = false;
 
 	zone = blk_lookup_zone(rq->q, sector);
-	if (!zone)
+	if (!zone) {
+		/* Test for a runt zone before giving up */
+		if (sdp->type != TYPE_ZBC) {
+			struct request_queue *q = rq->q;
+			struct rb_node *node;
+
+			node = rb_last(&q->zones);
+			if (node)
+				zone = rb_entry(node, struct blk_zone, node);
+			if (zone) {
+				spin_lock_irqsave(&zone->lock, flags);
+				if ((zone->start + zone->len) <= sector)
+					goto out;
+				spin_unlock_irqrestore(&zone->lock, flags);
+				zone = NULL;
+			}
+		}
 		return BLKPREP_KILL;
+	}
 
 	spin_lock_irqsave(&zone->lock, flags);
-
 	if (zone->state == BLK_ZONE_UNKNOWN ||
 	    zone->state == BLK_ZONE_BUSY) {
 		sd_zbc_debug_ratelimit(sdkp,
-				       "Discarding zone %zu state %x, deferring\n",
+				       "Discarding zone %zx state %x, deferring\n",
 				       zone->start, zone->state);
 		ret = BLKPREP_DEFER;
 		goto out;
@@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
 	if (zone->state == BLK_ZONE_OFFLINE) {
 		/* let the drive fail the command */
 		sd_zbc_debug_ratelimit(sdkp,
-				       "Discarding offline zone %zu\n",
+				       "Discarding offline zone %zx\n",
 				       zone->start);
 		goto out;
 	}
-
-	if (!blk_zone_is_smr(zone)) {
+	if (blk_zone_is_cmr(zone)) {
+		use_write_same = true;
 		sd_zbc_debug_ratelimit(sdkp,
-				       "Discarding %s zone %zu\n",
-				       blk_zone_is_cmr(zone) ? "CMR" : "unknown",
+				       "Discarding CMR zone %zx\n",
 				       zone->start);
-		ret = BLKPREP_DONE;
 		goto out;
 	}
-	if (blk_zone_is_empty(zone)) {
-		sd_zbc_debug_ratelimit(sdkp,
-				       "Discarding empty zone %zu\n",
-				       zone->start);
-		ret = BLKPREP_DONE;
+	if (zone->start != sector || zone->len < nr_sectors) {
+		sd_printk(KERN_ERR, sdkp,
+			  "Misaligned RESET WP %zx/%x on zone %zx/%zx\n",
+			  sector, nr_sectors, zone->start, zone->len);
+		ret = BLKPREP_KILL;
 		goto out;
 	}
-
-	if (zone->start != sector ||
-	    zone->len < num_sectors) {
+	/* Protect against Reset WP when more data had been written to the
+	 * zone than is being discarded.
+	 */
+	wp_offset = zone->wp - zone->start;
+	if (wp_offset > nr_sectors) {
 		sd_printk(KERN_ERR, sdkp,
-			  "Misaligned RESET WP, start %zu/%zu "
-			  "len %zu/%u\n",
-			  zone->start, sector, zone->len, num_sectors);
+			  "Will Corrupt RESET WP %zx/%x/%x on zone %zx/%zx/%zx\n",
+			  sector, wp_offset, nr_sectors,
+			  zone->start, zone->wp, zone->len);
 		ret = BLKPREP_KILL;
 		goto out;
 	}
-
-	/*
-	 * Opportunistic setting, will be fixed up with
-	 * zone update if RESET WRITE POINTER fails.
-	 */
-	zone->wp = zone->start;
+	if (blk_zone_is_empty(zone)) {
+		sd_zbc_debug_ratelimit(sdkp,
+				       "Discarding empty zone %zx [WP: %zx]\n",
+				       zone->start, zone->wp);
+		ret = BLKPREP_DONE;
+		goto out;
+	}
 
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 
+	if (ret != BLKPREP_OK)
+		goto done;
+	/*
+	 * blk_zone cache uses block layer sector units
+	 * but commands use device units
+	 */
+	sector >>= ilog2(sdp->sector_size) - 9;
+	nr_sectors >>= ilog2(sdp->sector_size) - 9;
+
+	if (use_write_same) {
+		cmd->cmd_len = 16;
+		cmd->cmnd[0] = WRITE_SAME_16;
+		cmd->cmnd[1] = 0; /* UNMAP (not set) */
+		put_unaligned_be64(sector, &cmd->cmnd[2]);
+		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+		cmd->transfersize = sdp->sector_size;
+		rq->timeout = SD_WRITE_SAME_TIMEOUT;
+	} else {
+		cmd->cmd_len = 16;
+		cmd->cmnd[0] = ZBC_OUT;
+		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
+		put_unaligned_be64(sector, &cmd->cmnd[2]);
+		/* Reset Write Pointer doesn't have a payload */
+		cmd->transfersize = 0;
+		cmd->sc_data_direction = DMA_NONE;
+		/*
+		 * Opportunistic setting, will be fixed up with
+		 * zone update if RESET WRITE POINTER fails.
+		 */
+		zone->wp = zone->start;
+	}
+
+done:
 	return ret;
 }
 
@@ -468,6 +524,9 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
 
 	spin_lock_irqsave(&zone->lock, flags);
 
+	if (blk_zone_is_cmr(zone))
+		goto out;
+
 	if (zone->state == BLK_ZONE_UNKNOWN ||
 	    zone->state == BLK_ZONE_BUSY) {
 		sd_zbc_debug_ratelimit(sdkp,
@@ -476,16 +535,6 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
 		ret = BLKPREP_DEFER;
 		goto out;
 	}
-	if (zone->state == BLK_ZONE_OFFLINE) {
-		/* let the drive fail the command */
-		sd_zbc_debug_ratelimit(sdkp,
-				       "zone %zu offline\n",
-				       zone->start);
-		goto out;
-	}
-
-	if (blk_zone_is_cmr(zone))
-		goto out;
 
 	if (blk_zone_is_seq_pref(zone)) {
 		if (op_is_write(req_op(rq))) {
@@ -514,6 +563,14 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
 		goto out;
 	}
 
+	if (zone->state == BLK_ZONE_OFFLINE) {
+		/* let the drive fail the command */
+		sd_zbc_debug_ratelimit(sdkp,
+				       "zone %zu offline\n",
+				       zone->start);
+		goto out;
+	}
+
 	if (op_is_write(req_op(rq))) {
 		if (zone->state == BLK_ZONE_READONLY)
 			goto out;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/4] Merge ZBC constants
  2016-08-22  4:31 [PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 1/4] Enable support for Seagate HostAware drives Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 2/4] On Discard either do Reset WP or Write Same Shaun Tancheff
@ 2016-08-22  4:31 ` Shaun Tancheff
  2016-08-22  4:31 ` [PATCH v2 4/4] Integrate ZBC command requests with zone cache Shaun Tancheff
  3 siblings, 0 replies; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-22  4:31 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Damien Le Moal,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei, Shaun Tancheff

Dedupe ZBC/ZAC constants used for reporting options, same code,
zone condition and zone type.

These are all useful to programs consuming zone information from
user space as well so include them in a uapi header.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
 block/blk-lib.c                   |   4 +-
 drivers/scsi/sd.c                 |   2 +-
 include/linux/blkdev.h            |  20 -----
 include/scsi/scsi_proto.h         |  17 ----
 include/uapi/linux/blkzoned_api.h | 167 +++++++++++++++++++++++---------------
 5 files changed, 103 insertions(+), 107 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e92bd56..67b9258 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -316,8 +316,8 @@ int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
 		__be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects);
 
 		conv->maximum_lba = blksz;
-		conv->descriptors[0].type = ZTYP_CONVENTIONAL;
-		conv->descriptors[0].flags = ZCOND_CONVENTIONAL << 4;
+		conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL;
+		conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4;
 		conv->descriptors[0].length = blksz;
 		conv->descriptors[0].lba_start = 0;
 		conv->descriptors[0].lba_wptr = blksz;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d5ef6d8..b76ffbb 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1201,7 +1201,7 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd)
 		src = kmap_atomic(bio->bi_io_vec->bv_page);
 		conv = src + bio->bi_io_vec->bv_offset;
 		conv->descriptor_count = cpu_to_be32(1);
-		conv->same_field = ZS_ALL_SAME;
+		conv->same_field = BLK_ZONE_SAME_ALL;
 		conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects);
 		kunmap_atomic(src);
 		goto out;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 68198eb..d5cdb5d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -263,26 +263,6 @@ struct blk_queue_tag {
 #define BLK_SCSI_CMD_PER_LONG	(BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
 
 #ifdef CONFIG_BLK_DEV_ZONED
-enum blk_zone_type {
-	BLK_ZONE_TYPE_UNKNOWN,
-	BLK_ZONE_TYPE_CONVENTIONAL,
-	BLK_ZONE_TYPE_SEQWRITE_REQ,
-	BLK_ZONE_TYPE_SEQWRITE_PREF,
-	BLK_ZONE_TYPE_RESERVED,
-};
-
-enum blk_zone_state {
-	BLK_ZONE_NO_WP,
-	BLK_ZONE_EMPTY,
-	BLK_ZONE_OPEN,
-	BLK_ZONE_OPEN_EXPLICIT,
-	BLK_ZONE_CLOSED,
-	BLK_ZONE_UNKNOWN = 5,
-	BLK_ZONE_READONLY = 0xd,
-	BLK_ZONE_FULL,
-	BLK_ZONE_OFFLINE,
-	BLK_ZONE_BUSY = 0x20,
-};
 
 struct blk_zone {
 	struct rb_node node;
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 6ba66e0..d1defd1 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -299,21 +299,4 @@ struct scsi_lun {
 #define SCSI_ACCESS_STATE_MASK        0x0f
 #define SCSI_ACCESS_STATE_PREFERRED   0x80
 
-/* Reporting options for REPORT ZONES */
-enum zbc_zone_reporting_options {
-	ZBC_ZONE_REPORTING_OPTION_ALL = 0,
-	ZBC_ZONE_REPORTING_OPTION_EMPTY,
-	ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN,
-	ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN,
-	ZBC_ZONE_REPORTING_OPTION_CLOSED,
-	ZBC_ZONE_REPORTING_OPTION_FULL,
-	ZBC_ZONE_REPORTING_OPTION_READONLY,
-	ZBC_ZONE_REPORTING_OPTION_OFFLINE,
-	ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
-	ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE,
-	ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f,
-};
-
-#define ZBC_REPORT_ZONE_PARTIAL 0x80
-
 #endif /* _SCSI_PROTO_H_ */
diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h
index cd81a9f..fa12976 100644
--- a/include/uapi/linux/blkzoned_api.h
+++ b/include/uapi/linux/blkzoned_api.h
@@ -16,97 +16,123 @@
 
 #include <linux/types.h>
 
+#define ZBC_REPORT_OPTION_MASK  0x3f
+#define ZBC_REPORT_ZONE_PARTIAL 0x80
+
 /**
  * enum zone_report_option - Report Zones types to be included.
  *
- * @ZOPT_NON_SEQ_AND_RESET: Default (all zones).
- * @ZOPT_ZC1_EMPTY: Zones which are empty.
- * @ZOPT_ZC2_OPEN_IMPLICIT: Zones open but not explicitly opened
- * @ZOPT_ZC3_OPEN_EXPLICIT: Zones opened explicitly
- * @ZOPT_ZC4_CLOSED: Zones closed for writing.
- * @ZOPT_ZC5_FULL: Zones that are full.
- * @ZOPT_ZC6_READ_ONLY: Zones that are read-only
- * @ZOPT_ZC7_OFFLINE: Zones that are offline
- * @ZOPT_RESET: Zones that are empty
- * @ZOPT_NON_SEQ: Zones that have HA media-cache writes pending
- * @ZOPT_NON_WP_ZONES: Zones that do not have Write Pointers (conventional)
- * @ZOPT_PARTIAL_FLAG: Modifies the definition of the Zone List Length field.
+ * @ZBC_ZONE_REPORTING_OPTION_ALL: Default (all zones).
+ * @ZBC_ZONE_REPORTING_OPTION_EMPTY: Zones which are empty.
+ * @ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN:
+ *	Zones open but not explicitly opened
+ * @ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN: Zones opened explicitly
+ * @ZBC_ZONE_REPORTING_OPTION_CLOSED: Zones closed for writing.
+ * @ZBC_ZONE_REPORTING_OPTION_FULL: Zones that are full.
+ * @ZBC_ZONE_REPORTING_OPTION_READONLY: Zones that are read-only
+ * @ZBC_ZONE_REPORTING_OPTION_OFFLINE: Zones that are offline
+ * @ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP: Zones with Reset WP Recommended
+ * @ZBC_ZONE_REPORTING_OPTION_RESERVED: Zones that with Non-Sequential
+ *	Write Resources Active
+ * @ZBC_ZONE_REPORTING_OPTION_NON_WP: Zones that do not have Write Pointers
+ *	(conventional)
+ * @ZBC_ZONE_REPORTING_OPTION_RESERVED: Undefined
+ * @ZBC_ZONE_REPORTING_OPTION_PARTIAL: Modifies the definition of the Zone List
+ *	Length field.
  *
  * Used by Report Zones in bdev_zone_get_report: report_option
  */
-enum bdev_zone_report_option {
-	ZOPT_NON_SEQ_AND_RESET   = 0x00,
-	ZOPT_ZC1_EMPTY,
-	ZOPT_ZC2_OPEN_IMPLICIT,
-	ZOPT_ZC3_OPEN_EXPLICIT,
-	ZOPT_ZC4_CLOSED,
-	ZOPT_ZC5_FULL,
-	ZOPT_ZC6_READ_ONLY,
-	ZOPT_ZC7_OFFLINE,
-	ZOPT_RESET               = 0x10,
-	ZOPT_NON_SEQ             = 0x11,
-	ZOPT_NON_WP_ZONES        = 0x3f,
-	ZOPT_PARTIAL_FLAG        = 0x80,
+enum zbc_zone_reporting_options {
+	ZBC_ZONE_REPORTING_OPTION_ALL = 0,
+	ZBC_ZONE_REPORTING_OPTION_EMPTY,
+	ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN,
+	ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN,
+	ZBC_ZONE_REPORTING_OPTION_CLOSED,
+	ZBC_ZONE_REPORTING_OPTION_FULL,
+	ZBC_ZONE_REPORTING_OPTION_READONLY,
+	ZBC_ZONE_REPORTING_OPTION_OFFLINE,
+	ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
+	ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE,
+	ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f,
+	ZBC_ZONE_REPORTING_OPTION_RESERVED = 0x40,
+	ZBC_ZONE_REPORTING_OPTION_PARTIAL = ZBC_REPORT_ZONE_PARTIAL
 };
 
 /**
- * enum bdev_zone_type - Type of zone in descriptor
+ * enum blk_zone_type - Types of zones allowed in a zoned device.
  *
- * @ZTYP_RESERVED: Reserved
- * @ZTYP_CONVENTIONAL: Conventional random write zone (No Write Pointer)
- * @ZTYP_SEQ_WRITE_REQUIRED: Non-sequential writes are rejected.
- * @ZTYP_SEQ_WRITE_PREFERRED: Non-sequential writes allowed but discouraged.
+ * @BLK_ZONE_TYPE_RESERVED: Reserved.
+ * @BLK_ZONE_TYPE_CONVENTIONAL: Zone has no WP. Zone commands are not available.
+ * @BLK_ZONE_TYPE_SEQWRITE_REQ: Zone must be written sequentially
+ * @BLK_ZONE_TYPE_SEQWRITE_PREF: Zone may be written non-sequentially
  *
- * Returned from Report Zones. See bdev_zone_descriptor* type.
+ * TBD: Move to blkzoned_api - we don't need pointless duplication
+ * and user space needs to handle the same information in the
+ * same format -- so lets make it easy
  */
-enum bdev_zone_type {
-	ZTYP_RESERVED            = 0,
-	ZTYP_CONVENTIONAL        = 1,
-	ZTYP_SEQ_WRITE_REQUIRED  = 2,
-	ZTYP_SEQ_WRITE_PREFERRED = 3,
+enum blk_zone_type {
+	BLK_ZONE_TYPE_RESERVED,
+	BLK_ZONE_TYPE_CONVENTIONAL,
+	BLK_ZONE_TYPE_SEQWRITE_REQ,
+	BLK_ZONE_TYPE_SEQWRITE_PREF,
+	BLK_ZONE_TYPE_UNKNOWN,
 };
 
 /**
- * enum bdev_zone_condition - Condition of zone in descriptor
- *
- * @ZCOND_CONVENTIONAL: N/A
- * @ZCOND_ZC1_EMPTY: Empty
- * @ZCOND_ZC2_OPEN_IMPLICIT: Opened via write to zone.
- * @ZCOND_ZC3_OPEN_EXPLICIT: Opened via open zone command.
- * @ZCOND_ZC4_CLOSED: Closed
- * @ZCOND_ZC6_READ_ONLY:
- * @ZCOND_ZC5_FULL: No remaining space in zone.
- * @ZCOND_ZC7_OFFLINE: Offline
- *
- * Returned from Report Zones. See bdev_zone_descriptor* flags.
+ * enum blk_zone_state - State [condition] of a zone in a zoned device.
+ *
+ * @BLK_ZONE_NO_WP: Zone has not write pointer it is CMR/Conventional
+ * @BLK_ZONE_EMPTY: Zone is empty. Write pointer is at the start of the zone.
+ * @BLK_ZONE_OPEN: Zone is open, but not explicitly opened by a zone open cmd.
+ * @BLK_ZONE_OPEN_EXPLICIT: Zones was explicitly opened by a zone open cmd.
+ * @BLK_ZONE_CLOSED: Zone was [explicitly] closed for writing.
+ * @BLK_ZONE_UNKNOWN: Zone states 0x5 through 0xc are reserved by standard.
+ * @BLK_ZONE_FULL: Zone was [explicitly] marked full by a zone finish cmd.
+ * @BLK_ZONE_READONLY: Zone is read-only.
+ * @BLK_ZONE_OFFLINE: Zone is offline.
+ * @BLK_ZONE_BUSY: [INTERNAL] Kernel zone cache for this zone is being updated.
+ *
+ * The Zone Condition state machine also maps the above deinitions as:
+ *   - ZC1: Empty         | BLK_ZONE_EMPTY
+ *   - ZC2: Implicit Open | BLK_ZONE_OPEN
+ *   - ZC3: Explicit Open | BLK_ZONE_OPEN_EXPLICIT
+ *   - ZC4: Closed        | BLK_ZONE_CLOSED
+ *   - ZC5: Full          | BLK_ZONE_FULL
+ *   - ZC6: Read Only     | BLK_ZONE_READONLY
+ *   - ZC7: Offline       | BLK_ZONE_OFFLINE
+ *
+ * States 0x5 to 0xC are reserved by the current ZBC/ZAC spec.
  */
-enum bdev_zone_condition {
-	ZCOND_CONVENTIONAL       = 0,
-	ZCOND_ZC1_EMPTY          = 1,
-	ZCOND_ZC2_OPEN_IMPLICIT  = 2,
-	ZCOND_ZC3_OPEN_EXPLICIT  = 3,
-	ZCOND_ZC4_CLOSED         = 4,
-	/* 0x5 to 0xC are reserved */
-	ZCOND_ZC6_READ_ONLY      = 0xd,
-	ZCOND_ZC5_FULL           = 0xe,
-	ZCOND_ZC7_OFFLINE        = 0xf,
+enum blk_zone_state {
+	BLK_ZONE_NO_WP,
+	BLK_ZONE_EMPTY,
+	BLK_ZONE_OPEN,
+	BLK_ZONE_OPEN_EXPLICIT,
+	BLK_ZONE_CLOSED,
+	BLK_ZONE_UNKNOWN = 0x5,
+	BLK_ZONE_READONLY = 0xd,
+	BLK_ZONE_FULL = 0xe,
+	BLK_ZONE_OFFLINE = 0xf,
+	BLK_ZONE_BUSY = 0x10,
 };
 
 /**
  * enum bdev_zone_same - Report Zones same code.
  *
- * @ZS_ALL_DIFFERENT: All zones differ in type and size.
- * @ZS_ALL_SAME: All zones are the same size and type.
- * @ZS_LAST_DIFFERS: All zones are the same size and type except the last zone.
- * @ZS_SAME_LEN_DIFF_TYPES: All zones are the same length but types differ.
+ * @BLK_ZONE_SAME_ALL_DIFFERENT: All zones differ in type and size.
+ * @BLK_ZONE_SAME_ALL: All zones are the same size and type.
+ * @BLK_ZONE_SAME_LAST_DIFFERS: All zones are the same size and type
+ *    except the last zone which differs by size.
+ * @BLK_ZONE_SAME_LEN_TYPES_DIFFER: All zones are the same length
+ *    but zone types differ.
  *
  * Returned from Report Zones. See bdev_zone_report* same_field.
  */
-enum bdev_zone_same {
-	ZS_ALL_DIFFERENT        = 0,
-	ZS_ALL_SAME             = 1,
-	ZS_LAST_DIFFERS         = 2,
-	ZS_SAME_LEN_DIFF_TYPES  = 3,
+enum blk_zone_same {
+	BLK_ZONE_SAME_ALL_DIFFERENT     = 0,
+	BLK_ZONE_SAME_ALL               = 1,
+	BLK_ZONE_SAME_LAST_DIFFERS      = 2,
+	BLK_ZONE_SAME_LEN_TYPES_DIFFER  = 3,
 };
 
 /**
@@ -198,6 +224,13 @@ struct bdev_zone_report_io {
 	} data;
 } __packed;
 
+
+static inline u32 max_report_entries(size_t bytes)
+{
+	bytes -= sizeof(struct bdev_zone_report);
+	return bytes / sizeof(struct bdev_zone_descriptor);
+}
+
 /* continuing from uapi/linux/fs.h: */
 #define BLKREPORT	_IOWR(0x12, 130, struct bdev_zone_report_io)
 #define BLKZONEACTION	_IOW(0x12, 131, struct bdev_zone_action)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/4] Integrate ZBC command requests with zone cache.
  2016-08-22  4:31 [PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache Shaun Tancheff
                   ` (2 preceding siblings ...)
  2016-08-22  4:31 ` [PATCH v2 3/4] Merge ZBC constants Shaun Tancheff
@ 2016-08-22  4:31 ` Shaun Tancheff
  3 siblings, 0 replies; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-22  4:31 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Damien Le Moal,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei, Shaun Tancheff

Block layer (bio/request) commands can use or update the
sd_zbc zone cache as appropriate for each command.

Report Zones [REQ_OP_ZONE_REPORT] by default uses the current
zone cache data to generate a device (ZBC spec) formatted response.
REQ_META can also be specified to force the command to the device
and the result will be used to refresh the zone cache.

Reset WP [REQ_OP_ZONE_RESET] by default will attempt to translate
the request into a discard following the SD_ZBC_RESET_WP provisioning
mode. REQ_META can also be specified to force the command to be sent
to the device.

Open, Close and Finish zones having no other analog are sent directly
to the device.

On successful completion each zone action will update the zone cache
as appropriate.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
 block/blk-lib.c       |  16 --
 drivers/scsi/sd.c     |  42 +++-
 drivers/scsi/sd.h     |  22 +-
 drivers/scsi/sd_zbc.c | 672 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 698 insertions(+), 54 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 67b9258..8cc5893 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -307,22 +307,6 @@ int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
 	bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags);
 	ret = submit_bio_wait(bio);
 
-	/*
-	 * When our request it nak'd the underlying device maybe conventional
-	 * so ... report a single conventional zone the size of the device.
-	 */
-	if (ret == -EIO && conv->descriptor_count) {
-		/* Adjust the conventional to the size of the partition ... */
-		__be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects);
-
-		conv->maximum_lba = blksz;
-		conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL;
-		conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4;
-		conv->descriptors[0].length = blksz;
-		conv->descriptors[0].lba_start = 0;
-		conv->descriptors[0].lba_wptr = blksz;
-		ret = 0;
-	}
 	bio_put(bio);
 	return ret;
 }
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b76ffbb..9a649fa 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1181,9 +1181,10 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd)
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	struct bio *bio = rq->bio;
 	sector_t sector = blk_rq_pos(rq);
-	struct gendisk *disk = rq->rq_disk;
 	unsigned int nr_bytes = blk_rq_bytes(rq);
 	int ret = BLKPREP_KILL;
+	bool is_fua = (rq->cmd_flags & REQ_META) ? true : false;
+	u8 rpt_opt = ZBC_ZONE_REPORTING_OPTION_ALL;
 
 	WARN_ON(nr_bytes == 0);
 
@@ -1194,18 +1195,35 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd)
 	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) {
 		void *src;
 		struct bdev_zone_report *conv;
+		__be64 blksz = cpu_to_be64(sdkp->capacity);
 
-		if (nr_bytes < sizeof(struct bdev_zone_report))
+		if (nr_bytes < 512)
 			goto out;
 
 		src = kmap_atomic(bio->bi_io_vec->bv_page);
 		conv = src + bio->bi_io_vec->bv_offset;
 		conv->descriptor_count = cpu_to_be32(1);
 		conv->same_field = BLK_ZONE_SAME_ALL;
-		conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects);
+		conv->maximum_lba = blksz;
+		conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL;
+		conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4;
+		conv->descriptors[0].length = blksz;
+		conv->descriptors[0].lba_start = 0;
+		conv->descriptors[0].lba_wptr = blksz;
 		kunmap_atomic(src);
+		ret = BLKPREP_DONE;
 		goto out;
 	}
+	/* FUTURE ... when streamid is available */
+	/* rpt_opt = bio_get_streamid(bio); */
+
+	if (!is_fua) {
+		ret = sd_zbc_setup_zone_report_cmnd(cmd, rpt_opt);
+		if (ret == BLKPREP_DONE || ret == BLKPREP_DEFER)
+			goto out;
+		if (ret == BLKPREP_KILL)
+			pr_err("No Zone Cache, query media.\n");
+	}
 
 	ret = scsi_init_io(cmd);
 	if (ret != BLKPREP_OK)
@@ -1224,8 +1242,7 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd)
 	cmd->cmnd[1] = ZI_REPORT_ZONES;
 	put_unaligned_be64(sector, &cmd->cmnd[2]);
 	put_unaligned_be32(nr_bytes, &cmd->cmnd[10]);
-	/* FUTURE ... when streamid is available */
-	/* cmd->cmnd[14] = bio_get_streamid(bio); */
+	cmd->cmnd[14] = rpt_opt;
 
 	cmd->sc_data_direction = DMA_FROM_DEVICE;
 	cmd->sdb.length = nr_bytes;
@@ -1243,11 +1260,22 @@ static int sd_setup_zone_action_cmnd(struct scsi_cmnd *cmd)
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
 	int ret = BLKPREP_KILL;
+	bool is_fua = (rq->cmd_flags & REQ_META) ? true : false;
 	u8 allbit = 0;
 
 	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC)
 		goto out;
 
+	rq->timeout = SD_TIMEOUT;
+	rq->completion_data = NULL;
+
+	/* Allow the ZBC zone cache an opportunity to hijack the request */
+	if (!is_fua) {
+		ret = sd_zbc_setup_zone_action(cmd);
+		if (ret == BLKPREP_OK || ret == BLKPREP_DEFER)
+			goto out;
+	}
+
 	if (sector == ~0ul) {
 		allbit = 1;
 		sector = 0;
@@ -1316,6 +1344,8 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
 {
 	struct request *rq = SCpnt->request;
 
+	sd_zbc_uninit_command(SCpnt);
+
 	if (req_op(rq) == REQ_OP_DISCARD &&
 	    rq->completion_data)
 		__free_page(rq->completion_data);
@@ -2044,6 +2074,8 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 	if (rq_data_dir(SCpnt->request) == READ && scsi_prot_sg_count(SCpnt))
 		sd_dif_complete(SCpnt, good_bytes);
 
+	sd_zbc_done(SCpnt, good_bytes);
+
 	return good_bytes;
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 2792c10..adbf3e0 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -293,11 +293,15 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, unsigned int a)
 extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int,
 			       sector_t, enum zbc_zone_reporting_options, bool);
 extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len);
-extern void sd_zbc_remove(struct scsi_disk *);
-extern void sd_zbc_reset_zones(struct scsi_disk *);
+extern int sd_zbc_setup_zone_report_cmnd(struct scsi_cmnd *cmd, u8 rpt_opt);
+extern int sd_zbc_setup_zone_action(struct scsi_cmnd *cmd);
 extern int sd_zbc_setup_discard(struct scsi_cmnd *cmd);
 extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *,
 				   sector_t, unsigned int *);
+extern void sd_zbc_done(struct scsi_cmnd *cmd, int good_bytes);
+extern void sd_zbc_uninit_command(struct scsi_cmnd *cmd);
+extern void sd_zbc_remove(struct scsi_disk *);
+extern void sd_zbc_reset_zones(struct scsi_disk *);
 extern void sd_zbc_update_zones(struct scsi_disk *, sector_t, int, int reason);
 extern unsigned int sd_zbc_discard_granularity(struct scsi_disk *sdkp);
 
@@ -318,7 +322,19 @@ static inline int sd_zbc_setup(struct scsi_disk *sdkp, u64 zlen,
 	return 0;
 }
 
-static inline int int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
+static inline void sd_zbc_done(struct scsi_cmnd *cmd, int good_bytes) {}
+
+extern int sd_zbc_setup_zone_report_cmnd(struct scsi_cmnd *cmd, u8 rpt_opt)
+{
+	return BLKPREP_KILL;
+}
+
+static inline sd_zbc_setup_zone_action(struct scsi_cmnd *cmd)
+{
+	return BLKPREP_KILL;
+}
+
+static inline int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
 {
 	return BLKPREP_KILL;
 }
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 0780118..0259bda 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -58,6 +58,43 @@ struct zbc_update_work {
 	char		zone_buf[0];
 };
 
+/**
+ * get_len_from_desc() - Decode write pointer as # of blocks from start
+ * @bzde: Zone descriptor entry.
+ *
+ * Return: Write Pointer as number of blocks from start of zone.
+ */
+static inline sector_t get_len_from_desc(struct scsi_disk *sdkp,
+					 struct bdev_zone_descriptor *bzde)
+{
+	return logical_to_sectors(sdkp->device, be64_to_cpu(bzde->length));
+}
+
+/**
+ * get_wp_from_desc() - Decode write pointer as # of blocks from start
+ * @bzde: Zone descriptor entry.
+ *
+ * Return: Write Pointer as number of blocks from start of zone.
+ */
+static inline sector_t get_wp_from_desc(struct scsi_disk *sdkp,
+					struct bdev_zone_descriptor *bzde)
+{
+	return logical_to_sectors(sdkp->device,
+		be64_to_cpu(bzde->lba_wptr) - be64_to_cpu(bzde->lba_start));
+}
+
+/**
+ * get_start_from_desc() - Decode write pointer as # of blocks from start
+ * @bzde: Zone descriptor entry.
+ *
+ * Return: Write Pointer as number of blocks from start of zone.
+ */
+static inline sector_t get_start_from_desc(struct scsi_disk *sdkp,
+					   struct bdev_zone_descriptor *bzde)
+{
+	return logical_to_sectors(sdkp->device, be64_to_cpu(bzde->lba_start));
+}
+
 static
 struct blk_zone *zbc_desc_to_zone(struct scsi_disk *sdkp, unsigned char *rec)
 {
@@ -289,7 +326,7 @@ retry:
 	zbc_work->zone_buflen = bufsize;
 	zbc_work->sdkp = sdkp;
 	INIT_WORK(&zbc_work->zone_work, sd_zbc_refresh_zone_work);
-	num_rec = (bufsize / 64) - 1;
+	num_rec = max_report_entries(bufsize);
 
 	/*
 	 * Mark zones under update as BUSY
@@ -382,6 +419,59 @@ int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buffer,
 	return 0;
 }
 
+/**
+ * discard_or_write_same - Wrapper to setup Write Same or Reset WP for ZBC dev
+ * @cmd: SCSI command / request to setup
+ * @sector: Block layer sector (512 byte sector) to map to device.
+ * @nr_sectors: Number of 512 byte sectors.
+ * @use_write_same: When true setup WRITE_SAME_16 w/o UNMAP set.
+ *                  When false setup RESET WP for zone starting at sector.
+ */
+static void discard_or_write_same(struct scsi_cmnd *cmd, sector_t sector,
+				  unsigned int nr_sectors, bool use_write_same)
+{
+	struct scsi_device *sdp = cmd->device;
+
+	/*
+	 * blk_zone cache uses block layer sector units
+	 * but commands use device units
+	 */
+	sector >>= ilog2(sdp->sector_size) - 9;
+	nr_sectors >>= ilog2(sdp->sector_size) - 9;
+
+	if (use_write_same) {
+		cmd->cmd_len = 16;
+		cmd->cmnd[0] = WRITE_SAME_16;
+		cmd->cmnd[1] = 0; /* UNMAP (not set) */
+		put_unaligned_be64(sector, &cmd->cmnd[2]);
+		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+		cmd->transfersize = sdp->sector_size;
+		cmd->request->timeout = SD_WRITE_SAME_TIMEOUT;
+	} else {
+		cmd->cmd_len = 16;
+		cmd->cmnd[0] = ZBC_OUT;
+		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
+		put_unaligned_be64(sector, &cmd->cmnd[2]);
+		/* Reset Write Pointer doesn't have a payload */
+		cmd->transfersize = 0;
+		cmd->sc_data_direction = DMA_NONE;
+	}
+}
+
+/**
+ * sd_zbc_setup_discard() - ZBC device hook for sd_setup_discard
+ * @cmd: SCSI command/request being setup
+ *
+ * Handle SD_ZBC_RESET_WP provisioning mode.
+ * If zone is sequential and discard matches zone size issue RESET WP
+ * If zone is conventional issue WRITE_SAME_16 w/o UNMAP.
+ *
+ * Return:
+ *  BLKPREP_OK    - Zone action not handled here (Skip futher processing)
+ *  BLKPREP_DONE  - Zone action not handled here (Process as normal)
+ *  BLKPREP_DEFER - Zone action should be handled here but memory
+ *                  allocation failed. Retry.
+ */
 int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -467,44 +557,430 @@ int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
 	}
 
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	if (ret != BLKPREP_OK)
-		goto done;
 	/*
-	 * blk_zone cache uses block layer sector units
-	 * but commands use device units
+	 * Opportunistic setting, will be fixed up with
+	 * zone update if RESET WRITE POINTER fails.
 	 */
-	sector >>= ilog2(sdp->sector_size) - 9;
-	nr_sectors >>= ilog2(sdp->sector_size) - 9;
+	if (ret == BLKPREP_OK && !use_write_same)
+		zone->wp = zone->start;
+	spin_unlock_irqrestore(&zone->lock, flags);
 
-	if (use_write_same) {
-		cmd->cmd_len = 16;
-		cmd->cmnd[0] = WRITE_SAME_16;
-		cmd->cmnd[1] = 0; /* UNMAP (not set) */
-		put_unaligned_be64(sector, &cmd->cmnd[2]);
-		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
-		cmd->transfersize = sdp->sector_size;
-		rq->timeout = SD_WRITE_SAME_TIMEOUT;
+	if (ret == BLKPREP_OK)
+		discard_or_write_same(cmd, sector, nr_sectors, use_write_same);
+
+	return ret;
+}
+
+
+static void __set_zone_state(struct blk_zone *zone, int op)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (blk_zone_is_cmr(zone))
+		goto out_unlock;
+
+	switch (op) {
+	case REQ_OP_ZONE_OPEN:
+		zone->state = BLK_ZONE_OPEN_EXPLICIT;
+		break;
+	case REQ_OP_ZONE_FINISH:
+		zone->state = BLK_ZONE_FULL;
+		zone->wp = zone->start + zone->len;
+		break;
+	case REQ_OP_ZONE_CLOSE:
+		zone->state = BLK_ZONE_CLOSED;
+		break;
+	case REQ_OP_ZONE_RESET:
+		zone->wp = zone->start;
+		break;
+	default:
+		WARN_ONCE(1, "%s: invalid op code: %u\n", __func__, op);
+	}
+out_unlock:
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static void update_zone_state(struct request *rq, sector_t lba, unsigned int op)
+{
+	struct request_queue *q = rq->q;
+	struct blk_zone *zone = NULL;
+
+	if (lba == ~0ul) {
+		struct rb_node *node;
+
+		for (node = rb_first(&q->zones); node; node = rb_next(node)) {
+			zone = rb_entry(node, struct blk_zone, node);
+			__set_zone_state(zone, op);
+		}
+		return;
 	} else {
-		cmd->cmd_len = 16;
-		cmd->cmnd[0] = ZBC_OUT;
-		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
-		put_unaligned_be64(sector, &cmd->cmnd[2]);
-		/* Reset Write Pointer doesn't have a payload */
-		cmd->transfersize = 0;
-		cmd->sc_data_direction = DMA_NONE;
+		zone = blk_lookup_zone(q, lba);
+		if (zone)
+			__set_zone_state(zone, op);
+	}
+}
+
+/**
+ * sd_zbc_setup_zone_action() - ZBC device hook for sd_setup_zone_action
+ * @cmd: SCSI command/request being setup
+ *
+ * Currently for RESET WP (REQ_OP_ZONE_RESET) if the META flag is cleared
+ * the command may be translated to follow SD_ZBC_RESET_WP provisioning mode.
+ *
+ * Return:
+ *  BLKPREP_OK    - Zone action handled here (Skip futher processing)
+ *  BLKPREP_DONE  - Zone action not handled here (Caller must process)
+ *  BLKPREP_DEFER - Zone action should be handled here but memory
+ *                  allocation failed. Retry.
+ */
+int sd_zbc_setup_zone_action(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+	struct blk_zone *zone;
+	unsigned long flags;
+	unsigned int nr_sectors;
+	int ret = BLKPREP_DONE;
+	int op = req_op(rq);
+	bool is_fua = (rq->cmd_flags & REQ_META) ? true : false;
+	bool use_write_same = false;
+
+	if (is_fua || op != REQ_OP_ZONE_RESET)
+		goto out;
+
+	zone = blk_lookup_zone(rq->q, sector);
+	if (!zone || sdkp->provisioning_mode != SD_ZBC_RESET_WP)
+		goto out;
+
+	/* Map a Reset WP w/o FUA to a discard request */
+	spin_lock_irqsave(&zone->lock, flags);
+	sector = zone->start;
+	nr_sectors = zone->len;
+	if (blk_zone_is_cmr(zone))
+		use_write_same = true;
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	rq->completion_data = NULL;
+	if (use_write_same) {
+		struct page *page;
+
+		page = alloc_page(GFP_ATOMIC | GFP_DMA | __GFP_ZERO);
+		if (!page)
+			return BLKPREP_DEFER;
+		rq->completion_data = page;
+		rq->timeout = SD_TIMEOUT;
+		cmd->sc_data_direction = DMA_TO_DEVICE;
+	}
+	rq->__sector = sector;
+	rq->__data_len = nr_sectors << 9;
+	ret = sd_zbc_setup_discard(cmd);
+	if (ret != BLKPREP_OK)
+		goto out;
+
+	cmd->allowed = SD_MAX_RETRIES;
+	if (cmd->transfersize) {
+		blk_add_request_payload(rq, rq->completion_data,
+					0, cmd->transfersize);
+		ret = scsi_init_io(cmd);
+	}
+	rq->__data_len = nr_sectors << 9;
+	if (ret != BLKPREP_OK && rq->completion_data) {
+		__free_page(rq->completion_data);
+		rq->completion_data = NULL;
+	}
+out:
+	return ret;
+}
+
+
+/**
+ * bzrpt_fill() - Fill a ZEPORT ZONES request in pages
+ * @rq: Request where zone cache lives
+ * @bzrpt: Zone report header.
+ * @bzd: Zone report descriptors.
+ * @sz: allocated size of bzrpt or bzd.
+ * @lba: LBA to start filling in.
+ * @opt: Report option.
+ *
+ * Returns: next_lba
+ */
+static sector_t bzrpt_fill(struct request *rq,
+			   struct bdev_zone_report *bzrpt,
+			   struct bdev_zone_descriptor *bzd,
+			   size_t sz, sector_t lba, u8 opt)
+{
+	struct request_queue *q = rq->q;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	struct blk_zone *zone = NULL;
+	struct rb_node *node = NULL;
+	sector_t progress = lba;
+	sector_t clen = ~0ul;
+	unsigned long flags;
+	u32 max_entries = bzrpt ? max_report_entries(sz) : sz / sizeof(*bzd);
+	u32 entry = 0;
+	int len_diffs = 0;
+	int type_diffs = 0;
+	u8 ctype;
+	u8 same = 0;
+
+	zone = blk_lookup_zone(q, lba);
+	if (zone)
+		node = &zone->node;
+
+	for (entry = 0; entry < max_entries && node; node = rb_next(node)) {
+		u64 z_len, z_start, z_wp_abs;
+		u8 cond = 0;
+		u8 flgs = 0;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		z_len = zone->len;
+		z_start = zone->start;
+		z_wp_abs = zone->wp;
+		progress = z_start + z_len;
+		cond = zone->state;
+		if (blk_zone_is_cmr(zone))
+			flgs |= 0x02;
+		else if (zone->wp != zone->start)
+			flgs |= 0x01; /* flag as RWP recommended? */
+		spin_unlock_irqrestore(&zone->lock, flags);
+
+		switch (opt & ZBC_REPORT_OPTION_MASK) {
+		case ZBC_ZONE_REPORTING_OPTION_EMPTY:
+			if (z_wp_abs != z_start)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN:
+			if (cond != BLK_ZONE_OPEN)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN:
+			if (cond != BLK_ZONE_OPEN_EXPLICIT)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_CLOSED:
+			if (cond != BLK_ZONE_CLOSED)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_FULL:
+			if (cond != BLK_ZONE_FULL)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_READONLY:
+			if (cond == BLK_ZONE_READONLY)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_OFFLINE:
+			if (cond == BLK_ZONE_OFFLINE)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP:
+			if (z_wp_abs == z_start)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_NON_WP:
+			if (cond == BLK_ZONE_NO_WP)
+				continue;
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE:
+			/* this can only be reported by the HW */
+			break;
+		case ZBC_ZONE_REPORTING_OPTION_ALL:
+		default:
+			break;
+		}
+
+		/* if same code only applies to returned zones */
+		if (opt & ZBC_REPORT_ZONE_PARTIAL) {
+			if (clen != ~0ul) {
+				clen = z_len;
+				ctype = zone->type;
+			}
+			if (z_len != clen)
+				len_diffs++;
+			if (zone->type != ctype)
+				type_diffs++;
+			ctype = zone->type;
+		}
+
+		/* shift to device units */
+		z_start >>= ilog2(sdkp->device->sector_size) - 9;
+		z_len >>= ilog2(sdkp->device->sector_size) - 9;
+		z_wp_abs >>= ilog2(sdkp->device->sector_size) - 9;
+
+		if (!bzd) {
+			if (bzrpt)
+				bzrpt->descriptor_count =
+					cpu_to_be32(++entry);
+			continue;
+		}
+
+		bzd[entry].lba_start = cpu_to_be64(z_start);
+		bzd[entry].length = cpu_to_be64(z_len);
+		bzd[entry].lba_wptr = cpu_to_be64(z_wp_abs);
+		bzd[entry].type = zone->type;
+		bzd[entry].flags = cond << 4 | flgs;
+		entry++;
+		if (bzrpt)
+			bzrpt->descriptor_count = cpu_to_be32(entry);
+	}
+
+	/* if same code applies to all zones */
+	if (bzrpt && !(opt & ZBC_REPORT_ZONE_PARTIAL)) {
+		for (node = rb_first(&q->zones); node; node = rb_next(node)) {
+			zone = rb_entry(node, struct blk_zone, node);
+
+			spin_lock_irqsave(&zone->lock, flags);
+			if (clen != ~0ul) {
+				clen = zone->len;
+				ctype = zone->type;
+			}
+			if (zone->len != clen)
+				len_diffs++;
+			if (zone->type != ctype)
+				type_diffs++;
+			ctype = zone->type;
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+	}
+
+	if (bzrpt) {
+		/* calculate same code  */
+		if (len_diffs == 0) {
+			if (type_diffs == 0)
+				same = BLK_ZONE_SAME_ALL;
+			else
+				same = BLK_ZONE_SAME_LEN_TYPES_DIFFER;
+		} else if (len_diffs == 1 && type_diffs == 0) {
+			same = BLK_ZONE_SAME_LAST_DIFFERS;
+		} else {
+			same = BLK_ZONE_SAME_ALL_DIFFERENT;
+		}
+		bzrpt->same_field = same;
+		bzrpt->maximum_lba = cpu_to_be64(
+			logical_to_bytes(sdkp->device, sdkp->capacity));
+	}
+
+	return progress;
+}
+
+/**
+ * copy_buffer_into_request() - Copy a buffer into a (part) of a request
+ * @rq: Request to copy into
+ * @src: Buffer to copy from
+ * @bytes: Number of bytes in src
+ * @skip: Number of bytes of request to 'seek' into before overwriting with src
+ *
+ * Return: Number of bytes copied into the request.
+ */
+static int copy_buffer_into_request(struct request *rq, void *src, uint bytes,
+				    off_t skip)
+{
+	struct req_iterator iter;
+	struct bio_vec bvec;
+	off_t skipped = 0;
+	uint copied = 0;
+
+	rq_for_each_segment(bvec, rq, iter) {
+		void *buf;
+		unsigned long flags;
+		uint len;
+		uint remain = 0;
+
+		buf = bvec_kmap_irq(&bvec, &flags);
+		if (skip > skipped)
+			remain = skip - skipped;
+
+		if (remain < bvec.bv_len) {
+			len = min_t(uint, bvec.bv_len - remain, bytes - copied);
+			memcpy(buf + remain, src + copied, len);
+			copied += len;
+		}
+		skipped += min(remain, bvec.bv_len);
+		bvec_kunmap_irq(buf, &flags);
+		if (copied >= bytes)
+			break;
+	}
+	return copied;
+}
+
+/**
+ * sd_zbc_setup_zone_report_cmnd() - Handle a zone report request
+ * @cmd: SCSI command request
+ * @ropt: Current report option flags to use
+ *
+ * Use the zone cache to fill in a report zones request.
+ *
+ * Return: BLKPREP_DONE if copy was sucessful.
+ *         BLKPREP_KILL if payload size was invalid.
+ *         BLKPREP_DEFER if unable to acquire a temporary page of working data.
+ */
+int sd_zbc_setup_zone_report_cmnd(struct scsi_cmnd *cmd, u8 ropt)
+{
+	struct request *rq = cmd->request;
+	sector_t sector = blk_rq_pos(rq);
+	struct bdev_zone_descriptor *bzd;
+	struct bdev_zone_report *bzrpt;
+	void *pbuf;
+	off_t skip = 0;
+	unsigned int nr_bytes = blk_rq_bytes(rq);
+	int rnum;
+	int ret = BLKPREP_DEFER;
+	unsigned int chunk;
+
+	if (nr_bytes < 512) {
+		ret = BLKPREP_KILL;
+		goto out;
+	}
+
+	pbuf = (void *)__get_free_page(GFP_ATOMIC);
+	if (!pbuf)
+		goto out;
+
+	bzrpt = pbuf;
+	/* fill in the header and first chunk of data*/
+	bzrpt_fill(rq, bzrpt, NULL, nr_bytes, sector, ropt);
+
+	bzd = pbuf + sizeof(struct bdev_zone_report);
+	chunk = min_t(unsigned int, nr_bytes, PAGE_SIZE) -
+		sizeof(struct bdev_zone_report);
+	sector = bzrpt_fill(rq, NULL, bzd, chunk, sector, ropt);
+	chunk += sizeof(struct bdev_zone_report);
+	do {
+		rnum = copy_buffer_into_request(rq, pbuf, chunk, skip);
+		if (rnum != chunk) {
+			pr_err("buffer_to_request_partial() failed to copy "
+			       "zone report to command data [%u != %u]\n",
+				chunk, rnum);
+			ret = BLKPREP_KILL;
+			goto out;
+		}
+		skip += chunk;
+		if (skip >= nr_bytes)
+			break;
 		/*
-		 * Opportunistic setting, will be fixed up with
-		 * zone update if RESET WRITE POINTER fails.
+		 * keep loading more descriptors until nr_bytes have been
+		 * copied to the command
 		 */
-		zone->wp = zone->start;
-	}
+		chunk = min_t(unsigned int, nr_bytes - skip, PAGE_SIZE);
+		sector = bzrpt_fill(rq, NULL, pbuf, chunk, sector, ropt);
+	} while (skip < nr_bytes);
+	ret = BLKPREP_DONE;
 
-done:
+out:
+	if (pbuf)
+		free_page((unsigned long) pbuf);
 	return ret;
 }
 
+/**
+ * sd_zbc_setup_read_write() - ZBC device hook for sd_setup_read_write
+ * @sdkp: SCSI Disk
+ * @rq: Request being setup
+ * @sector: Request sector
+ * @num_sectors: Request size
+ */
 int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
 			    sector_t sector, unsigned int *num_sectors)
 {
@@ -622,6 +1098,140 @@ out:
 }
 
 /**
+ * update_zones_from_report() - Update zone WP's and state [condition]
+ * @cmd: SCSI command request
+ * @nr_bytes: Number of 'good' bytes in this request.
+ *
+ * Read the result data from a REPORT ZONES in PAGE_SIZE chunks util
+ * the full report is digested into the zone cache.
+ */
+static void update_zones_from_report(struct scsi_cmnd *cmd, u32 nr_bytes)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	struct bdev_zone_descriptor *bzde;
+	u32 nread = 0;
+	u32 dmax = 0;
+	void *pbuf = (void *)__get_free_page(GFP_ATOMIC);
+
+	if (!pbuf)
+		goto done;
+
+	/* read a page at a time and update the zone cache's WPs */
+	while (nr_bytes > nread) {
+		u32 iter, count;
+		u32 len = min_t(u32, nr_bytes - nread, PAGE_SIZE);
+		size_t rnum = sg_pcopy_to_buffer(scsi_sglist(cmd),
+						 scsi_sg_count(cmd), pbuf, len,
+						 nread);
+		if (rnum != len) {
+			pr_err("%s: FAIL sg_pcopy_to_buffer: %u of %u bytes\n",
+				__func__, (uint)rnum, len);
+			goto done;
+		}
+		bzde = pbuf;
+		count = len / sizeof(struct bdev_zone_report);
+		if (nread == 0) {
+			const struct bdev_zone_report *bzrpt = pbuf;
+
+			dmax = min_t(u32, be32_to_cpu(bzrpt->descriptor_count),
+					  max_report_entries(nr_bytes));
+			bzde = pbuf + sizeof(struct bdev_zone_report);
+			count--;
+		}
+		for (iter = 0; iter < count && dmax > 0; dmax--, iter++) {
+			struct blk_zone *zone;
+			struct bdev_zone_descriptor *entry = &bzde[iter];
+			sector_t s = get_start_from_desc(sdkp, entry);
+			sector_t z_len = get_len_from_desc(sdkp, entry);
+			unsigned long flags;
+
+			if (!z_len)
+				goto done;
+
+			zone = blk_lookup_zone(rq->q, s);
+			if (!zone)
+				goto done;
+
+			spin_lock_irqsave(&zone->lock, flags);
+			zone->type = entry->type & 0xF;
+			zone->state = (entry->flags >> 4) & 0xF;
+			zone->wp = get_wp_from_desc(sdkp, entry);
+			zone->len = z_len;
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+		nread += len;
+		if (!dmax)
+			goto done;
+	}
+done:
+	if (pbuf)
+		free_page((unsigned long) pbuf);
+}
+
+/**
+ * sd_zbc_done() - ZBC device hook for sd_done
+ * @cmd: SCSI command that is done.
+ * @good_bytes: number of bytes (successfully) completed.
+ *
+ * Cleanup or sync zone cache with cmd.
+ * Currently a successful REPORT ZONES w/FUA flag will pull data
+ * from the media. On done the zone cache is re-sync'd with the
+ * result which is presumed to be the most accurate picture of
+ * the zone condition and WP location.
+ */
+void sd_zbc_done(struct scsi_cmnd *cmd, int good_bytes)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+	int result = cmd->result;
+	int op = req_op(rq);
+	bool is_fua = (rq->cmd_flags & REQ_META) ? true : false;
+
+	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC)
+		return;
+
+	switch (op) {
+	case REQ_OP_ZONE_REPORT:
+		if (is_fua && good_bytes > 0)
+			update_zones_from_report(cmd, good_bytes);
+		break;
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_RESET:
+		if (result == 0)
+			update_zone_state(rq, sector, op);
+		break;
+	default:
+		break;
+	}
+}
+
+/**
+ * sd_zbc_uninit_command() - ZBC device hook for sd_uninit_command
+ * @cmd: SCSI command that is done.
+ *
+ * On uninit if a RESET WP (w/o FUA flag) was translated to a
+ * DISCARD and then to an SCT Write Same then there may be a
+ * page of completion_data that was allocated and needs to be freed.
+ */
+void sd_zbc_uninit_command(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	int op = req_op(rq);
+	bool is_fua = (rq->cmd_flags & REQ_META) ? true : false;
+
+	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC)
+		return;
+
+	if (!is_fua && op == REQ_OP_ZONE_RESET && rq->completion_data)
+		__free_page(rq->completion_data);
+}
+
+/**
  * sd_zbc_setup - Load zones of matching zlen size into rb tree.
  *
  */
@@ -663,7 +1273,8 @@ int sd_zbc_setup(struct scsi_disk *sdkp, u64 zlen, char *buf, int buf_len)
 }
 
 /**
- * sd_zbc_remove -
+ * sd_zbc_remove - Prepare for device removal.
+ * @sdkp: SCSI Disk being removed.
  */
 void sd_zbc_remove(struct scsi_disk *sdkp)
 {
@@ -674,6 +1285,7 @@ void sd_zbc_remove(struct scsi_disk *sdkp)
 		destroy_workqueue(sdkp->zone_work_q);
 	}
 }
+
 /**
  * sd_zbc_discard_granularity - Determine discard granularity.
  * @sdkp: SCSI disk used to calculate discard granularity.
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
  2016-08-22  4:31 ` [PATCH v2 2/4] On Discard either do Reset WP or Write Same Shaun Tancheff
@ 2016-08-22 23:57   ` Damien Le Moal
  2016-08-23  0:22     ` Shaun Tancheff
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2016-08-22 23:57 UTC (permalink / raw)
  To: Shaun Tancheff, linux-block, linux-scsi, linux-kernel
  Cc: Jens Axboe, Christoph Hellwig, James E . J . Bottomley,
	Martin K . Petersen, Hannes Reinecke, Josh Bingaman,
	Dan Williams, Sagi Grimberg, Mike Christie, Toshi Kani, Ming Lei,
	Shaun Tancheff


Shaun,

On 8/22/16 13:31, Shaun Tancheff wrote:
[...]
> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
> -			 sector_t sector, unsigned int num_sectors)
> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
>  {
> -	struct blk_zone *zone;
> +	struct request *rq = cmd->request;
> +	struct scsi_device *sdp = cmd->device;
> +	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
> +	sector_t sector = blk_rq_pos(rq);
> +	unsigned int nr_sectors = blk_rq_sectors(rq);
>  	int ret = BLKPREP_OK;
> +	struct blk_zone *zone;
>  	unsigned long flags;
> +	u32 wp_offset;
> +	bool use_write_same = false;
>  
>  	zone = blk_lookup_zone(rq->q, sector);
> -	if (!zone)
> +	if (!zone) {
> +		/* Test for a runt zone before giving up */
> +		if (sdp->type != TYPE_ZBC) {
> +			struct request_queue *q = rq->q;
> +			struct rb_node *node;
> +
> +			node = rb_last(&q->zones);
> +			if (node)
> +				zone = rb_entry(node, struct blk_zone, node);
> +			if (zone) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				if ((zone->start + zone->len) <= sector)
> +					goto out;
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +				zone = NULL;
> +			}
> +		}
>  		return BLKPREP_KILL;
> +	}

I do not understand the point of this code here to test for the runt
zone. As long as sector is within the device maximum capacity (in 512B
unit), blk_lookup_zone will return the pointer to the zone structure
containing that sector (the RB-tree does not have any constraint
regarding zone size). The only case where NULL would be returned is if
discard is issued super early right after the disk is probed and before
the zone refresh work has completed. We can certainly protect against
that by delaying the discard.


>  
>  	spin_lock_irqsave(&zone->lock, flags);
> -
>  	if (zone->state == BLK_ZONE_UNKNOWN ||
>  	    zone->state == BLK_ZONE_BUSY) {
>  		sd_zbc_debug_ratelimit(sdkp,
> -				       "Discarding zone %zu state %x, deferring\n",
> +				       "Discarding zone %zx state %x, deferring\n",

Sector values are usually displayed in decimal. Why use Hex here ? At
least "0x" would be needed to avoid confusion I think.

>  				       zone->start, zone->state);
>  		ret = BLKPREP_DEFER;
>  		goto out;
> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>  	if (zone->state == BLK_ZONE_OFFLINE) {
>  		/* let the drive fail the command */
>  		sd_zbc_debug_ratelimit(sdkp,
> -				       "Discarding offline zone %zu\n",
> +				       "Discarding offline zone %zx\n",
>  				       zone->start);
>  		goto out;
>  	}
> -
> -	if (!blk_zone_is_smr(zone)) {
> +	if (blk_zone_is_cmr(zone)) {
> +		use_write_same = true;
>  		sd_zbc_debug_ratelimit(sdkp,
> -				       "Discarding %s zone %zu\n",
> -				       blk_zone_is_cmr(zone) ? "CMR" : "unknown",
> +				       "Discarding CMR zone %zx\n",
>  				       zone->start);
> -		ret = BLKPREP_DONE;
>  		goto out;
>  	}

Some 10TB host managed disks out there have 1% conventional zone space,
that is 100GB of capacity. When issuing a "reset all", doing a write
same in these zones will take forever... If the user really wants zeroes
in those zones, let it issue a zeroout.

I think that it would a better choice to simply not report
discard_zeroes_data as true and do nothing for conventional zones reset.

> -	if (blk_zone_is_empty(zone)) {
> -		sd_zbc_debug_ratelimit(sdkp,
> -				       "Discarding empty zone %zu\n",
> -				       zone->start);
> -		ret = BLKPREP_DONE;
> +	if (zone->start != sector || zone->len < nr_sectors) {
> +		sd_printk(KERN_ERR, sdkp,
> +			  "Misaligned RESET WP %zx/%x on zone %zx/%zx\n",
> +			  sector, nr_sectors, zone->start, zone->len);
> +		ret = BLKPREP_KILL;
>  		goto out;
>  	}
> -
> -	if (zone->start != sector ||
> -	    zone->len < num_sectors) {
> +	/* Protect against Reset WP when more data had been written to the
> +	 * zone than is being discarded.
> +	 */
> +	wp_offset = zone->wp - zone->start;
> +	if (wp_offset > nr_sectors) {
>  		sd_printk(KERN_ERR, sdkp,
> -			  "Misaligned RESET WP, start %zu/%zu "
> -			  "len %zu/%u\n",
> -			  zone->start, sector, zone->len, num_sectors);
> +			  "Will Corrupt RESET WP %zx/%x/%x on zone %zx/%zx/%zx\n",
> +			  sector, wp_offset, nr_sectors,
> +			  zone->start, zone->wp, zone->len);
>  		ret = BLKPREP_KILL;
>  		goto out;
>  	}
> -
> -	/*
> -	 * Opportunistic setting, will be fixed up with
> -	 * zone update if RESET WRITE POINTER fails.
> -	 */
> -	zone->wp = zone->start;
> +	if (blk_zone_is_empty(zone)) {
> +		sd_zbc_debug_ratelimit(sdkp,
> +				       "Discarding empty zone %zx [WP: %zx]\n",
> +				       zone->start, zone->wp);
> +		ret = BLKPREP_DONE;
> +		goto out;
> +	}
>  
>  out:
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
> +	if (ret != BLKPREP_OK)
> +		goto done;
> +	/*
> +	 * blk_zone cache uses block layer sector units
> +	 * but commands use device units
> +	 */
> +	sector >>= ilog2(sdp->sector_size) - 9;
> +	nr_sectors >>= ilog2(sdp->sector_size) - 9;
> +
> +	if (use_write_same) {
> +		cmd->cmd_len = 16;
> +		cmd->cmnd[0] = WRITE_SAME_16;
> +		cmd->cmnd[1] = 0; /* UNMAP (not set) */
> +		put_unaligned_be64(sector, &cmd->cmnd[2]);
> +		put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
> +		cmd->transfersize = sdp->sector_size;
> +		rq->timeout = SD_WRITE_SAME_TIMEOUT;
> +	} else {
> +		cmd->cmd_len = 16;
> +		cmd->cmnd[0] = ZBC_OUT;
> +		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
> +		put_unaligned_be64(sector, &cmd->cmnd[2]);
> +		/* Reset Write Pointer doesn't have a payload */
> +		cmd->transfersize = 0;
> +		cmd->sc_data_direction = DMA_NONE;
> +		/*
> +		 * Opportunistic setting, will be fixed up with
> +		 * zone update if RESET WRITE POINTER fails.
> +		 */
> +		zone->wp = zone->start;
> +	}
> +
> +done:
>  	return ret;
>  }
>  
> @@ -468,6 +524,9 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
>  
>  	spin_lock_irqsave(&zone->lock, flags);
>  
> +	if (blk_zone_is_cmr(zone))
> +		goto out;
> +
>  	if (zone->state == BLK_ZONE_UNKNOWN ||
>  	    zone->state == BLK_ZONE_BUSY) {
>  		sd_zbc_debug_ratelimit(sdkp,
> @@ -476,16 +535,6 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
>  		ret = BLKPREP_DEFER;
>  		goto out;
>  	}
> -	if (zone->state == BLK_ZONE_OFFLINE) {
> -		/* let the drive fail the command */
> -		sd_zbc_debug_ratelimit(sdkp,
> -				       "zone %zu offline\n",
> -				       zone->start);
> -		goto out;
> -	}
> -
> -	if (blk_zone_is_cmr(zone))
> -		goto out;
>  
>  	if (blk_zone_is_seq_pref(zone)) {
>  		if (op_is_write(req_op(rq))) {
> @@ -514,6 +563,14 @@ int sd_zbc_setup_read_write(struct scsi_disk *sdkp, struct request *rq,
>  		goto out;
>  	}
>  
> +	if (zone->state == BLK_ZONE_OFFLINE) {
> +		/* let the drive fail the command */
> +		sd_zbc_debug_ratelimit(sdkp,
> +				       "zone %zu offline\n",
> +				       zone->start);
> +		goto out;
> +	}
> +
>  	if (op_is_write(req_op(rq))) {
>  		if (zone->state == BLK_ZONE_READONLY)
>  			goto out;
> 

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
  2016-08-22 23:57   ` Damien Le Moal
@ 2016-08-23  0:22     ` Shaun Tancheff
  2016-08-23  1:25       ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-23  0:22 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei

On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote:
>
> Shaun,
>
> On 8/22/16 13:31, Shaun Tancheff wrote:
> [...]
>> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>> -                      sector_t sector, unsigned int num_sectors)
>> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
>>  {
>> -     struct blk_zone *zone;
>> +     struct request *rq = cmd->request;
>> +     struct scsi_device *sdp = cmd->device;
>> +     struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
>> +     sector_t sector = blk_rq_pos(rq);
>> +     unsigned int nr_sectors = blk_rq_sectors(rq);
>>       int ret = BLKPREP_OK;
>> +     struct blk_zone *zone;
>>       unsigned long flags;
>> +     u32 wp_offset;
>> +     bool use_write_same = false;
>>
>>       zone = blk_lookup_zone(rq->q, sector);
>> -     if (!zone)
>> +     if (!zone) {
>> +             /* Test for a runt zone before giving up */
>> +             if (sdp->type != TYPE_ZBC) {
>> +                     struct request_queue *q = rq->q;
>> +                     struct rb_node *node;
>> +
>> +                     node = rb_last(&q->zones);
>> +                     if (node)
>> +                             zone = rb_entry(node, struct blk_zone, node);
>> +                     if (zone) {
>> +                             spin_lock_irqsave(&zone->lock, flags);
>> +                             if ((zone->start + zone->len) <= sector)
>> +                                     goto out;
>> +                             spin_unlock_irqrestore(&zone->lock, flags);
>> +                             zone = NULL;
>> +                     }
>> +             }
>>               return BLKPREP_KILL;
>> +     }
>
> I do not understand the point of this code here to test for the runt
> zone. As long as sector is within the device maximum capacity (in 512B
> unit), blk_lookup_zone will return the pointer to the zone structure
> containing that sector (the RB-tree does not have any constraint
> regarding zone size). The only case where NULL would be returned is if
> discard is issued super early right after the disk is probed and before
> the zone refresh work has completed. We can certainly protect against
> that by delaying the discard.

As you can see I am not including Host Managed in the
runt check.

Also you may note that in my patch to get Host Aware working
with the zone cache I do not include the runt zone in the cache.
So as it sits I need this fallback otherwise doing blkdiscard over
the whole device ends in a error, as well as mkfs.f2fs et. al.

>>       spin_lock_irqsave(&zone->lock, flags);
>> -
>>       if (zone->state == BLK_ZONE_UNKNOWN ||
>>           zone->state == BLK_ZONE_BUSY) {
>>               sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding zone %zu state %x, deferring\n",
>> +                                    "Discarding zone %zx state %x, deferring\n",
>
> Sector values are usually displayed in decimal. Why use Hex here ? At
> least "0x" would be needed to avoid confusion I think.

Yeah, my brain is lazy about converting very large
numbers to powers of 2. So it's much easier to spot
zone alignment here.



>>                                      zone->start, zone->state);
>>               ret = BLKPREP_DEFER;
>>               goto out;
>> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>>       if (zone->state == BLK_ZONE_OFFLINE) {
>>               /* let the drive fail the command */
>>               sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding offline zone %zu\n",
>> +                                    "Discarding offline zone %zx\n",
>>                                      zone->start);
>>               goto out;
>>       }
>> -
>> -     if (!blk_zone_is_smr(zone)) {
>> +     if (blk_zone_is_cmr(zone)) {
>> +             use_write_same = true;
>>               sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding %s zone %zu\n",
>> -                                    blk_zone_is_cmr(zone) ? "CMR" : "unknown",
>> +                                    "Discarding CMR zone %zx\n",
>>                                      zone->start);
>> -             ret = BLKPREP_DONE;
>>               goto out;
>>       }
>
> Some 10TB host managed disks out there have 1% conventional zone space,
> that is 100GB of capacity. When issuing a "reset all", doing a write
> same in these zones will take forever... If the user really wants zeroes
> in those zones, let it issue a zeroout.
>
> I think that it would a better choice to simply not report
> discard_zeroes_data as true and do nothing for conventional zones reset.

I think that would be unfortunate for Host Managed but I think it's
the right choice for Host Aware at this time. So either we base
it on disk type or we have some other config flag added to sysfs.

>> -     if (blk_zone_is_empty(zone)) {
>> -             sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding empty zone %zu\n",
>> -                                    zone->start);
>> -             ret = BLKPREP_DONE;
>> +     if (zone->start != sector || zone->len < nr_sectors) {
>> +             sd_printk(KERN_ERR, sdkp,
>> +                       "Misaligned RESET WP %zx/%x on zone %zx/%zx\n",
>> +                       sector, nr_sectors, zone->start, zone->len);
>> +             ret = BLKPREP_KILL;
>>               goto out;
>>       }
>> -
>> -     if (zone->start != sector ||
>> -         zone->len < num_sectors) {
>> +     /* Protect against Reset WP when more data had been written to the
>> +      * zone than is being discarded.
>> +      */
>> +     wp_offset = zone->wp - zone->start;
>> +     if (wp_offset > nr_sectors) {
>>               sd_printk(KERN_ERR, sdkp,
>> -                       "Misaligned RESET WP, start %zu/%zu "
>> -                       "len %zu/%u\n",
>> -                       zone->start, sector, zone->len, num_sectors);
>> +                       "Will Corrupt RESET WP %zx/%x/%x on zone %zx/%zx/%zx\n",
>> +                       sector, wp_offset, nr_sectors,
>> +                       zone->start, zone->wp, zone->len);
>>               ret = BLKPREP_KILL;
>>               goto out;
>>       }

---
Shaun Tancheff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
  2016-08-23  0:22     ` Shaun Tancheff
@ 2016-08-23  1:25       ` Damien Le Moal
  2016-08-24  5:19         ` Shaun Tancheff
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2016-08-23  1:25 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei


Shaun,

On 8/23/16 09:22, Shaun Tancheff wrote:
> On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote:
>>
>> Shaun,
>>
>> On 8/22/16 13:31, Shaun Tancheff wrote:
>> [...]
>>> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>>> -                      sector_t sector, unsigned int num_sectors)
>>> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
>>>  {
>>> -     struct blk_zone *zone;
>>> +     struct request *rq = cmd->request;
>>> +     struct scsi_device *sdp = cmd->device;
>>> +     struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
>>> +     sector_t sector = blk_rq_pos(rq);
>>> +     unsigned int nr_sectors = blk_rq_sectors(rq);
>>>       int ret = BLKPREP_OK;
>>> +     struct blk_zone *zone;
>>>       unsigned long flags;
>>> +     u32 wp_offset;
>>> +     bool use_write_same = false;
>>>
>>>       zone = blk_lookup_zone(rq->q, sector);
>>> -     if (!zone)
>>> +     if (!zone) {
>>> +             /* Test for a runt zone before giving up */
>>> +             if (sdp->type != TYPE_ZBC) {
>>> +                     struct request_queue *q = rq->q;
>>> +                     struct rb_node *node;
>>> +
>>> +                     node = rb_last(&q->zones);
>>> +                     if (node)
>>> +                             zone = rb_entry(node, struct blk_zone, node);
>>> +                     if (zone) {
>>> +                             spin_lock_irqsave(&zone->lock, flags);
>>> +                             if ((zone->start + zone->len) <= sector)
>>> +                                     goto out;
>>> +                             spin_unlock_irqrestore(&zone->lock, flags);
>>> +                             zone = NULL;
>>> +                     }
>>> +             }
>>>               return BLKPREP_KILL;
>>> +     }
>>
>> I do not understand the point of this code here to test for the runt
>> zone. As long as sector is within the device maximum capacity (in 512B
>> unit), blk_lookup_zone will return the pointer to the zone structure
>> containing that sector (the RB-tree does not have any constraint
>> regarding zone size). The only case where NULL would be returned is if
>> discard is issued super early right after the disk is probed and before
>> the zone refresh work has completed. We can certainly protect against
>> that by delaying the discard.
> 
> As you can see I am not including Host Managed in the
> runt check.

Indeed, but having a runt zone could also happen with a host-managed disk.

> Also you may note that in my patch to get Host Aware working
> with the zone cache I do not include the runt zone in the cache.

Why not ? The RB-tree will handle it just fine (the insert and lookup
code as Hannes had them was not relying on a constant zone size).

> So as it sits I need this fallback otherwise doing blkdiscard over
> the whole device ends in a error, as well as mkfs.f2fs et. al.

Got it, but I do not see a problem with including it. I have not checked
the code, but the split of a big discard call into "chunks" should be
already handling the last chunk and make sure that the operation does
not exceed the device capacity (in any case, that's easy to fix in the
sd_zbc_setup_discard code).

>>>                                      zone->start, zone->state);
>>>               ret = BLKPREP_DEFER;
>>>               goto out;
>>> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>>>       if (zone->state == BLK_ZONE_OFFLINE) {
>>>               /* let the drive fail the command */
>>>               sd_zbc_debug_ratelimit(sdkp,
>>> -                                    "Discarding offline zone %zu\n",
>>> +                                    "Discarding offline zone %zx\n",
>>>                                      zone->start);
>>>               goto out;
>>>       }
>>> -
>>> -     if (!blk_zone_is_smr(zone)) {
>>> +     if (blk_zone_is_cmr(zone)) {
>>> +             use_write_same = true;
>>>               sd_zbc_debug_ratelimit(sdkp,
>>> -                                    "Discarding %s zone %zu\n",
>>> -                                    blk_zone_is_cmr(zone) ? "CMR" : "unknown",
>>> +                                    "Discarding CMR zone %zx\n",
>>>                                      zone->start);
>>> -             ret = BLKPREP_DONE;
>>>               goto out;
>>>       }
>>
>> Some 10TB host managed disks out there have 1% conventional zone space,
>> that is 100GB of capacity. When issuing a "reset all", doing a write
>> same in these zones will take forever... If the user really wants zeroes
>> in those zones, let it issue a zeroout.
>>
>> I think that it would a better choice to simply not report
>> discard_zeroes_data as true and do nothing for conventional zones reset.
> 
> I think that would be unfortunate for Host Managed but I think it's
> the right choice for Host Aware at this time. So either we base
> it on disk type or we have some other config flag added to sysfs.

I do not see any difference between host managed and host aware. Both
define the same behavior for reset, and both end up in a NOP for
conventional zone reset (no data "erasure" required by the standard).
For write pointer zones, reading unwritten LBAs returns the
initialization pattern, with the exception of host-managed disks with
the URSWRZ bit set to 0. But that case is covered in sd.c, so the
behavior is consistent across all models. So why forcing data zeroing
when the standards do not mandate it ?

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
  2016-08-23  1:25       ` Damien Le Moal
@ 2016-08-24  5:19         ` Shaun Tancheff
  0 siblings, 0 replies; 9+ messages in thread
From: Shaun Tancheff @ 2016-08-24  5:19 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen,
	Hannes Reinecke, Josh Bingaman, Dan Williams, Sagi Grimberg,
	Mike Christie, Toshi Kani, Ming Lei

On Mon, Aug 22, 2016 at 8:25 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote:
>
> Shaun,
>
> On 8/23/16 09:22, Shaun Tancheff wrote:
>> On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote:

>> Also you may note that in my patch to get Host Aware working
>> with the zone cache I do not include the runt zone in the cache.
>
> Why not ? The RB-tree will handle it just fine (the insert and lookup
> code as Hannes had them was not relying on a constant zone size).

A good point. I didn't pay too much attention while brining this
forward. I think a few of my hacks may be pointless now. I'll
try to rework it and get rid of the runt check.

>> So as it sits I need this fallback otherwise doing blkdiscard over
>> the whole device ends in a error, as well as mkfs.f2fs et. al.
>
> Got it, but I do not see a problem with including it. I have not checked
> the code, but the split of a big discard call into "chunks" should be
> already handling the last chunk and make sure that the operation does
> not exceed the device capacity (in any case, that's easy to fix in the
> sd_zbc_setup_discard code).

Yes I agree the split of big discards does handle the last chunk correctly.

>>> Some 10TB host managed disks out there have 1% conventional zone space,
>>> that is 100GB of capacity. When issuing a "reset all", doing a write
>>> same in these zones will take forever... If the user really wants zeroes
>>> in those zones, let it issue a zeroout.
>>>
>>> I think that it would a better choice to simply not report
>>> discard_zeroes_data as true and do nothing for conventional zones reset.
>>
>> I think that would be unfortunate for Host Managed but I think it's
>> the right choice for Host Aware at this time. So either we base
>> it on disk type or we have some other config flag added to sysfs.
>
> I do not see any difference between host managed and host aware. Both
> define the same behavior for reset, and both end up in a NOP for
> conventional zone reset (no data "erasure" required by the standard).
> For write pointer zones, reading unwritten LBAs returns the
> initialization pattern, with the exception of host-managed disks with
> the URSWRZ bit set to 0. But that case is covered in sd.c, so the
> behavior is consistent across all models. So why forcing data zeroing
> when the standards do not mandate it ?

Well you do have point.
It appears to be only mkfs and similar tools that are really utilizing
discard zeros data at the moment.

I did a quick test:

mkfs -t ext4 -b 4096 -g 32768 -G 32  \
 -E lazy_itable_init=0,lazy_journal_init=0,offset=0,num_backup_sb=0,packed_meta_blocks=1,discard
  \
 -O flex_bg,extent,sparse_super2

   - discard zeroes data true - 3 minutess
   - discard zeroes data false - 6 minutes
So for the smaller conventional space on the current HA drive
there is some advantage to enabling discard zeroes data.

However for a larger conventional space you are correct the overall
impact is worse performance.

For some reason I had been assuming that some file systems
used or relied on discard zeroes data during normal operation.
Now that I am looking for that I don't seem to be finding any
evidence of it, so aside from mkfs I don't have as good an
argument discard zeroes data as I though I did.

Regards,
Shaun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-08-24  5:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-22  4:31 [PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache Shaun Tancheff
2016-08-22  4:31 ` [PATCH v2 1/4] Enable support for Seagate HostAware drives Shaun Tancheff
2016-08-22  4:31 ` [PATCH v2 2/4] On Discard either do Reset WP or Write Same Shaun Tancheff
2016-08-22 23:57   ` Damien Le Moal
2016-08-23  0:22     ` Shaun Tancheff
2016-08-23  1:25       ` Damien Le Moal
2016-08-24  5:19         ` Shaun Tancheff
2016-08-22  4:31 ` [PATCH v2 3/4] Merge ZBC constants Shaun Tancheff
2016-08-22  4:31 ` [PATCH v2 4/4] Integrate ZBC command requests with zone cache Shaun Tancheff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).