All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-07-29 19:02 Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hi Jens,

This series is based on linus' current tip after the merge of 'for-4.8/core'

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be suitable
for use by Host Managed drives.

ZAC/ZBC drives add new commands for discovering and working with Zones.

This extends the ZAC/ZBC support up to the block layer allowing direct control
by file systems or device mapper targets. Also by deferring the zone handling
to the authoritative subsystem there is an overall lower memory usage for
holding the active zone information as well as clarifying responsible party
for maintaining the write pointer for each active zone.
By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and 
where the write pointer is based on writes completed w/o errors.
Knowing the drive's zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or 
discards) that are zone aligned.
A perhaps non-obvious approach is that a conventional drive will 
returns a zone report descriptor with a single large conventional zone.

Patches for util-linux can be found here:
https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux

This patch is available here:
    https://github.com/stancheff/linux/tree/zbc.bio.v6

    git@github.com:stancheff/linux.git zbc.bio.v6

v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.

Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS                       |   9 ++
 block/blk-lib.c                   |  95 ++++++++++++++++
 block/ioctl.c                     | 110 +++++++++++++++++++
 drivers/scsi/sd.c                 | 118 ++++++++++++++++++++
 drivers/scsi/sd.h                 |   1 +
 include/linux/bio.h               |   7 +-
 include/linux/blk_types.h         |   6 +-
 include/linux/blkzoned_api.h      |  25 +++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/blkzoned_api.h | 220 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h           |   1 +
 11 files changed, 591 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.8.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands
  2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
@ 2016-07-29 19:02 ` Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff
  2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
  2 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman, Shaun Tancheff

Add op flags to access to zone information as well as open, close
and reset zones:
  - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
  - REQ_OP_ZONE_OPEN - Explictly open a zone for writing
  - REQ_OP_ZONE_CLOSE - Explictly close a zone
  - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone

These op flags can be used to create bio's to control zoned devices
through the block layer.

This is useful for filesystems and device mappers that need explicit
control of zoned devices such as Host Mananged and Host Aware SMR drives,

Report zones is a device read that requires a buffer.

Open, Close and Reset are device commands that have no associated
data transfer. Sending an LBA of ~0 will attempt to operate on all
zones. This is typically used with Reset to wipe a drive as a Reset
behaves similar to TRIM in that all data in the zone(s) is deleted.

The Finish zone command is intentionally not implimented as there is no
current use case for that operation.

Report zones currently defaults to reporting on all zones. It expected
that support for the zone option flag will piggy back on streamid
support. The report option flag is useful as it can reduce the number
of zones in each report, but not critical.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Removed include/uapi/linux/fs.h from this patch.
---
 MAINTAINERS                       |   9 ++
 block/blk-lib.c                   |  95 +++++++++++++++++
 drivers/scsi/sd.c                 | 118 +++++++++++++++++++++
 drivers/scsi/sd.h                 |   1 +
 include/linux/bio.h               |   7 +-
 include/linux/blk_types.h         |   6 +-
 include/linux/blkzoned_api.h      |  25 +++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/blkzoned_api.h | 214 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 474 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 771c31c..32f5598 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12785,6 +12785,15 @@ F:	Documentation/networking/z8530drv.txt
 F:	drivers/net/hamradio/*scc.c
 F:	drivers/net/hamradio/z8530.h
 
+ZBC AND ZBC BLOCK DEVICES
+M:	Shaun Tancheff <shaun.tancheff@seagate.com>
+W:	http://seagate.com
+W:	https://github.com/Seagate/ZDM-Device-Mapper
+L:	linux-block@vger.kernel.org
+S:	Maintained
+F:	include/linux/blkzoned_api.h
+F:	include/uapi/linux/blkzoned_api.h
+
 ZBUD COMPRESSED PAGE ALLOCATOR
 M:	Seth Jennings <sjenning@redhat.com>
 L:	linux-mm@kvack.org
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 083e56f..6dcdcbf 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -6,6 +6,7 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/scatterlist.h>
+#include <linux/blkzoned_api.h>
 
 #include "blk.h"
 
@@ -266,3 +267,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_zone_report - queue a report zones operation
+ * @bdev:	target blockdev
+ * @op_flags:	extra bio rw flags. If unsure, use 0.
+ * @sector:	starting sector (report will include this sector).
+ * @opt:	See: zone_report_option, default is 0 (all zones).
+ * @page:	one or more contiguous pages.
+ * @pgsz:	up to size of page in bytes, size of report.
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
+			     sector_t sector, u8 opt, struct page *page,
+			     size_t pgsz, gfp_t gfp_mask)
+{
+	struct bdev_zone_report *conv = page_address(page);
+	struct bio *bio;
+	unsigned int nr_iovecs = 1;
+	int ret = 0;
+
+	if (pgsz < (sizeof(struct bdev_zone_report) +
+		    sizeof(struct bdev_zone_descriptor)))
+		return -EINVAL;
+
+	bio = bio_alloc(gfp_mask, nr_iovecs);
+	if (!bio)
+		return -ENOMEM;
+
+	conv->descriptor_count = 0;
+	bio->bi_iter.bi_sector = sector;
+	bio->bi_bdev = bdev;
+	bio->bi_vcnt = 0;
+	bio->bi_iter.bi_size = 0;
+
+	bio_add_page(bio, page, pgsz, 0);
+	bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags);
+	ret = submit_bio_wait(bio);
+
+	/*
+	 * When our request it nak'd the underlying device maybe conventional
+	 * so ... report a single conventional zone the size of the device.
+	 */
+	if (ret == -EIO && conv->descriptor_count) {
+		/* Adjust the conventional to the size of the partition ... */
+		__be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects);
+
+		conv->maximum_lba = blksz;
+		conv->descriptors[0].type = ZTYP_CONVENTIONAL;
+		conv->descriptors[0].flags = ZCOND_CONVENTIONAL << 4;
+		conv->descriptors[0].length = blksz;
+		conv->descriptors[0].lba_start = 0;
+		conv->descriptors[0].lba_wptr = blksz;
+		ret = 0;
+	}
+	bio_put(bio);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_zone_report);
+
+/**
+ * blkdev_issue_zone_action - queue a report zones operation
+ * @bdev:	target blockdev
+ * @op:		One of REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_RESET
+ * @op_flags:	extra bio rw flags. If unsure, use 0.
+ * @sector:	starting lba of sector
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_action(struct block_device *bdev, unsigned int op,
+			     unsigned int op_flags, sector_t sector,
+			     gfp_t gfp_mask)
+{
+	int ret;
+	struct bio *bio;
+
+	bio = bio_alloc(gfp_mask, 1);
+	if (!bio)
+		return -ENOMEM;
+
+	bio->bi_iter.bi_sector = sector;
+	bio->bi_bdev = bdev;
+	bio->bi_vcnt = 0;
+	bio->bi_iter.bi_size = 0;
+	bio_set_op_attrs(bio, op, op_flags);
+	ret = submit_bio_wait(bio);
+	bio_put(bio);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_zone_action);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d3e852a..b54f359 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -52,6 +52,7 @@
 #include <linux/slab.h>
 #include <linux/pm_runtime.h>
 #include <linux/pr.h>
+#include <linux/blkzoned_api.h>
 #include <asm/uaccess.h>
 #include <asm/unaligned.h>
 
@@ -1134,6 +1135,115 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt)
 	return ret;
 }
 
+static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_device *sdp = cmd->device;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	struct bio *bio = rq->bio;
+	sector_t sector = blk_rq_pos(rq);
+	struct gendisk *disk = rq->rq_disk;
+	unsigned int nr_bytes = blk_rq_bytes(rq);
+	int ret = BLKPREP_KILL;
+
+	WARN_ON(nr_bytes == 0);
+
+	/*
+	 * For conventional drives generate a report that shows a
+	 * large single convetional zone the size of the block device
+	 */
+	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) {
+		void *src;
+		struct bdev_zone_report *conv;
+
+		if (nr_bytes < sizeof(struct bdev_zone_report))
+			goto out;
+
+		src = kmap_atomic(bio->bi_io_vec->bv_page);
+		conv = src + bio->bi_io_vec->bv_offset;
+		conv->descriptor_count = cpu_to_be32(1);
+		conv->same_field = ZS_ALL_SAME;
+		conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects);
+		kunmap_atomic(src);
+		goto out;
+	}
+
+	ret = scsi_init_io(cmd);
+	if (ret != BLKPREP_OK)
+		goto out;
+
+	cmd = rq->special;
+	if (sdp->changed) {
+		pr_err("SCSI disk has been changed or is not present.");
+		ret = BLKPREP_KILL;
+		goto out;
+	}
+
+	cmd->cmd_len = 16;
+	memset(cmd->cmnd, 0, cmd->cmd_len);
+	cmd->cmnd[0] = ZBC_IN;
+	cmd->cmnd[1] = ZI_REPORT_ZONES;
+	put_unaligned_be64(sector, &cmd->cmnd[2]);
+	put_unaligned_be32(nr_bytes, &cmd->cmnd[10]);
+	/* FUTURE ... when streamid is available */
+	/* cmd->cmnd[14] = bio_get_streamid(bio); */
+
+	cmd->sc_data_direction = DMA_FROM_DEVICE;
+	cmd->sdb.length = nr_bytes;
+	cmd->transfersize = sdp->sector_size;
+	cmd->underflow = 0;
+	cmd->allowed = SD_MAX_RETRIES;
+	ret = BLKPREP_OK;
+out:
+	return ret;
+}
+
+static int sd_setup_zone_action_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+	int ret = BLKPREP_KILL;
+	u8 allbit = 0;
+
+	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC)
+		goto out;
+
+	if (sector == ~0ul) {
+		allbit = 1;
+		sector = 0;
+	}
+
+	cmd->cmd_len = 16;
+	memset(cmd->cmnd, 0, cmd->cmd_len);
+	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
+	cmd->cmnd[0] = ZBC_OUT;
+	switch (req_op(rq)) {
+	case REQ_OP_ZONE_OPEN:
+		cmd->cmnd[1] = ZO_OPEN_ZONE;
+		break;
+	case REQ_OP_ZONE_CLOSE:
+		cmd->cmnd[1] = ZO_CLOSE_ZONE;
+		break;
+	case REQ_OP_ZONE_RESET:
+		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
+		break;
+	default:
+		goto out;
+	}
+	cmd->cmnd[14] = allbit;
+	put_unaligned_be64(sector, &cmd->cmnd[2]);
+
+	cmd->transfersize = 0;
+	cmd->underflow = 0;
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->sc_data_direction = DMA_NONE;
+
+	ret = BLKPREP_OK;
+out:
+	return ret;
+}
+
 static int sd_init_command(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -1148,6 +1258,12 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
 		return sd_setup_read_write_cmnd(cmd);
+	case REQ_OP_ZONE_REPORT:
+		return sd_setup_zone_report_cmnd(cmd);
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_RESET:
+		return sd_setup_zone_action_cmnd(cmd);
 	default:
 		BUG();
 	}
@@ -2737,6 +2853,8 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
 		queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, sdkp->disk->queue);
 	}
 
+	sdkp->zoned = (buffer[8] >> 4) & 3;
+
  out:
 	kfree(buffer);
 }
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 765a6f1..f782990 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -94,6 +94,7 @@ struct scsi_disk {
 	unsigned	lbpvpd : 1;
 	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
+	unsigned	zoned: 2;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 583c108..ffa3179 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -79,7 +79,12 @@ static inline bool bio_has_data(struct bio *bio)
 
 static inline bool bio_no_advance_iter(struct bio *bio)
 {
-	return bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_WRITE_SAME;
+	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_WRITE_SAME ||
+	       bio_op(bio) == REQ_OP_ZONE_REPORT ||
+	       bio_op(bio) == REQ_OP_ZONE_OPEN ||
+	       bio_op(bio) == REQ_OP_ZONE_CLOSE ||
+	       bio_op(bio) == REQ_OP_ZONE_RESET;
 }
 
 static inline bool bio_is_rw(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index f254eb2..35e3813 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -231,13 +231,17 @@ enum rq_flag_bits {
 enum req_op {
 	REQ_OP_READ,
 	REQ_OP_WRITE,
+	REQ_OP_ZONE_REPORT,
+	REQ_OP_ZONE_OPEN,
+	REQ_OP_ZONE_CLOSE,
+	REQ_OP_ZONE_RESET,
 	REQ_OP_DISCARD,		/* request to discard sectors */
 	REQ_OP_SECURE_ERASE,	/* request to securely erase sectors */
 	REQ_OP_WRITE_SAME,	/* write same block many times */
 	REQ_OP_FLUSH,		/* request for cache flush */
 };
 
-#define REQ_OP_BITS 3
+#define REQ_OP_BITS 4
 
 typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE	-1U
diff --git a/include/linux/blkzoned_api.h b/include/linux/blkzoned_api.h
new file mode 100644
index 0000000..47c091a
--- /dev/null
+++ b/include/linux/blkzoned_api.h
@@ -0,0 +1,25 @@
+/*
+ * Functions for zone based SMR devices.
+ *
+ * Copyright (C) 2015 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@seagate.com>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _BLKZONED_API_H
+#define _BLKZONED_API_H
+
+#include <uapi/linux/blkzoned_api.h>
+
+extern int blkdev_issue_zone_action(struct block_device *, unsigned int op,
+				    unsigned int op_flags, sector_t, gfp_t);
+extern int blkdev_issue_zone_report(struct block_device *, unsigned int op_flgs,
+				    sector_t, u8 opt, struct page *, size_t,
+				    gfp_t);
+
+#endif /* _BLKZONED_API_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ec10cfe..b94461c 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -70,6 +70,7 @@ header-y += bfs_fs.h
 header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
+header-y += blkzoned_api.h
 header-y += bpf_common.h
 header-y += bpf.h
 header-y += bpqether.h
diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h
new file mode 100644
index 0000000..48c17ad
--- /dev/null
+++ b/include/uapi/linux/blkzoned_api.h
@@ -0,0 +1,214 @@
+/*
+ * Functions for zone based SMR devices.
+ *
+ * Copyright (C) 2015 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@seagate.com>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _UAPI_BLKZONED_API_H
+#define _UAPI_BLKZONED_API_H
+
+#include <linux/types.h>
+
+/**
+ * enum zone_report_option - Report Zones types to be included.
+ *
+ * @ZOPT_NON_SEQ_AND_RESET: Default (all zones).
+ * @ZOPT_ZC1_EMPTY: Zones which are empty.
+ * @ZOPT_ZC2_OPEN_IMPLICIT: Zones open but not explicitly opened
+ * @ZOPT_ZC3_OPEN_EXPLICIT: Zones opened explicitly
+ * @ZOPT_ZC4_CLOSED: Zones closed for writing.
+ * @ZOPT_ZC5_FULL: Zones that are full.
+ * @ZOPT_ZC6_READ_ONLY: Zones that are read-only
+ * @ZOPT_ZC7_OFFLINE: Zones that are offline
+ * @ZOPT_RESET: Zones that are empty
+ * @ZOPT_NON_SEQ: Zones that have HA media-cache writes pending
+ * @ZOPT_NON_WP_ZONES: Zones that do not have Write Pointers (conventional)
+ * @ZOPT_PARTIAL_FLAG: Modifies the definition of the Zone List Length field.
+ *
+ * Used by Report Zones in bdev_zone_get_report: report_option
+ */
+enum zone_report_option {
+	ZOPT_NON_SEQ_AND_RESET   = 0x00,
+	ZOPT_ZC1_EMPTY,
+	ZOPT_ZC2_OPEN_IMPLICIT,
+	ZOPT_ZC3_OPEN_EXPLICIT,
+	ZOPT_ZC4_CLOSED,
+	ZOPT_ZC5_FULL,
+	ZOPT_ZC6_READ_ONLY,
+	ZOPT_ZC7_OFFLINE,
+	ZOPT_RESET               = 0x10,
+	ZOPT_NON_SEQ             = 0x11,
+	ZOPT_NON_WP_ZONES        = 0x3f,
+	ZOPT_PARTIAL_FLAG        = 0x80,
+};
+
+/**
+ * enum bdev_zone_type - Type of zone in descriptor
+ *
+ * @ZTYP_RESERVED: Reserved
+ * @ZTYP_CONVENTIONAL: Conventional random write zone (No Write Pointer)
+ * @ZTYP_SEQ_WRITE_REQUIRED: Non-sequential writes are rejected.
+ * @ZTYP_SEQ_WRITE_PREFERRED: Non-sequential writes allowed but discouraged.
+ *
+ * Returned from Report Zones. See bdev_zone_descriptor* type.
+ */
+enum bdev_zone_type {
+	ZTYP_RESERVED            = 0,
+	ZTYP_CONVENTIONAL        = 1,
+	ZTYP_SEQ_WRITE_REQUIRED  = 2,
+	ZTYP_SEQ_WRITE_PREFERRED = 3,
+};
+
+
+/**
+ * enum bdev_zone_condition - Condition of zone in descriptor
+ *
+ * @ZCOND_CONVENTIONAL: N/A
+ * @ZCOND_ZC1_EMPTY: Empty
+ * @ZCOND_ZC2_OPEN_IMPLICIT: Opened via write to zone.
+ * @ZCOND_ZC3_OPEN_EXPLICIT: Opened via open zone command.
+ * @ZCOND_ZC4_CLOSED: Closed
+ * @ZCOND_ZC6_READ_ONLY:
+ * @ZCOND_ZC5_FULL: No remaining space in zone.
+ * @ZCOND_ZC7_OFFLINE: Offline
+ *
+ * Returned from Report Zones. See bdev_zone_descriptor* flags.
+ */
+enum bdev_zone_condition {
+	ZCOND_CONVENTIONAL       = 0,
+	ZCOND_ZC1_EMPTY          = 1,
+	ZCOND_ZC2_OPEN_IMPLICIT  = 2,
+	ZCOND_ZC3_OPEN_EXPLICIT  = 3,
+	ZCOND_ZC4_CLOSED         = 4,
+	/* 0x5 to 0xC are reserved */
+	ZCOND_ZC6_READ_ONLY      = 0xd,
+	ZCOND_ZC5_FULL           = 0xe,
+	ZCOND_ZC7_OFFLINE        = 0xf,
+};
+
+
+/**
+ * enum bdev_zone_same - Report Zones same code.
+ *
+ * @ZS_ALL_DIFFERENT: All zones differ in type and size.
+ * @ZS_ALL_SAME: All zones are the same size and type.
+ * @ZS_LAST_DIFFERS: All zones are the same size and type except the last zone.
+ * @ZS_SAME_LEN_DIFF_TYPES: All zones are the same length but types differ.
+ *
+ * Returned from Report Zones. See bdev_zone_report* same_field.
+ */
+enum bdev_zone_same {
+	ZS_ALL_DIFFERENT        = 0,
+	ZS_ALL_SAME             = 1,
+	ZS_LAST_DIFFERS         = 2,
+	ZS_SAME_LEN_DIFF_TYPES  = 3,
+};
+
+
+/**
+ * struct bdev_zone_get_report - ioctl: Report Zones request
+ *
+ * @zone_locator_lba: starting lba for first [reported] zone
+ * @return_page_count: number of *bytes* allocated for result
+ * @report_option: see: zone_report_option enum
+ *
+ * Used to issue report zones command to connected device
+ */
+struct bdev_zone_get_report {
+	__u64 zone_locator_lba;
+	__u32 return_page_count;
+	__u8  report_option;
+} __packed;
+
+/**
+ * struct bdev_zone_descriptor_le - See: bdev_zone_descriptor
+ */
+struct bdev_zone_descriptor_le {
+	__u8 type;
+	__u8 flags;
+	__u8 reserved1[6];
+	__le64 length;
+	__le64 lba_start;
+	__le64 lba_wptr;
+	__u8 reserved[32];
+} __packed;
+
+
+/**
+ * struct bdev_zone_report_le - See: bdev_zone_report
+ */
+struct bdev_zone_report_le {
+	__le32 descriptor_count;
+	__u8 same_field;
+	__u8 reserved1[3];
+	__le64 maximum_lba;
+	__u8 reserved2[48];
+	struct bdev_zone_descriptor_le descriptors[0];
+} __packed;
+
+
+/**
+ * struct bdev_zone_descriptor - A Zone descriptor entry from report zones
+ *
+ * @type: see zone_type enum
+ * @flags: Bits 0:reset, 1:non-seq, 2-3: resv, 4-7: see zone_condition enum
+ * @reserved1: padding
+ * @length: length of zone in sectors
+ * @lba_start: lba where the zone starts.
+ * @lba_wptr: lba of the current write pointer.
+ * @reserved: padding
+ *
+ */
+struct bdev_zone_descriptor {
+	__u8 type;
+	__u8 flags;
+	__u8  reserved1[6];
+	__be64 length;
+	__be64 lba_start;
+	__be64 lba_wptr;
+	__u8 reserved[32];
+} __packed;
+
+
+/**
+ * struct bdev_zone_report - Report Zones result
+ *
+ * @descriptor_count: Number of descriptor entries that follow
+ * @same_field: bits 0-3: enum zone_same (MASK: 0x0F)
+ * @reserved1: padding
+ * @maximum_lba: LBA of the last logical sector on the device, inclusive
+ *               of all logical sectors in all zones.
+ * @reserved2: padding
+ * @descriptors: array of descriptors follows.
+ */
+struct bdev_zone_report {
+	__be32 descriptor_count;
+	__u8 same_field;
+	__u8 reserved1[3];
+	__be64 maximum_lba;
+	__u8 reserved2[48];
+	struct bdev_zone_descriptor descriptors[0];
+} __packed;
+
+
+/**
+ * struct bdev_zone_report_io - Report Zones ioctl argument.
+ *
+ * @in: Report Zones inputs
+ * @out: Report Zones output
+ */
+struct bdev_zone_report_io {
+	union {
+		struct bdev_zone_get_report in;
+		struct bdev_zone_report out;
+	} data;
+} __packed;
+
+#endif /* _UAPI_BLKZONED_API_H */
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer
  2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
@ 2016-07-29 19:02 ` Shaun Tancheff
  2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
  2 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman, Shaun Tancheff

Add support for ZBC ioctl's
    BLKREPORT    - Issue Report Zones to device.
    BLKOPENZONE  - Issue Zone Action: Open Zone command.
    BLKCLOSEZONE - Issue Zone Action: Close Zone command.
    BLKRESETZONE - Issue Zone Action: Reset Zone command.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
v6:
 - Added GFP_DMA to gfp mask.
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
---
 block/ioctl.c                     | 110 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/blkzoned_api.h |   6 +++
 include/uapi/linux/fs.h           |   1 +
 3 files changed, 117 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..a2a6c2c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -7,6 +7,7 @@
 #include <linux/backing-dev.h>
 #include <linux/fs.h>
 #include <linux/blktrace_api.h>
+#include <linux/blkzoned_api.h>
 #include <linux/pr.h>
 #include <asm/uaccess.h>
 
@@ -194,6 +195,109 @@ int blkdev_reread_part(struct block_device *bdev)
 }
 EXPORT_SYMBOL(blkdev_reread_part);
 
+static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode,
+		void __user *parg)
+{
+	int error = -EFAULT;
+	gfp_t gfp = GFP_KERNEL | GFP_DMA;
+	struct bdev_zone_report_io *zone_iodata = NULL;
+	int order = 0;
+	struct page *pgs = NULL;
+	u32 alloc_size = PAGE_SIZE;
+	unsigned long op_flags = 0;
+	u8 opt = 0;
+
+	if (!(mode & FMODE_READ))
+		return -EBADF;
+
+	zone_iodata = (void *)get_zeroed_page(gfp);
+	if (!zone_iodata) {
+		error = -ENOMEM;
+		goto report_zones_out;
+	}
+	if (copy_from_user(zone_iodata, parg, sizeof(*zone_iodata))) {
+		error = -EFAULT;
+		goto report_zones_out;
+	}
+	if (zone_iodata->data.in.return_page_count > alloc_size) {
+		int npages;
+
+		alloc_size = zone_iodata->data.in.return_page_count;
+		npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		pgs = alloc_pages(gfp, ilog2(npages));
+		if (pgs) {
+			void *mem = page_address(pgs);
+
+			if (!mem) {
+				error = -ENOMEM;
+				goto report_zones_out;
+			}
+			order = ilog2(npages);
+			memset(mem, 0, alloc_size);
+			memcpy(mem, zone_iodata, sizeof(*zone_iodata));
+			free_page((unsigned long)zone_iodata);
+			zone_iodata = mem;
+		} else {
+			/* Result requires DMA capable memory */
+			pr_err("Not enough memory available for request.\n");
+			error = -ENOMEM;
+			goto report_zones_out;
+		}
+	}
+	opt = zone_iodata->data.in.report_option;
+	error = blkdev_issue_zone_report(bdev, op_flags,
+			zone_iodata->data.in.zone_locator_lba, opt,
+			pgs ? pgs : virt_to_page(zone_iodata),
+			alloc_size, GFP_KERNEL);
+
+	if (error)
+		goto report_zones_out;
+
+	if (copy_to_user(parg, zone_iodata, alloc_size))
+		error = -EFAULT;
+
+report_zones_out:
+	if (pgs)
+		__free_pages(pgs, order);
+	else if (zone_iodata)
+		free_page((unsigned long)zone_iodata);
+	return error;
+}
+
+static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode,
+				  unsigned int cmd, unsigned long arg)
+{
+	unsigned int op = 0;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	/*
+	 * When acting on zones we explicitly disallow using a partition.
+	 */
+	if (bdev != bdev->bd_contains) {
+		pr_err("%s: All zone operations disallowed on this device\n",
+			__func__);
+		return -EFAULT;
+	}
+
+	switch (cmd) {
+	case BLKOPENZONE:
+		op = REQ_OP_ZONE_OPEN;
+		break;
+	case BLKCLOSEZONE:
+		op = REQ_OP_ZONE_CLOSE;
+		break;
+	case BLKRESETZONE:
+		op = REQ_OP_ZONE_RESET;
+		break;
+	default:
+		pr_err("%s: Unknown action: %u\n", __func__, cmd);
+		return -EINVAL;
+	}
+	return blkdev_issue_zone_action(bdev, op, 0, arg, GFP_KERNEL);
+}
+
 static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 		unsigned long arg, unsigned long flags)
 {
@@ -568,6 +672,12 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	case BLKTRACESETUP:
 	case BLKTRACETEARDOWN:
 		return blk_trace_ioctl(bdev, cmd, argp);
+	case BLKREPORT:
+		return blk_zoned_report_ioctl(bdev, mode, argp);
+	case BLKOPENZONE:
+	case BLKCLOSEZONE:
+	case BLKRESETZONE:
+		return blk_zoned_action_ioctl(bdev, mode, cmd, arg);
 	case IOC_PR_REGISTER:
 		return blkdev_pr_register(bdev, argp);
 	case IOC_PR_RESERVE:
diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h
index 48c17ad..3566de0 100644
--- a/include/uapi/linux/blkzoned_api.h
+++ b/include/uapi/linux/blkzoned_api.h
@@ -211,4 +211,10 @@ struct bdev_zone_report_io {
 	} data;
 } __packed;
 
+/* continuing from uapi/linux/fs.h: */
+#define BLKREPORT	_IOWR(0x12, 130, struct bdev_zone_report_io)
+#define BLKOPENZONE	_IO(0x12, 131)
+#define BLKCLOSEZONE	_IO(0x12, 132)
+#define BLKRESETZONE	_IO(0x12, 133)
+
 #endif /* _UAPI_BLKZONED_API_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3b00f7c..c0b565b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -222,6 +222,7 @@ struct fsxattr {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+/* A jump here: See blkzoned_api.h, Reserving 130 to 133. */
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff
@ 2016-08-01  9:41 ` Christoph Hellwig
  2016-08-01 17:07     ` Shaun Tancheff
  2 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2016-08-01  9:41 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: linux-block, linux-scsi, linux-kernel, Jens Axboe, Jens Axboe,
	Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman


Can you please integrate this with Hannes series so that it uses
his cache of the zone information?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
@ 2016-08-01 17:07     ` Shaun Tancheff
  0 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-01 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Jens Axboe, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman, Hannes Reinecke,
	Damien Le Moal

On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

Adding Hannes and Damien to Cc.

Christoph,

I can make a patch the marshal Hannes' RB-Tree into to a block report, that=
 is
quite simple. I can even have the open/close/reset zone commands update the
RB-Tree .. the non-private parts anyway. I would prefer to do this around t=
he
CONFIG_SD_ZBC support, offering the existing type of patch for setups that =
do
not need the RB-Tree to function with zoned media.

I do still have concerns with the approach which I have shared in smaller
forums but perhaps I have to bring them to this group.

First is the memory consumption. This isn't really much of a concern for la=
rge
servers with few drives but I think the embedded NAS market will grumble as
well as the large data pods trying to stuff 300+ drives in a chassis.

As of now the RB-Tree needs to hold ~30000 zones.
sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
around 3.5 MB per zoned drive attached.
Which is fine if it is really needed, but most of it is fixed information
and it can be significantly condensed (I have proposed 8 bytes per zone hel=
d
in an array as more than adequate). Worse is that the crucial piece of
information, the current wp needed for scheduling the next write, is mostly
out of date because it is updated only after the write completes and zones
being actively written to must work off of the last location / size that wa=
s
submitted, not completed. The work around is for that tracking to be handle=
d
in the private_data member. I am not saying that updating the wp on
completing a write isn=E2=80=99t important, I am saying that the bi_end_io =
hook is
the existing hook that works just fine.

This all tails into domain responsability. With the RB-Tree doing half of t=
he
work and the =E2=80=98responsible=E2=80=99 domain handling the active path =
via private_data
why have the split at all? It seems to be a double work to have second obje=
ct
tracking the first so that I/O scheduling can function.

Finally is the error handling path when the RB-Tree encounters and error it
attempts to requery the drive topology virtually guaranteeing that the
private_data is now out-of-sync with the RB-Tree. Again this is something
that can be better encapsulated in the bi_end_io to be informed of the
failed I/O and schedule the appropriate recovery (including re-querying the
zone information of the affected zone(s)).

Anyway those are my concerns and why I am still reluctant to drop this line=
 of
support. I have incorporated Hannes changes at various points. Hence the
SCT Write Same to attempt to work around some of the flaws in mapping
discard to reset write pointer.

Thanks and Regards,
Shaun

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=3Dhttp=
-3A__vger.kernel.org_majordomo-2Dinfo.html&d=3DCwIBAg&c=3DIGDlg0lD0b-nebmJJ=
0Kp8A&r=3DWg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=3D0ZPyN4vfYZXSmuCmI=
m3wpExF1K28PYO9KmgcqDsfQBg&s=3Daiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&=
e=3D



--=20
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-01 17:07     ` Shaun Tancheff
  0 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-01 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Jens Axboe, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman, Hannes Reinecke,
	Damien Le Moal

On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

Adding Hannes and Damien to Cc.

Christoph,

I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
quite simple. I can even have the open/close/reset zone commands update the
RB-Tree .. the non-private parts anyway. I would prefer to do this around the
CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
not need the RB-Tree to function with zoned media.

I do still have concerns with the approach which I have shared in smaller
forums but perhaps I have to bring them to this group.

First is the memory consumption. This isn't really much of a concern for large
servers with few drives but I think the embedded NAS market will grumble as
well as the large data pods trying to stuff 300+ drives in a chassis.

As of now the RB-Tree needs to hold ~30000 zones.
sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
around 3.5 MB per zoned drive attached.
Which is fine if it is really needed, but most of it is fixed information
and it can be significantly condensed (I have proposed 8 bytes per zone held
in an array as more than adequate). Worse is that the crucial piece of
information, the current wp needed for scheduling the next write, is mostly
out of date because it is updated only after the write completes and zones
being actively written to must work off of the last location / size that was
submitted, not completed. The work around is for that tracking to be handled
in the private_data member. I am not saying that updating the wp on
completing a write isn’t important, I am saying that the bi_end_io hook is
the existing hook that works just fine.

This all tails into domain responsability. With the RB-Tree doing half of the
work and the ‘responsible’ domain handling the active path via private_data
why have the split at all? It seems to be a double work to have second object
tracking the first so that I/O scheduling can function.

Finally is the error handling path when the RB-Tree encounters and error it
attempts to requery the drive topology virtually guaranteeing that the
private_data is now out-of-sync with the RB-Tree. Again this is something
that can be better encapsulated in the bi_end_io to be informed of the
failed I/O and schedule the appropriate recovery (including re-querying the
zone information of the affected zone(s)).

Anyway those are my concerns and why I am still reluctant to drop this line of
support. I have incorporated Hannes changes at various points. Hence the
SCT Write Same to attempt to work around some of the flaws in mapping
discard to reset write pointer.

Thanks and Regards,
Shaun

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg&s=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&e=



-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-01 17:07     ` Shaun Tancheff
  (?)
@ 2016-08-02 14:35     ` Hannes Reinecke
  2016-08-03  1:29         ` Damien Le Moal
  -1 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2016-08-02 14:35 UTC (permalink / raw)
  To: Shaun Tancheff, Christoph Hellwig
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Jens Axboe, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman, Damien Le Moal

On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
> 
> Adding Hannes and Damien to Cc.
> 
> Christoph,
> 
> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
> quite simple. I can even have the open/close/reset zone commands update the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
> not need the RB-Tree to function with zoned media.
> 
> I do still have concerns with the approach which I have shared in smaller
> forums but perhaps I have to bring them to this group.
> 
> First is the memory consumption. This isn't really much of a concern for large
> servers with few drives but I think the embedded NAS market will grumble as
> well as the large data pods trying to stuff 300+ drives in a chassis.
> 
> As of now the RB-Tree needs to hold ~30000 zones.
> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
> around 3.5 MB per zoned drive attached.
> Which is fine if it is really needed, but most of it is fixed information
> and it can be significantly condensed (I have proposed 8 bytes per zone held
> in an array as more than adequate). Worse is that the crucial piece of
> information, the current wp needed for scheduling the next write, is mostly
> out of date because it is updated only after the write completes and zones
> being actively written to must work off of the last location / size that was
> submitted, not completed. The work around is for that tracking to be handled
> in the private_data member. I am not saying that updating the wp on
> completing a write isn’t important, I am saying that the bi_end_io hook is
> the existing hook that works just fine.
> 
Which _actually_ is not true; with my patches I'll update the write
pointer prior to submit the I/O (on the reasoning that most of the time
I/O will succeed) and re-read the zone information if an I/O failed.
(Which I'll have to do anyway as after an I/O failure the write pointer
status is not clearly defined.)

I have thought about condensing the RB tree information, but then I
figured that for 'real' SMR handling we cannot assume all zones are of
fixed size, and hence we need all the information there.
Any condensing method would assume a given structure of the zones, which
the standard just doesn't provide.
Or am I missing something here?

As for write pointer handling: yes, the write pointer on the zones is
not really useful for upper-level usage.
Where we do need it is to detect I/O which is crossing the write pointer
(eg when doing reads over the entire zone).
As per spec you will be getting an I/O error here, so we need to split
the I/O on the write pointer to get valid results back.

> This all tails into domain responsibility. With the RB-Tree doing half of the
> work and the ‘responsible’ domain handling the active path via private_data
> why have the split at all? It seems to be a double work to have second object
> tracking the first so that I/O scheduling can function.
> 
Tracking the zone states via RB trees is mainly to handly host-managed
drives seamlessly. Without an RB tree we will be subjected to I/O errors
during boot, as eg with new drives we inevitably will try to read beyond
the write pointer, which in turn will generate I/O errors on certain drives.
I do agree that there is no strict need to setup an RB tree for
host-aware drives; I can easily add an attribute/flag to disable RB
trees for those.
However, tracking zone information in the RB tree _dramatically_ speeds
up device initialisation; issuing a blkdev_discard() for the entire
drive will take _ages_ without it.

> Finally is the error handling path when the RB-Tree encounters and error it
> attempts to requery the drive topology virtually guaranteeing that the
> private_data is now out-of-sync with the RB-Tree. Again this is something
> that can be better encapsulated in the bi_end_io to be informed of the
> failed I/O and schedule the appropriate recovery (including re-querying the
> zone information of the affected zone(s)).
> 
Well, if we have an RB tree with write pointers of course we need to
re-read the zone information on failure (and I thought I did that?).
Plus the error information will be propagated via the usual mechanism,
so they need to take care of updating any information in ->private_data.

I'm perfectly fine with integrating your patches for the various
blkdev_XX callouts and associated ioctls; Jens favours that approach,
too, so we should be combining those efforts.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-02 14:35     ` Hannes Reinecke
@ 2016-08-03  1:29         ` Damien Le Moal
  0 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-03  1:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

SGFubmVzLCBTaGF1biwNCg0KTGV0IG1lIGFkZCBzb21lIG1vcmUgY29tbWVudHMuDQoNCj4gT24g
QXVnIDIsIDIwMTYsIGF0IDIzOjM1LCBIYW5uZXMgUmVpbmVja2UgPGhhcmVAc3VzZS5kZT4gd3Jv
dGU6DQo+IA0KPiBPbiAwOC8wMS8yMDE2IDA3OjA3IFBNLCBTaGF1biBUYW5jaGVmZiB3cm90ZToN
Cj4+IE9uIE1vbiwgQXVnIDEsIDIwMTYgYXQgNDo0MSBBTSwgQ2hyaXN0b3BoIEhlbGx3aWcgPGhj
aEBsc3QuZGU+IHdyb3RlOg0KPj4+IA0KPj4+IENhbiB5b3UgcGxlYXNlIGludGVncmF0ZSB0aGlz
IHdpdGggSGFubmVzIHNlcmllcyBzbyB0aGF0IGl0IHVzZXMNCj4+PiBoaXMgY2FjaGUgb2YgdGhl
IHpvbmUgaW5mb3JtYXRpb24/DQo+PiANCj4+IEFkZGluZyBIYW5uZXMgYW5kIERhbWllbiB0byBD
Yy4NCj4+IA0KPj4gQ2hyaXN0b3BoLA0KPj4gDQo+PiBJIGNhbiBtYWtlIGEgcGF0Y2ggdGhlIG1h
cnNoYWwgSGFubmVzJyBSQi1UcmVlIGludG8gdG8gYSBibG9jayByZXBvcnQsIHRoYXQgaXMNCj4+
IHF1aXRlIHNpbXBsZS4gSSBjYW4gZXZlbiBoYXZlIHRoZSBvcGVuL2Nsb3NlL3Jlc2V0IHpvbmUg
Y29tbWFuZHMgdXBkYXRlIHRoZQ0KPj4gUkItVHJlZSAuLiB0aGUgbm9uLXByaXZhdGUgcGFydHMg
YW55d2F5LiBJIHdvdWxkIHByZWZlciB0byBkbyB0aGlzIGFyb3VuZCB0aGUNCj4+IENPTkZJR19T
RF9aQkMgc3VwcG9ydCwgb2ZmZXJpbmcgdGhlIGV4aXN0aW5nIHR5cGUgb2YgcGF0Y2ggZm9yIHNl
dHVwcyB0aGF0IGRvDQo+PiBub3QgbmVlZCB0aGUgUkItVHJlZSB0byBmdW5jdGlvbiB3aXRoIHpv
bmVkIG1lZGlhLg0KPj4gDQo+PiBJIGRvIHN0aWxsIGhhdmUgY29uY2VybnMgd2l0aCB0aGUgYXBw
cm9hY2ggd2hpY2ggSSBoYXZlIHNoYXJlZCBpbiBzbWFsbGVyDQo+PiBmb3J1bXMgYnV0IHBlcmhh
cHMgSSBoYXZlIHRvIGJyaW5nIHRoZW0gdG8gdGhpcyBncm91cC4NCj4+IA0KPj4gRmlyc3QgaXMg
dGhlIG1lbW9yeSBjb25zdW1wdGlvbi4gVGhpcyBpc24ndCByZWFsbHkgbXVjaCBvZiBhIGNvbmNl
cm4gZm9yIGxhcmdlDQo+PiBzZXJ2ZXJzIHdpdGggZmV3IGRyaXZlcyBidXQgSSB0aGluayB0aGUg
ZW1iZWRkZWQgTkFTIG1hcmtldCB3aWxsIGdydW1ibGUgYXMNCj4+IHdlbGwgYXMgdGhlIGxhcmdl
IGRhdGEgcG9kcyB0cnlpbmcgdG8gc3R1ZmYgMzAwKyBkcml2ZXMgaW4gYSBjaGFzc2lzLg0KPj4g
DQo+PiBBcyBvZiBub3cgdGhlIFJCLVRyZWUgbmVlZHMgdG8gaG9sZCB+MzAwMDAgem9uZXMuDQo+
PiBzaXplb2YoKSByZXBvcnRzIHN0cnVjdCBibGtfem9uZSB0byB1c2UgMTIwIGJ5dGVzIG9uIHg4
Nl82NC4gVGhpcyB5aWVsZHMNCj4+IGFyb3VuZCAzLjUgTUIgcGVyIHpvbmVkIGRyaXZlIGF0dGFj
aGVkLg0KPj4gV2hpY2ggaXMgZmluZSBpZiBpdCBpcyByZWFsbHkgbmVlZGVkLCBidXQgbW9zdCBv
ZiBpdCBpcyBmaXhlZCBpbmZvcm1hdGlvbg0KPj4gYW5kIGl0IGNhbiBiZSBzaWduaWZpY2FudGx5
IGNvbmRlbnNlZCAoSSBoYXZlIHByb3Bvc2VkIDggYnl0ZXMgcGVyIHpvbmUgaGVsZA0KPj4gaW4g
YW4gYXJyYXkgYXMgbW9yZSB0aGFuIGFkZXF1YXRlKS4gV29yc2UgaXMgdGhhdCB0aGUgY3J1Y2lh
bCBwaWVjZSBvZg0KPj4gaW5mb3JtYXRpb24sIHRoZSBjdXJyZW50IHdwIG5lZWRlZCBmb3Igc2No
ZWR1bGluZyB0aGUgbmV4dCB3cml0ZSwgaXMgbW9zdGx5DQo+PiBvdXQgb2YgZGF0ZSBiZWNhdXNl
IGl0IGlzIHVwZGF0ZWQgb25seSBhZnRlciB0aGUgd3JpdGUgY29tcGxldGVzIGFuZCB6b25lcw0K
Pj4gYmVpbmcgYWN0aXZlbHkgd3JpdHRlbiB0byBtdXN0IHdvcmsgb2ZmIG9mIHRoZSBsYXN0IGxv
Y2F0aW9uIC8gc2l6ZSB0aGF0IHdhcw0KPj4gc3VibWl0dGVkLCBub3QgY29tcGxldGVkLiBUaGUg
d29yayBhcm91bmQgaXMgZm9yIHRoYXQgdHJhY2tpbmcgdG8gYmUgaGFuZGxlZA0KPj4gaW4gdGhl
IHByaXZhdGVfZGF0YSBtZW1iZXIuIEkgYW0gbm90IHNheWluZyB0aGF0IHVwZGF0aW5nIHRoZSB3
cCBvbg0KPj4gY29tcGxldGluZyBhIHdyaXRlIGlzbuKAmXQgaW1wb3J0YW50LCBJIGFtIHNheWlu
ZyB0aGF0IHRoZSBiaV9lbmRfaW8gaG9vayBpcw0KPj4gdGhlIGV4aXN0aW5nIGhvb2sgdGhhdCB3
b3JrcyBqdXN0IGZpbmUuDQo+PiANCj4gV2hpY2ggX2FjdHVhbGx5XyBpcyBub3QgdHJ1ZTsgd2l0
aCBteSBwYXRjaGVzIEknbGwgdXBkYXRlIHRoZSB3cml0ZQ0KPiBwb2ludGVyIHByaW9yIHRvIHN1
Ym1pdCB0aGUgSS9PIChvbiB0aGUgcmVhc29uaW5nIHRoYXQgbW9zdCBvZiB0aGUgdGltZQ0KPiBJ
L08gd2lsbCBzdWNjZWVkKSBhbmQgcmUtcmVhZCB0aGUgem9uZSBpbmZvcm1hdGlvbiBpZiBhbiBJ
L08gZmFpbGVkLg0KPiAoV2hpY2ggSSdsbCBoYXZlIHRvIGRvIGFueXdheSBhcyBhZnRlciBhbiBJ
L08gZmFpbHVyZSB0aGUgd3JpdGUgcG9pbnRlcg0KPiBzdGF0dXMgaXMgbm90IGNsZWFybHkgZGVm
aW5lZC4pDQo+IA0KPiBJIGhhdmUgdGhvdWdodCBhYm91dCBjb25kZW5zaW5nIHRoZSBSQiB0cmVl
IGluZm9ybWF0aW9uLCBidXQgdGhlbiBJDQo+IGZpZ3VyZWQgdGhhdCBmb3IgJ3JlYWwnIFNNUiBo
YW5kbGluZyB3ZSBjYW5ub3QgYXNzdW1lIGFsbCB6b25lcyBhcmUgb2YNCj4gZml4ZWQgc2l6ZSwg
YW5kIGhlbmNlIHdlIG5lZWQgYWxsIHRoZSBpbmZvcm1hdGlvbiB0aGVyZS4NCj4gQW55IGNvbmRl
bnNpbmcgbWV0aG9kIHdvdWxkIGFzc3VtZSBhIGdpdmVuIHN0cnVjdHVyZSBvZiB0aGUgem9uZXMs
IHdoaWNoDQo+IHRoZSBzdGFuZGFyZCBqdXN0IGRvZXNuJ3QgcHJvdmlkZS4NCj4gT3IgYW0gSSBt
aXNzaW5nIHNvbWV0aGluZyBoZXJlPw0KDQpJbmRlZWQsIHRoZSBzdGFuZGFyZHMgZG8gbm90IG1h
bmRhdGUgYW55IHBhcnRpY3VsYXIgem9uZSBjb25maWd1cmF0aW9uLA0KY29uc3RhbnQgem9uZSBz
aXplLCBldGMuIFNvIHdyaXRpbmcgY29kZSBzbyB0aGF0IGNhbiBiZSBoYW5kbGVkIGlzIGNlcnRh
aW5seQ0KdGhlIHJpZ2h0IHdheSBvZiBkb2luZyB0aGluZ3MuIEhvd2V2ZXIsIGlmIHdlIGRlY2lk
ZSB0byBnbyBmb3J3YXJkIHdpdGgNCm1hcHBpbmcgUkVTRVQgV1JJVEUgUE9JTlRFUiBjb21tYW5k
IHRvIERJU0NBUkQsIHRoZW4gYXQgbGVhc3QgYSBjb25zdGFudA0Kem9uZSBzaXplIChtaW51cyB0
aGUgbGFzdCB6b25lIGFzIHlvdSBzYWlkKSBtdXN0IGJlIGFzc3VtZWQsIGFuZCB0aGF0DQppbmZv
cm1hdGlvbiBjYW4gYmUgcmVtb3ZlZCBmcm9tIHRoZSBlbnRyaWVzIGluIHRoZSBSQiB0cmVlIChh
cyBpdCB3aWxsIGJlDQpzYXZlZCBmb3IgdGhlIHN5c2ZzICJ6b25lX3NpemUiIGZpbGUgYW55d2F5
LiBBZGRpbmcgYSBsaXR0bGUgY29kZSB0byBoYW5kbGUNCnRoYXQgZXZlbnR1YWwgbGFzdCBydW50
IHpvbmUgd2l0aCBhIGRpZmZlcmVudCBzaXplIGlzIG5vdCBhIGJpZyBwcm9ibGVtLg0KDQo+IA0K
PiBBcyBmb3Igd3JpdGUgcG9pbnRlciBoYW5kbGluZzogeWVzLCB0aGUgd3JpdGUgcG9pbnRlciBv
biB0aGUgem9uZXMgaXMNCj4gbm90IHJlYWxseSB1c2VmdWwgZm9yIHVwcGVyLWxldmVsIHVzYWdl
Lg0KPiBXaGVyZSB3ZSBkbyBuZWVkIGl0IGlzIHRvIGRldGVjdCBJL08gd2hpY2ggaXMgY3Jvc3Np
bmcgdGhlIHdyaXRlIHBvaW50ZXINCj4gKGVnIHdoZW4gZG9pbmcgcmVhZHMgb3ZlciB0aGUgZW50
aXJlIHpvbmUpLg0KPiBBcyBwZXIgc3BlYyB5b3Ugd2lsbCBiZSBnZXR0aW5nIGFuIEkvTyBlcnJv
ciBoZXJlLCBzbyB3ZSBuZWVkIHRvIHNwbGl0DQo+IHRoZSBJL08gb24gdGhlIHdyaXRlIHBvaW50
ZXIgdG8gZ2V0IHZhbGlkIHJlc3VsdHMgYmFjay4NCg0KVG8gYmUgcHJlY2lzZSBoZXJlLCB0aGUg
SS9PIHNwbGl0dGluZyB3aWxsIGJlIGhhbmRsZWQgYnkgdGhlIGJsb2NrIGxheWVyDQp0aGFua3Mg
dG8gdGhlICJjaHVua19zZWN0b3JzIiBzZXR0aW5nLiBCdXQgdGhhdCByZWxpZXMgb24gYSBjb25z
dGFudCB6b25lDQpzaXplIGFzc3VtcHRpb24gdG9vLg0KDQpUaGUgUkItdHJlZSBoZXJlIGlzIG1v
c3QgdXNlZnVsIGZvciByZWFkcyBvdmVyIG9yIGFmdGVyIHRoZSB3cml0ZSBwb2ludGVyIGFzDQp0
aGlzIGNhbiBoYXZlIGRpZmZlcmVudCBiZWhhdmlvciBvbiBkaWZmZXJlbnQgZHJpdmVzIChVUlNX
UlogYml0KS4gVGhlIFJCLXRyZWUNCmFsbG93cyB1cyB0byBoaWRlIHRoZXNlIGRpZmZlcmVuY2Vz
IHRvIHVwcGVyIGxheWVycyBhbmQgc2ltcGxpZnkgc3VwcG9ydCBhdA0KdGhvc2UgbGV2ZWxzLg0K
DQpJIHRvbyBhZ3JlZSB0aGF0IHRoZSB3cml0ZSBwb2ludGVyIHZhbHVlIGlzIG5vdCB2ZXJ5IHVz
ZWZ1bCB0byB1cHBlciBsYXllcnMNCihlLmcuIEZTKS4gV2hhdCBtYXR0ZXJzIG1vc3Qgb2YgdGhl
IHRpbWVzIGZvciB0aGVzZSBsYXllcnMgaXMgYW4gImFsbG9jYXRpb24NCnBvaW50ZXIiIG9yICJz
dWJtaXNzaW9uIHBvaW50ZXIiIHdoaWNoIGluZGljYXRlcyB0aGUgTEJBIHRvIHVzZSBmb3IgdGhl
IG5leHQNCkJJTyBzdWJtaXNzaW9uLiBUaGF0IExCQSB3aWxsIG1vc3Qgb2YgdGhlIHRpbWUgYmUg
aW4gYWR2YW5jZSBvZiB0aGUgem9uZSBXUA0KYmVjYXVzZSBvZiByZXF1ZXN0IHF1ZWluZyBpbiB0
aGUgYmxvY2sgc2NoZWR1bGVyLg0KTm90ZSB0aGF0IGZyb20gd2hhdCB3ZSBoYXZlIGRvbmUgYWxy
ZWFkeSwgaW4gbWFueSBjYXNlcywgdXBwZXIgbGF5ZXJzIGRvIG5vdA0KZXZlbiBuZWVkIHRoaXMg
KGUuZy4gRjJGUywgYnRyZnMpIGFuZCB3b3JrIGZpbmUgd2l0aG91dCBuZWVkaW5nICphbnkqIA0K
em9uZS0+cHJpdmF0ZV9kYXRhIGJlY2F1c2UgdGhhdCAiYWxsb2NhdGlvbiBwb2ludGVyIiBpbmZv
cm1hdGlvbiBnZW5lcmFsbHkNCmFscmVhZHkgZXhpc3RzIGluIGEgZGlmZmVyZW50IGZvcm0gKGUu
Zy4gYSBiaXQgaW4gYSBiaXRtYXApLg0KRm9yIHRoZXNlIGNhc2VzLCBub3QgaGF2aW5nIHRoZSBS
Qi10cmVlIG9mIHpvbmVzIHdvdWxkIGZvcmNlIGEgbG90DQptb3JlIGNvZGUgdG8gYmUgYWRkZWQg
dG8gZmlsZSBzeXN0ZW1zIGZvciBTTVIgc3VwcG9ydC4NCg0KPiANCj4+IFRoaXMgYWxsIHRhaWxz
IGludG8gZG9tYWluIHJlc3BvbnNpYmlsaXR5LiBXaXRoIHRoZSBSQi1UcmVlIGRvaW5nIGhhbGYg
b2YgdGhlDQo+PiB3b3JrIGFuZCB0aGUg4oCYcmVzcG9uc2libGXigJkgZG9tYWluIGhhbmRsaW5n
IHRoZSBhY3RpdmUgcGF0aCB2aWEgcHJpdmF0ZV9kYXRhDQo+PiB3aHkgaGF2ZSB0aGUgc3BsaXQg
YXQgYWxsPyBJdCBzZWVtcyB0byBiZSBhIGRvdWJsZSB3b3JrIHRvIGhhdmUgc2Vjb25kIG9iamVj
dA0KPj4gdHJhY2tpbmcgdGhlIGZpcnN0IHNvIHRoYXQgSS9PIHNjaGVkdWxpbmcgY2FuIGZ1bmN0
aW9uLg0KPj4gDQo+IFRyYWNraW5nIHRoZSB6b25lIHN0YXRlcyB2aWEgUkIgdHJlZXMgaXMgbWFp
bmx5IHRvIGhhbmRseSBob3N0LW1hbmFnZWQNCj4gZHJpdmVzIHNlYW1sZXNzbHkuIFdpdGhvdXQg
YW4gUkIgdHJlZSB3ZSB3aWxsIGJlIHN1YmplY3RlZCB0byBJL08gZXJyb3JzDQo+IGR1cmluZyBi
b290LCBhcyBlZyB3aXRoIG5ldyBkcml2ZXMgd2UgaW5ldml0YWJseSB3aWxsIHRyeSB0byByZWFk
IGJleW9uZA0KPiB0aGUgd3JpdGUgcG9pbnRlciwgd2hpY2ggaW4gdHVybiB3aWxsIGdlbmVyYXRl
IEkvTyBlcnJvcnMgb24gY2VydGFpbiBkcml2ZXMuDQo+IEkgZG8gYWdyZWUgdGhhdCB0aGVyZSBp
cyBubyBzdHJpY3QgbmVlZCB0byBzZXR1cCBhbiBSQiB0cmVlIGZvcg0KPiBob3N0LWF3YXJlIGRy
aXZlczsgSSBjYW4gZWFzaWx5IGFkZCBhbiBhdHRyaWJ1dGUvZmxhZyB0byBkaXNhYmxlIFJCDQo+
IHRyZWVzIGZvciB0aG9zZS4NCj4gSG93ZXZlciwgdHJhY2tpbmcgem9uZSBpbmZvcm1hdGlvbiBp
biB0aGUgUkIgdHJlZSBfZHJhbWF0aWNhbGx5XyBzcGVlZHMNCj4gdXAgZGV2aWNlIGluaXRpYWxp
c2F0aW9uOyBpc3N1aW5nIGEgYmxrZGV2X2Rpc2NhcmQoKSBmb3IgdGhlIGVudGlyZQ0KPiBkcml2
ZSB3aWxsIHRha2UgX2FnZXNfIHdpdGhvdXQgaXQuDQoNCkFuZCB0aGUgUkItdHJlZSB3aWxsIGFs
c28gYmUgdmVyeSB1c2VmdWwgZm9yIHNwZWVkaW5nIHVwIHJlcG9ydCB6b25lcw0KaW9jdGxzIHRv
by4gVW5sZXNzIHdlIHdhbnQgdGhvc2UgdG8gZm9yY2UgYW4gdXBkYXRlIG9mIHRoZSBSQi10cmVl
IGluZm9ybWF0aW9uID8NCg0KPiANCj4+IEZpbmFsbHkgaXMgdGhlIGVycm9yIGhhbmRsaW5nIHBh
dGggd2hlbiB0aGUgUkItVHJlZSBlbmNvdW50ZXJzIGFuZCBlcnJvciBpdA0KPj4gYXR0ZW1wdHMg
dG8gcmVxdWVyeSB0aGUgZHJpdmUgdG9wb2xvZ3kgdmlydHVhbGx5IGd1YXJhbnRlZWluZyB0aGF0
IHRoZQ0KPj4gcHJpdmF0ZV9kYXRhIGlzIG5vdyBvdXQtb2Ytc3luYyB3aXRoIHRoZSBSQi1UcmVl
LiBBZ2FpbiB0aGlzIGlzIHNvbWV0aGluZw0KPj4gdGhhdCBjYW4gYmUgYmV0dGVyIGVuY2Fwc3Vs
YXRlZCBpbiB0aGUgYmlfZW5kX2lvIHRvIGJlIGluZm9ybWVkIG9mIHRoZQ0KPj4gZmFpbGVkIEkv
TyBhbmQgc2NoZWR1bGUgdGhlIGFwcHJvcHJpYXRlIHJlY292ZXJ5IChpbmNsdWRpbmcgcmUtcXVl
cnlpbmcgdGhlDQo+PiB6b25lIGluZm9ybWF0aW9uIG9mIHRoZSBhZmZlY3RlZCB6b25lKHMpKS4N
Cj4+IA0KPiBXZWxsLCBpZiB3ZSBoYXZlIGFuIFJCIHRyZWUgd2l0aCB3cml0ZSBwb2ludGVycyBv
ZiBjb3Vyc2Ugd2UgbmVlZCB0bw0KPiByZS1yZWFkIHRoZSB6b25lIGluZm9ybWF0aW9uIG9uIGZh
aWx1cmUgKGFuZCBJIHRob3VnaHQgSSBkaWQgdGhhdD8pLg0KPiBQbHVzIHRoZSBlcnJvciBpbmZv
cm1hdGlvbiB3aWxsIGJlIHByb3BhZ2F0ZWQgdmlhIHRoZSB1c3VhbCBtZWNoYW5pc20sDQo+IHNv
IHRoZXkgbmVlZCB0byB0YWtlIGNhcmUgb2YgdXBkYXRpbmcgYW55IGluZm9ybWF0aW9uIGluIC0+
cHJpdmF0ZV9kYXRhLg0KPiANCj4gSSdtIHBlcmZlY3RseSBmaW5lIHdpdGggaW50ZWdyYXRpbmcg
eW91ciBwYXRjaGVzIGZvciB0aGUgdmFyaW91cw0KPiBibGtkZXZfWFggY2FsbG91dHMgYW5kIGFz
c29jaWF0ZWQgaW9jdGxzOyBKZW5zIGZhdm91cnMgdGhhdCBhcHByb2FjaCwNCj4gdG9vLCBzbyB3
ZSBzaG91bGQgYmUgY29tYmluaW5nIHRob3NlIGVmZm9ydHMuDQoNCkNhbiB3ZSBhZ3JlZSBvbiBh
biBpbnRlcmZhY2UgPw0KU3VtbWFyaXppbmcgYWxsIHRoZSBkaXNjdXNzaW9ucyB3ZSBoYWQsIEkg
YW0gYWxsIGluIGZhdm9yIG9mIHRoZSBmb2xsb3dpbmc6DQoNCjEpIEEgInpvbmVfc2l6ZSIgc3lz
ZnMgYXR0cmlidXRlIHRvIGluZGljYXRlIHRoYXQgYSBkcml2ZSBpcyB6b25lZDoNClRoZSBhbHJl
YWR5IGV4aXN0aW5nIGRldmljZSB0eXBlIGZpbGUgY2FuIHRlbGwgdXMgaWYgdGhpcyBpcyBhbiBo
b3N0DQptYW5hZ2VkIGRpc2sgKHR5cGU9PTIwKSBvciBhIGhvc3QgYXdhcmUgZGlzayAodHlwZT09
MCkuIERyaXZlIG1hbmFnZWQNCmRpc2tzIGNvdWxkIGFsc28gYmUgZGV0ZWN0ZWQsIGJ1dCBzaW5j
ZSBubyBpbmZvcm1hdGlvbiBvbiB0aGVpciB6b25lDQpjb25maWd1cmF0aW9uIGNhbiBiZSBvYnRh
aW5lZCwgbGV0cyB0cmVhdCB0aG9zZSBhcyByZWd1bGFyIGRyaXZlcy4NCg0KMikgQWRkIGlvY3Rs
cyBmb3Igem9uZSBtYW5hZ2VtZW50Og0KUmVwb3J0IHpvbmVzIChnZXQgaW5mb3JtYXRpb24gZnJv
bSBSQiB0cmVlKSwgcmVzZXQgem9uZSAoc2ltcGxlIHdyYXBwZXINCnRvIGlvY3RsIGZvciBibG9j
ayBkaXNjYXJkKSwgb3BlbiB6b25lLCBjbG9zZSB6b25lIGFuZCBmaW5pc2ggem9uZS4gVGhhdA0K
d2lsbCBhbGxvdyBta2ZzIGxpa2UgdG9vbHMgdG8gZ2V0IHpvbmUgaW5mb3JtYXRpb24gd2l0aG91
dCBoYXZpbmcgdG8gcGFyc2UNCnRob3VzYW5kcyBvZiBzeXNmcyBmaWxlcyAoYW5kIGNhbiBhbHNv
IGJlIGludGVncmF0ZWQgaW4gbGliemJjIGJsb2NrIGJhY2tlbmQNCmRyaXZlciBmb3IgYSB1bmlm
aWVkIGludGVyZmFjZSB3aXRoIHRoZSBkaXJlY3QgU0dfSU8gcGF0aCBmb3Iga2VybmVscyB3aXRo
b3V0DQp0aGUgWkJDIGNvZGUgZW5hYmxlZCkuDQoNCjMpIFRyeSB0byBjb25kZW5zZSB0aGUgYmxr
em9uZSBkYXRhIHN0cnVjdHVyZSB0byBzYXZlIG1lbW9yeToNCkkgdGhpbmsgdGhhdCB3ZSBjYW4g
YXQgdGhlIHZlcnkgbGVhc3QgcmVtb3ZlIHRoZSB6b25lIGxlbmd0aCwgYW5kIGFsc28NCm1heSBi
ZSB0aGUgcGVyIHpvbmUgc3BpbmxvY2sgdG9vIChhIHNpbmdsZSBzcGlubG9jayBhbmQgcHJvcGVy
IHN0YXRlIGZsYWdzIGNhbg0KYmUgdXNlZCkuDQoNCkJlc3QgcmVnYXJkcy4NCg0KLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tDQpEYW1pZW4gTGUgTW9hbCwgUGguRC4NClNyLiBNYW5hZ2VyLCBTeXN0
ZW0gU29mdHdhcmUgR3JvdXAsIEhHU1QgUmVzZWFyY2gsDQpIR1NULCBhIFdlc3Rlcm4gRGlnaXRh
bCBicmFuZA0KRGFtaWVuLkxlTW9hbEBoZ3N0LmNvbQ0KKCs4MSkgMDQ2Ni05OC0zNTkzIChleHQu
IDUxMzU5MykNCjEga2lyaWhhcmEtY2hvLCBGdWppc2F3YSwgDQpLYW5hZ2F3YSwgMjUyLTA4ODgg
SmFwYW4NCnd3dy5oZ3N0LmNvbSANCg0KV2VzdGVybiBEaWdpdGFsIENvcnBvcmF0aW9uIChhbmQg
aXRzIHN1YnNpZGlhcmllcykgRS1tYWlsIENvbmZpZGVudGlhbGl0eSBOb3RpY2UgJiBEaXNjbGFp
bWVyOgoKVGhpcyBlLW1haWwgYW5kIGFueSBmaWxlcyB0cmFuc21pdHRlZCB3aXRoIGl0IG1heSBj
b250YWluIGNvbmZpZGVudGlhbCBvciBsZWdhbGx5IHByaXZpbGVnZWQgaW5mb3JtYXRpb24gb2Yg
V0RDIGFuZC9vciBpdHMgYWZmaWxpYXRlcywgYW5kIGFyZSBpbnRlbmRlZCBzb2xlbHkgZm9yIHRo
ZSB1c2Ugb2YgdGhlIGluZGl2aWR1YWwgb3IgZW50aXR5IHRvIHdoaWNoIHRoZXkgYXJlIGFkZHJl
c3NlZC4gSWYgeW91IGFyZSBub3QgdGhlIGludGVuZGVkIHJlY2lwaWVudCwgYW55IGRpc2Nsb3N1
cmUsIGNvcHlpbmcsIGRpc3RyaWJ1dGlvbiBvciBhbnkgYWN0aW9uIHRha2VuIG9yIG9taXR0ZWQg
dG8gYmUgdGFrZW4gaW4gcmVsaWFuY2Ugb24gaXQsIGlzIHByb2hpYml0ZWQuIElmIHlvdSBoYXZl
IHJlY2VpdmVkIHRoaXMgZS1tYWlsIGluIGVycm9yLCBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIg
aW1tZWRpYXRlbHkgYW5kIGRlbGV0ZSB0aGUgZS1tYWlsIGluIGl0cyBlbnRpcmV0eSBmcm9tIHlv
dXIgc3lzdGVtLgo=

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-03  1:29         ` Damien Le Moal
  0 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-03  1:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hannes, Shaun,

Let me add some more comments.

> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> 
>>> Can you please integrate this with Hannes series so that it uses
>>> his cache of the zone information?
>> 
>> Adding Hannes and Damien to Cc.
>> 
>> Christoph,
>> 
>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>> quite simple. I can even have the open/close/reset zone commands update the
>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>> not need the RB-Tree to function with zoned media.
>> 
>> I do still have concerns with the approach which I have shared in smaller
>> forums but perhaps I have to bring them to this group.
>> 
>> First is the memory consumption. This isn't really much of a concern for large
>> servers with few drives but I think the embedded NAS market will grumble as
>> well as the large data pods trying to stuff 300+ drives in a chassis.
>> 
>> As of now the RB-Tree needs to hold ~30000 zones.
>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>> around 3.5 MB per zoned drive attached.
>> Which is fine if it is really needed, but most of it is fixed information
>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>> in an array as more than adequate). Worse is that the crucial piece of
>> information, the current wp needed for scheduling the next write, is mostly
>> out of date because it is updated only after the write completes and zones
>> being actively written to must work off of the last location / size that was
>> submitted, not completed. The work around is for that tracking to be handled
>> in the private_data member. I am not saying that updating the wp on
>> completing a write isn’t important, I am saying that the bi_end_io hook is
>> the existing hook that works just fine.
>> 
> Which _actually_ is not true; with my patches I'll update the write
> pointer prior to submit the I/O (on the reasoning that most of the time
> I/O will succeed) and re-read the zone information if an I/O failed.
> (Which I'll have to do anyway as after an I/O failure the write pointer
> status is not clearly defined.)
> 
> I have thought about condensing the RB tree information, but then I
> figured that for 'real' SMR handling we cannot assume all zones are of
> fixed size, and hence we need all the information there.
> Any condensing method would assume a given structure of the zones, which
> the standard just doesn't provide.
> Or am I missing something here?

Indeed, the standards do not mandate any particular zone configuration,
constant zone size, etc. So writing code so that can be handled is certainly
the right way of doing things. However, if we decide to go forward with
mapping RESET WRITE POINTER command to DISCARD, then at least a constant
zone size (minus the last zone as you said) must be assumed, and that
information can be removed from the entries in the RB tree (as it will be
saved for the sysfs "zone_size" file anyway. Adding a little code to handle
that eventual last runt zone with a different size is not a big problem.

> 
> As for write pointer handling: yes, the write pointer on the zones is
> not really useful for upper-level usage.
> Where we do need it is to detect I/O which is crossing the write pointer
> (eg when doing reads over the entire zone).
> As per spec you will be getting an I/O error here, so we need to split
> the I/O on the write pointer to get valid results back.

To be precise here, the I/O splitting will be handled by the block layer
thanks to the "chunk_sectors" setting. But that relies on a constant zone
size assumption too.

The RB-tree here is most useful for reads over or after the write pointer as
this can have different behavior on different drives (URSWRZ bit). The RB-tree
allows us to hide these differences to upper layers and simplify support at
those levels.

I too agree that the write pointer value is not very useful to upper layers
(e.g. FS). What matters most of the times for these layers is an "allocation
pointer" or "submission pointer" which indicates the LBA to use for the next
BIO submission. That LBA will most of the time be in advance of the zone WP
because of request queing in the block scheduler.
Note that from what we have done already, in many cases, upper layers do not
even need this (e.g. F2FS, btrfs) and work fine without needing *any* 
zone->private_data because that "allocation pointer" information generally
already exists in a different form (e.g. a bit in a bitmap).
For these cases, not having the RB-tree of zones would force a lot
more code to be added to file systems for SMR support.

> 
>> This all tails into domain responsibility. With the RB-Tree doing half of the
>> work and the ‘responsible’ domain handling the active path via private_data
>> why have the split at all? It seems to be a double work to have second object
>> tracking the first so that I/O scheduling can function.
>> 
> Tracking the zone states via RB trees is mainly to handly host-managed
> drives seamlessly. Without an RB tree we will be subjected to I/O errors
> during boot, as eg with new drives we inevitably will try to read beyond
> the write pointer, which in turn will generate I/O errors on certain drives.
> I do agree that there is no strict need to setup an RB tree for
> host-aware drives; I can easily add an attribute/flag to disable RB
> trees for those.
> However, tracking zone information in the RB tree _dramatically_ speeds
> up device initialisation; issuing a blkdev_discard() for the entire
> drive will take _ages_ without it.

And the RB-tree will also be very useful for speeding up report zones
ioctls too. Unless we want those to force an update of the RB-tree information ?

> 
>> Finally is the error handling path when the RB-Tree encounters and error it
>> attempts to requery the drive topology virtually guaranteeing that the
>> private_data is now out-of-sync with the RB-Tree. Again this is something
>> that can be better encapsulated in the bi_end_io to be informed of the
>> failed I/O and schedule the appropriate recovery (including re-querying the
>> zone information of the affected zone(s)).
>> 
> Well, if we have an RB tree with write pointers of course we need to
> re-read the zone information on failure (and I thought I did that?).
> Plus the error information will be propagated via the usual mechanism,
> so they need to take care of updating any information in ->private_data.
> 
> I'm perfectly fine with integrating your patches for the various
> blkdev_XX callouts and associated ioctls; Jens favours that approach,
> too, so we should be combining those efforts.

Can we agree on an interface ?
Summarizing all the discussions we had, I am all in favor of the following:

1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
The already existing device type file can tell us if this is an host
managed disk (type==20) or a host aware disk (type==0). Drive managed
disks could also be detected, but since no information on their zone
configuration can be obtained, lets treat those as regular drives.

2) Add ioctls for zone management:
Report zones (get information from RB tree), reset zone (simple wrapper
to ioctl for block discard), open zone, close zone and finish zone. That
will allow mkfs like tools to get zone information without having to parse
thousands of sysfs files (and can also be integrated in libzbc block backend
driver for a unified interface with the direct SG_IO path for kernels without
the ZBC code enabled).

3) Try to condense the blkzone data structure to save memory:
I think that we can at the very least remove the zone length, and also
may be the per zone spinlock too (a single spinlock and proper state flags can
be used).

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-03  1:29         ` Damien Le Moal
@ 2016-08-05 20:35           ` Shaun Tancheff
  -1 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-05 20:35 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wro=
te:
> Hannes, Shaun,
>
> Let me add some more comments.
>
>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> Can you please integrate this with Hannes series so that it uses
>>>> his cache of the zone information?
>>>
>>> Adding Hannes and Damien to Cc.
>>>
>>> Christoph,
>>>
>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, =
that is
>>> quite simple. I can even have the open/close/reset zone commands update=
 the
>>> RB-Tree .. the non-private parts anyway. I would prefer to do this arou=
nd the
>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups t=
hat do
>>> not need the RB-Tree to function with zoned media.

I have posted patches to integrate with the zone cache, hopefully they
make sense.

>>>
>>> I do still have concerns with the approach which I have shared in small=
er
>>> forums but perhaps I have to bring them to this group.
>>>
>>> First is the memory consumption. This isn't really much of a concern fo=
r large
>>> servers with few drives but I think the embedded NAS market will grumbl=
e as
>>> well as the large data pods trying to stuff 300+ drives in a chassis.
>>>
>>> As of now the RB-Tree needs to hold ~30000 zones.
>>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yield=
s
>>> around 3.5 MB per zoned drive attached.
>>> Which is fine if it is really needed, but most of it is fixed informati=
on
>>> and it can be significantly condensed (I have proposed 8 bytes per zone=
 held
>>> in an array as more than adequate). Worse is that the crucial piece of
>>> information, the current wp needed for scheduling the next write, is mo=
stly
>>> out of date because it is updated only after the write completes and zo=
nes
>>> being actively written to must work off of the last location / size tha=
t was
>>> submitted, not completed. The work around is for that tracking to be ha=
ndled
>>> in the private_data member. I am not saying that updating the wp on
>>> completing a write isn=E2=80=99t important, I am saying that the bi_end=
_io hook is
>>> the existing hook that works just fine.
>>>
>> Which _actually_ is not true; with my patches I'll update the write
>> pointer prior to submit the I/O (on the reasoning that most of the time
>> I/O will succeed) and re-read the zone information if an I/O failed.
>> (Which I'll have to do anyway as after an I/O failure the write pointer
>> status is not clearly defined.)

Apologies for my mis-characterization.

>> I have thought about condensing the RB tree information, but then I
>> figured that for 'real' SMR handling we cannot assume all zones are of
>> fixed size, and hence we need all the information there.
>> Any condensing method would assume a given structure of the zones, which
>> the standard just doesn't provide.
>> Or am I missing something here?
>
> Indeed, the standards do not mandate any particular zone configuration,
> constant zone size, etc. So writing code so that can be handled is certai=
nly
> the right way of doing things. However, if we decide to go forward with
> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
> zone size (minus the last zone as you said) must be assumed, and that
> information can be removed from the entries in the RB tree (as it will be
> saved for the sysfs "zone_size" file anyway. Adding a little code to hand=
le
> that eventual last runt zone with a different size is not a big problem.

>> As for write pointer handling: yes, the write pointer on the zones is
>> not really useful for upper-level usage.
>> Where we do need it is to detect I/O which is crossing the write pointer
>> (eg when doing reads over the entire zone).
>> As per spec you will be getting an I/O error here, so we need to split
>> the I/O on the write pointer to get valid results back.
>
> To be precise here, the I/O splitting will be handled by the block layer
> thanks to the "chunk_sectors" setting. But that relies on a constant zone
> size assumption too.
>
> The RB-tree here is most useful for reads over or after the write pointer=
 as
> this can have different behavior on different drives (URSWRZ bit). The RB=
-tree
> allows us to hide these differences to upper layers and simplify support =
at
> those levels.

It is unfortunate that path was chosen rather than returning Made Up Data.
However I don't think this is a file system or device mapper problem as
neither of those layers attempt to read blocks they have not written
(or discarded).
All I can think of something that probes the drive media for a partition ta=
ble
or other identification...isn't the conventional space sufficient to cover
those cases?
Anyway you could just handle the error and sense codes [CHECK CONDITION /
ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
when URSWRZ is 0. It would have the same effect as using the zone cache
without the possibility of the zone cache being out-of-sync.

> I too agree that the write pointer value is not very useful to upper laye=
rs
> (e.g. FS). What matters most of the times for these layers is an "allocat=
ion
> pointer" or "submission pointer" which indicates the LBA to use for the n=
ext
> BIO submission. That LBA will most of the time be in advance of the zone =
WP
> because of request queing in the block scheduler.

Which is kind of my point.

> Note that from what we have done already, in many cases, upper layers do =
not
> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
> zone->private_data because that "allocation pointer" information generall=
y
> already exists in a different form (e.g. a bit in a bitmap).

Agreed. Log structured and copy on write file systems are a natural fit for
these types of drives

> For these cases, not having the RB-tree of zones would force a lot
> more code to be added to file systems for SMR support.

I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
Certainly any of these could trigger the zone cache to be loaded at mkfs
and/or mount time?

>>> This all tails into domain responsibility. With the RB-Tree doing half =
of the
>>> work and the =E2=80=98responsible=E2=80=99 domain handling the active p=
ath via private_data
>>> why have the split at all? It seems to be a double work to have second =
object
>>> tracking the first so that I/O scheduling can function.
>>>
>> Tracking the zone states via RB trees is mainly to handly host-managed
>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>> during boot, as eg with new drives we inevitably will try to read beyond
>> the write pointer, which in turn will generate I/O errors on certain dri=
ves.

If the zone cache is needed to boot then clearly my objection to reading
in the zone report information on drive attach is unfounded. I was under
the impression that this was not the case.

>> I do agree that there is no strict need to setup an RB tree for
>> host-aware drives; I can easily add an attribute/flag to disable RB
>> trees for those.
>> However, tracking zone information in the RB tree _dramatically_ speeds
>> up device initialisation; issuing a blkdev_discard() for the entire
>> drive will take _ages_ without it.
>
> And the RB-tree will also be very useful for speeding up report zones
> ioctls too. Unless we want those to force an update of the RB-tree inform=
ation ?

That is debatable. I would tend to argue for both. *If* there must be
a zone cache
then reading from the zone cache is a must. Forcing and update of the zone =
cache
from the media seems like a reasonable thing to be able to do.

Also the zone report is 'slow' in that there is an overhead for the
report itself but
the number of zones per query can be quite large so 4 or 5 I/Os that
run into the
hundreds if milliseconds to cache the entire drive isn't really unworkable =
for
something that is used infrequently.

>>> Finally is the error handling path when the RB-Tree encounters and erro=
r it
>>> attempts to requery the drive topology virtually guaranteeing that the
>>> private_data is now out-of-sync with the RB-Tree. Again this is somethi=
ng
>>> that can be better encapsulated in the bi_end_io to be informed of the
>>> failed I/O and schedule the appropriate recovery (including re-querying=
 the
>>> zone information of the affected zone(s)).
>>>
>> Well, if we have an RB tree with write pointers of course we need to
>> re-read the zone information on failure (and I thought I did that?).
>> Plus the error information will be propagated via the usual mechanism,
>> so they need to take care of updating any information in ->private_data.
>>
>> I'm perfectly fine with integrating your patches for the various
>> blkdev_XX callouts and associated ioctls; Jens favours that approach,
>> too, so we should be combining those efforts.
>
> Can we agree on an interface ?
> Summarizing all the discussions we had, I am all in favor of the followin=
g:
>
> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
> The already existing device type file can tell us if this is an host
> managed disk (type=3D=3D20) or a host aware disk (type=3D=3D0). Drive man=
aged
> disks could also be detected, but since no information on their zone
> configuration can be obtained, lets treat those as regular drives.

Since disk type =3D=3D 0 for everything that isn't HM so I would prefer the
sysfs 'zoned' file just report if the drive is HA or HM.

> 2) Add ioctls for zone management:
> Report zones (get information from RB tree), reset zone (simple wrapper
> to ioctl for block discard), open zone, close zone and finish zone. That
> will allow mkfs like tools to get zone information without having to pars=
e
> thousands of sysfs files (and can also be integrated in libzbc block back=
end
> driver for a unified interface with the direct SG_IO path for kernels wit=
hout
> the ZBC code enabled).

I can add finish zone ... but I really can't think of a use for it, myself.

> 3) Try to condense the blkzone data structure to save memory:
> I think that we can at the very least remove the zone length, and also
> may be the per zone spinlock too (a single spinlock and proper state flag=
s can
> be used).

I have a variant that is an array of descriptors that roughly mimics the
api from blk-zoned.c that I did a few months ago as an example.
I should be able to update that to the current kernel + patches.

--=20
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-05 20:35           ` Shaun Tancheff
  0 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-05 20:35 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
> Hannes, Shaun,
>
> Let me add some more comments.
>
>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> Can you please integrate this with Hannes series so that it uses
>>>> his cache of the zone information?
>>>
>>> Adding Hannes and Damien to Cc.
>>>
>>> Christoph,
>>>
>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>>> quite simple. I can even have the open/close/reset zone commands update the
>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>>> not need the RB-Tree to function with zoned media.

I have posted patches to integrate with the zone cache, hopefully they
make sense.

>>>
>>> I do still have concerns with the approach which I have shared in smaller
>>> forums but perhaps I have to bring them to this group.
>>>
>>> First is the memory consumption. This isn't really much of a concern for large
>>> servers with few drives but I think the embedded NAS market will grumble as
>>> well as the large data pods trying to stuff 300+ drives in a chassis.
>>>
>>> As of now the RB-Tree needs to hold ~30000 zones.
>>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>>> around 3.5 MB per zoned drive attached.
>>> Which is fine if it is really needed, but most of it is fixed information
>>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>>> in an array as more than adequate). Worse is that the crucial piece of
>>> information, the current wp needed for scheduling the next write, is mostly
>>> out of date because it is updated only after the write completes and zones
>>> being actively written to must work off of the last location / size that was
>>> submitted, not completed. The work around is for that tracking to be handled
>>> in the private_data member. I am not saying that updating the wp on
>>> completing a write isn’t important, I am saying that the bi_end_io hook is
>>> the existing hook that works just fine.
>>>
>> Which _actually_ is not true; with my patches I'll update the write
>> pointer prior to submit the I/O (on the reasoning that most of the time
>> I/O will succeed) and re-read the zone information if an I/O failed.
>> (Which I'll have to do anyway as after an I/O failure the write pointer
>> status is not clearly defined.)

Apologies for my mis-characterization.

>> I have thought about condensing the RB tree information, but then I
>> figured that for 'real' SMR handling we cannot assume all zones are of
>> fixed size, and hence we need all the information there.
>> Any condensing method would assume a given structure of the zones, which
>> the standard just doesn't provide.
>> Or am I missing something here?
>
> Indeed, the standards do not mandate any particular zone configuration,
> constant zone size, etc. So writing code so that can be handled is certainly
> the right way of doing things. However, if we decide to go forward with
> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
> zone size (minus the last zone as you said) must be assumed, and that
> information can be removed from the entries in the RB tree (as it will be
> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
> that eventual last runt zone with a different size is not a big problem.

>> As for write pointer handling: yes, the write pointer on the zones is
>> not really useful for upper-level usage.
>> Where we do need it is to detect I/O which is crossing the write pointer
>> (eg when doing reads over the entire zone).
>> As per spec you will be getting an I/O error here, so we need to split
>> the I/O on the write pointer to get valid results back.
>
> To be precise here, the I/O splitting will be handled by the block layer
> thanks to the "chunk_sectors" setting. But that relies on a constant zone
> size assumption too.
>
> The RB-tree here is most useful for reads over or after the write pointer as
> this can have different behavior on different drives (URSWRZ bit). The RB-tree
> allows us to hide these differences to upper layers and simplify support at
> those levels.

It is unfortunate that path was chosen rather than returning Made Up Data.
However I don't think this is a file system or device mapper problem as
neither of those layers attempt to read blocks they have not written
(or discarded).
All I can think of something that probes the drive media for a partition table
or other identification...isn't the conventional space sufficient to cover
those cases?
Anyway you could just handle the error and sense codes [CHECK CONDITION /
ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
when URSWRZ is 0. It would have the same effect as using the zone cache
without the possibility of the zone cache being out-of-sync.

> I too agree that the write pointer value is not very useful to upper layers
> (e.g. FS). What matters most of the times for these layers is an "allocation
> pointer" or "submission pointer" which indicates the LBA to use for the next
> BIO submission. That LBA will most of the time be in advance of the zone WP
> because of request queing in the block scheduler.

Which is kind of my point.

> Note that from what we have done already, in many cases, upper layers do not
> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
> zone->private_data because that "allocation pointer" information generally
> already exists in a different form (e.g. a bit in a bitmap).

Agreed. Log structured and copy on write file systems are a natural fit for
these types of drives

> For these cases, not having the RB-tree of zones would force a lot
> more code to be added to file systems for SMR support.

I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
Certainly any of these could trigger the zone cache to be loaded at mkfs
and/or mount time?

>>> This all tails into domain responsibility. With the RB-Tree doing half of the
>>> work and the ‘responsible’ domain handling the active path via private_data
>>> why have the split at all? It seems to be a double work to have second object
>>> tracking the first so that I/O scheduling can function.
>>>
>> Tracking the zone states via RB trees is mainly to handly host-managed
>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>> during boot, as eg with new drives we inevitably will try to read beyond
>> the write pointer, which in turn will generate I/O errors on certain drives.

If the zone cache is needed to boot then clearly my objection to reading
in the zone report information on drive attach is unfounded. I was under
the impression that this was not the case.

>> I do agree that there is no strict need to setup an RB tree for
>> host-aware drives; I can easily add an attribute/flag to disable RB
>> trees for those.
>> However, tracking zone information in the RB tree _dramatically_ speeds
>> up device initialisation; issuing a blkdev_discard() for the entire
>> drive will take _ages_ without it.
>
> And the RB-tree will also be very useful for speeding up report zones
> ioctls too. Unless we want those to force an update of the RB-tree information ?

That is debatable. I would tend to argue for both. *If* there must be
a zone cache
then reading from the zone cache is a must. Forcing and update of the zone cache
from the media seems like a reasonable thing to be able to do.

Also the zone report is 'slow' in that there is an overhead for the
report itself but
the number of zones per query can be quite large so 4 or 5 I/Os that
run into the
hundreds if milliseconds to cache the entire drive isn't really unworkable for
something that is used infrequently.

>>> Finally is the error handling path when the RB-Tree encounters and error it
>>> attempts to requery the drive topology virtually guaranteeing that the
>>> private_data is now out-of-sync with the RB-Tree. Again this is something
>>> that can be better encapsulated in the bi_end_io to be informed of the
>>> failed I/O and schedule the appropriate recovery (including re-querying the
>>> zone information of the affected zone(s)).
>>>
>> Well, if we have an RB tree with write pointers of course we need to
>> re-read the zone information on failure (and I thought I did that?).
>> Plus the error information will be propagated via the usual mechanism,
>> so they need to take care of updating any information in ->private_data.
>>
>> I'm perfectly fine with integrating your patches for the various
>> blkdev_XX callouts and associated ioctls; Jens favours that approach,
>> too, so we should be combining those efforts.
>
> Can we agree on an interface ?
> Summarizing all the discussions we had, I am all in favor of the following:
>
> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
> The already existing device type file can tell us if this is an host
> managed disk (type==20) or a host aware disk (type==0). Drive managed
> disks could also be detected, but since no information on their zone
> configuration can be obtained, lets treat those as regular drives.

Since disk type == 0 for everything that isn't HM so I would prefer the
sysfs 'zoned' file just report if the drive is HA or HM.

> 2) Add ioctls for zone management:
> Report zones (get information from RB tree), reset zone (simple wrapper
> to ioctl for block discard), open zone, close zone and finish zone. That
> will allow mkfs like tools to get zone information without having to parse
> thousands of sysfs files (and can also be integrated in libzbc block backend
> driver for a unified interface with the direct SG_IO path for kernels without
> the ZBC code enabled).

I can add finish zone ... but I really can't think of a use for it, myself.

> 3) Try to condense the blkzone data structure to save memory:
> I think that we can at the very least remove the zone length, and also
> may be the per zone spinlock too (a single spinlock and proper state flags can
> be used).

I have a variant that is an array of descriptors that roughly mimics the
api from blk-zoned.c that I did a few months ago as an example.
I should be able to update that to the current kernel + patches.

-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-05 20:35           ` Shaun Tancheff
  (?)
@ 2016-08-09  6:47           ` Hannes Reinecke
  2016-08-09  8:09               ` Damien Le Moal
                               ` (2 more replies)
  -1 siblings, 3 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-08-09  6:47 UTC (permalink / raw)
  To: Shaun Tancheff, Damien Le Moal
  Cc: Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML,
	Jens Axboe, Jens Axboe, James E . J . Bottomley,
	Martin K . Petersen, Jeff Layton, J . Bruce Fields,
	Josh Bingaman

On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>> Hannes, Shaun,
>>
>> Let me add some more comments.
>>
>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>
>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>>
>>>>> Can you please integrate this with Hannes series so that it uses
>>>>> his cache of the zone information?
>>>>
>>>> Adding Hannes and Damien to Cc.
>>>>
>>>> Christoph,
>>>>
>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>>>> quite simple. I can even have the open/close/reset zone commands update the
>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>>>> not need the RB-Tree to function with zoned media.
> 
> I have posted patches to integrate with the zone cache, hopefully they
> make sense.
> 
[ .. ]
>>> I have thought about condensing the RB tree information, but then I
>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>> fixed size, and hence we need all the information there.
>>> Any condensing method would assume a given structure of the zones, which
>>> the standard just doesn't provide.
>>> Or am I missing something here?
>>
>> Indeed, the standards do not mandate any particular zone configuration,
>> constant zone size, etc. So writing code so that can be handled is certainly
>> the right way of doing things. However, if we decide to go forward with
>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
>> zone size (minus the last zone as you said) must be assumed, and that
>> information can be removed from the entries in the RB tree (as it will be
>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
>> that eventual last runt zone with a different size is not a big problem.
> 
>>> As for write pointer handling: yes, the write pointer on the zones is
>>> not really useful for upper-level usage.
>>> Where we do need it is to detect I/O which is crossing the write pointer
>>> (eg when doing reads over the entire zone).
>>> As per spec you will be getting an I/O error here, so we need to split
>>> the I/O on the write pointer to get valid results back.
>>
>> To be precise here, the I/O splitting will be handled by the block layer
>> thanks to the "chunk_sectors" setting. But that relies on a constant zone
>> size assumption too.
>>
>> The RB-tree here is most useful for reads over or after the write pointer as
>> this can have different behavior on different drives (URSWRZ bit). The RB-tree
>> allows us to hide these differences to upper layers and simplify support at
>> those levels.
> 
> It is unfortunate that path was chosen rather than returning Made Up Data.
But how would you return made up data without knowing that you _have_ to
generate some?
Of course it's trivial to generate made up data once an I/O error
occurred, but that's too late as the I/O error already happened.

The general idea behind this is that I _really_ do want to avoid
triggering an I/O error on initial access. This kind of thing really
tends to freak out users (connect a new drive and getting I/O errors;
not the best user experience ...)

> However I don't think this is a file system or device mapper problem as
> neither of those layers attempt to read blocks they have not written
> (or discarded).
> All I can think of something that probes the drive media for a partition table
> or other identification...isn't the conventional space sufficient to cover
> those cases?
Upon device scan the kernel will attempt to read the partition tables,
which consists of reading the first few megabytes and the last sector
(for GPT shadow partition tables).
And depending on the driver manufacturer you might or might not have
enough conventional space at the right locations.

> Anyway you could just handle the error and sense codes [CHECK CONDITION /
> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
> when URSWRZ is 0. It would have the same effect as using the zone cache
> without the possibility of the zone cache being out-of-sync.
> 
Yes, but then we would already have registered an I/O error.
And as indicated above, I really do _not_ want to handle all the
customer calls for supposed faulty new SMR drives.

>> I too agree that the write pointer value is not very useful to upper layers
>> (e.g. FS). What matters most of the times for these layers is an "allocation
>> pointer" or "submission pointer" which indicates the LBA to use for the next
>> BIO submission. That LBA will most of the time be in advance of the zone WP
>> because of request queing in the block scheduler.
> 
> Which is kind of my point.
> 
>> Note that from what we have done already, in many cases, upper layers do not
>> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
>> zone->private_data because that "allocation pointer" information generally
>> already exists in a different form (e.g. a bit in a bitmap).
> 
> Agreed. Log structured and copy on write file systems are a natural fit for
> these types of drives
> 
>> For these cases, not having the RB-tree of zones would force a lot
>> more code to be added to file systems for SMR support.
> 
> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
> Certainly any of these could trigger the zone cache to be loaded at mkfs
> and/or mount time?
> 
>>>> This all tails into domain responsibility. With the RB-Tree doing half of the
>>>> work and the ‘responsible’ domain handling the active path via private_data
>>>> why have the split at all? It seems to be a double work to have second object
>>>> tracking the first so that I/O scheduling can function.
>>>>
>>> Tracking the zone states via RB trees is mainly to handly host-managed
>>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>>> during boot, as eg with new drives we inevitably will try to read beyond
>>> the write pointer, which in turn will generate I/O errors on certain drives.
> 
> If the zone cache is needed to boot then clearly my objection to reading
> in the zone report information on drive attach is unfounded. I was under
> the impression that this was not the case.
> 
>>> I do agree that there is no strict need to setup an RB tree for
>>> host-aware drives; I can easily add an attribute/flag to disable RB
>>> trees for those.
>>> However, tracking zone information in the RB tree _dramatically_ speeds
>>> up device initialisation; issuing a blkdev_discard() for the entire
>>> drive will take _ages_ without it.
>>
>> And the RB-tree will also be very useful for speeding up report zones
>> ioctls too. Unless we want those to force an update of the RB-tree information ?
> 
> That is debatable. I would tend to argue for both. *If* there must be
> a zone cache then reading from the zone cache is a must. Forcing and
> update of the zone cache from the media seems like a reasonable thing
> to be able to do.
> 
> Also the zone report is 'slow' in that there is an overhead for the
> report itself but
> the number of zones per query can be quite large so 4 or 5 I/Os that
> run into the
> hundreds if milliseconds to cache the entire drive isn't really unworkable for
> something that is used infrequently.
> 
No, surely not.
But one of the _big_ advantages for the RB tree is blkdev_discard().
Without the RB tree any mkfs program will issue a 'discard' for every
sector. We will be able to coalesce those into one discard per zone, but
we still need to issue one for _every_ zone.
Which is (as indicated) really slow, and easily takes several minutes.
With the RB tree we can short-circuit discards to empty zones, and speed
up processing time dramatically.
Sure we could be moving the logic into mkfs and friends, but that would
require us to change the programs and agree on a library (libzbc?) which
should be handling that.

>>>> Finally is the error handling path when the RB-Tree encounters and error it
>>>> attempts to requery the drive topology virtually guaranteeing that the
>>>> private_data is now out-of-sync with the RB-Tree. Again this is something
>>>> that can be better encapsulated in the bi_end_io to be informed of the
>>>> failed I/O and schedule the appropriate recovery (including re-querying the
>>>> zone information of the affected zone(s)).
>>>>
>>> Well, if we have an RB tree with write pointers of course we need to
>>> re-read the zone information on failure (and I thought I did that?).
>>> Plus the error information will be propagated via the usual mechanism,
>>> so they need to take care of updating any information in ->private_data.
>>>
>>> I'm perfectly fine with integrating your patches for the various
>>> blkdev_XX callouts and associated ioctls; Jens favours that approach,
>>> too, so we should be combining those efforts.
>>
>> Can we agree on an interface ?
>> Summarizing all the discussions we had, I am all in favor of the following:
>>
>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>> The already existing device type file can tell us if this is an host
>> managed disk (type==20) or a host aware disk (type==0). Drive managed
>> disks could also be detected, but since no information on their zone
>> configuration can be obtained, lets treat those as regular drives.
> 
> Since disk type == 0 for everything that isn't HM so I would prefer the
> sysfs 'zoned' file just report if the drive is HA or HM.
> 
Okay. So let's put in the 'zoned' attribute the device type:
'host-managed', 'host-aware', or 'device managed'.

>> 2) Add ioctls for zone management:
>> Report zones (get information from RB tree), reset zone (simple wrapper
>> to ioctl for block discard), open zone, close zone and finish zone. That
>> will allow mkfs like tools to get zone information without having to parse
>> thousands of sysfs files (and can also be integrated in libzbc block backend
>> driver for a unified interface with the direct SG_IO path for kernels without
>> the ZBC code enabled).
> 
> I can add finish zone ... but I really can't think of a use for it, myself.
> 
Which is not the point. The standard defines this, so clearly someone
found it a reasonable addendum. So let's add this for completeness.

>> 3) Try to condense the blkzone data structure to save memory:
>> I think that we can at the very least remove the zone length, and also
>> may be the per zone spinlock too (a single spinlock and proper state flags can
>> be used).
> 
> I have a variant that is an array of descriptors that roughly mimics the
> api from blk-zoned.c that I did a few months ago as an example.
> I should be able to update that to the current kernel + patches.
> 
Okay. If we restrict the in-kernel SMR drive handling to devices with
identical zone sizes of course we can remove the zone length.
And we can do away with the per-zone spinlock and use a global one instead.

I will be updating my patchset accordingly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  6:47           ` Hannes Reinecke
@ 2016-08-09  8:09               ` Damien Le Moal
  2016-08-10  3:26             ` Shaun Tancheff
  2016-08-14  0:09               ` Shaun Tancheff
  2 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-09  8:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hannes,

> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
[...]
>>> =

>>> Can we agree on an interface ?
>>> Summarizing all the discussions we had, I am all in favor of the follow=
ing:
>>> =

>>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>>> The already existing device type file can tell us if this is an host
>>> managed disk (type=3D=3D20) or a host aware disk (type=3D=3D0). Drive m=
anaged
>>> disks could also be detected, but since no information on their zone
>>> configuration can be obtained, lets treat those as regular drives.
>> =

>> Since disk type =3D=3D 0 for everything that isn't HM so I would prefer =
the
>> sysfs 'zoned' file just report if the drive is HA or HM.
>> =

> Okay. So let's put in the 'zoned' attribute the device type:
> 'host-managed', 'host-aware', or 'device managed'.

I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

> =

>>> 2) Add ioctls for zone management:
>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>> will allow mkfs like tools to get zone information without having to pa=
rse
>>> thousands of sysfs files (and can also be integrated in libzbc block ba=
ckend
>>> driver for a unified interface with the direct SG_IO path for kernels w=
ithout
>>> the ZBC code enabled).
>> =

>> I can add finish zone ... but I really can't think of a use for it, myse=
lf.
>> =

> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.
The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is com=
plete
and maps to the standard definitions.

I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISC=
ARD
ioctl call with a range exactly aligned on a zone.

> =

>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state fl=
ags can
>>> be used).
>> =

>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>> =

> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instea=
d.

Did that too. The blk_zone struct is now exactly 64B. I removed the per zone
spinlock and replaced it with a flag so that zones can still be locked
independently using wait_on_bit_lock & friends (I added the functions
blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per z=
one
locking comes in handy to implement the ioctls code without stalling the co=
mmand
queue for read, writes and discard on different zones.

I however kept the zone length in the structure. The reason for doing so is=
 that
this allows supporting drives with non-constant zone sizes, albeit with a m=
ore
limited interface since in such case the device chunk_sectors is not set (t=
hat
is how a user or application can detect that the zones are not constant siz=
e).
For these drives, the block layer may happily merge BIOs across zone bounda=
ries
and the discard code will not split and align calls on the zones. But upper=
 layers
(an FS or a device mapper) can still do all this by themselves if they want=
/can
support non-constant zone sizes.

The only exception is drives like the Seagate one with only the last zone o=
f a
different size. This case is handled exactly as if all zones are the same s=
ize
simply because any operation on the last smaller zone will naturally align =
as
the checks of operation size against the drive capacity will do the right t=
hings.

The ioctls work for all cases (drive with constant zone size or not). This =
is again
to allow supporting eventual weird drives at application level. I integrate=
d all
these ioctl into libzbc block device backend driver and everything is fine.=
 Can't
tell the difference with direct-to-drive SG_IO accesses. But unlike these, =
the zone
ioctls keep the zone information RB-tree cache up to date.

> =

> I will be updating my patchset accordingly.

I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and =
I will send
everything for review. If you have any comment on the above, please let me =
know and
I will be happy to incorporate changes.

Best regards.


------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, =

Kanagawa, 252-0888 Japan
www.hgst.com =


Western Digital Corporation (and its subsidiaries) E-mail Confidentiality N=
otice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or l=
egally privileged information of WDC and/or its affiliates, and are intende=
d solely for the use of the individual or entity to which they are addresse=
d. If you are not the intended recipient, any disclosure, copying, distribu=
tion or any action taken or omitted to be taken in reliance on it, is prohi=
bited. If you have received this e-mail in error, please notify the sender =
immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-09  8:09               ` Damien Le Moal
  0 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-09  8:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hannes,

> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
[...]
>>> 
>>> Can we agree on an interface ?
>>> Summarizing all the discussions we had, I am all in favor of the following:
>>> 
>>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>>> The already existing device type file can tell us if this is an host
>>> managed disk (type==20) or a host aware disk (type==0). Drive managed
>>> disks could also be detected, but since no information on their zone
>>> configuration can be obtained, lets treat those as regular drives.
>> 
>> Since disk type == 0 for everything that isn't HM so I would prefer the
>> sysfs 'zoned' file just report if the drive is HA or HM.
>> 
> Okay. So let's put in the 'zoned' attribute the device type:
> 'host-managed', 'host-aware', or 'device managed'.

I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

> 
>>> 2) Add ioctls for zone management:
>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>> will allow mkfs like tools to get zone information without having to parse
>>> thousands of sysfs files (and can also be integrated in libzbc block backend
>>> driver for a unified interface with the direct SG_IO path for kernels without
>>> the ZBC code enabled).
>> 
>> I can add finish zone ... but I really can't think of a use for it, myself.
>> 
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.
The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is complete
and maps to the standard definitions.

I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
ioctl call with a range exactly aligned on a zone.

> 
>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>> be used).
>> 
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>> 
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instead.

Did that too. The blk_zone struct is now exactly 64B. I removed the per zone
spinlock and replaced it with a flag so that zones can still be locked
independently using wait_on_bit_lock & friends (I added the functions
blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
locking comes in handy to implement the ioctls code without stalling the command
queue for read, writes and discard on different zones.

I however kept the zone length in the structure. The reason for doing so is that
this allows supporting drives with non-constant zone sizes, albeit with a more
limited interface since in such case the device chunk_sectors is not set (that
is how a user or application can detect that the zones are not constant size).
For these drives, the block layer may happily merge BIOs across zone boundaries
and the discard code will not split and align calls on the zones. But upper layers
(an FS or a device mapper) can still do all this by themselves if they want/can
support non-constant zone sizes.

The only exception is drives like the Seagate one with only the last zone of a
different size. This case is handled exactly as if all zones are the same size
simply because any operation on the last smaller zone will naturally align as
the checks of operation size against the drive capacity will do the right things.

The ioctls work for all cases (drive with constant zone size or not). This is again
to allow supporting eventual weird drives at application level. I integrated all
these ioctl into libzbc block device backend driver and everything is fine. Can't
tell the difference with direct-to-drive SG_IO accesses. But unlike these, the zone
ioctls keep the zone information RB-tree cache up to date.

> 
> I will be updating my patchset accordingly.

I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I will send
everything for review. If you have any comment on the above, please let me know and
I will be happy to incorporate changes.

Best regards.


------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  6:47           ` Hannes Reinecke
  2016-08-09  8:09               ` Damien Le Moal
@ 2016-08-10  3:26             ` Shaun Tancheff
  2016-08-14  0:09               ` Shaun Tancheff
  2 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-10  3:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:

[trim]
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.
> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

Adding an additional library dependency seems overkill for a program
that is already doing ioctls and raw block I/O ... but I would leave that
up to each file system. As it sits issuing the ioctl and walking the array
of data returned [see blkreport.c] is already quite trivial.

I believe the goal here is for F2FS, and perhaps NILFS? to "just
work" with the DISCARD to Reset WP and zone cache in place.

Still quite skeptical about other common file systems
"just working" without their respective mkfs et. al. being
zone aware and handling the topology of the media at mkfs time.
Perhaps there is something I am unaware of?

[trim]

>> I can add finish zone ... but I really can't think of a use for it, myself.
>>
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Agreed and queued for the next version.

Regards,
Shaun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  8:09               ` Damien Le Moal
@ 2016-08-10  3:58                 ` Shaun Tancheff
  -1 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-10  3:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wro=
te:
>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:

[trim]

>>> Since disk type =3D=3D 0 for everything that isn't HM so I would prefer=
 the
>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>
>> Okay. So let's put in the 'zoned' attribute the device type:
>> 'host-managed', 'host-aware', or 'device managed'.
>
> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file=
.
> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
> else. This means that drive managed models are not exposed as zoned block
> devices. For HM vs HA differentiation, an application can look at the
> device type file since it is already present.
>
> We could indeed set the "zoned" file to the device type, but HM drives an=
d
> regular drives will both have "0" in it, so no differentiation possible.
> The other choice could be the "zoned" bits defined by ZBC, but these
> do not define a value for host managed drives, and the drive managed valu=
e
> being not "0" could be confusing too. So I settled for a simple 0/1 boole=
an.

This seems good to me.

>>>> 2) Add ioctls for zone management:
>>>> Report zones (get information from RB tree), reset zone (simple wrappe=
r
>>>> to ioctl for block discard), open zone, close zone and finish zone. Th=
at
>>>> will allow mkfs like tools to get zone information without having to p=
arse
>>>> thousands of sysfs files (and can also be integrated in libzbc block b=
ackend
>>>> driver for a unified interface with the direct SG_IO path for kernels =
without
>>>> the ZBC code enabled).
>>>
>>> I can add finish zone ... but I really can't think of a use for it, mys=
elf.
>>>
>> Which is not the point. The standard defines this, so clearly someone
>> found it a reasonable addendum. So let's add this for completeness.

Agreed.

> Done: I hacked Shaun ioctl code and added finish zone too. The
> difference with Shaun initial code is that the ioctl are propagated down =
to
> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need f=
or
> BIO request definition for the zone operations. So a lot less code added.

The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.

> The ioctls do not mimic exactly the ZBC standard. For instance, there is =
no
> reporting options for report zones, nor is the "all" bit supported for op=
en,
> close or finish zone commands. But the information provided on zones is c=
omplete
> and maps to the standard definitions.

For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...

As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(

> I also added a reset_wp ioctl for completeness, but this one simply calls
> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDI=
SCARD
> ioctl call with a range exactly aligned on a zone.

I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.

[trim]

> Did that too. The blk_zone struct is now exactly 64B. I removed the per z=
one

Thanks .. being a cache line is harder to whinge about...

> spinlock and replaced it with a flag so that zones can still be locked
> independently using wait_on_bit_lock & friends (I added the functions
> blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per=
 zone
> locking comes in handy to implement the ioctls code without stalling the =
command
> queue for read, writes and discard on different zones.
>
> I however kept the zone length in the structure. The reason for doing so =
is that
> this allows supporting drives with non-constant zone sizes, albeit with a=
 more
> limited interface since in such case the device chunk_sectors is not set =
(that
> is how a user or application can detect that the zones are not constant s=
ize).
> For these drives, the block layer may happily merge BIOs across zone boun=
daries
> and the discard code will not split and align calls on the zones. But upp=
er layers
> (an FS or a device mapper) can still do all this by themselves if they wa=
nt/can
> support non-constant zone sizes.
>
> The only exception is drives like the Seagate one with only the last zone=
 of a
> different size. This case is handled exactly as if all zones are the same=
 size
> simply because any operation on the last smaller zone will naturally alig=
n as
> the checks of operation size against the drive capacity will do the right=
 things.
>
> The ioctls work for all cases (drive with constant zone size or not). Thi=
s is again
> to allow supporting eventual weird drives at application level. I integra=
ted all
> these ioctl into libzbc block device backend driver and everything is fin=
e. Can't
> tell the difference with direct-to-drive SG_IO accesses. But unlike these=
, the zone
> ioctls keep the zone information RB-tree cache up to date.
>
>>
>> I will be updating my patchset accordingly.
>
> I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this an=
d I will send
> everything for review. If you have any comment on the above, please let m=
e know and
> I will be happy to incorporate changes.
>
> Best regards.
>
>
> ------------------------
> Damien Le Moal, Ph.D.
> Sr. Manager, System Software Group, HGST Research,
> HGST, a Western Digital brand
> Damien.LeMoal@hgst.com
> (+81) 0466-98-3593 (ext. 513593)
> 1 kirihara-cho, Fujisawa,
> Kanagawa, 252-0888 Japan
> www.hgst.com
>
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality=
 Notice & Disclaimer:
>
> This e-mail and any files transmitted with it may contain confidential or=
 legally privileged information of WDC and/or its affiliates, and are inten=
ded solely for the use of the individual or entity to which they are addres=
sed. If you are not the intended recipient, any disclosure, copying, distri=
bution or any action taken or omitted to be taken in reliance on it, is pro=
hibited. If you have received this e-mail in error, please notify the sende=
r immediately and delete the e-mail in its entirety from your system.
>



--=20
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-10  3:58                 ` Shaun Tancheff
  0 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-10  3:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:

[trim]

>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>
>> Okay. So let's put in the 'zoned' attribute the device type:
>> 'host-managed', 'host-aware', or 'device managed'.
>
> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
> else. This means that drive managed models are not exposed as zoned block
> devices. For HM vs HA differentiation, an application can look at the
> device type file since it is already present.
>
> We could indeed set the "zoned" file to the device type, but HM drives and
> regular drives will both have "0" in it, so no differentiation possible.
> The other choice could be the "zoned" bits defined by ZBC, but these
> do not define a value for host managed drives, and the drive managed value
> being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

This seems good to me.

>>>> 2) Add ioctls for zone management:
>>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>>> will allow mkfs like tools to get zone information without having to parse
>>>> thousands of sysfs files (and can also be integrated in libzbc block backend
>>>> driver for a unified interface with the direct SG_IO path for kernels without
>>>> the ZBC code enabled).
>>>
>>> I can add finish zone ... but I really can't think of a use for it, myself.
>>>
>> Which is not the point. The standard defines this, so clearly someone
>> found it a reasonable addendum. So let's add this for completeness.

Agreed.

> Done: I hacked Shaun ioctl code and added finish zone too. The
> difference with Shaun initial code is that the ioctl are propagated down to
> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
> BIO request definition for the zone operations. So a lot less code added.

The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.

> The ioctls do not mimic exactly the ZBC standard. For instance, there is no
> reporting options for report zones, nor is the "all" bit supported for open,
> close or finish zone commands. But the information provided on zones is complete
> and maps to the standard definitions.

For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...

As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(

> I also added a reset_wp ioctl for completeness, but this one simply calls
> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
> ioctl call with a range exactly aligned on a zone.

I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.

[trim]

> Did that too. The blk_zone struct is now exactly 64B. I removed the per zone

Thanks .. being a cache line is harder to whinge about...

> spinlock and replaced it with a flag so that zones can still be locked
> independently using wait_on_bit_lock & friends (I added the functions
> blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
> locking comes in handy to implement the ioctls code without stalling the command
> queue for read, writes and discard on different zones.
>
> I however kept the zone length in the structure. The reason for doing so is that
> this allows supporting drives with non-constant zone sizes, albeit with a more
> limited interface since in such case the device chunk_sectors is not set (that
> is how a user or application can detect that the zones are not constant size).
> For these drives, the block layer may happily merge BIOs across zone boundaries
> and the discard code will not split and align calls on the zones. But upper layers
> (an FS or a device mapper) can still do all this by themselves if they want/can
> support non-constant zone sizes.
>
> The only exception is drives like the Seagate one with only the last zone of a
> different size. This case is handled exactly as if all zones are the same size
> simply because any operation on the last smaller zone will naturally align as
> the checks of operation size against the drive capacity will do the right things.
>
> The ioctls work for all cases (drive with constant zone size or not). This is again
> to allow supporting eventual weird drives at application level. I integrated all
> these ioctl into libzbc block device backend driver and everything is fine. Can't
> tell the difference with direct-to-drive SG_IO accesses. But unlike these, the zone
> ioctls keep the zone information RB-tree cache up to date.
>
>>
>> I will be updating my patchset accordingly.
>
> I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I will send
> everything for review. If you have any comment on the above, please let me know and
> I will be happy to incorporate changes.
>
> Best regards.
>
>
> ------------------------
> Damien Le Moal, Ph.D.
> Sr. Manager, System Software Group, HGST Research,
> HGST, a Western Digital brand
> Damien.LeMoal@hgst.com
> (+81) 0466-98-3593 (ext. 513593)
> 1 kirihara-cho, Fujisawa,
> Kanagawa, 252-0888 Japan
> www.hgst.com
>
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:
>
> This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
>



-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-10  3:58                 ` Shaun Tancheff
@ 2016-08-10  4:38                   ` Damien Le Moal
  -1 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-10  4:38 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

U2hhdW4sCgpPbiA4LzEwLzE2IDEyOjU4LCBTaGF1biBUYW5jaGVmZiB3cm90ZToKPiBPbiBUdWUs
IEF1ZyA5LCAyMDE2IGF0IDM6MDkgQU0sIERhbWllbiBMZSBNb2FsIDxEYW1pZW4uTGVNb2FsQGhn
c3QuY29tPiB3cm90ZToKPj4+IE9uIEF1ZyA5LCAyMDE2LCBhdCAxNTo0NywgSGFubmVzIFJlaW5l
Y2tlIDxoYXJlQHN1c2UuZGU+IHdyb3RlOgo+Cj4gW3RyaW1dCj4KPj4+PiBTaW5jZSBkaXNrIHR5
cGUgPT0gMCBmb3IgZXZlcnl0aGluZyB0aGF0IGlzbid0IEhNIHNvIEkgd291bGQgcHJlZmVyIHRo
ZQo+Pj4+IHN5c2ZzICd6b25lZCcgZmlsZSBqdXN0IHJlcG9ydCBpZiB0aGUgZHJpdmUgaXMgSEEg
b3IgSE0uCj4+Pj4KPj4+IE9rYXkuIFNvIGxldCdzIHB1dCBpbiB0aGUgJ3pvbmVkJyBhdHRyaWJ1
dGUgdGhlIGRldmljZSB0eXBlOgo+Pj4gJ2hvc3QtbWFuYWdlZCcsICdob3N0LWF3YXJlJywgb3Ig
J2RldmljZSBtYW5hZ2VkJy4KPj4KPj4gSSBoYWNrZWQgeW91ciBwYXRjaGVzIGFuZCBzaW1wbHkg
cHV0IGEgIjAiIG9yICIxIiBpbiB0aGUgc3lzZnMgem9uZWQgZmlsZS4KPj4gQW55IGRyaXZlIHRo
YXQgaGFzIFpCQy9aQUMgY29tbWFuZCBzdXBwb3J0IGdldHMgYSAiMSIsICIwIiBmb3IgZXZlcnl0
aGluZwo+PiBlbHNlLiBUaGlzIG1lYW5zIHRoYXQgZHJpdmUgbWFuYWdlZCBtb2RlbHMgYXJlIG5v
dCBleHBvc2VkIGFzIHpvbmVkIGJsb2NrCj4+IGRldmljZXMuIEZvciBITSB2cyBIQSBkaWZmZXJl
bnRpYXRpb24sIGFuIGFwcGxpY2F0aW9uIGNhbiBsb29rIGF0IHRoZQo+PiBkZXZpY2UgdHlwZSBm
aWxlIHNpbmNlIGl0IGlzIGFscmVhZHkgcHJlc2VudC4KPj4KPj4gV2UgY291bGQgaW5kZWVkIHNl
dCB0aGUgInpvbmVkIiBmaWxlIHRvIHRoZSBkZXZpY2UgdHlwZSwgYnV0IEhNIGRyaXZlcyBhbmQK
Pj4gcmVndWxhciBkcml2ZXMgd2lsbCBib3RoIGhhdmUgIjAiIGluIGl0LCBzbyBubyBkaWZmZXJl
bnRpYXRpb24gcG9zc2libGUuCj4+IFRoZSBvdGhlciBjaG9pY2UgY291bGQgYmUgdGhlICJ6b25l
ZCIgYml0cyBkZWZpbmVkIGJ5IFpCQywgYnV0IHRoZXNlCj4+IGRvIG5vdCBkZWZpbmUgYSB2YWx1
ZSBmb3IgaG9zdCBtYW5hZ2VkIGRyaXZlcywgYW5kIHRoZSBkcml2ZSBtYW5hZ2VkIHZhbHVlCj4+
IGJlaW5nIG5vdCAiMCIgY291bGQgYmUgY29uZnVzaW5nIHRvby4gU28gSSBzZXR0bGVkIGZvciBh
IHNpbXBsZSAwLzEgYm9vbGVhbi4KPgo+IFRoaXMgc2VlbXMgZ29vZCB0byBtZS4KCkFub3RoZXIg
b3B0aW9uIEkgZm9yZ290IGlzIGZvciB0aGUgInpvbmVkIiBmaWxlIHRvIGluZGljYXRlIHRoZSB0
b3RhbCAKbnVtYmVyIG9mIHpvbmVzIG9mIHRoZSBkZXZpY2UsIGFuZCAwIGZvciBhIG5vbiB6b25l
ZCByZWd1bGFyIGJsb2NrIApkZXZpY2UuIFRoYXQgd291bGQgd29yayBhcyB3ZWxsLgoKWy4uLl0K
Pj4gRG9uZTogSSBoYWNrZWQgU2hhdW4gaW9jdGwgY29kZSBhbmQgYWRkZWQgZmluaXNoIHpvbmUg
dG9vLiBUaGUKPj4gZGlmZmVyZW5jZSB3aXRoIFNoYXVuIGluaXRpYWwgY29kZSBpcyB0aGF0IHRo
ZSBpb2N0bCBhcmUgcHJvcGFnYXRlZCBkb3duIHRvCj4+IHRoZSBkcml2ZXIgKF9fYmxrZGV2X2Ry
aXZlcl9pb2N0bCAtPiBzZF9pb2N0bCkgc28gdGhhdCB0aGVyZSBpcyBubyBuZWVkIGZvcgo+PiBC
SU8gcmVxdWVzdCBkZWZpbml0aW9uIGZvciB0aGUgem9uZSBvcGVyYXRpb25zLiBTbyBhIGxvdCBs
ZXNzIGNvZGUgYWRkZWQuCj4KPiBUaGUgcHVycG9zZSBvZiB0aGUgQklPIGZsYWdzIGlzIG5vdCB0
byBlbmFibGUgdGhlIGlvY3RscyBzbyBtdWNoIGFzCj4gdGhlIG90aGVyIHdheSByb3VuZC4gQ3Jl
YXRpbmcgQklPIG9wJ3MgaXMgdG8gZW5hYmxlIGlzc3VpbmcgWkJDCj4gY29tbWFuZHMgZnJvbSBk
ZXZpY2UgbWFwcGVyIHRhcmdldHMgYW5kIGZpbGUgc3lzdGVtcyB3aXRob3V0IHNvbWUKPiBoZWlu
b3VzIGlvY3RsIGhhY2tzLgo+IE1ha2luZyB0aGUgcmVzdWx0aW5nIGJsb2NrIGxheWVyIGludGVy
ZmFjZXMgYXZhaWxhYmxlIHZpYSBpb2N0bHMgaXMganVzdCBhCj4gcmVhc29uYWJsZSB3YXkgdG8g
ZXhlcmNpc2UgdGhlIGNvZGUgLi4uIG9yIHRoYXQgd2FzIG15IGludGVudC4KClllcywgSSB1bmRl
cnN0b29kIHlvdXIgY29kZS4gSG93ZXZlciwgc2luY2UgKG9yIGlmKSB3ZSBrZWVwIHRoZSB6b25l
IAppbmZvcm1hdGlvbiBpbiB0aGUgUkItdHJlZSBjYWNoZSwgdGhlcmUgaXMgbm8gbmVlZCBmb3Ig
dGhlIHJlcG9ydCB6b25lIApvcGVyYXRpb24gQklPIGludGVyZmFjZS4gU2FtZSBmb3IgcmVzZXQg
d3JpdGUgcG9pbnRlciBieSBrZWVwaW5nIHRoZSAKbWFwcGluZyB0byBkaXNjYXJkLiBibGtfbG9v
a3VwX3pvbmUgY2FuIGJlIHVzZWQgaW4ga2VybmVsIGFzIGEgcmVwb3J0IAp6b25lIEJJTyByZXBs
YWNlbWVudCBhbmQgd29ya3MgYXMgd2VsbCBmb3IgdGhlIHJlcG9ydCB6b25lIGlvY3RsIAppbXBs
ZW1lbnRhdGlvbi4gRm9yIHJlc2V0LCB0aGVyZSBpcyBibGtkZXZfaXNzdWVfZGlzY3JhZCBpbiBr
ZXJuZWwsIGFuZCAKdGhlIHJlc2V0IHpvbmUgaW9jdGwgYmVjb21lcyBlcXVpdmFsZW50IHRvIEJM
S0RJU0NBUkQgaW9jdGwuIFRoZXNlIGFyZSAKc2ltcGxlLiBPcGVuLCBjbG9zZSBhbmQgZmluaXNo
IHpvbmUgcmVtYWlucy4gRm9yIHRoZXNlLCBhZGRpbmcgdGhlIEJJTyAKaW50ZXJmYWNlIHNlZW1l
ZCBhbiBvdmVya2lsbC4gSGVuY2UgbXkgY2hvaWNlIG9mIHByb3BhZ2F0aW5nIHRoZSBpb2N0bCAK
dG8gdGhlIGRyaXZlci4KVGhpcyBpcyBkZWJhdGFibGUgb2YgY291cnNlLCBhbmQgYWRkaW5nIGFu
IGluLWtlcm5lbCBpbnRlcmZhY2UgaXMgbm90IApoYXJkOiB3ZSBjYW4gaW1wbGVtZW50IGJsa19v
cGVuX3pvbmUsIGJsa19jbG9zZV96b25lIGFuZCBibGtfZmluaXNoX3pvbmUgCnVzaW5nIF9fYmxr
ZGV2X2RyaXZlcl9pb2N0bC4gVGhhdCBsb29rcyBjbGVhbiB0byBtZS4KCk92ZXJhbGwsIG15IGNv
bmNlcm4gd2l0aCB0aGUgQklPIGJhc2VkIGludGVyZmFjZSBmb3IgdGhlIFpCQyBjb21tYW5kcyBp
cyAKdGhhdCBpdCBhZGRzIG9uZSBmbGFnIGZvciBlYWNoIGNvbW1hbmQsIHdoaWNoIGlzIG5vdCBy
ZWFsbHkgdGhlIApwaGlsb3NvcGh5IG9mIHRoZSBpbnRlcmZhY2UgYW5kIHBvdGVudGlhbGx5IG9w
ZW5zIHRoZSBkb29yIGZvciBtb3JlIHN1Y2ggCmltcGxlbWVudGF0aW9ucyBpbiB0aGUgZnV0dXJl
IHdpdGggbmV3IHN0YW5kYXJkcyBhbmQgbmV3IGNvbW1hbmRzIGNvbWluZyAKdXAuIENsZWFybHkg
dGhhdCBpcyBub3QgYSBzdXN0YWluYWJsZSBwYXRoLiBTbyBJIHRoaW5rIHRoYXQgYSBtb3JlIApz
cGVjaWZpYyBpbnRlcmZhY2UgZm9yIHRoZXNlIHpvbmUgb3BlcmF0aW9ucyBpcyBhIGJldHRlciBj
aG9pY2UuIFRoYXQgaXMgCmNvbnNpc3RlbnQgd2l0aCB3aGF0IGhhcHBlbnMgd2l0aCB0aGUgdG9u
cyBvZiBBVEEgYW5kIFNDU0kgY29tbWFuZHMgbm90IAphY3R1YWxseSBkb2luZyBkYXRhIEkvT3Mg
KG1vZGUgc2Vuc2UsIGxvZyBwYWdlcywgU01BUlQsIGV0YykuIEFsbCB0aGVzZSAKZG8gbm90IHVz
ZSBCSU9zIGFuZCBhcmUgcHJvY2Vzc2VkIGFzIHJlcXVlc3QgUkVRX1RZUEVfQkxPQ0tfUEMuCgo+
PiBUaGUgaW9jdGxzIGRvIG5vdCBtaW1pYyBleGFjdGx5IHRoZSBaQkMgc3RhbmRhcmQuIEZvciBp
bnN0YW5jZSwgdGhlcmUgaXMgbm8KPj4gcmVwb3J0aW5nIG9wdGlvbnMgZm9yIHJlcG9ydCB6b25l
cywgbm9yIGlzIHRoZSAiYWxsIiBiaXQgc3VwcG9ydGVkIGZvciBvcGVuLAo+PiBjbG9zZSBvciBm
aW5pc2ggem9uZSBjb21tYW5kcy4gQnV0IHRoZSBpbmZvcm1hdGlvbiBwcm92aWRlZCBvbiB6b25l
cyBpcyBjb21wbGV0ZQo+PiBhbmQgbWFwcyB0byB0aGUgc3RhbmRhcmQgZGVmaW5pdGlvbnMuCj4K
PiBGb3IgdGhlIHJlcG9ydGluZyBvcHRpb25zIEkgaGF2ZSBwbGFubmVkIHRvIHJldXNlIHRoZSBz
dHJlYW1faWQgaW4KPiBzdHJ1Y3QgYmlvIHdoZW4gdGhhdCBpcyBmb3JtYWxpemVkLiBUaGVyZSBh
cmUgY2VydGFpbmx5IG90aGVyIHBsYWNlcyBpbgo+IHN0cnVjdCBiaW8gdG8gc3R1ZmYgYSBmZXcg
ZXh0cmEgYml0cyAuLi4KCldlIGNvdWxkIGFkZCByZXBvcnRpbmcgb3B0aW9ucyB0byBibGtfbG9v
a3VwX3pvbmVzIHRvIGZpbHRlciB0aGUgcmVzdWx0IAphbmQgdXNlIHRoYXQgaW4gdGhlIGlvY3Rs
IGltcGxlbWVudGF0aW9uIGFzIHdlbGwuIFRoaXMgY2FuIGJlIGFkZGVkIAp3aXRob3V0IGFueSBw
cm9ibGVtLgoKPiBBcyBmYXIgYXMgdGhlIGFsbCBiaXQgLi4uIHRoaXMgaXMgYmVpbmcgaGFuZGxl
ZCBieSBhbGwgdGhlIHpvbmUgYWN0aW9uCj4gY29tbWFuZHMuIEp1c3QgcGFzcyBhIHNlY3RvciBv
ZiB+MHVsIGFuZCBpdCdzIGhhbmRsZWQgaW4gc2QuYyBieQo+IHNkX3NldHVwX3pvbmVfYWN0aW9u
X2NtbmQoKS4KPgo+IEFwb2xvZ2llcyBhcyBhcHBhcmVudGx5IG15IGRvY3VtZW50YXRpb24gaGVy
ZSBpcyBsYWNraW5nIDotKAoKWWVzLCBJIGdvdCBpdCAobGliemJjIGRvZXMgdGhlIHNhbWUgYWN0
dWFsbHkpLiBJIGRpZCBub3QgYWRkIGl0IGZvciAKc2ltcGxpY2l0eS4gQnV0IGluZGVlZCBtYXkg
YmUgaXQgc2hvdWxkIGJlLiBUaGUgc2FtZSB0cmljayBjYW4gYmUgdXNlZCAKd2l0aCB0aGUgaW9j
dGwgdG8gZHJpdmVyIGludGVyZmFjZS4gTm8gcHJvYmxlbXMuCgo+PiBJIGFsc28gYWRkZWQgYSBy
ZXNldF93cCBpb2N0bCBmb3IgY29tcGxldGVuZXNzLCBidXQgdGhpcyBvbmUgc2ltcGx5IGNhbGxz
Cj4+IGJsa2Rldl9pc3N1ZV9kaXNjYXJkIGludGVybmFsbHksIHNvIGl0IGlzIGluIGZhY3QgZXF1
aXZhbGVudCB0byB0aGUgQkxLRElTQ0FSRAo+PiBpb2N0bCBjYWxsIHdpdGggYSByYW5nZSBleGFj
dGx5IGFsaWduZWQgb24gYSB6b25lLgo+Cj4gSSdtIGNvbmZ1c2VkIGFzIG15IHBhdGNoIHNldCBp
bmNsdWRlcyBhIFJlc2V0IFdQIChCTEtSRVNFVFpPTkUpIHRoYXQKPiBjcmVhdGVzIGEgUkVRX09Q
X1pPTkVfUkVTRVQgLi4gc2FtZSBhcyBvcGVuIGFuZCBjbG9zZS4gTXkKPiBleHBlY3RhdGlvbiBi
ZWluZyB0aGF0IEJMS0RJU0NBUkQgZG9lc24ndCByZWFsbHkgbmVlZCB5ZXQgYW5vdGhlciBhbGlh
cy4KClllcywgd2UgY291bGQgcmVtb3ZlIHRoZSBCTEtSRVNFVFpPTkUgaW9jdGwgYW5kIGhhdmUg
YXBwbGljYXRpb25zIHVzZSAKdGhlIEJMS0RJU0NBUkQgaW9jdGwgaW5zdGVhZC4gSW4gdGhlIGtl
cm5lbCwgYm90aCBnZW5lcmF0ZSAKYmxrZGV2X2lzc3VlX2Rpc2NhcmQgY2FsbHMgYW5kIGFyZSB0
aHVzIGlkZW50aWNhbC4gT25seSB0aGUgaW9jdGwgCmludGVyZmFjZSBkaWZmZXJzIChzZWN0b3Ig
dnMgcmFuZ2UgYXJndW1lbnQpLiBBZ2FpbiwgSSBhZGRlZCBpdCB0byBoYXZlIAp0aGUgZnVsbCBz
ZXQgb2YgNSBaQkMvWkFDIG9wZXJhdGlvbnMgbWFwcGluZyB0byBvbmUgQkxLeHh4Wk9ORSBpb2N0
bC4gCkJ1dCBncmFudGVkLCByZXNldCBpcyBvcHRpb25hbC4KCkJlc3QgcmVnYXJkcy4KCi0tIApE
YW1pZW4gTGUgTW9hbCwgUGguRC4KU3IuIE1hbmFnZXIsIFN5c3RlbSBTb2Z0d2FyZSBHcm91cCwg
SEdTVCBSZXNlYXJjaCwKSEdTVCwgYSBXZXN0ZXJuIERpZ2l0YWwgYnJhbmQKRGFtaWVuLkxlTW9h
bEBoZ3N0LmNvbQooKzgxKSAwNDY2LTk4LTM1OTMgKGV4dC4gNTEzNTkzKQoxIGtpcmloYXJhLWNo
bywgRnVqaXNhd2EsCkthbmFnYXdhLCAyNTItMDg4OCBKYXBhbgp3d3cuaGdzdC5jb20KV2VzdGVy
biBEaWdpdGFsIENvcnBvcmF0aW9uIChhbmQgaXRzIHN1YnNpZGlhcmllcykgRS1tYWlsIENvbmZp
ZGVudGlhbGl0eSBOb3RpY2UgJiBEaXNjbGFpbWVyOgoKVGhpcyBlLW1haWwgYW5kIGFueSBmaWxl
cyB0cmFuc21pdHRlZCB3aXRoIGl0IG1heSBjb250YWluIGNvbmZpZGVudGlhbCBvciBsZWdhbGx5
IHByaXZpbGVnZWQgaW5mb3JtYXRpb24gb2YgV0RDIGFuZC9vciBpdHMgYWZmaWxpYXRlcywgYW5k
IGFyZSBpbnRlbmRlZCBzb2xlbHkgZm9yIHRoZSB1c2Ugb2YgdGhlIGluZGl2aWR1YWwgb3IgZW50
aXR5IHRvIHdoaWNoIHRoZXkgYXJlIGFkZHJlc3NlZC4gSWYgeW91IGFyZSBub3QgdGhlIGludGVu
ZGVkIHJlY2lwaWVudCwgYW55IGRpc2Nsb3N1cmUsIGNvcHlpbmcsIGRpc3RyaWJ1dGlvbiBvciBh
bnkgYWN0aW9uIHRha2VuIG9yIG9taXR0ZWQgdG8gYmUgdGFrZW4gaW4gcmVsaWFuY2Ugb24gaXQs
IGlzIHByb2hpYml0ZWQuIElmIHlvdSBoYXZlIHJlY2VpdmVkIHRoaXMgZS1tYWlsIGluIGVycm9y
LCBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIgaW1tZWRpYXRlbHkgYW5kIGRlbGV0ZSB0aGUgZS1t
YWlsIGluIGl0cyBlbnRpcmV0eSBmcm9tIHlvdXIgc3lzdGVtLgo=

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-10  4:38                   ` Damien Le Moal
  0 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-10  4:38 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Shaun,

On 8/10/16 12:58, Shaun Tancheff wrote:
> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
>
> [trim]
>
>>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>>
>>> Okay. So let's put in the 'zoned' attribute the device type:
>>> 'host-managed', 'host-aware', or 'device managed'.
>>
>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>> else. This means that drive managed models are not exposed as zoned block
>> devices. For HM vs HA differentiation, an application can look at the
>> device type file since it is already present.
>>
>> We could indeed set the "zoned" file to the device type, but HM drives and
>> regular drives will both have "0" in it, so no differentiation possible.
>> The other choice could be the "zoned" bits defined by ZBC, but these
>> do not define a value for host managed drives, and the drive managed value
>> being not "0" could be confusing too. So I settled for a simple 0/1 boolean.
>
> This seems good to me.

Another option I forgot is for the "zoned" file to indicate the total 
number of zones of the device, and 0 for a non zoned regular block 
device. That would work as well.

[...]
>> Done: I hacked Shaun ioctl code and added finish zone too. The
>> difference with Shaun initial code is that the ioctl are propagated down to
>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
>> BIO request definition for the zone operations. So a lot less code added.
>
> The purpose of the BIO flags is not to enable the ioctls so much as
> the other way round. Creating BIO op's is to enable issuing ZBC
> commands from device mapper targets and file systems without some
> heinous ioctl hacks.
> Making the resulting block layer interfaces available via ioctls is just a
> reasonable way to exercise the code ... or that was my intent.

Yes, I understood your code. However, since (or if) we keep the zone 
information in the RB-tree cache, there is no need for the report zone 
operation BIO interface. Same for reset write pointer by keeping the 
mapping to discard. blk_lookup_zone can be used in kernel as a report 
zone BIO replacement and works as well for the report zone ioctl 
implementation. For reset, there is blkdev_issue_discrad in kernel, and 
the reset zone ioctl becomes equivalent to BLKDISCARD ioctl. These are 
simple. Open, close and finish zone remains. For these, adding the BIO 
interface seemed an overkill. Hence my choice of propagating the ioctl 
to the driver.
This is debatable of course, and adding an in-kernel interface is not 
hard: we can implement blk_open_zone, blk_close_zone and blk_finish_zone 
using __blkdev_driver_ioctl. That looks clean to me.

Overall, my concern with the BIO based interface for the ZBC commands is 
that it adds one flag for each command, which is not really the 
philosophy of the interface and potentially opens the door for more such 
implementations in the future with new standards and new commands coming 
up. Clearly that is not a sustainable path. So I think that a more 
specific interface for these zone operations is a better choice. That is 
consistent with what happens with the tons of ATA and SCSI commands not 
actually doing data I/Os (mode sense, log pages, SMART, etc). All these 
do not use BIOs and are processed as request REQ_TYPE_BLOCK_PC.

>> The ioctls do not mimic exactly the ZBC standard. For instance, there is no
>> reporting options for report zones, nor is the "all" bit supported for open,
>> close or finish zone commands. But the information provided on zones is complete
>> and maps to the standard definitions.
>
> For the reporting options I have planned to reuse the stream_id in
> struct bio when that is formalized. There are certainly other places in
> struct bio to stuff a few extra bits ...

We could add reporting options to blk_lookup_zones to filter the result 
and use that in the ioctl implementation as well. This can be added 
without any problem.

> As far as the all bit ... this is being handled by all the zone action
> commands. Just pass a sector of ~0ul and it's handled in sd.c by
> sd_setup_zone_action_cmnd().
>
> Apologies as apparently my documentation here is lacking :-(

Yes, I got it (libzbc does the same actually). I did not add it for 
simplicity. But indeed may be it should be. The same trick can be used 
with the ioctl to driver interface. No problems.

>> I also added a reset_wp ioctl for completeness, but this one simply calls
>> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
>> ioctl call with a range exactly aligned on a zone.
>
> I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
> creates a REQ_OP_ZONE_RESET .. same as open and close. My
> expectation being that BLKDISCARD doesn't really need yet another alias.

Yes, we could remove the BLKRESETZONE ioctl and have applications use 
the BLKDISCARD ioctl instead. In the kernel, both generate 
blkdev_issue_discard calls and are thus identical. Only the ioctl 
interface differs (sector vs range argument). Again, I added it to have 
the full set of 5 ZBC/ZAC operations mapping to one BLKxxxZONE ioctl. 
But granted, reset is optional.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  6:47           ` Hannes Reinecke
@ 2016-08-14  0:09               ` Shaun Tancheff
  2016-08-10  3:26             ` Shaun Tancheff
  2016-08-14  0:09               ` Shaun Tancheff
  2 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-14  0:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> =
wrote:
>>> Hannes, Shaun,
>>>
>>> Let me add some more comments.
>>>
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>>>
>>>>>> Can you please integrate this with Hannes series so that it uses
>>>>>> his cache of the zone information?
>>>>>
>>>>> Adding Hannes and Damien to Cc.
>>>>>
>>>>> Christoph,
>>>>>
>>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report=
, that is
>>>>> quite simple. I can even have the open/close/reset zone commands upda=
te the
>>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this ar=
ound the
>>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups=
 that do
>>>>> not need the RB-Tree to function with zoned media.
>>
>> I have posted patches to integrate with the zone cache, hopefully they
>> make sense.
>>
> [ .. ]
>>>> I have thought about condensing the RB tree information, but then I
>>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>>> fixed size, and hence we need all the information there.
>>>> Any condensing method would assume a given structure of the zones, whi=
ch
>>>> the standard just doesn't provide.
>>>> Or am I missing something here?

Of course you can condense the zone cache without loosing any
information. Here is the layout I used ... I haven't update the patch
to the latest posted patches but this is the basic idea.

[It was originally done as a follow on of making your zone cache work
 with Seagate's HA drive. I did not include the wp-in-arrays patch
 along with the HA drive support that I sent you in May as you were
 quite terse about RB trees when I tried to discuss this approach with
 you at Vault]

struct blk_zone {
        unsigned type:4;
        unsigned state:5;
        unsigned extra:7;
        unsigned wp:40;
        void *private_data;
};

struct contiguous_wps {
        u64 start_lba;
        u64 last_lba; /* or # of blocks */
        u64 zone_size; /* size in blocks */
        unsigned is_zoned:1;
        u32 zone_count;
        spinlock_t lock;
        struct blk_zone zones[0];
};

struct zone_wps {
        u32 wps_count;
        struct contiguous_wps **wps;
};

Then in struct request_queue
-    struct rb_root zones;
+   struct struct zone_wps *zones;

For each contiguous chunk of zones you need a descriptor. In the current
drives you need 1 or 2 descriptors.

Here a conventional drive is encapsulated as zoned media with one
drive sized conventional zone.

I have not spent time building an ad-hoc LVM comprised of zoned and
conventional media so it's not all ironed out yet.
I think you can see the advantage of being able to put conventional space
anywhere you would like to work around zoned media not being laid out
the the best manner for your setup.

Yes things start to break down if every other zone is a different size ..

The point being that even with supporting zones that order 48 bytes.
in size this saves a lot of space with no loss of information.
I still kind of prefer pushing blk_zone down to a u32 by reducing
the max zone size and dropping the private_data ... but that may
be going a bit too far.

blk_lookup_zone then has an [unfortunate] signature change:


/**
 * blk_lookup_zone() - Lookup zones
 * @q: Request Queue
 * @sector: Location to lookup
 * @start: Starting location zone (OUT: Required)
 * @len: Length of zone (OUT: Required)
 * @lock: Spinlock of zones (OUT: Required)
 */
struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
                                 sector_t *start, sector_t *len,
                                 spinlock_t **lock)
{
        int iter;
        struct blk_zone *bzone =3D NULL;
        struct zone_wps *zi =3D q->zones;

        *start =3D 0;
        *len =3D 0;
        *lock =3D NULL;

        if (!q->zones)
                goto out;

        for (iter =3D 0; iter < zi->wps_count; iter++) {
                if (sector >=3D zi->wps[iter]->start_lba &&
                    sector <  zi->wps[iter]->last_lba) {
                        struct contiguous_wps *wp =3D zi->wps[iter];
                        u64 index =3D (sector - wp->start_lba) / wp->zone_s=
ize;

                        BUG_ON(index >=3D wp->zone_count);

                        bzone =3D &wp->zones[index];
                        *len =3D wp->zone_size;
                        *start =3D wp->start_lba + (index * wp->zone_size);
                        *lock =3D &wp->lock;
                }
        }

out:
        return bzone;
}

>>>
>>> Indeed, the standards do not mandate any particular zone configuration,
>>> constant zone size, etc. So writing code so that can be handled is cert=
ainly
>>> the right way of doing things. However, if we decide to go forward with
>>> mapping RESET WRITE POINTER command to DISCARD, then at least a constan=
t
>>> zone size (minus the last zone as you said) must be assumed, and that
>>> information can be removed from the entries in the RB tree (as it will =
be
>>> saved for the sysfs "zone_size" file anyway. Adding a little code to ha=
ndle
>>> that eventual last runt zone with a different size is not a big problem=
.
>>
>>>> As for write pointer handling: yes, the write pointer on the zones is
>>>> not really useful for upper-level usage.
>>>> Where we do need it is to detect I/O which is crossing the write point=
er
>>>> (eg when doing reads over the entire zone).
>>>> As per spec you will be getting an I/O error here, so we need to split
>>>> the I/O on the write pointer to get valid results back.
>>>
>>> To be precise here, the I/O splitting will be handled by the block laye=
r
>>> thanks to the "chunk_sectors" setting. But that relies on a constant zo=
ne
>>> size assumption too.
>>>
>>> The RB-tree here is most useful for reads over or after the write point=
er as
>>> this can have different behavior on different drives (URSWRZ bit). The =
RB-tree
>>> allows us to hide these differences to upper layers and simplify suppor=
t at
>>> those levels.
>>
>> It is unfortunate that path was chosen rather than returning Made Up Dat=
a.
> But how would you return made up data without knowing that you _have_ to
> generate some?
> Of course it's trivial to generate made up data once an I/O error
> occurred, but that's too late as the I/O error already happened.
>
> The general idea behind this is that I _really_ do want to avoid
> triggering an I/O error on initial access. This kind of thing really
> tends to freak out users (connect a new drive and getting I/O errors;
> not the best user experience ...)
>
>> However I don't think this is a file system or device mapper problem as
>> neither of those layers attempt to read blocks they have not written
>> (or discarded).
>> All I can think of something that probes the drive media for a partition=
 table
>> or other identification...isn't the conventional space sufficient to cov=
er
>> those cases?
> Upon device scan the kernel will attempt to read the partition tables,
> which consists of reading the first few megabytes and the last sector
> (for GPT shadow partition tables).
> And depending on the driver manufacturer you might or might not have
> enough conventional space at the right locations.
>
>> Anyway you could just handle the error and sense codes [CHECK CONDITION =
/
>> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
>> when URSWRZ is 0. It would have the same effect as using the zone cache
>> without the possibility of the zone cache being out-of-sync.
>>
> Yes, but then we would already have registered an I/O error.
> And as indicated above, I really do _not_ want to handle all the
> customer calls for supposed faulty new SMR drives.

Fair enough. I just really want to be sure that all this memory locked
to the zone cache is really being made useful.

>>> I too agree that the write pointer value is not very useful to upper la=
yers
>>> (e.g. FS). What matters most of the times for these layers is an "alloc=
ation
>>> pointer" or "submission pointer" which indicates the LBA to use for the=
 next
>>> BIO submission. That LBA will most of the time be in advance of the zon=
e WP
>>> because of request queing in the block scheduler.
>>
>> Which is kind of my point.
>>
>>> Note that from what we have done already, in many cases, upper layers d=
o not
>>> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
>>> zone->private_data because that "allocation pointer" information genera=
lly
>>> already exists in a different form (e.g. a bit in a bitmap).
>>
>> Agreed. Log structured and copy on write file systems are a natural fit =
for
>> these types of drives
>>
>>> For these cases, not having the RB-tree of zones would force a lot
>>> more code to be added to file systems for SMR support.
>>
>> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btr=
fs.
>> Certainly any of these could trigger the zone cache to be loaded at mkfs
>> and/or mount time?
>>
>>>>> This all tails into domain responsibility. With the RB-Tree doing hal=
f of the
>>>>> work and the =E2=80=98responsible=E2=80=99 domain handling the active=
 path via private_data
>>>>> why have the split at all? It seems to be a double work to have secon=
d object
>>>>> tracking the first so that I/O scheduling can function.
>>>>>
>>>> Tracking the zone states via RB trees is mainly to handly host-managed
>>>> drives seamlessly. Without an RB tree we will be subjected to I/O erro=
rs
>>>> during boot, as eg with new drives we inevitably will try to read beyo=
nd
>>>> the write pointer, which in turn will generate I/O errors on certain d=
rives.
>>
>> If the zone cache is needed to boot then clearly my objection to reading
>> in the zone report information on drive attach is unfounded. I was under
>> the impression that this was not the case.
>>
>>>> I do agree that there is no strict need to setup an RB tree for
>>>> host-aware drives; I can easily add an attribute/flag to disable RB
>>>> trees for those.
>>>> However, tracking zone information in the RB tree _dramatically_ speed=
s
>>>> up device initialisation; issuing a blkdev_discard() for the entire
>>>> drive will take _ages_ without it.
>>>
>>> And the RB-tree will also be very useful for speeding up report zones
>>> ioctls too. Unless we want those to force an update of the RB-tree info=
rmation ?
>>
>> That is debatable. I would tend to argue for both. *If* there must be
>> a zone cache then reading from the zone cache is a must. Forcing and
>> update of the zone cache from the media seems like a reasonable thing
>> to be able to do.
>>
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkab=
le for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.

How can you make coalesce work transparently in the
sd layer _without_ keeping some sort of a discard cache along
with the zone cache?

Currently the block layer's blkdev_issue_discard() is breaking
large discard's into nice granular and aligned chunks but it is
not preventing small discards nor coalescing them.

In the sd layer would there be way to persist or purge an
overly large discard cache? What about honoring
discard_zeroes_data? Once the discard is completed with
discard_zeroes_data you have to return zeroes whenever
a discarded sector is read. Isn't that a log more than just
tracking a write pointer? Couldn't a zone have dozens of holes?

> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
so I'm not sure your argument is valid here.

[..]

>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state fl=
ags can
>>> be used).
>>
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>>
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instea=
d.

I don't think dropping the zone length is a reasonable thing to do.

What I propose is an array of _descriptors_ it doesn't drop _any_
of the zone information that you are holding in an RB tree, it is
just a condensed format that _mostly_ plugs into your existing
API.

--=20
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-14  0:09               ` Shaun Tancheff
  0 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-14  0:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>>> Hannes, Shaun,
>>>
>>> Let me add some more comments.
>>>
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>>>
>>>>>> Can you please integrate this with Hannes series so that it uses
>>>>>> his cache of the zone information?
>>>>>
>>>>> Adding Hannes and Damien to Cc.
>>>>>
>>>>> Christoph,
>>>>>
>>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>>>>> quite simple. I can even have the open/close/reset zone commands update the
>>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>>>>> not need the RB-Tree to function with zoned media.
>>
>> I have posted patches to integrate with the zone cache, hopefully they
>> make sense.
>>
> [ .. ]
>>>> I have thought about condensing the RB tree information, but then I
>>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>>> fixed size, and hence we need all the information there.
>>>> Any condensing method would assume a given structure of the zones, which
>>>> the standard just doesn't provide.
>>>> Or am I missing something here?

Of course you can condense the zone cache without loosing any
information. Here is the layout I used ... I haven't update the patch
to the latest posted patches but this is the basic idea.

[It was originally done as a follow on of making your zone cache work
 with Seagate's HA drive. I did not include the wp-in-arrays patch
 along with the HA drive support that I sent you in May as you were
 quite terse about RB trees when I tried to discuss this approach with
 you at Vault]

struct blk_zone {
        unsigned type:4;
        unsigned state:5;
        unsigned extra:7;
        unsigned wp:40;
        void *private_data;
};

struct contiguous_wps {
        u64 start_lba;
        u64 last_lba; /* or # of blocks */
        u64 zone_size; /* size in blocks */
        unsigned is_zoned:1;
        u32 zone_count;
        spinlock_t lock;
        struct blk_zone zones[0];
};

struct zone_wps {
        u32 wps_count;
        struct contiguous_wps **wps;
};

Then in struct request_queue
-    struct rb_root zones;
+   struct struct zone_wps *zones;

For each contiguous chunk of zones you need a descriptor. In the current
drives you need 1 or 2 descriptors.

Here a conventional drive is encapsulated as zoned media with one
drive sized conventional zone.

I have not spent time building an ad-hoc LVM comprised of zoned and
conventional media so it's not all ironed out yet.
I think you can see the advantage of being able to put conventional space
anywhere you would like to work around zoned media not being laid out
the the best manner for your setup.

Yes things start to break down if every other zone is a different size ..

The point being that even with supporting zones that order 48 bytes.
in size this saves a lot of space with no loss of information.
I still kind of prefer pushing blk_zone down to a u32 by reducing
the max zone size and dropping the private_data ... but that may
be going a bit too far.

blk_lookup_zone then has an [unfortunate] signature change:


/**
 * blk_lookup_zone() - Lookup zones
 * @q: Request Queue
 * @sector: Location to lookup
 * @start: Starting location zone (OUT: Required)
 * @len: Length of zone (OUT: Required)
 * @lock: Spinlock of zones (OUT: Required)
 */
struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
                                 sector_t *start, sector_t *len,
                                 spinlock_t **lock)
{
        int iter;
        struct blk_zone *bzone = NULL;
        struct zone_wps *zi = q->zones;

        *start = 0;
        *len = 0;
        *lock = NULL;

        if (!q->zones)
                goto out;

        for (iter = 0; iter < zi->wps_count; iter++) {
                if (sector >= zi->wps[iter]->start_lba &&
                    sector <  zi->wps[iter]->last_lba) {
                        struct contiguous_wps *wp = zi->wps[iter];
                        u64 index = (sector - wp->start_lba) / wp->zone_size;

                        BUG_ON(index >= wp->zone_count);

                        bzone = &wp->zones[index];
                        *len = wp->zone_size;
                        *start = wp->start_lba + (index * wp->zone_size);
                        *lock = &wp->lock;
                }
        }

out:
        return bzone;
}

>>>
>>> Indeed, the standards do not mandate any particular zone configuration,
>>> constant zone size, etc. So writing code so that can be handled is certainly
>>> the right way of doing things. However, if we decide to go forward with
>>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
>>> zone size (minus the last zone as you said) must be assumed, and that
>>> information can be removed from the entries in the RB tree (as it will be
>>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
>>> that eventual last runt zone with a different size is not a big problem.
>>
>>>> As for write pointer handling: yes, the write pointer on the zones is
>>>> not really useful for upper-level usage.
>>>> Where we do need it is to detect I/O which is crossing the write pointer
>>>> (eg when doing reads over the entire zone).
>>>> As per spec you will be getting an I/O error here, so we need to split
>>>> the I/O on the write pointer to get valid results back.
>>>
>>> To be precise here, the I/O splitting will be handled by the block layer
>>> thanks to the "chunk_sectors" setting. But that relies on a constant zone
>>> size assumption too.
>>>
>>> The RB-tree here is most useful for reads over or after the write pointer as
>>> this can have different behavior on different drives (URSWRZ bit). The RB-tree
>>> allows us to hide these differences to upper layers and simplify support at
>>> those levels.
>>
>> It is unfortunate that path was chosen rather than returning Made Up Data.
> But how would you return made up data without knowing that you _have_ to
> generate some?
> Of course it's trivial to generate made up data once an I/O error
> occurred, but that's too late as the I/O error already happened.
>
> The general idea behind this is that I _really_ do want to avoid
> triggering an I/O error on initial access. This kind of thing really
> tends to freak out users (connect a new drive and getting I/O errors;
> not the best user experience ...)
>
>> However I don't think this is a file system or device mapper problem as
>> neither of those layers attempt to read blocks they have not written
>> (or discarded).
>> All I can think of something that probes the drive media for a partition table
>> or other identification...isn't the conventional space sufficient to cover
>> those cases?
> Upon device scan the kernel will attempt to read the partition tables,
> which consists of reading the first few megabytes and the last sector
> (for GPT shadow partition tables).
> And depending on the driver manufacturer you might or might not have
> enough conventional space at the right locations.
>
>> Anyway you could just handle the error and sense codes [CHECK CONDITION /
>> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
>> when URSWRZ is 0. It would have the same effect as using the zone cache
>> without the possibility of the zone cache being out-of-sync.
>>
> Yes, but then we would already have registered an I/O error.
> And as indicated above, I really do _not_ want to handle all the
> customer calls for supposed faulty new SMR drives.

Fair enough. I just really want to be sure that all this memory locked
to the zone cache is really being made useful.

>>> I too agree that the write pointer value is not very useful to upper layers
>>> (e.g. FS). What matters most of the times for these layers is an "allocation
>>> pointer" or "submission pointer" which indicates the LBA to use for the next
>>> BIO submission. That LBA will most of the time be in advance of the zone WP
>>> because of request queing in the block scheduler.
>>
>> Which is kind of my point.
>>
>>> Note that from what we have done already, in many cases, upper layers do not
>>> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
>>> zone->private_data because that "allocation pointer" information generally
>>> already exists in a different form (e.g. a bit in a bitmap).
>>
>> Agreed. Log structured and copy on write file systems are a natural fit for
>> these types of drives
>>
>>> For these cases, not having the RB-tree of zones would force a lot
>>> more code to be added to file systems for SMR support.
>>
>> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
>> Certainly any of these could trigger the zone cache to be loaded at mkfs
>> and/or mount time?
>>
>>>>> This all tails into domain responsibility. With the RB-Tree doing half of the
>>>>> work and the ‘responsible’ domain handling the active path via private_data
>>>>> why have the split at all? It seems to be a double work to have second object
>>>>> tracking the first so that I/O scheduling can function.
>>>>>
>>>> Tracking the zone states via RB trees is mainly to handly host-managed
>>>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>>>> during boot, as eg with new drives we inevitably will try to read beyond
>>>> the write pointer, which in turn will generate I/O errors on certain drives.
>>
>> If the zone cache is needed to boot then clearly my objection to reading
>> in the zone report information on drive attach is unfounded. I was under
>> the impression that this was not the case.
>>
>>>> I do agree that there is no strict need to setup an RB tree for
>>>> host-aware drives; I can easily add an attribute/flag to disable RB
>>>> trees for those.
>>>> However, tracking zone information in the RB tree _dramatically_ speeds
>>>> up device initialisation; issuing a blkdev_discard() for the entire
>>>> drive will take _ages_ without it.
>>>
>>> And the RB-tree will also be very useful for speeding up report zones
>>> ioctls too. Unless we want those to force an update of the RB-tree information ?
>>
>> That is debatable. I would tend to argue for both. *If* there must be
>> a zone cache then reading from the zone cache is a must. Forcing and
>> update of the zone cache from the media seems like a reasonable thing
>> to be able to do.
>>
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.

How can you make coalesce work transparently in the
sd layer _without_ keeping some sort of a discard cache along
with the zone cache?

Currently the block layer's blkdev_issue_discard() is breaking
large discard's into nice granular and aligned chunks but it is
not preventing small discards nor coalescing them.

In the sd layer would there be way to persist or purge an
overly large discard cache? What about honoring
discard_zeroes_data? Once the discard is completed with
discard_zeroes_data you have to return zeroes whenever
a discarded sector is read. Isn't that a log more than just
tracking a write pointer? Couldn't a zone have dozens of holes?

> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
so I'm not sure your argument is valid here.

[..]

>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>> be used).
>>
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>>
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instead.

I don't think dropping the zone length is a reasonable thing to do.

What I propose is an array of _descriptors_ it doesn't drop _any_
of the zone information that you are holding in an RB tree, it is
just a condensed format that _mostly_ plugs into your existing
API.

-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-14  0:09               ` Shaun Tancheff
@ 2016-08-16  4:00                 ` Damien Le Moal
  -1 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-16  4:00 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

DQpTaGF1biwNCg0KPiBPbiBBdWcgMTQsIDIwMTYsIGF0IDA5OjA5LCBTaGF1biBUYW5jaGVmZiA8
c2hhdW4udGFuY2hlZmZAc2VhZ2F0ZS5jb20+IHdyb3RlOg0KW+KApl0NCj4+PiANCj4+IE5vLCBz
dXJlbHkgbm90Lg0KPj4gQnV0IG9uZSBvZiB0aGUgX2JpZ18gYWR2YW50YWdlcyBmb3IgdGhlIFJC
IHRyZWUgaXMgYmxrZGV2X2Rpc2NhcmQoKS4NCj4+IFdpdGhvdXQgdGhlIFJCIHRyZWUgYW55IG1r
ZnMgcHJvZ3JhbSB3aWxsIGlzc3VlIGEgJ2Rpc2NhcmQnIGZvciBldmVyeQ0KPj4gc2VjdG9yLiBX
ZSB3aWxsIGJlIGFibGUgdG8gY29hbGVzY2UgdGhvc2UgaW50byBvbmUgZGlzY2FyZCBwZXIgem9u
ZSwgYnV0DQo+PiB3ZSBzdGlsbCBuZWVkIHRvIGlzc3VlIG9uZSBmb3IgX2V2ZXJ5XyB6b25lLg0K
PiANCj4gSG93IGNhbiB5b3UgbWFrZSBjb2FsZXNjZSB3b3JrIHRyYW5zcGFyZW50bHkgaW4gdGhl
DQo+IHNkIGxheWVyIF93aXRob3V0XyBrZWVwaW5nIHNvbWUgc29ydCBvZiBhIGRpc2NhcmQgY2Fj
aGUgYWxvbmcNCj4gd2l0aCB0aGUgem9uZSBjYWNoZT8NCj4gDQo+IEN1cnJlbnRseSB0aGUgYmxv
Y2sgbGF5ZXIncyBibGtkZXZfaXNzdWVfZGlzY2FyZCgpIGlzIGJyZWFraW5nDQo+IGxhcmdlIGRp
c2NhcmQncyBpbnRvIG5pY2UgZ3JhbnVsYXIgYW5kIGFsaWduZWQgY2h1bmtzIGJ1dCBpdCBpcw0K
PiBub3QgcHJldmVudGluZyBzbWFsbCBkaXNjYXJkcyBub3IgY29hbGVzY2luZyB0aGVtLg0KPiAN
Cj4gSW4gdGhlIHNkIGxheWVyIHdvdWxkIHRoZXJlIGJlIHdheSB0byBwZXJzaXN0IG9yIHB1cmdl
IGFuDQo+IG92ZXJseSBsYXJnZSBkaXNjYXJkIGNhY2hlPyBXaGF0IGFib3V0IGhvbm9yaW5nDQo+
IGRpc2NhcmRfemVyb2VzX2RhdGE/IE9uY2UgdGhlIGRpc2NhcmQgaXMgY29tcGxldGVkIHdpdGgN
Cj4gZGlzY2FyZF96ZXJvZXNfZGF0YSB5b3UgaGF2ZSB0byByZXR1cm4gemVyb2VzIHdoZW5ldmVy
DQo+IGEgZGlzY2FyZGVkIHNlY3RvciBpcyByZWFkLiBJc24ndCB0aGF0IGEgbG9nIG1vcmUgdGhh
biBqdXN0DQo+IHRyYWNraW5nIGEgd3JpdGUgcG9pbnRlcj8gQ291bGRuJ3QgYSB6b25lIGhhdmUg
ZG96ZW5zIG9mIGhvbGVzPw0KDQpNeSB1bmRlcnN0YW5kaW5nIG9mIHRoZSBzdGFuZGFyZHMgcmVn
YXJkaW5nIGRpc2NhcmQgaXMgdGhhdCBpdCBpcyBub3QNCm1hbmRhdG9yeSBhbmQgdGhhdCBpdCBp
cyBhIGhpbnQgdG8gdGhlIGRyaXZlLiBUaGUgZHJpdmUgY2FuIGNvbXBsZXRlbHkNCmlnbm9yZSBp
dCBpZiBpdCB0aGlua3MgdGhhdCBpcyBhIGJldHRlciBjaG9pY2UuIEkgbWF5IGJlIHdyb25nIG9u
IHRoaXMNCnRob3VnaC4gTmVlZCB0byBjaGVjayBhZ2Fpbi4NCkZvciByZXNldCB3cml0ZSBwb2lu
dGVyLCB0aGUgbWFwcGluZyB0byBkaXNjYXJkIHJlcXVpcmVzIHRoYXQgdGhlIGNhbGxzDQp0byBi
bGtkZXZfaXNzdWVfZGlzY2FyZCBiZSB6b25lIGFsaWduZWQgZm9yIGFueXRoaW5nIHRvIGhhcHBl
bi4gU3BlY2lmeQ0KbGVzcyB0aGFuIGEgem9uZSBhbmQgbm90aGluZyB3aWxsIGJlIGRvbmUuIFRo
aXMgSSB0aGluayBwcmVzZXJ2ZSB0aGUNCmRpc2NhcmQgc2VtYW50aWMuDQoNCkFzIGZvciB0aGUg
4oCcZGlzY2FyZF96ZXJvZXNfZGF0YeKAnSB0aGluZywgSSBhbHNvIHRoaW5rIHRoYXQgaXMgYSBk
cml2ZQ0KZmVhdHVyZSBub3QgbWFuZGF0b3J5LiBEcml2ZXMgbWF5IGhhdmUgaXQgb3Igbm90LCB3
aGljaCBpcyBjb25zaXN0ZW50DQp3aXRoIHRoZSBaQkMvWkFDIHN0YW5kYXJkcyByZWdhcmRpbmcg
cmVhZGluZyBhZnRlciB3cml0ZSBwb2ludGVyIChub3RoaW5nDQpzYXlzIHRoYXQgemVyb3MgaGF2
ZSB0byBiZSByZXR1cm5lZCkuIEluIGFueSBjYXNlLCBkaXNjYXJkIG9mIENNUiB6b25lcw0Kd2ls
bCBiZSBhIG5vcCwgc28gZm9yIFNNUiBkcml2ZXMsIGRpc2NhcmRfemVyb2VzX2RhdGE9MCBtYXkg
YmUgYSBiZXR0ZXINCmNob2ljZS4NCg0KPiANCj4+IFdoaWNoIGlzIChhcyBpbmRpY2F0ZWQpIHJl
YWxseSBzbG93LCBhbmQgZWFzaWx5IHRha2VzIHNldmVyYWwgbWludXRlcy4NCj4+IFdpdGggdGhl
IFJCIHRyZWUgd2UgY2FuIHNob3J0LWNpcmN1aXQgZGlzY2FyZHMgdG8gZW1wdHkgem9uZXMsIGFu
ZCBzcGVlZA0KPj4gdXAgcHJvY2Vzc2luZyB0aW1lIGRyYW1hdGljYWxseS4NCj4+IFN1cmUgd2Ug
Y291bGQgYmUgbW92aW5nIHRoZSBsb2dpYyBpbnRvIG1rZnMgYW5kIGZyaWVuZHMsIGJ1dCB0aGF0
IHdvdWxkDQo+PiByZXF1aXJlIHVzIHRvIGNoYW5nZSB0aGUgcHJvZ3JhbXMgYW5kIGFncmVlIG9u
IGEgbGlicmFyeSAobGliemJjPykgd2hpY2gNCj4+IHNob3VsZCBiZSBoYW5kbGluZyB0aGF0Lg0K
PiANCj4gRjJGUydzIG1rZnMuZjJmcyBpcyBhbHJlYWR5IHJlYWRpbmcgdGhlIHpvbmUgdG9wb2xv
Z3kgdmlhIFNHX0lPIC4uLg0KPiBzbyBJJ20gbm90IHN1cmUgeW91ciBhcmd1bWVudCBpcyB2YWxp
ZCBoZXJlLg0KDQpUaGlzIGluaXRpYWwgU01SIHN1cHBvcnQgcGF0Y2ggaXMganVzdCB0aGF0OiBh
IGZpcnN0IHRyeS4gSmFlZ2V1aw0KdXNlZCBTR19JTyAoaW4gZmFjdCBjb3B5LXBhc3RlIG9mIHBh
cnRzIG9mIGxpYnpiYykgYmVjYXVzZSB0aGUgY3VycmVudA0KWkJDIHBhdGNoLXNldCBoYXMgbm8g
aW9jdGwgQVBJIGZvciB6b25lIGluZm9ybWF0aW9uIG1hbmlwdWxhdGlvbi4gV2UNCndpbGwgZml4
IHRoaXMgbWtmcy5mMmZzIG9uY2Ugd2UgYWdyZWUgb24gYW4gaW9jdGwgaW50ZXJmYWNlLg0KDQo+
IA0KPiBbLi5dDQo+IA0KPj4+PiAzKSBUcnkgdG8gY29uZGVuc2UgdGhlIGJsa3pvbmUgZGF0YSBz
dHJ1Y3R1cmUgdG8gc2F2ZSBtZW1vcnk6DQo+Pj4+IEkgdGhpbmsgdGhhdCB3ZSBjYW4gYXQgdGhl
IHZlcnkgbGVhc3QgcmVtb3ZlIHRoZSB6b25lIGxlbmd0aCwgYW5kIGFsc28NCj4+Pj4gbWF5IGJl
IHRoZSBwZXIgem9uZSBzcGlubG9jayB0b28gKGEgc2luZ2xlIHNwaW5sb2NrIGFuZCBwcm9wZXIg
c3RhdGUgZmxhZ3MgY2FuDQo+Pj4+IGJlIHVzZWQpLg0KPj4+IA0KPj4+IEkgaGF2ZSBhIHZhcmlh
bnQgdGhhdCBpcyBhbiBhcnJheSBvZiBkZXNjcmlwdG9ycyB0aGF0IHJvdWdobHkgbWltaWNzIHRo
ZQ0KPj4+IGFwaSBmcm9tIGJsay16b25lZC5jIHRoYXQgSSBkaWQgYSBmZXcgbW9udGhzIGFnbyBh
cyBhbiBleGFtcGxlLg0KPj4+IEkgc2hvdWxkIGJlIGFibGUgdG8gdXBkYXRlIHRoYXQgdG8gdGhl
IGN1cnJlbnQga2VybmVsICsgcGF0Y2hlcy4NCj4+PiANCj4+IE9rYXkuIElmIHdlIHJlc3RyaWN0
IHRoZSBpbi1rZXJuZWwgU01SIGRyaXZlIGhhbmRsaW5nIHRvIGRldmljZXMgd2l0aA0KPj4gaWRl
bnRpY2FsIHpvbmUgc2l6ZXMgb2YgY291cnNlIHdlIGNhbiByZW1vdmUgdGhlIHpvbmUgbGVuZ3Ro
Lg0KPj4gQW5kIHdlIGNhbiBkbyBhd2F5IHdpdGggdGhlIHBlci16b25lIHNwaW5sb2NrIGFuZCB1
c2UgYSBnbG9iYWwgb25lIGluc3RlYWQuDQo+IA0KPiBJIGRvbid0IHRoaW5rIGRyb3BwaW5nIHRo
ZSB6b25lIGxlbmd0aCBpcyBhIHJlYXNvbmFibGUgdGhpbmcgdG8gZG8uDQo+IA0KPiBXaGF0IEkg
cHJvcG9zZSBpcyBhbiBhcnJheSBvZiBfZGVzY3JpcHRvcnNfIGl0IGRvZXNuJ3QgZHJvcCBfYW55
Xw0KPiBvZiB0aGUgem9uZSBpbmZvcm1hdGlvbiB0aGF0IHlvdSBhcmUgaG9sZGluZyBpbiBhbiBS
QiB0cmVlLCBpdCBpcw0KPiBqdXN0IGEgY29uZGVuc2VkIGZvcm1hdCB0aGF0IF9tb3N0bHlfIHBs
dWdzIGludG8geW91ciBleGlzdGluZw0KPiBBUEkuDQoNCkkgZG8gbm90IGFncmVlLiBUaGUgU2Vh
Z2F0ZSBkcml2ZSBhbHJlYWR5IGhhcyBvbmUgem9uZSAodGhlIGxhc3Qgb25lKQ0KdGhhdCBpcyBu
b3QgdGhlIHNhbWUgbGVuZ3RoIGFzIHRoZSBvdGhlciB6b25lcy4gU3VyZSwgc2luY2UgaXQgaXMg
dGhlDQpsYXN0IG9uZSwgd2UgY2FuIGhhZCDigJxpZiAobGFzdCB6b25lKeKAnSBhbGwgb3ZlciB0
aGUgcGxhY2UgYW5kIG1ha2UgaXQNCndvcmsuIEJ1dCB0aGF0IGlzIHJlYWxseSB1Z2x5LiBLZWVw
aW5nIHRoZSBsZW5ndGggZmllbGQgbWFrZXMgdGhlIGNvZGUNCmdlbmVyaWMgYW5kIGZvbGxvd2lu
ZyB0aGUgc3RhbmRhcmQsIHdoaWNoIGhhcyBubyByZXN0cmljdGlvbiBvbiB0aGUNCnpvbmUgc2l6
ZXMuIFdlIGNvdWxkIGRvIHNvbWUgbWVtb3J5IG9wdGltaXNhdGlvbiB1c2luZyBkaWZmZXJlbnQg
dHlwZXMNCm9mIGJsa196b25lIHN0dXJjdHMsIHRoZSB0eXBlcyBtYXBwaW5nIHRvIHRoZSBTQU1F
IHZhbHVlOiBkcml2ZXMgd2l0aA0KY29uc3RhbnQgem9uZSBzaXplIGNhbiB1c2UgYSBibGtfem9u
ZSB0eXBlIHdpdGhvdXQgdGhlIGxlbmd0aCBmaWVsZCwNCm90aGVycyB1c2UgYSBkaWZmZXJlbnQg
dHlwZSB0aGF0IGluY2x1ZGUgdGhlIGZpZWxkLiBBY2Nlc3NvciBmdW5jdGlvbnMNCmNhbiBoaWRl
IHRoZSBkaWZmZXJlbnQgdHlwZXMgaW4gdGhlIHpvbmUgbWFuaXB1bGF0aW9uIGNvZGUuDQoNCkJl
c3QgcmVnYXJkcy4NCg0KDQotLSANCkRhbWllbiBMZSBNb2FsLCBQaC5ELg0KU3IuIE1hbmFnZXIs
IFN5c3RlbSBTb2Z0d2FyZSBHcm91cCwgSEdTVCBSZXNlYXJjaCwNCkhHU1QsIGEgV2VzdGVybiBE
aWdpdGFsIGJyYW5kDQpEYW1pZW4uTGVNb2FsQGhnc3QuY29tDQooKzgxKSAwNDY2LTk4LTM1OTMg
KGV4dC4gNTEzNTkzKQ0KMSBraXJpaGFyYS1jaG8sIEZ1amlzYXdhLCANCkthbmFnYXdhLCAyNTIt
MDg4OCBKYXBhbg0Kd3d3Lmhnc3QuY29tDQoNCldlc3Rlcm4gRGlnaXRhbCBDb3Jwb3JhdGlvbiAo
YW5kIGl0cyBzdWJzaWRpYXJpZXMpIEUtbWFpbCBDb25maWRlbnRpYWxpdHkgTm90aWNlICYgRGlz
Y2xhaW1lcjoKClRoaXMgZS1tYWlsIGFuZCBhbnkgZmlsZXMgdHJhbnNtaXR0ZWQgd2l0aCBpdCBt
YXkgY29udGFpbiBjb25maWRlbnRpYWwgb3IgbGVnYWxseSBwcml2aWxlZ2VkIGluZm9ybWF0aW9u
IG9mIFdEQyBhbmQvb3IgaXRzIGFmZmlsaWF0ZXMsIGFuZCBhcmUgaW50ZW5kZWQgc29sZWx5IGZv
ciB0aGUgdXNlIG9mIHRoZSBpbmRpdmlkdWFsIG9yIGVudGl0eSB0byB3aGljaCB0aGV5IGFyZSBh
ZGRyZXNzZWQuIElmIHlvdSBhcmUgbm90IHRoZSBpbnRlbmRlZCByZWNpcGllbnQsIGFueSBkaXNj
bG9zdXJlLCBjb3B5aW5nLCBkaXN0cmlidXRpb24gb3IgYW55IGFjdGlvbiB0YWtlbiBvciBvbWl0
dGVkIHRvIGJlIHRha2VuIGluIHJlbGlhbmNlIG9uIGl0LCBpcyBwcm9oaWJpdGVkLiBJZiB5b3Ug
aGF2ZSByZWNlaXZlZCB0aGlzIGUtbWFpbCBpbiBlcnJvciwgcGxlYXNlIG5vdGlmeSB0aGUgc2Vu
ZGVyIGltbWVkaWF0ZWx5IGFuZCBkZWxldGUgdGhlIGUtbWFpbCBpbiBpdHMgZW50aXJldHkgZnJv
bSB5b3VyIHN5c3RlbS4K

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-16  4:00                 ` Damien Le Moal
  0 siblings, 0 replies; 26+ messages in thread
From: Damien Le Moal @ 2016-08-16  4:00 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman


Shaun,

> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> wrote:
[…]
>>> 
>> No, surely not.
>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>> Without the RB tree any mkfs program will issue a 'discard' for every
>> sector. We will be able to coalesce those into one discard per zone, but
>> we still need to issue one for _every_ zone.
> 
> How can you make coalesce work transparently in the
> sd layer _without_ keeping some sort of a discard cache along
> with the zone cache?
> 
> Currently the block layer's blkdev_issue_discard() is breaking
> large discard's into nice granular and aligned chunks but it is
> not preventing small discards nor coalescing them.
> 
> In the sd layer would there be way to persist or purge an
> overly large discard cache? What about honoring
> discard_zeroes_data? Once the discard is completed with
> discard_zeroes_data you have to return zeroes whenever
> a discarded sector is read. Isn't that a log more than just
> tracking a write pointer? Couldn't a zone have dozens of holes?

My understanding of the standards regarding discard is that it is not
mandatory and that it is a hint to the drive. The drive can completely
ignore it if it thinks that is a better choice. I may be wrong on this
though. Need to check again.
For reset write pointer, the mapping to discard requires that the calls
to blkdev_issue_discard be zone aligned for anything to happen. Specify
less than a zone and nothing will be done. This I think preserve the
discard semantic.

As for the “discard_zeroes_data” thing, I also think that is a drive
feature not mandatory. Drives may have it or not, which is consistent
with the ZBC/ZAC standards regarding reading after write pointer (nothing
says that zeros have to be returned). In any case, discard of CMR zones
will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
choice.

> 
>> Which is (as indicated) really slow, and easily takes several minutes.
>> With the RB tree we can short-circuit discards to empty zones, and speed
>> up processing time dramatically.
>> Sure we could be moving the logic into mkfs and friends, but that would
>> require us to change the programs and agree on a library (libzbc?) which
>> should be handling that.
> 
> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
> so I'm not sure your argument is valid here.

This initial SMR support patch is just that: a first try. Jaegeuk
used SG_IO (in fact copy-paste of parts of libzbc) because the current
ZBC patch-set has no ioctl API for zone information manipulation. We
will fix this mkfs.f2fs once we agree on an ioctl interface.

> 
> [..]
> 
>>>> 3) Try to condense the blkzone data structure to save memory:
>>>> I think that we can at the very least remove the zone length, and also
>>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>>> be used).
>>> 
>>> I have a variant that is an array of descriptors that roughly mimics the
>>> api from blk-zoned.c that I did a few months ago as an example.
>>> I should be able to update that to the current kernel + patches.
>>> 
>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>> identical zone sizes of course we can remove the zone length.
>> And we can do away with the per-zone spinlock and use a global one instead.
> 
> I don't think dropping the zone length is a reasonable thing to do.
> 
> What I propose is an array of _descriptors_ it doesn't drop _any_
> of the zone information that you are holding in an RB tree, it is
> just a condensed format that _mostly_ plugs into your existing
> API.

I do not agree. The Seagate drive already has one zone (the last one)
that is not the same length as the other zones. Sure, since it is the
last one, we can had “if (last zone)” all over the place and make it
work. But that is really ugly. Keeping the length field makes the code
generic and following the standard, which has no restriction on the
zone sizes. We could do some memory optimisation using different types
of blk_zone sturcts, the types mapping to the SAME value: drives with
constant zone size can use a blk_zone type without the length field,
others use a different type that include the field. Accessor functions
can hide the different types in the zone manipulation code.

Best regards.


-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-16  4:00                 ` Damien Le Moal
@ 2016-08-16  5:49                   ` Shaun Tancheff
  -1 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-16  5:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal <Damien.LeMoal@hgst.com> w=
rote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> w=
rote:
> [=E2=80=A6]
>>>>
>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, bu=
t
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=3D1 in your
current patches. I believe that setting discard_zeroes_data=3D1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the =E2=80=9Cdiscard_zeroes_data=E2=80=9D thing, I also think that=
 is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=3D0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and spee=
d
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) whic=
h
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less memory. I would also like a path available that
does not require SD_ZBC or BLK_ZONED for Host Aware
drives to work, hence this set of patches and me begging
for a single bit in struct bio.

>>
>> [..]
>>
>>>>> 3) Try to condense the blkzone data structure to save memory:
>>>>> I think that we can at the very least remove the zone length, and als=
o
>>>>> may be the per zone spinlock too (a single spinlock and proper state =
flags can
>>>>> be used).
>>>>
>>>> I have a variant that is an array of descriptors that roughly mimics t=
he
>>>> api from blk-zoned.c that I did a few months ago as an example.
>>>> I should be able to update that to the current kernel + patches.
>>>>
>>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>>> identical zone sizes of course we can remove the zone length.
>>> And we can do away with the per-zone spinlock and use a global one inst=
ead.
>>
>> I don't think dropping the zone length is a reasonable thing to do.

REPEAT: I do _not_ think dropping the zone length is a good thing.

>> What I propose is an array of _descriptors_ it doesn't drop _any_
>> of the zone information that you are holding in an RB tree, it is
>> just a condensed format that _mostly_ plugs into your existing
>> API.
>
> I do not agree. The Seagate drive already has one zone (the last one)
> that is not the same length as the other zones. Sure, since it is the
> last one, we can had =E2=80=9Cif (last zone)=E2=80=9D all over the place =
and make it
> work. But that is really ugly. Keeping the length field makes the code
> generic and following the standard, which has no restriction on the
> zone sizes. We could do some memory optimisation using different types
> of blk_zone sturcts, the types mapping to the SAME value: drives with
> constant zone size can use a blk_zone type without the length field,
> others use a different type that include the field. Accessor functions
> can hide the different types in the zone manipulation code.

Ah. I just said that dropping the zone length is not a good idea.
Why the antagonistic exposition?

All I am saying is that I can give you the zone cache with 1/7th of
the memory and the same performance with _no_ loss of information,
or features, as compared to the existing zone cache.

All the code is done now. I will post patches once my testing is
done.

I have also reworked all the zone integration so the BIO flags will
pull from and update the zone cache as opposed to the first hack
that only really integrated with some ioctls.

Regards,
Shaun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-08-16  5:49                   ` Shaun Tancheff
  0 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-16  5:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> wrote:
> […]
>>>>
>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, but
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=1 in your
current patches. I believe that setting discard_zeroes_data=1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the “discard_zeroes_data” thing, I also think that is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and speed
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) which
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less memory. I would also like a path available that
does not require SD_ZBC or BLK_ZONED for Host Aware
drives to work, hence this set of patches and me begging
for a single bit in struct bio.

>>
>> [..]
>>
>>>>> 3) Try to condense the blkzone data structure to save memory:
>>>>> I think that we can at the very least remove the zone length, and also
>>>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>>>> be used).
>>>>
>>>> I have a variant that is an array of descriptors that roughly mimics the
>>>> api from blk-zoned.c that I did a few months ago as an example.
>>>> I should be able to update that to the current kernel + patches.
>>>>
>>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>>> identical zone sizes of course we can remove the zone length.
>>> And we can do away with the per-zone spinlock and use a global one instead.
>>
>> I don't think dropping the zone length is a reasonable thing to do.

REPEAT: I do _not_ think dropping the zone length is a good thing.

>> What I propose is an array of _descriptors_ it doesn't drop _any_
>> of the zone information that you are holding in an RB tree, it is
>> just a condensed format that _mostly_ plugs into your existing
>> API.
>
> I do not agree. The Seagate drive already has one zone (the last one)
> that is not the same length as the other zones. Sure, since it is the
> last one, we can had “if (last zone)” all over the place and make it
> work. But that is really ugly. Keeping the length field makes the code
> generic and following the standard, which has no restriction on the
> zone sizes. We could do some memory optimisation using different types
> of blk_zone sturcts, the types mapping to the SAME value: drives with
> constant zone size can use a blk_zone type without the length field,
> others use a different type that include the field. Accessor functions
> can hide the different types in the zone manipulation code.

Ah. I just said that dropping the zone length is not a good idea.
Why the antagonistic exposition?

All I am saying is that I can give you the zone cache with 1/7th of
the memory and the same performance with _no_ loss of information,
or features, as compared to the existing zone cache.

All the code is done now. I will post patches once my testing is
done.

I have also reworked all the zone integration so the BIO flags will
pull from and update the zone cache as opposed to the first hack
that only really integrated with some ioctls.

Regards,
Shaun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-10  4:38                   ` Damien Le Moal
  (?)
@ 2016-08-16  6:07                   ` Shaun Tancheff
  -1 siblings, 0 replies; 26+ messages in thread
From: Shaun Tancheff @ 2016-08-16  6:07 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote:
> Shaun,
>
> On 8/10/16 12:58, Shaun Tancheff wrote:
>>
>> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com>
>> wrote:
>>>>
>>>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
>>
>>
>> [trim]
>>
>>>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>>>
>>>> Okay. So let's put in the 'zoned' attribute the device type:
>>>> 'host-managed', 'host-aware', or 'device managed'.
>>>
>>>
>>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned
>>> file.
>>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>>> else. This means that drive managed models are not exposed as zoned block
>>> devices. For HM vs HA differentiation, an application can look at the
>>> device type file since it is already present.
>>>
>>> We could indeed set the "zoned" file to the device type, but HM drives
>>> and
>>> regular drives will both have "0" in it, so no differentiation possible.
>>> The other choice could be the "zoned" bits defined by ZBC, but these
>>> do not define a value for host managed drives, and the drive managed
>>> value
>>> being not "0" could be confusing too. So I settled for a simple 0/1
>>> boolean.
>>
>>
>> This seems good to me.
>
>
> Another option I forgot is for the "zoned" file to indicate the total number
> of zones of the device, and 0 for a non zoned regular block device. That
> would work as well.

Clearly either is sufficient.

> [...]
>>>
>>> Done: I hacked Shaun ioctl code and added finish zone too. The
>>> difference with Shaun initial code is that the ioctl are propagated down
>>> to
>>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need
>>> for
>>> BIO request definition for the zone operations. So a lot less code added.
>>
>>
>> The purpose of the BIO flags is not to enable the ioctls so much as
>> the other way round. Creating BIO op's is to enable issuing ZBC
>> commands from device mapper targets and file systems without some
>> heinous ioctl hacks.
>> Making the resulting block layer interfaces available via ioctls is just a
>> reasonable way to exercise the code ... or that was my intent.
>
>
> Yes, I understood your code. However, since (or if) we keep the zone
> information in the RB-tree cache, there is no need for the report zone
> operation BIO interface. Same for reset write pointer by keeping the mapping
> to discard. blk_lookup_zone can be used in kernel as a report zone BIO
> replacement and works as well for the report zone ioctl implementation. For
> reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl
> becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and
> finish zone remains. For these, adding the BIO interface seemed an overkill.
> Hence my choice of propagating the ioctl to the driver.
> This is debatable of course, and adding an in-kernel interface is not hard:
> we can implement blk_open_zone, blk_close_zone and blk_finish_zone using
> __blkdev_driver_ioctl. That looks clean to me.

Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API
-> Kernel
is not really a good designed IMO.

> Overall, my concern with the BIO based interface for the ZBC commands is
> that it adds one flag for each command, which is not really the philosophy
> of the interface and potentially opens the door for more such
> implementations in the future with new standards and new commands coming up.
> Clearly that is not a sustainable path. So I think that a more specific
> interface for these zone operations is a better choice. That is consistent
> with what happens with the tons of ATA and SCSI commands not actually doing
> data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and
> are processed as request REQ_TYPE_BLOCK_PC.

Part of the reason for following on Mike Christie's bio op/flags cleanup was to
make these op's. The advantage of being added as ops is that there is only
1 extra bit need (not 4 or 5 bits for flags). The other reason for being
promoted into the block layer as commands is because it seems to me
to make sense that these abstractions could be allowed to be passed through
a DM layer and be handled by a files system.

>>> The ioctls do not mimic exactly the ZBC standard. For instance, there is
>>> no
>>> reporting options for report zones, nor is the "all" bit supported for
>>> open,
>>> close or finish zone commands. But the information provided on zones is
>>> complete
>>> and maps to the standard definitions.
>>
>>
>> For the reporting options I have planned to reuse the stream_id in
>> struct bio when that is formalized. There are certainly other places in
>> struct bio to stuff a few extra bits ...

Sorry I was confused here. I was under the impression you were talking
about one of my patches when you seem to have been talking about
your hacking thereof.

> We could add reporting options to blk_lookup_zones to filter the result and
> use that in the ioctl implementation as well. This can be added without any
> problem.
>
>> As far as the all bit ... this is being handled by all the zone action
>> commands. Just pass a sector of ~0ul and it's handled in sd.c by
>> sd_setup_zone_action_cmnd().
>>
>> Apologies as apparently my documentation here is lacking :-(
>
>
> Yes, I got it (libzbc does the same actually). I did not add it for
> simplicity. But indeed may be it should be. The same trick can be used with
> the ioctl to driver interface. No problems.
>
>>> I also added a reset_wp ioctl for completeness, but this one simply calls
>>> blkdev_issue_discard internally, so it is in fact equivalent to the
>>> BLKDISCARD
>>> ioctl call with a range exactly aligned on a zone.
>>
>>
>> I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
>> creates a REQ_OP_ZONE_RESET .. same as open and close. My
>> expectation being that BLKDISCARD doesn't really need yet another alias.

Same as above.

> Yes, we could remove the BLKRESETZONE ioctl and have applications use the
> BLKDISCARD ioctl instead. In the kernel, both generate blkdev_issue_discard
> calls and are thus identical. Only the ioctl interface differs (sector vs
> range argument). Again, I added it to have the full set of 5 ZBC/ZAC
> operations mapping to one BLKxxxZONE ioctl. But granted, reset is optional.

Which is the opposite of what I had stated above which is that I do _not_
think reset is optional. I think that if the user wants discard they should
call discard and if they want reset wp they *can* call reset wp and have
it behave as expected.
Why would I and a new ioctl that just calls discard? That makes no sense to me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-08-16  6:07 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff
2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
2016-08-01 17:07   ` Shaun Tancheff
2016-08-01 17:07     ` Shaun Tancheff
2016-08-02 14:35     ` Hannes Reinecke
2016-08-03  1:29       ` Damien Le Moal
2016-08-03  1:29         ` Damien Le Moal
2016-08-05 20:35         ` Shaun Tancheff
2016-08-05 20:35           ` Shaun Tancheff
2016-08-09  6:47           ` Hannes Reinecke
2016-08-09  8:09             ` Damien Le Moal
2016-08-09  8:09               ` Damien Le Moal
2016-08-10  3:58               ` Shaun Tancheff
2016-08-10  3:58                 ` Shaun Tancheff
2016-08-10  4:38                 ` Damien Le Moal
2016-08-10  4:38                   ` Damien Le Moal
2016-08-16  6:07                   ` Shaun Tancheff
2016-08-10  3:26             ` Shaun Tancheff
2016-08-14  0:09             ` Shaun Tancheff
2016-08-14  0:09               ` Shaun Tancheff
2016-08-16  4:00               ` Damien Le Moal
2016-08-16  4:00                 ` Damien Le Moal
2016-08-16  5:49                 ` Shaun Tancheff
2016-08-16  5:49                   ` Shaun Tancheff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.