linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] Block layer support ZAC/ZBC commands
@ 2016-07-29 19:02 Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hi Jens,

This series is based on linus' current tip after the merge of 'for-4.8/core'

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be suitable
for use by Host Managed drives.

ZAC/ZBC drives add new commands for discovering and working with Zones.

This extends the ZAC/ZBC support up to the block layer allowing direct control
by file systems or device mapper targets. Also by deferring the zone handling
to the authoritative subsystem there is an overall lower memory usage for
holding the active zone information as well as clarifying responsible party
for maintaining the write pointer for each active zone.
By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and 
where the write pointer is based on writes completed w/o errors.
Knowing the drive's zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or 
discards) that are zone aligned.
A perhaps non-obvious approach is that a conventional drive will 
returns a zone report descriptor with a single large conventional zone.

Patches for util-linux can be found here:
https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux

This patch is available here:
    https://github.com/stancheff/linux/tree/zbc.bio.v6

    git@github.com:stancheff/linux.git zbc.bio.v6

v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.

Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS                       |   9 ++
 block/blk-lib.c                   |  95 ++++++++++++++++
 block/ioctl.c                     | 110 +++++++++++++++++++
 drivers/scsi/sd.c                 | 118 ++++++++++++++++++++
 drivers/scsi/sd.h                 |   1 +
 include/linux/bio.h               |   7 +-
 include/linux/blk_types.h         |   6 +-
 include/linux/blkzoned_api.h      |  25 +++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/blkzoned_api.h | 220 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h           |   1 +
 11 files changed, 591 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.8.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands
  2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
@ 2016-07-29 19:02 ` Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff
  2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
  2 siblings, 0 replies; 17+ messages in thread
From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman, Shaun Tancheff

Add op flags to access to zone information as well as open, close
and reset zones:
  - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
  - REQ_OP_ZONE_OPEN - Explictly open a zone for writing
  - REQ_OP_ZONE_CLOSE - Explictly close a zone
  - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone

These op flags can be used to create bio's to control zoned devices
through the block layer.

This is useful for filesystems and device mappers that need explicit
control of zoned devices such as Host Mananged and Host Aware SMR drives,

Report zones is a device read that requires a buffer.

Open, Close and Reset are device commands that have no associated
data transfer. Sending an LBA of ~0 will attempt to operate on all
zones. This is typically used with Reset to wipe a drive as a Reset
behaves similar to TRIM in that all data in the zone(s) is deleted.

The Finish zone command is intentionally not implimented as there is no
current use case for that operation.

Report zones currently defaults to reporting on all zones. It expected
that support for the zone option flag will piggy back on streamid
support. The report option flag is useful as it can reduce the number
of zones in each report, but not critical.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Removed include/uapi/linux/fs.h from this patch.
---
 MAINTAINERS                       |   9 ++
 block/blk-lib.c                   |  95 +++++++++++++++++
 drivers/scsi/sd.c                 | 118 +++++++++++++++++++++
 drivers/scsi/sd.h                 |   1 +
 include/linux/bio.h               |   7 +-
 include/linux/blk_types.h         |   6 +-
 include/linux/blkzoned_api.h      |  25 +++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/blkzoned_api.h | 214 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 474 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 771c31c..32f5598 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12785,6 +12785,15 @@ F:	Documentation/networking/z8530drv.txt
 F:	drivers/net/hamradio/*scc.c
 F:	drivers/net/hamradio/z8530.h
 
+ZBC AND ZBC BLOCK DEVICES
+M:	Shaun Tancheff <shaun.tancheff@seagate.com>
+W:	http://seagate.com
+W:	https://github.com/Seagate/ZDM-Device-Mapper
+L:	linux-block@vger.kernel.org
+S:	Maintained
+F:	include/linux/blkzoned_api.h
+F:	include/uapi/linux/blkzoned_api.h
+
 ZBUD COMPRESSED PAGE ALLOCATOR
 M:	Seth Jennings <sjenning@redhat.com>
 L:	linux-mm@kvack.org
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 083e56f..6dcdcbf 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -6,6 +6,7 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/scatterlist.h>
+#include <linux/blkzoned_api.h>
 
 #include "blk.h"
 
@@ -266,3 +267,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_zone_report - queue a report zones operation
+ * @bdev:	target blockdev
+ * @op_flags:	extra bio rw flags. If unsure, use 0.
+ * @sector:	starting sector (report will include this sector).
+ * @opt:	See: zone_report_option, default is 0 (all zones).
+ * @page:	one or more contiguous pages.
+ * @pgsz:	up to size of page in bytes, size of report.
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
+			     sector_t sector, u8 opt, struct page *page,
+			     size_t pgsz, gfp_t gfp_mask)
+{
+	struct bdev_zone_report *conv = page_address(page);
+	struct bio *bio;
+	unsigned int nr_iovecs = 1;
+	int ret = 0;
+
+	if (pgsz < (sizeof(struct bdev_zone_report) +
+		    sizeof(struct bdev_zone_descriptor)))
+		return -EINVAL;
+
+	bio = bio_alloc(gfp_mask, nr_iovecs);
+	if (!bio)
+		return -ENOMEM;
+
+	conv->descriptor_count = 0;
+	bio->bi_iter.bi_sector = sector;
+	bio->bi_bdev = bdev;
+	bio->bi_vcnt = 0;
+	bio->bi_iter.bi_size = 0;
+
+	bio_add_page(bio, page, pgsz, 0);
+	bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags);
+	ret = submit_bio_wait(bio);
+
+	/*
+	 * When our request it nak'd the underlying device maybe conventional
+	 * so ... report a single conventional zone the size of the device.
+	 */
+	if (ret == -EIO && conv->descriptor_count) {
+		/* Adjust the conventional to the size of the partition ... */
+		__be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects);
+
+		conv->maximum_lba = blksz;
+		conv->descriptors[0].type = ZTYP_CONVENTIONAL;
+		conv->descriptors[0].flags = ZCOND_CONVENTIONAL << 4;
+		conv->descriptors[0].length = blksz;
+		conv->descriptors[0].lba_start = 0;
+		conv->descriptors[0].lba_wptr = blksz;
+		ret = 0;
+	}
+	bio_put(bio);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_zone_report);
+
+/**
+ * blkdev_issue_zone_action - queue a report zones operation
+ * @bdev:	target blockdev
+ * @op:		One of REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_RESET
+ * @op_flags:	extra bio rw flags. If unsure, use 0.
+ * @sector:	starting lba of sector
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_action(struct block_device *bdev, unsigned int op,
+			     unsigned int op_flags, sector_t sector,
+			     gfp_t gfp_mask)
+{
+	int ret;
+	struct bio *bio;
+
+	bio = bio_alloc(gfp_mask, 1);
+	if (!bio)
+		return -ENOMEM;
+
+	bio->bi_iter.bi_sector = sector;
+	bio->bi_bdev = bdev;
+	bio->bi_vcnt = 0;
+	bio->bi_iter.bi_size = 0;
+	bio_set_op_attrs(bio, op, op_flags);
+	ret = submit_bio_wait(bio);
+	bio_put(bio);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_zone_action);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d3e852a..b54f359 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -52,6 +52,7 @@
 #include <linux/slab.h>
 #include <linux/pm_runtime.h>
 #include <linux/pr.h>
+#include <linux/blkzoned_api.h>
 #include <asm/uaccess.h>
 #include <asm/unaligned.h>
 
@@ -1134,6 +1135,115 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt)
 	return ret;
 }
 
+static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_device *sdp = cmd->device;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	struct bio *bio = rq->bio;
+	sector_t sector = blk_rq_pos(rq);
+	struct gendisk *disk = rq->rq_disk;
+	unsigned int nr_bytes = blk_rq_bytes(rq);
+	int ret = BLKPREP_KILL;
+
+	WARN_ON(nr_bytes == 0);
+
+	/*
+	 * For conventional drives generate a report that shows a
+	 * large single convetional zone the size of the block device
+	 */
+	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) {
+		void *src;
+		struct bdev_zone_report *conv;
+
+		if (nr_bytes < sizeof(struct bdev_zone_report))
+			goto out;
+
+		src = kmap_atomic(bio->bi_io_vec->bv_page);
+		conv = src + bio->bi_io_vec->bv_offset;
+		conv->descriptor_count = cpu_to_be32(1);
+		conv->same_field = ZS_ALL_SAME;
+		conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects);
+		kunmap_atomic(src);
+		goto out;
+	}
+
+	ret = scsi_init_io(cmd);
+	if (ret != BLKPREP_OK)
+		goto out;
+
+	cmd = rq->special;
+	if (sdp->changed) {
+		pr_err("SCSI disk has been changed or is not present.");
+		ret = BLKPREP_KILL;
+		goto out;
+	}
+
+	cmd->cmd_len = 16;
+	memset(cmd->cmnd, 0, cmd->cmd_len);
+	cmd->cmnd[0] = ZBC_IN;
+	cmd->cmnd[1] = ZI_REPORT_ZONES;
+	put_unaligned_be64(sector, &cmd->cmnd[2]);
+	put_unaligned_be32(nr_bytes, &cmd->cmnd[10]);
+	/* FUTURE ... when streamid is available */
+	/* cmd->cmnd[14] = bio_get_streamid(bio); */
+
+	cmd->sc_data_direction = DMA_FROM_DEVICE;
+	cmd->sdb.length = nr_bytes;
+	cmd->transfersize = sdp->sector_size;
+	cmd->underflow = 0;
+	cmd->allowed = SD_MAX_RETRIES;
+	ret = BLKPREP_OK;
+out:
+	return ret;
+}
+
+static int sd_setup_zone_action_cmnd(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+	int ret = BLKPREP_KILL;
+	u8 allbit = 0;
+
+	if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC)
+		goto out;
+
+	if (sector == ~0ul) {
+		allbit = 1;
+		sector = 0;
+	}
+
+	cmd->cmd_len = 16;
+	memset(cmd->cmnd, 0, cmd->cmd_len);
+	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
+	cmd->cmnd[0] = ZBC_OUT;
+	switch (req_op(rq)) {
+	case REQ_OP_ZONE_OPEN:
+		cmd->cmnd[1] = ZO_OPEN_ZONE;
+		break;
+	case REQ_OP_ZONE_CLOSE:
+		cmd->cmnd[1] = ZO_CLOSE_ZONE;
+		break;
+	case REQ_OP_ZONE_RESET:
+		cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
+		break;
+	default:
+		goto out;
+	}
+	cmd->cmnd[14] = allbit;
+	put_unaligned_be64(sector, &cmd->cmnd[2]);
+
+	cmd->transfersize = 0;
+	cmd->underflow = 0;
+	cmd->allowed = SD_MAX_RETRIES;
+	cmd->sc_data_direction = DMA_NONE;
+
+	ret = BLKPREP_OK;
+out:
+	return ret;
+}
+
 static int sd_init_command(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -1148,6 +1258,12 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
 		return sd_setup_read_write_cmnd(cmd);
+	case REQ_OP_ZONE_REPORT:
+		return sd_setup_zone_report_cmnd(cmd);
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_RESET:
+		return sd_setup_zone_action_cmnd(cmd);
 	default:
 		BUG();
 	}
@@ -2737,6 +2853,8 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
 		queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, sdkp->disk->queue);
 	}
 
+	sdkp->zoned = (buffer[8] >> 4) & 3;
+
  out:
 	kfree(buffer);
 }
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 765a6f1..f782990 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -94,6 +94,7 @@ struct scsi_disk {
 	unsigned	lbpvpd : 1;
 	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
+	unsigned	zoned: 2;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 583c108..ffa3179 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -79,7 +79,12 @@ static inline bool bio_has_data(struct bio *bio)
 
 static inline bool bio_no_advance_iter(struct bio *bio)
 {
-	return bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_WRITE_SAME;
+	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_WRITE_SAME ||
+	       bio_op(bio) == REQ_OP_ZONE_REPORT ||
+	       bio_op(bio) == REQ_OP_ZONE_OPEN ||
+	       bio_op(bio) == REQ_OP_ZONE_CLOSE ||
+	       bio_op(bio) == REQ_OP_ZONE_RESET;
 }
 
 static inline bool bio_is_rw(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index f254eb2..35e3813 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -231,13 +231,17 @@ enum rq_flag_bits {
 enum req_op {
 	REQ_OP_READ,
 	REQ_OP_WRITE,
+	REQ_OP_ZONE_REPORT,
+	REQ_OP_ZONE_OPEN,
+	REQ_OP_ZONE_CLOSE,
+	REQ_OP_ZONE_RESET,
 	REQ_OP_DISCARD,		/* request to discard sectors */
 	REQ_OP_SECURE_ERASE,	/* request to securely erase sectors */
 	REQ_OP_WRITE_SAME,	/* write same block many times */
 	REQ_OP_FLUSH,		/* request for cache flush */
 };
 
-#define REQ_OP_BITS 3
+#define REQ_OP_BITS 4
 
 typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE	-1U
diff --git a/include/linux/blkzoned_api.h b/include/linux/blkzoned_api.h
new file mode 100644
index 0000000..47c091a
--- /dev/null
+++ b/include/linux/blkzoned_api.h
@@ -0,0 +1,25 @@
+/*
+ * Functions for zone based SMR devices.
+ *
+ * Copyright (C) 2015 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@seagate.com>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _BLKZONED_API_H
+#define _BLKZONED_API_H
+
+#include <uapi/linux/blkzoned_api.h>
+
+extern int blkdev_issue_zone_action(struct block_device *, unsigned int op,
+				    unsigned int op_flags, sector_t, gfp_t);
+extern int blkdev_issue_zone_report(struct block_device *, unsigned int op_flgs,
+				    sector_t, u8 opt, struct page *, size_t,
+				    gfp_t);
+
+#endif /* _BLKZONED_API_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ec10cfe..b94461c 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -70,6 +70,7 @@ header-y += bfs_fs.h
 header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
+header-y += blkzoned_api.h
 header-y += bpf_common.h
 header-y += bpf.h
 header-y += bpqether.h
diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h
new file mode 100644
index 0000000..48c17ad
--- /dev/null
+++ b/include/uapi/linux/blkzoned_api.h
@@ -0,0 +1,214 @@
+/*
+ * Functions for zone based SMR devices.
+ *
+ * Copyright (C) 2015 Seagate Technology PLC
+ *
+ * Written by:
+ * Shaun Tancheff <shaun.tancheff@seagate.com>
+ *
+ * This file is licensed under  the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef _UAPI_BLKZONED_API_H
+#define _UAPI_BLKZONED_API_H
+
+#include <linux/types.h>
+
+/**
+ * enum zone_report_option - Report Zones types to be included.
+ *
+ * @ZOPT_NON_SEQ_AND_RESET: Default (all zones).
+ * @ZOPT_ZC1_EMPTY: Zones which are empty.
+ * @ZOPT_ZC2_OPEN_IMPLICIT: Zones open but not explicitly opened
+ * @ZOPT_ZC3_OPEN_EXPLICIT: Zones opened explicitly
+ * @ZOPT_ZC4_CLOSED: Zones closed for writing.
+ * @ZOPT_ZC5_FULL: Zones that are full.
+ * @ZOPT_ZC6_READ_ONLY: Zones that are read-only
+ * @ZOPT_ZC7_OFFLINE: Zones that are offline
+ * @ZOPT_RESET: Zones that are empty
+ * @ZOPT_NON_SEQ: Zones that have HA media-cache writes pending
+ * @ZOPT_NON_WP_ZONES: Zones that do not have Write Pointers (conventional)
+ * @ZOPT_PARTIAL_FLAG: Modifies the definition of the Zone List Length field.
+ *
+ * Used by Report Zones in bdev_zone_get_report: report_option
+ */
+enum zone_report_option {
+	ZOPT_NON_SEQ_AND_RESET   = 0x00,
+	ZOPT_ZC1_EMPTY,
+	ZOPT_ZC2_OPEN_IMPLICIT,
+	ZOPT_ZC3_OPEN_EXPLICIT,
+	ZOPT_ZC4_CLOSED,
+	ZOPT_ZC5_FULL,
+	ZOPT_ZC6_READ_ONLY,
+	ZOPT_ZC7_OFFLINE,
+	ZOPT_RESET               = 0x10,
+	ZOPT_NON_SEQ             = 0x11,
+	ZOPT_NON_WP_ZONES        = 0x3f,
+	ZOPT_PARTIAL_FLAG        = 0x80,
+};
+
+/**
+ * enum bdev_zone_type - Type of zone in descriptor
+ *
+ * @ZTYP_RESERVED: Reserved
+ * @ZTYP_CONVENTIONAL: Conventional random write zone (No Write Pointer)
+ * @ZTYP_SEQ_WRITE_REQUIRED: Non-sequential writes are rejected.
+ * @ZTYP_SEQ_WRITE_PREFERRED: Non-sequential writes allowed but discouraged.
+ *
+ * Returned from Report Zones. See bdev_zone_descriptor* type.
+ */
+enum bdev_zone_type {
+	ZTYP_RESERVED            = 0,
+	ZTYP_CONVENTIONAL        = 1,
+	ZTYP_SEQ_WRITE_REQUIRED  = 2,
+	ZTYP_SEQ_WRITE_PREFERRED = 3,
+};
+
+
+/**
+ * enum bdev_zone_condition - Condition of zone in descriptor
+ *
+ * @ZCOND_CONVENTIONAL: N/A
+ * @ZCOND_ZC1_EMPTY: Empty
+ * @ZCOND_ZC2_OPEN_IMPLICIT: Opened via write to zone.
+ * @ZCOND_ZC3_OPEN_EXPLICIT: Opened via open zone command.
+ * @ZCOND_ZC4_CLOSED: Closed
+ * @ZCOND_ZC6_READ_ONLY:
+ * @ZCOND_ZC5_FULL: No remaining space in zone.
+ * @ZCOND_ZC7_OFFLINE: Offline
+ *
+ * Returned from Report Zones. See bdev_zone_descriptor* flags.
+ */
+enum bdev_zone_condition {
+	ZCOND_CONVENTIONAL       = 0,
+	ZCOND_ZC1_EMPTY          = 1,
+	ZCOND_ZC2_OPEN_IMPLICIT  = 2,
+	ZCOND_ZC3_OPEN_EXPLICIT  = 3,
+	ZCOND_ZC4_CLOSED         = 4,
+	/* 0x5 to 0xC are reserved */
+	ZCOND_ZC6_READ_ONLY      = 0xd,
+	ZCOND_ZC5_FULL           = 0xe,
+	ZCOND_ZC7_OFFLINE        = 0xf,
+};
+
+
+/**
+ * enum bdev_zone_same - Report Zones same code.
+ *
+ * @ZS_ALL_DIFFERENT: All zones differ in type and size.
+ * @ZS_ALL_SAME: All zones are the same size and type.
+ * @ZS_LAST_DIFFERS: All zones are the same size and type except the last zone.
+ * @ZS_SAME_LEN_DIFF_TYPES: All zones are the same length but types differ.
+ *
+ * Returned from Report Zones. See bdev_zone_report* same_field.
+ */
+enum bdev_zone_same {
+	ZS_ALL_DIFFERENT        = 0,
+	ZS_ALL_SAME             = 1,
+	ZS_LAST_DIFFERS         = 2,
+	ZS_SAME_LEN_DIFF_TYPES  = 3,
+};
+
+
+/**
+ * struct bdev_zone_get_report - ioctl: Report Zones request
+ *
+ * @zone_locator_lba: starting lba for first [reported] zone
+ * @return_page_count: number of *bytes* allocated for result
+ * @report_option: see: zone_report_option enum
+ *
+ * Used to issue report zones command to connected device
+ */
+struct bdev_zone_get_report {
+	__u64 zone_locator_lba;
+	__u32 return_page_count;
+	__u8  report_option;
+} __packed;
+
+/**
+ * struct bdev_zone_descriptor_le - See: bdev_zone_descriptor
+ */
+struct bdev_zone_descriptor_le {
+	__u8 type;
+	__u8 flags;
+	__u8 reserved1[6];
+	__le64 length;
+	__le64 lba_start;
+	__le64 lba_wptr;
+	__u8 reserved[32];
+} __packed;
+
+
+/**
+ * struct bdev_zone_report_le - See: bdev_zone_report
+ */
+struct bdev_zone_report_le {
+	__le32 descriptor_count;
+	__u8 same_field;
+	__u8 reserved1[3];
+	__le64 maximum_lba;
+	__u8 reserved2[48];
+	struct bdev_zone_descriptor_le descriptors[0];
+} __packed;
+
+
+/**
+ * struct bdev_zone_descriptor - A Zone descriptor entry from report zones
+ *
+ * @type: see zone_type enum
+ * @flags: Bits 0:reset, 1:non-seq, 2-3: resv, 4-7: see zone_condition enum
+ * @reserved1: padding
+ * @length: length of zone in sectors
+ * @lba_start: lba where the zone starts.
+ * @lba_wptr: lba of the current write pointer.
+ * @reserved: padding
+ *
+ */
+struct bdev_zone_descriptor {
+	__u8 type;
+	__u8 flags;
+	__u8  reserved1[6];
+	__be64 length;
+	__be64 lba_start;
+	__be64 lba_wptr;
+	__u8 reserved[32];
+} __packed;
+
+
+/**
+ * struct bdev_zone_report - Report Zones result
+ *
+ * @descriptor_count: Number of descriptor entries that follow
+ * @same_field: bits 0-3: enum zone_same (MASK: 0x0F)
+ * @reserved1: padding
+ * @maximum_lba: LBA of the last logical sector on the device, inclusive
+ *               of all logical sectors in all zones.
+ * @reserved2: padding
+ * @descriptors: array of descriptors follows.
+ */
+struct bdev_zone_report {
+	__be32 descriptor_count;
+	__u8 same_field;
+	__u8 reserved1[3];
+	__be64 maximum_lba;
+	__u8 reserved2[48];
+	struct bdev_zone_descriptor descriptors[0];
+} __packed;
+
+
+/**
+ * struct bdev_zone_report_io - Report Zones ioctl argument.
+ *
+ * @in: Report Zones inputs
+ * @out: Report Zones output
+ */
+struct bdev_zone_report_io {
+	union {
+		struct bdev_zone_get_report in;
+		struct bdev_zone_report out;
+	} data;
+} __packed;
+
+#endif /* _UAPI_BLKZONED_API_H */
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer
  2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
@ 2016-07-29 19:02 ` Shaun Tancheff
  2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
  2 siblings, 0 replies; 17+ messages in thread
From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel
  Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman, Shaun Tancheff

Add support for ZBC ioctl's
    BLKREPORT    - Issue Report Zones to device.
    BLKOPENZONE  - Issue Zone Action: Open Zone command.
    BLKCLOSEZONE - Issue Zone Action: Close Zone command.
    BLKRESETZONE - Issue Zone Action: Reset Zone command.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
---
v6:
 - Added GFP_DMA to gfp mask.
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
---
 block/ioctl.c                     | 110 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/blkzoned_api.h |   6 +++
 include/uapi/linux/fs.h           |   1 +
 3 files changed, 117 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..a2a6c2c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -7,6 +7,7 @@
 #include <linux/backing-dev.h>
 #include <linux/fs.h>
 #include <linux/blktrace_api.h>
+#include <linux/blkzoned_api.h>
 #include <linux/pr.h>
 #include <asm/uaccess.h>
 
@@ -194,6 +195,109 @@ int blkdev_reread_part(struct block_device *bdev)
 }
 EXPORT_SYMBOL(blkdev_reread_part);
 
+static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode,
+		void __user *parg)
+{
+	int error = -EFAULT;
+	gfp_t gfp = GFP_KERNEL | GFP_DMA;
+	struct bdev_zone_report_io *zone_iodata = NULL;
+	int order = 0;
+	struct page *pgs = NULL;
+	u32 alloc_size = PAGE_SIZE;
+	unsigned long op_flags = 0;
+	u8 opt = 0;
+
+	if (!(mode & FMODE_READ))
+		return -EBADF;
+
+	zone_iodata = (void *)get_zeroed_page(gfp);
+	if (!zone_iodata) {
+		error = -ENOMEM;
+		goto report_zones_out;
+	}
+	if (copy_from_user(zone_iodata, parg, sizeof(*zone_iodata))) {
+		error = -EFAULT;
+		goto report_zones_out;
+	}
+	if (zone_iodata->data.in.return_page_count > alloc_size) {
+		int npages;
+
+		alloc_size = zone_iodata->data.in.return_page_count;
+		npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		pgs = alloc_pages(gfp, ilog2(npages));
+		if (pgs) {
+			void *mem = page_address(pgs);
+
+			if (!mem) {
+				error = -ENOMEM;
+				goto report_zones_out;
+			}
+			order = ilog2(npages);
+			memset(mem, 0, alloc_size);
+			memcpy(mem, zone_iodata, sizeof(*zone_iodata));
+			free_page((unsigned long)zone_iodata);
+			zone_iodata = mem;
+		} else {
+			/* Result requires DMA capable memory */
+			pr_err("Not enough memory available for request.\n");
+			error = -ENOMEM;
+			goto report_zones_out;
+		}
+	}
+	opt = zone_iodata->data.in.report_option;
+	error = blkdev_issue_zone_report(bdev, op_flags,
+			zone_iodata->data.in.zone_locator_lba, opt,
+			pgs ? pgs : virt_to_page(zone_iodata),
+			alloc_size, GFP_KERNEL);
+
+	if (error)
+		goto report_zones_out;
+
+	if (copy_to_user(parg, zone_iodata, alloc_size))
+		error = -EFAULT;
+
+report_zones_out:
+	if (pgs)
+		__free_pages(pgs, order);
+	else if (zone_iodata)
+		free_page((unsigned long)zone_iodata);
+	return error;
+}
+
+static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode,
+				  unsigned int cmd, unsigned long arg)
+{
+	unsigned int op = 0;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	/*
+	 * When acting on zones we explicitly disallow using a partition.
+	 */
+	if (bdev != bdev->bd_contains) {
+		pr_err("%s: All zone operations disallowed on this device\n",
+			__func__);
+		return -EFAULT;
+	}
+
+	switch (cmd) {
+	case BLKOPENZONE:
+		op = REQ_OP_ZONE_OPEN;
+		break;
+	case BLKCLOSEZONE:
+		op = REQ_OP_ZONE_CLOSE;
+		break;
+	case BLKRESETZONE:
+		op = REQ_OP_ZONE_RESET;
+		break;
+	default:
+		pr_err("%s: Unknown action: %u\n", __func__, cmd);
+		return -EINVAL;
+	}
+	return blkdev_issue_zone_action(bdev, op, 0, arg, GFP_KERNEL);
+}
+
 static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 		unsigned long arg, unsigned long flags)
 {
@@ -568,6 +672,12 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 	case BLKTRACESETUP:
 	case BLKTRACETEARDOWN:
 		return blk_trace_ioctl(bdev, cmd, argp);
+	case BLKREPORT:
+		return blk_zoned_report_ioctl(bdev, mode, argp);
+	case BLKOPENZONE:
+	case BLKCLOSEZONE:
+	case BLKRESETZONE:
+		return blk_zoned_action_ioctl(bdev, mode, cmd, arg);
 	case IOC_PR_REGISTER:
 		return blkdev_pr_register(bdev, argp);
 	case IOC_PR_RESERVE:
diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h
index 48c17ad..3566de0 100644
--- a/include/uapi/linux/blkzoned_api.h
+++ b/include/uapi/linux/blkzoned_api.h
@@ -211,4 +211,10 @@ struct bdev_zone_report_io {
 	} data;
 } __packed;
 
+/* continuing from uapi/linux/fs.h: */
+#define BLKREPORT	_IOWR(0x12, 130, struct bdev_zone_report_io)
+#define BLKOPENZONE	_IO(0x12, 131)
+#define BLKCLOSEZONE	_IO(0x12, 132)
+#define BLKRESETZONE	_IO(0x12, 133)
+
 #endif /* _UAPI_BLKZONED_API_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3b00f7c..c0b565b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -222,6 +222,7 @@ struct fsxattr {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+/* A jump here: See blkzoned_api.h, Reserving 130 to 133. */
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
  2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff
@ 2016-08-01  9:41 ` Christoph Hellwig
  2016-08-01 17:07   ` Shaun Tancheff
  2 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-08-01  9:41 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: linux-block, linux-scsi, linux-kernel, Jens Axboe, Jens Axboe,
	Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman


Can you please integrate this with Hannes series so that it uses
his cache of the zone information?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
@ 2016-08-01 17:07   ` Shaun Tancheff
  2016-08-02 14:35     ` Hannes Reinecke
  0 siblings, 1 reply; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-01 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Jens Axboe, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman, Hannes Reinecke,
	Damien Le Moal

On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

Adding Hannes and Damien to Cc.

Christoph,

I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
quite simple. I can even have the open/close/reset zone commands update the
RB-Tree .. the non-private parts anyway. I would prefer to do this around the
CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
not need the RB-Tree to function with zoned media.

I do still have concerns with the approach which I have shared in smaller
forums but perhaps I have to bring them to this group.

First is the memory consumption. This isn't really much of a concern for large
servers with few drives but I think the embedded NAS market will grumble as
well as the large data pods trying to stuff 300+ drives in a chassis.

As of now the RB-Tree needs to hold ~30000 zones.
sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
around 3.5 MB per zoned drive attached.
Which is fine if it is really needed, but most of it is fixed information
and it can be significantly condensed (I have proposed 8 bytes per zone held
in an array as more than adequate). Worse is that the crucial piece of
information, the current wp needed for scheduling the next write, is mostly
out of date because it is updated only after the write completes and zones
being actively written to must work off of the last location / size that was
submitted, not completed. The work around is for that tracking to be handled
in the private_data member. I am not saying that updating the wp on
completing a write isn’t important, I am saying that the bi_end_io hook is
the existing hook that works just fine.

This all tails into domain responsability. With the RB-Tree doing half of the
work and the ‘responsible’ domain handling the active path via private_data
why have the split at all? It seems to be a double work to have second object
tracking the first so that I/O scheduling can function.

Finally is the error handling path when the RB-Tree encounters and error it
attempts to requery the drive topology virtually guaranteeing that the
private_data is now out-of-sync with the RB-Tree. Again this is something
that can be better encapsulated in the bi_end_io to be informed of the
failed I/O and schedule the appropriate recovery (including re-querying the
zone information of the affected zone(s)).

Anyway those are my concerns and why I am still reluctant to drop this line of
support. I have incorporated Hannes changes at various points. Hence the
SCT Write Same to attempt to work around some of the flaws in mapping
discard to reset write pointer.

Thanks and Regards,
Shaun

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg&s=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&e=



-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-01 17:07   ` Shaun Tancheff
@ 2016-08-02 14:35     ` Hannes Reinecke
  2016-08-03  1:29       ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Hannes Reinecke @ 2016-08-02 14:35 UTC (permalink / raw)
  To: Shaun Tancheff, Christoph Hellwig
  Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe,
	Jens Axboe, James E . J . Bottomley, Martin K . Petersen,
	Jeff Layton, J . Bruce Fields, Josh Bingaman, Damien Le Moal

On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
> 
> Adding Hannes and Damien to Cc.
> 
> Christoph,
> 
> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
> quite simple. I can even have the open/close/reset zone commands update the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
> not need the RB-Tree to function with zoned media.
> 
> I do still have concerns with the approach which I have shared in smaller
> forums but perhaps I have to bring them to this group.
> 
> First is the memory consumption. This isn't really much of a concern for large
> servers with few drives but I think the embedded NAS market will grumble as
> well as the large data pods trying to stuff 300+ drives in a chassis.
> 
> As of now the RB-Tree needs to hold ~30000 zones.
> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
> around 3.5 MB per zoned drive attached.
> Which is fine if it is really needed, but most of it is fixed information
> and it can be significantly condensed (I have proposed 8 bytes per zone held
> in an array as more than adequate). Worse is that the crucial piece of
> information, the current wp needed for scheduling the next write, is mostly
> out of date because it is updated only after the write completes and zones
> being actively written to must work off of the last location / size that was
> submitted, not completed. The work around is for that tracking to be handled
> in the private_data member. I am not saying that updating the wp on
> completing a write isn’t important, I am saying that the bi_end_io hook is
> the existing hook that works just fine.
> 
Which _actually_ is not true; with my patches I'll update the write
pointer prior to submit the I/O (on the reasoning that most of the time
I/O will succeed) and re-read the zone information if an I/O failed.
(Which I'll have to do anyway as after an I/O failure the write pointer
status is not clearly defined.)

I have thought about condensing the RB tree information, but then I
figured that for 'real' SMR handling we cannot assume all zones are of
fixed size, and hence we need all the information there.
Any condensing method would assume a given structure of the zones, which
the standard just doesn't provide.
Or am I missing something here?

As for write pointer handling: yes, the write pointer on the zones is
not really useful for upper-level usage.
Where we do need it is to detect I/O which is crossing the write pointer
(eg when doing reads over the entire zone).
As per spec you will be getting an I/O error here, so we need to split
the I/O on the write pointer to get valid results back.

> This all tails into domain responsibility. With the RB-Tree doing half of the
> work and the ‘responsible’ domain handling the active path via private_data
> why have the split at all? It seems to be a double work to have second object
> tracking the first so that I/O scheduling can function.
> 
Tracking the zone states via RB trees is mainly to handly host-managed
drives seamlessly. Without an RB tree we will be subjected to I/O errors
during boot, as eg with new drives we inevitably will try to read beyond
the write pointer, which in turn will generate I/O errors on certain drives.
I do agree that there is no strict need to setup an RB tree for
host-aware drives; I can easily add an attribute/flag to disable RB
trees for those.
However, tracking zone information in the RB tree _dramatically_ speeds
up device initialisation; issuing a blkdev_discard() for the entire
drive will take _ages_ without it.

> Finally is the error handling path when the RB-Tree encounters and error it
> attempts to requery the drive topology virtually guaranteeing that the
> private_data is now out-of-sync with the RB-Tree. Again this is something
> that can be better encapsulated in the bi_end_io to be informed of the
> failed I/O and schedule the appropriate recovery (including re-querying the
> zone information of the affected zone(s)).
> 
Well, if we have an RB tree with write pointers of course we need to
re-read the zone information on failure (and I thought I did that?).
Plus the error information will be propagated via the usual mechanism,
so they need to take care of updating any information in ->private_data.

I'm perfectly fine with integrating your patches for the various
blkdev_XX callouts and associated ioctls; Jens favours that approach,
too, so we should be combining those efforts.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-02 14:35     ` Hannes Reinecke
@ 2016-08-03  1:29       ` Damien Le Moal
  2016-08-05 20:35         ` Shaun Tancheff
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2016-08-03  1:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hannes, Shaun,

Let me add some more comments.

> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> 
>>> Can you please integrate this with Hannes series so that it uses
>>> his cache of the zone information?
>> 
>> Adding Hannes and Damien to Cc.
>> 
>> Christoph,
>> 
>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>> quite simple. I can even have the open/close/reset zone commands update the
>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>> not need the RB-Tree to function with zoned media.
>> 
>> I do still have concerns with the approach which I have shared in smaller
>> forums but perhaps I have to bring them to this group.
>> 
>> First is the memory consumption. This isn't really much of a concern for large
>> servers with few drives but I think the embedded NAS market will grumble as
>> well as the large data pods trying to stuff 300+ drives in a chassis.
>> 
>> As of now the RB-Tree needs to hold ~30000 zones.
>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>> around 3.5 MB per zoned drive attached.
>> Which is fine if it is really needed, but most of it is fixed information
>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>> in an array as more than adequate). Worse is that the crucial piece of
>> information, the current wp needed for scheduling the next write, is mostly
>> out of date because it is updated only after the write completes and zones
>> being actively written to must work off of the last location / size that was
>> submitted, not completed. The work around is for that tracking to be handled
>> in the private_data member. I am not saying that updating the wp on
>> completing a write isn’t important, I am saying that the bi_end_io hook is
>> the existing hook that works just fine.
>> 
> Which _actually_ is not true; with my patches I'll update the write
> pointer prior to submit the I/O (on the reasoning that most of the time
> I/O will succeed) and re-read the zone information if an I/O failed.
> (Which I'll have to do anyway as after an I/O failure the write pointer
> status is not clearly defined.)
> 
> I have thought about condensing the RB tree information, but then I
> figured that for 'real' SMR handling we cannot assume all zones are of
> fixed size, and hence we need all the information there.
> Any condensing method would assume a given structure of the zones, which
> the standard just doesn't provide.
> Or am I missing something here?

Indeed, the standards do not mandate any particular zone configuration,
constant zone size, etc. So writing code so that can be handled is certainly
the right way of doing things. However, if we decide to go forward with
mapping RESET WRITE POINTER command to DISCARD, then at least a constant
zone size (minus the last zone as you said) must be assumed, and that
information can be removed from the entries in the RB tree (as it will be
saved for the sysfs "zone_size" file anyway. Adding a little code to handle
that eventual last runt zone with a different size is not a big problem.

> 
> As for write pointer handling: yes, the write pointer on the zones is
> not really useful for upper-level usage.
> Where we do need it is to detect I/O which is crossing the write pointer
> (eg when doing reads over the entire zone).
> As per spec you will be getting an I/O error here, so we need to split
> the I/O on the write pointer to get valid results back.

To be precise here, the I/O splitting will be handled by the block layer
thanks to the "chunk_sectors" setting. But that relies on a constant zone
size assumption too.

The RB-tree here is most useful for reads over or after the write pointer as
this can have different behavior on different drives (URSWRZ bit). The RB-tree
allows us to hide these differences to upper layers and simplify support at
those levels.

I too agree that the write pointer value is not very useful to upper layers
(e.g. FS). What matters most of the times for these layers is an "allocation
pointer" or "submission pointer" which indicates the LBA to use for the next
BIO submission. That LBA will most of the time be in advance of the zone WP
because of request queing in the block scheduler.
Note that from what we have done already, in many cases, upper layers do not
even need this (e.g. F2FS, btrfs) and work fine without needing *any* 
zone->private_data because that "allocation pointer" information generally
already exists in a different form (e.g. a bit in a bitmap).
For these cases, not having the RB-tree of zones would force a lot
more code to be added to file systems for SMR support.

> 
>> This all tails into domain responsibility. With the RB-Tree doing half of the
>> work and the ‘responsible’ domain handling the active path via private_data
>> why have the split at all? It seems to be a double work to have second object
>> tracking the first so that I/O scheduling can function.
>> 
> Tracking the zone states via RB trees is mainly to handly host-managed
> drives seamlessly. Without an RB tree we will be subjected to I/O errors
> during boot, as eg with new drives we inevitably will try to read beyond
> the write pointer, which in turn will generate I/O errors on certain drives.
> I do agree that there is no strict need to setup an RB tree for
> host-aware drives; I can easily add an attribute/flag to disable RB
> trees for those.
> However, tracking zone information in the RB tree _dramatically_ speeds
> up device initialisation; issuing a blkdev_discard() for the entire
> drive will take _ages_ without it.

And the RB-tree will also be very useful for speeding up report zones
ioctls too. Unless we want those to force an update of the RB-tree information ?

> 
>> Finally is the error handling path when the RB-Tree encounters and error it
>> attempts to requery the drive topology virtually guaranteeing that the
>> private_data is now out-of-sync with the RB-Tree. Again this is something
>> that can be better encapsulated in the bi_end_io to be informed of the
>> failed I/O and schedule the appropriate recovery (including re-querying the
>> zone information of the affected zone(s)).
>> 
> Well, if we have an RB tree with write pointers of course we need to
> re-read the zone information on failure (and I thought I did that?).
> Plus the error information will be propagated via the usual mechanism,
> so they need to take care of updating any information in ->private_data.
> 
> I'm perfectly fine with integrating your patches for the various
> blkdev_XX callouts and associated ioctls; Jens favours that approach,
> too, so we should be combining those efforts.

Can we agree on an interface ?
Summarizing all the discussions we had, I am all in favor of the following:

1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
The already existing device type file can tell us if this is an host
managed disk (type==20) or a host aware disk (type==0). Drive managed
disks could also be detected, but since no information on their zone
configuration can be obtained, lets treat those as regular drives.

2) Add ioctls for zone management:
Report zones (get information from RB tree), reset zone (simple wrapper
to ioctl for block discard), open zone, close zone and finish zone. That
will allow mkfs like tools to get zone information without having to parse
thousands of sysfs files (and can also be integrated in libzbc block backend
driver for a unified interface with the direct SG_IO path for kernels without
the ZBC code enabled).

3) Try to condense the blkzone data structure to save memory:
I think that we can at the very least remove the zone length, and also
may be the per zone spinlock too (a single spinlock and proper state flags can
be used).

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-03  1:29       ` Damien Le Moal
@ 2016-08-05 20:35         ` Shaun Tancheff
  2016-08-09  6:47           ` Hannes Reinecke
  0 siblings, 1 reply; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-05 20:35 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
> Hannes, Shaun,
>
> Let me add some more comments.
>
>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> Can you please integrate this with Hannes series so that it uses
>>>> his cache of the zone information?
>>>
>>> Adding Hannes and Damien to Cc.
>>>
>>> Christoph,
>>>
>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>>> quite simple. I can even have the open/close/reset zone commands update the
>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>>> not need the RB-Tree to function with zoned media.

I have posted patches to integrate with the zone cache, hopefully they
make sense.

>>>
>>> I do still have concerns with the approach which I have shared in smaller
>>> forums but perhaps I have to bring them to this group.
>>>
>>> First is the memory consumption. This isn't really much of a concern for large
>>> servers with few drives but I think the embedded NAS market will grumble as
>>> well as the large data pods trying to stuff 300+ drives in a chassis.
>>>
>>> As of now the RB-Tree needs to hold ~30000 zones.
>>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>>> around 3.5 MB per zoned drive attached.
>>> Which is fine if it is really needed, but most of it is fixed information
>>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>>> in an array as more than adequate). Worse is that the crucial piece of
>>> information, the current wp needed for scheduling the next write, is mostly
>>> out of date because it is updated only after the write completes and zones
>>> being actively written to must work off of the last location / size that was
>>> submitted, not completed. The work around is for that tracking to be handled
>>> in the private_data member. I am not saying that updating the wp on
>>> completing a write isn’t important, I am saying that the bi_end_io hook is
>>> the existing hook that works just fine.
>>>
>> Which _actually_ is not true; with my patches I'll update the write
>> pointer prior to submit the I/O (on the reasoning that most of the time
>> I/O will succeed) and re-read the zone information if an I/O failed.
>> (Which I'll have to do anyway as after an I/O failure the write pointer
>> status is not clearly defined.)

Apologies for my mis-characterization.

>> I have thought about condensing the RB tree information, but then I
>> figured that for 'real' SMR handling we cannot assume all zones are of
>> fixed size, and hence we need all the information there.
>> Any condensing method would assume a given structure of the zones, which
>> the standard just doesn't provide.
>> Or am I missing something here?
>
> Indeed, the standards do not mandate any particular zone configuration,
> constant zone size, etc. So writing code so that can be handled is certainly
> the right way of doing things. However, if we decide to go forward with
> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
> zone size (minus the last zone as you said) must be assumed, and that
> information can be removed from the entries in the RB tree (as it will be
> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
> that eventual last runt zone with a different size is not a big problem.

>> As for write pointer handling: yes, the write pointer on the zones is
>> not really useful for upper-level usage.
>> Where we do need it is to detect I/O which is crossing the write pointer
>> (eg when doing reads over the entire zone).
>> As per spec you will be getting an I/O error here, so we need to split
>> the I/O on the write pointer to get valid results back.
>
> To be precise here, the I/O splitting will be handled by the block layer
> thanks to the "chunk_sectors" setting. But that relies on a constant zone
> size assumption too.
>
> The RB-tree here is most useful for reads over or after the write pointer as
> this can have different behavior on different drives (URSWRZ bit). The RB-tree
> allows us to hide these differences to upper layers and simplify support at
> those levels.

It is unfortunate that path was chosen rather than returning Made Up Data.
However I don't think this is a file system or device mapper problem as
neither of those layers attempt to read blocks they have not written
(or discarded).
All I can think of something that probes the drive media for a partition table
or other identification...isn't the conventional space sufficient to cover
those cases?
Anyway you could just handle the error and sense codes [CHECK CONDITION /
ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
when URSWRZ is 0. It would have the same effect as using the zone cache
without the possibility of the zone cache being out-of-sync.

> I too agree that the write pointer value is not very useful to upper layers
> (e.g. FS). What matters most of the times for these layers is an "allocation
> pointer" or "submission pointer" which indicates the LBA to use for the next
> BIO submission. That LBA will most of the time be in advance of the zone WP
> because of request queing in the block scheduler.

Which is kind of my point.

> Note that from what we have done already, in many cases, upper layers do not
> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
> zone->private_data because that "allocation pointer" information generally
> already exists in a different form (e.g. a bit in a bitmap).

Agreed. Log structured and copy on write file systems are a natural fit for
these types of drives

> For these cases, not having the RB-tree of zones would force a lot
> more code to be added to file systems for SMR support.

I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
Certainly any of these could trigger the zone cache to be loaded at mkfs
and/or mount time?

>>> This all tails into domain responsibility. With the RB-Tree doing half of the
>>> work and the ‘responsible’ domain handling the active path via private_data
>>> why have the split at all? It seems to be a double work to have second object
>>> tracking the first so that I/O scheduling can function.
>>>
>> Tracking the zone states via RB trees is mainly to handly host-managed
>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>> during boot, as eg with new drives we inevitably will try to read beyond
>> the write pointer, which in turn will generate I/O errors on certain drives.

If the zone cache is needed to boot then clearly my objection to reading
in the zone report information on drive attach is unfounded. I was under
the impression that this was not the case.

>> I do agree that there is no strict need to setup an RB tree for
>> host-aware drives; I can easily add an attribute/flag to disable RB
>> trees for those.
>> However, tracking zone information in the RB tree _dramatically_ speeds
>> up device initialisation; issuing a blkdev_discard() for the entire
>> drive will take _ages_ without it.
>
> And the RB-tree will also be very useful for speeding up report zones
> ioctls too. Unless we want those to force an update of the RB-tree information ?

That is debatable. I would tend to argue for both. *If* there must be
a zone cache
then reading from the zone cache is a must. Forcing and update of the zone cache
from the media seems like a reasonable thing to be able to do.

Also the zone report is 'slow' in that there is an overhead for the
report itself but
the number of zones per query can be quite large so 4 or 5 I/Os that
run into the
hundreds if milliseconds to cache the entire drive isn't really unworkable for
something that is used infrequently.

>>> Finally is the error handling path when the RB-Tree encounters and error it
>>> attempts to requery the drive topology virtually guaranteeing that the
>>> private_data is now out-of-sync with the RB-Tree. Again this is something
>>> that can be better encapsulated in the bi_end_io to be informed of the
>>> failed I/O and schedule the appropriate recovery (including re-querying the
>>> zone information of the affected zone(s)).
>>>
>> Well, if we have an RB tree with write pointers of course we need to
>> re-read the zone information on failure (and I thought I did that?).
>> Plus the error information will be propagated via the usual mechanism,
>> so they need to take care of updating any information in ->private_data.
>>
>> I'm perfectly fine with integrating your patches for the various
>> blkdev_XX callouts and associated ioctls; Jens favours that approach,
>> too, so we should be combining those efforts.
>
> Can we agree on an interface ?
> Summarizing all the discussions we had, I am all in favor of the following:
>
> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
> The already existing device type file can tell us if this is an host
> managed disk (type==20) or a host aware disk (type==0). Drive managed
> disks could also be detected, but since no information on their zone
> configuration can be obtained, lets treat those as regular drives.

Since disk type == 0 for everything that isn't HM so I would prefer the
sysfs 'zoned' file just report if the drive is HA or HM.

> 2) Add ioctls for zone management:
> Report zones (get information from RB tree), reset zone (simple wrapper
> to ioctl for block discard), open zone, close zone and finish zone. That
> will allow mkfs like tools to get zone information without having to parse
> thousands of sysfs files (and can also be integrated in libzbc block backend
> driver for a unified interface with the direct SG_IO path for kernels without
> the ZBC code enabled).

I can add finish zone ... but I really can't think of a use for it, myself.

> 3) Try to condense the blkzone data structure to save memory:
> I think that we can at the very least remove the zone length, and also
> may be the per zone spinlock too (a single spinlock and proper state flags can
> be used).

I have a variant that is an array of descriptors that roughly mimics the
api from blk-zoned.c that I did a few months ago as an example.
I should be able to update that to the current kernel + patches.

-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-05 20:35         ` Shaun Tancheff
@ 2016-08-09  6:47           ` Hannes Reinecke
  2016-08-09  8:09             ` Damien Le Moal
                               ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Hannes Reinecke @ 2016-08-09  6:47 UTC (permalink / raw)
  To: Shaun Tancheff, Damien Le Moal
  Cc: Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML,
	Jens Axboe, Jens Axboe, James E . J . Bottomley,
	Martin K . Petersen, Jeff Layton, J . Bruce Fields,
	Josh Bingaman

On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>> Hannes, Shaun,
>>
>> Let me add some more comments.
>>
>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>
>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>>
>>>>> Can you please integrate this with Hannes series so that it uses
>>>>> his cache of the zone information?
>>>>
>>>> Adding Hannes and Damien to Cc.
>>>>
>>>> Christoph,
>>>>
>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>>>> quite simple. I can even have the open/close/reset zone commands update the
>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>>>> not need the RB-Tree to function with zoned media.
> 
> I have posted patches to integrate with the zone cache, hopefully they
> make sense.
> 
[ .. ]
>>> I have thought about condensing the RB tree information, but then I
>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>> fixed size, and hence we need all the information there.
>>> Any condensing method would assume a given structure of the zones, which
>>> the standard just doesn't provide.
>>> Or am I missing something here?
>>
>> Indeed, the standards do not mandate any particular zone configuration,
>> constant zone size, etc. So writing code so that can be handled is certainly
>> the right way of doing things. However, if we decide to go forward with
>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
>> zone size (minus the last zone as you said) must be assumed, and that
>> information can be removed from the entries in the RB tree (as it will be
>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
>> that eventual last runt zone with a different size is not a big problem.
> 
>>> As for write pointer handling: yes, the write pointer on the zones is
>>> not really useful for upper-level usage.
>>> Where we do need it is to detect I/O which is crossing the write pointer
>>> (eg when doing reads over the entire zone).
>>> As per spec you will be getting an I/O error here, so we need to split
>>> the I/O on the write pointer to get valid results back.
>>
>> To be precise here, the I/O splitting will be handled by the block layer
>> thanks to the "chunk_sectors" setting. But that relies on a constant zone
>> size assumption too.
>>
>> The RB-tree here is most useful for reads over or after the write pointer as
>> this can have different behavior on different drives (URSWRZ bit). The RB-tree
>> allows us to hide these differences to upper layers and simplify support at
>> those levels.
> 
> It is unfortunate that path was chosen rather than returning Made Up Data.
But how would you return made up data without knowing that you _have_ to
generate some?
Of course it's trivial to generate made up data once an I/O error
occurred, but that's too late as the I/O error already happened.

The general idea behind this is that I _really_ do want to avoid
triggering an I/O error on initial access. This kind of thing really
tends to freak out users (connect a new drive and getting I/O errors;
not the best user experience ...)

> However I don't think this is a file system or device mapper problem as
> neither of those layers attempt to read blocks they have not written
> (or discarded).
> All I can think of something that probes the drive media for a partition table
> or other identification...isn't the conventional space sufficient to cover
> those cases?
Upon device scan the kernel will attempt to read the partition tables,
which consists of reading the first few megabytes and the last sector
(for GPT shadow partition tables).
And depending on the driver manufacturer you might or might not have
enough conventional space at the right locations.

> Anyway you could just handle the error and sense codes [CHECK CONDITION /
> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
> when URSWRZ is 0. It would have the same effect as using the zone cache
> without the possibility of the zone cache being out-of-sync.
> 
Yes, but then we would already have registered an I/O error.
And as indicated above, I really do _not_ want to handle all the
customer calls for supposed faulty new SMR drives.

>> I too agree that the write pointer value is not very useful to upper layers
>> (e.g. FS). What matters most of the times for these layers is an "allocation
>> pointer" or "submission pointer" which indicates the LBA to use for the next
>> BIO submission. That LBA will most of the time be in advance of the zone WP
>> because of request queing in the block scheduler.
> 
> Which is kind of my point.
> 
>> Note that from what we have done already, in many cases, upper layers do not
>> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
>> zone->private_data because that "allocation pointer" information generally
>> already exists in a different form (e.g. a bit in a bitmap).
> 
> Agreed. Log structured and copy on write file systems are a natural fit for
> these types of drives
> 
>> For these cases, not having the RB-tree of zones would force a lot
>> more code to be added to file systems for SMR support.
> 
> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
> Certainly any of these could trigger the zone cache to be loaded at mkfs
> and/or mount time?
> 
>>>> This all tails into domain responsibility. With the RB-Tree doing half of the
>>>> work and the ‘responsible’ domain handling the active path via private_data
>>>> why have the split at all? It seems to be a double work to have second object
>>>> tracking the first so that I/O scheduling can function.
>>>>
>>> Tracking the zone states via RB trees is mainly to handly host-managed
>>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>>> during boot, as eg with new drives we inevitably will try to read beyond
>>> the write pointer, which in turn will generate I/O errors on certain drives.
> 
> If the zone cache is needed to boot then clearly my objection to reading
> in the zone report information on drive attach is unfounded. I was under
> the impression that this was not the case.
> 
>>> I do agree that there is no strict need to setup an RB tree for
>>> host-aware drives; I can easily add an attribute/flag to disable RB
>>> trees for those.
>>> However, tracking zone information in the RB tree _dramatically_ speeds
>>> up device initialisation; issuing a blkdev_discard() for the entire
>>> drive will take _ages_ without it.
>>
>> And the RB-tree will also be very useful for speeding up report zones
>> ioctls too. Unless we want those to force an update of the RB-tree information ?
> 
> That is debatable. I would tend to argue for both. *If* there must be
> a zone cache then reading from the zone cache is a must. Forcing and
> update of the zone cache from the media seems like a reasonable thing
> to be able to do.
> 
> Also the zone report is 'slow' in that there is an overhead for the
> report itself but
> the number of zones per query can be quite large so 4 or 5 I/Os that
> run into the
> hundreds if milliseconds to cache the entire drive isn't really unworkable for
> something that is used infrequently.
> 
No, surely not.
But one of the _big_ advantages for the RB tree is blkdev_discard().
Without the RB tree any mkfs program will issue a 'discard' for every
sector. We will be able to coalesce those into one discard per zone, but
we still need to issue one for _every_ zone.
Which is (as indicated) really slow, and easily takes several minutes.
With the RB tree we can short-circuit discards to empty zones, and speed
up processing time dramatically.
Sure we could be moving the logic into mkfs and friends, but that would
require us to change the programs and agree on a library (libzbc?) which
should be handling that.

>>>> Finally is the error handling path when the RB-Tree encounters and error it
>>>> attempts to requery the drive topology virtually guaranteeing that the
>>>> private_data is now out-of-sync with the RB-Tree. Again this is something
>>>> that can be better encapsulated in the bi_end_io to be informed of the
>>>> failed I/O and schedule the appropriate recovery (including re-querying the
>>>> zone information of the affected zone(s)).
>>>>
>>> Well, if we have an RB tree with write pointers of course we need to
>>> re-read the zone information on failure (and I thought I did that?).
>>> Plus the error information will be propagated via the usual mechanism,
>>> so they need to take care of updating any information in ->private_data.
>>>
>>> I'm perfectly fine with integrating your patches for the various
>>> blkdev_XX callouts and associated ioctls; Jens favours that approach,
>>> too, so we should be combining those efforts.
>>
>> Can we agree on an interface ?
>> Summarizing all the discussions we had, I am all in favor of the following:
>>
>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>> The already existing device type file can tell us if this is an host
>> managed disk (type==20) or a host aware disk (type==0). Drive managed
>> disks could also be detected, but since no information on their zone
>> configuration can be obtained, lets treat those as regular drives.
> 
> Since disk type == 0 for everything that isn't HM so I would prefer the
> sysfs 'zoned' file just report if the drive is HA or HM.
> 
Okay. So let's put in the 'zoned' attribute the device type:
'host-managed', 'host-aware', or 'device managed'.

>> 2) Add ioctls for zone management:
>> Report zones (get information from RB tree), reset zone (simple wrapper
>> to ioctl for block discard), open zone, close zone and finish zone. That
>> will allow mkfs like tools to get zone information without having to parse
>> thousands of sysfs files (and can also be integrated in libzbc block backend
>> driver for a unified interface with the direct SG_IO path for kernels without
>> the ZBC code enabled).
> 
> I can add finish zone ... but I really can't think of a use for it, myself.
> 
Which is not the point. The standard defines this, so clearly someone
found it a reasonable addendum. So let's add this for completeness.

>> 3) Try to condense the blkzone data structure to save memory:
>> I think that we can at the very least remove the zone length, and also
>> may be the per zone spinlock too (a single spinlock and proper state flags can
>> be used).
> 
> I have a variant that is an array of descriptors that roughly mimics the
> api from blk-zoned.c that I did a few months ago as an example.
> I should be able to update that to the current kernel + patches.
> 
Okay. If we restrict the in-kernel SMR drive handling to devices with
identical zone sizes of course we can remove the zone length.
And we can do away with the per-zone spinlock and use a global one instead.

I will be updating my patchset accordingly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  6:47           ` Hannes Reinecke
@ 2016-08-09  8:09             ` Damien Le Moal
  2016-08-10  3:58               ` Shaun Tancheff
  2016-08-10  3:26             ` Shaun Tancheff
  2016-08-14  0:09             ` Shaun Tancheff
  2 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2016-08-09  8:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Hannes,

> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
[...]
>>> 
>>> Can we agree on an interface ?
>>> Summarizing all the discussions we had, I am all in favor of the following:
>>> 
>>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>>> The already existing device type file can tell us if this is an host
>>> managed disk (type==20) or a host aware disk (type==0). Drive managed
>>> disks could also be detected, but since no information on their zone
>>> configuration can be obtained, lets treat those as regular drives.
>> 
>> Since disk type == 0 for everything that isn't HM so I would prefer the
>> sysfs 'zoned' file just report if the drive is HA or HM.
>> 
> Okay. So let's put in the 'zoned' attribute the device type:
> 'host-managed', 'host-aware', or 'device managed'.

I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

> 
>>> 2) Add ioctls for zone management:
>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>> will allow mkfs like tools to get zone information without having to parse
>>> thousands of sysfs files (and can also be integrated in libzbc block backend
>>> driver for a unified interface with the direct SG_IO path for kernels without
>>> the ZBC code enabled).
>> 
>> I can add finish zone ... but I really can't think of a use for it, myself.
>> 
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.
The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is complete
and maps to the standard definitions.

I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
ioctl call with a range exactly aligned on a zone.

> 
>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>> be used).
>> 
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>> 
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instead.

Did that too. The blk_zone struct is now exactly 64B. I removed the per zone
spinlock and replaced it with a flag so that zones can still be locked
independently using wait_on_bit_lock & friends (I added the functions
blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
locking comes in handy to implement the ioctls code without stalling the command
queue for read, writes and discard on different zones.

I however kept the zone length in the structure. The reason for doing so is that
this allows supporting drives with non-constant zone sizes, albeit with a more
limited interface since in such case the device chunk_sectors is not set (that
is how a user or application can detect that the zones are not constant size).
For these drives, the block layer may happily merge BIOs across zone boundaries
and the discard code will not split and align calls on the zones. But upper layers
(an FS or a device mapper) can still do all this by themselves if they want/can
support non-constant zone sizes.

The only exception is drives like the Seagate one with only the last zone of a
different size. This case is handled exactly as if all zones are the same size
simply because any operation on the last smaller zone will naturally align as
the checks of operation size against the drive capacity will do the right things.

The ioctls work for all cases (drive with constant zone size or not). This is again
to allow supporting eventual weird drives at application level. I integrated all
these ioctl into libzbc block device backend driver and everything is fine. Can't
tell the difference with direct-to-drive SG_IO accesses. But unlike these, the zone
ioctls keep the zone information RB-tree cache up to date.

> 
> I will be updating my patchset accordingly.

I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I will send
everything for review. If you have any comment on the above, please let me know and
I will be happy to incorporate changes.

Best regards.


------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  6:47           ` Hannes Reinecke
  2016-08-09  8:09             ` Damien Le Moal
@ 2016-08-10  3:26             ` Shaun Tancheff
  2016-08-14  0:09             ` Shaun Tancheff
  2 siblings, 0 replies; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-10  3:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:

[trim]
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.
> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

Adding an additional library dependency seems overkill for a program
that is already doing ioctls and raw block I/O ... but I would leave that
up to each file system. As it sits issuing the ioctl and walking the array
of data returned [see blkreport.c] is already quite trivial.

I believe the goal here is for F2FS, and perhaps NILFS? to "just
work" with the DISCARD to Reset WP and zone cache in place.

Still quite skeptical about other common file systems
"just working" without their respective mkfs et. al. being
zone aware and handling the topology of the media at mkfs time.
Perhaps there is something I am unaware of?

[trim]

>> I can add finish zone ... but I really can't think of a use for it, myself.
>>
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Agreed and queued for the next version.

Regards,
Shaun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  8:09             ` Damien Le Moal
@ 2016-08-10  3:58               ` Shaun Tancheff
  2016-08-10  4:38                 ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-10  3:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:

[trim]

>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>
>> Okay. So let's put in the 'zoned' attribute the device type:
>> 'host-managed', 'host-aware', or 'device managed'.
>
> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
> else. This means that drive managed models are not exposed as zoned block
> devices. For HM vs HA differentiation, an application can look at the
> device type file since it is already present.
>
> We could indeed set the "zoned" file to the device type, but HM drives and
> regular drives will both have "0" in it, so no differentiation possible.
> The other choice could be the "zoned" bits defined by ZBC, but these
> do not define a value for host managed drives, and the drive managed value
> being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

This seems good to me.

>>>> 2) Add ioctls for zone management:
>>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>>> will allow mkfs like tools to get zone information without having to parse
>>>> thousands of sysfs files (and can also be integrated in libzbc block backend
>>>> driver for a unified interface with the direct SG_IO path for kernels without
>>>> the ZBC code enabled).
>>>
>>> I can add finish zone ... but I really can't think of a use for it, myself.
>>>
>> Which is not the point. The standard defines this, so clearly someone
>> found it a reasonable addendum. So let's add this for completeness.

Agreed.

> Done: I hacked Shaun ioctl code and added finish zone too. The
> difference with Shaun initial code is that the ioctl are propagated down to
> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
> BIO request definition for the zone operations. So a lot less code added.

The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.

> The ioctls do not mimic exactly the ZBC standard. For instance, there is no
> reporting options for report zones, nor is the "all" bit supported for open,
> close or finish zone commands. But the information provided on zones is complete
> and maps to the standard definitions.

For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...

As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(

> I also added a reset_wp ioctl for completeness, but this one simply calls
> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
> ioctl call with a range exactly aligned on a zone.

I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.

[trim]

> Did that too. The blk_zone struct is now exactly 64B. I removed the per zone

Thanks .. being a cache line is harder to whinge about...

> spinlock and replaced it with a flag so that zones can still be locked
> independently using wait_on_bit_lock & friends (I added the functions
> blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
> locking comes in handy to implement the ioctls code without stalling the command
> queue for read, writes and discard on different zones.
>
> I however kept the zone length in the structure. The reason for doing so is that
> this allows supporting drives with non-constant zone sizes, albeit with a more
> limited interface since in such case the device chunk_sectors is not set (that
> is how a user or application can detect that the zones are not constant size).
> For these drives, the block layer may happily merge BIOs across zone boundaries
> and the discard code will not split and align calls on the zones. But upper layers
> (an FS or a device mapper) can still do all this by themselves if they want/can
> support non-constant zone sizes.
>
> The only exception is drives like the Seagate one with only the last zone of a
> different size. This case is handled exactly as if all zones are the same size
> simply because any operation on the last smaller zone will naturally align as
> the checks of operation size against the drive capacity will do the right things.
>
> The ioctls work for all cases (drive with constant zone size or not). This is again
> to allow supporting eventual weird drives at application level. I integrated all
> these ioctl into libzbc block device backend driver and everything is fine. Can't
> tell the difference with direct-to-drive SG_IO accesses. But unlike these, the zone
> ioctls keep the zone information RB-tree cache up to date.
>
>>
>> I will be updating my patchset accordingly.
>
> I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I will send
> everything for review. If you have any comment on the above, please let me know and
> I will be happy to incorporate changes.
>
> Best regards.
>
>
> ------------------------
> Damien Le Moal, Ph.D.
> Sr. Manager, System Software Group, HGST Research,
> HGST, a Western Digital brand
> Damien.LeMoal@hgst.com
> (+81) 0466-98-3593 (ext. 513593)
> 1 kirihara-cho, Fujisawa,
> Kanagawa, 252-0888 Japan
> www.hgst.com
>
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:
>
> This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
>



-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-10  3:58               ` Shaun Tancheff
@ 2016-08-10  4:38                 ` Damien Le Moal
  2016-08-16  6:07                   ` Shaun Tancheff
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2016-08-10  4:38 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

Shaun,

On 8/10/16 12:58, Shaun Tancheff wrote:
> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
>
> [trim]
>
>>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>>
>>> Okay. So let's put in the 'zoned' attribute the device type:
>>> 'host-managed', 'host-aware', or 'device managed'.
>>
>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>> else. This means that drive managed models are not exposed as zoned block
>> devices. For HM vs HA differentiation, an application can look at the
>> device type file since it is already present.
>>
>> We could indeed set the "zoned" file to the device type, but HM drives and
>> regular drives will both have "0" in it, so no differentiation possible.
>> The other choice could be the "zoned" bits defined by ZBC, but these
>> do not define a value for host managed drives, and the drive managed value
>> being not "0" could be confusing too. So I settled for a simple 0/1 boolean.
>
> This seems good to me.

Another option I forgot is for the "zoned" file to indicate the total 
number of zones of the device, and 0 for a non zoned regular block 
device. That would work as well.

[...]
>> Done: I hacked Shaun ioctl code and added finish zone too. The
>> difference with Shaun initial code is that the ioctl are propagated down to
>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
>> BIO request definition for the zone operations. So a lot less code added.
>
> The purpose of the BIO flags is not to enable the ioctls so much as
> the other way round. Creating BIO op's is to enable issuing ZBC
> commands from device mapper targets and file systems without some
> heinous ioctl hacks.
> Making the resulting block layer interfaces available via ioctls is just a
> reasonable way to exercise the code ... or that was my intent.

Yes, I understood your code. However, since (or if) we keep the zone 
information in the RB-tree cache, there is no need for the report zone 
operation BIO interface. Same for reset write pointer by keeping the 
mapping to discard. blk_lookup_zone can be used in kernel as a report 
zone BIO replacement and works as well for the report zone ioctl 
implementation. For reset, there is blkdev_issue_discrad in kernel, and 
the reset zone ioctl becomes equivalent to BLKDISCARD ioctl. These are 
simple. Open, close and finish zone remains. For these, adding the BIO 
interface seemed an overkill. Hence my choice of propagating the ioctl 
to the driver.
This is debatable of course, and adding an in-kernel interface is not 
hard: we can implement blk_open_zone, blk_close_zone and blk_finish_zone 
using __blkdev_driver_ioctl. That looks clean to me.

Overall, my concern with the BIO based interface for the ZBC commands is 
that it adds one flag for each command, which is not really the 
philosophy of the interface and potentially opens the door for more such 
implementations in the future with new standards and new commands coming 
up. Clearly that is not a sustainable path. So I think that a more 
specific interface for these zone operations is a better choice. That is 
consistent with what happens with the tons of ATA and SCSI commands not 
actually doing data I/Os (mode sense, log pages, SMART, etc). All these 
do not use BIOs and are processed as request REQ_TYPE_BLOCK_PC.

>> The ioctls do not mimic exactly the ZBC standard. For instance, there is no
>> reporting options for report zones, nor is the "all" bit supported for open,
>> close or finish zone commands. But the information provided on zones is complete
>> and maps to the standard definitions.
>
> For the reporting options I have planned to reuse the stream_id in
> struct bio when that is formalized. There are certainly other places in
> struct bio to stuff a few extra bits ...

We could add reporting options to blk_lookup_zones to filter the result 
and use that in the ioctl implementation as well. This can be added 
without any problem.

> As far as the all bit ... this is being handled by all the zone action
> commands. Just pass a sector of ~0ul and it's handled in sd.c by
> sd_setup_zone_action_cmnd().
>
> Apologies as apparently my documentation here is lacking :-(

Yes, I got it (libzbc does the same actually). I did not add it for 
simplicity. But indeed may be it should be. The same trick can be used 
with the ioctl to driver interface. No problems.

>> I also added a reset_wp ioctl for completeness, but this one simply calls
>> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
>> ioctl call with a range exactly aligned on a zone.
>
> I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
> creates a REQ_OP_ZONE_RESET .. same as open and close. My
> expectation being that BLKDISCARD doesn't really need yet another alias.

Yes, we could remove the BLKRESETZONE ioctl and have applications use 
the BLKDISCARD ioctl instead. In the kernel, both generate 
blkdev_issue_discard calls and are thus identical. Only the ioctl 
interface differs (sector vs range argument). Again, I added it to have 
the full set of 5 ZBC/ZAC operations mapping to one BLKxxxZONE ioctl. 
But granted, reset is optional.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-09  6:47           ` Hannes Reinecke
  2016-08-09  8:09             ` Damien Le Moal
  2016-08-10  3:26             ` Shaun Tancheff
@ 2016-08-14  0:09             ` Shaun Tancheff
  2016-08-16  4:00               ` Damien Le Moal
  2 siblings, 1 reply; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-14  0:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>>> Hannes, Shaun,
>>>
>>> Let me add some more comments.
>>>
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote:
>>>>>>
>>>>>> Can you please integrate this with Hannes series so that it uses
>>>>>> his cache of the zone information?
>>>>>
>>>>> Adding Hannes and Damien to Cc.
>>>>>
>>>>> Christoph,
>>>>>
>>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>>>>> quite simple. I can even have the open/close/reset zone commands update the
>>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>>>>> not need the RB-Tree to function with zoned media.
>>
>> I have posted patches to integrate with the zone cache, hopefully they
>> make sense.
>>
> [ .. ]
>>>> I have thought about condensing the RB tree information, but then I
>>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>>> fixed size, and hence we need all the information there.
>>>> Any condensing method would assume a given structure of the zones, which
>>>> the standard just doesn't provide.
>>>> Or am I missing something here?

Of course you can condense the zone cache without loosing any
information. Here is the layout I used ... I haven't update the patch
to the latest posted patches but this is the basic idea.

[It was originally done as a follow on of making your zone cache work
 with Seagate's HA drive. I did not include the wp-in-arrays patch
 along with the HA drive support that I sent you in May as you were
 quite terse about RB trees when I tried to discuss this approach with
 you at Vault]

struct blk_zone {
        unsigned type:4;
        unsigned state:5;
        unsigned extra:7;
        unsigned wp:40;
        void *private_data;
};

struct contiguous_wps {
        u64 start_lba;
        u64 last_lba; /* or # of blocks */
        u64 zone_size; /* size in blocks */
        unsigned is_zoned:1;
        u32 zone_count;
        spinlock_t lock;
        struct blk_zone zones[0];
};

struct zone_wps {
        u32 wps_count;
        struct contiguous_wps **wps;
};

Then in struct request_queue
-    struct rb_root zones;
+   struct struct zone_wps *zones;

For each contiguous chunk of zones you need a descriptor. In the current
drives you need 1 or 2 descriptors.

Here a conventional drive is encapsulated as zoned media with one
drive sized conventional zone.

I have not spent time building an ad-hoc LVM comprised of zoned and
conventional media so it's not all ironed out yet.
I think you can see the advantage of being able to put conventional space
anywhere you would like to work around zoned media not being laid out
the the best manner for your setup.

Yes things start to break down if every other zone is a different size ..

The point being that even with supporting zones that order 48 bytes.
in size this saves a lot of space with no loss of information.
I still kind of prefer pushing blk_zone down to a u32 by reducing
the max zone size and dropping the private_data ... but that may
be going a bit too far.

blk_lookup_zone then has an [unfortunate] signature change:


/**
 * blk_lookup_zone() - Lookup zones
 * @q: Request Queue
 * @sector: Location to lookup
 * @start: Starting location zone (OUT: Required)
 * @len: Length of zone (OUT: Required)
 * @lock: Spinlock of zones (OUT: Required)
 */
struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
                                 sector_t *start, sector_t *len,
                                 spinlock_t **lock)
{
        int iter;
        struct blk_zone *bzone = NULL;
        struct zone_wps *zi = q->zones;

        *start = 0;
        *len = 0;
        *lock = NULL;

        if (!q->zones)
                goto out;

        for (iter = 0; iter < zi->wps_count; iter++) {
                if (sector >= zi->wps[iter]->start_lba &&
                    sector <  zi->wps[iter]->last_lba) {
                        struct contiguous_wps *wp = zi->wps[iter];
                        u64 index = (sector - wp->start_lba) / wp->zone_size;

                        BUG_ON(index >= wp->zone_count);

                        bzone = &wp->zones[index];
                        *len = wp->zone_size;
                        *start = wp->start_lba + (index * wp->zone_size);
                        *lock = &wp->lock;
                }
        }

out:
        return bzone;
}

>>>
>>> Indeed, the standards do not mandate any particular zone configuration,
>>> constant zone size, etc. So writing code so that can be handled is certainly
>>> the right way of doing things. However, if we decide to go forward with
>>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
>>> zone size (minus the last zone as you said) must be assumed, and that
>>> information can be removed from the entries in the RB tree (as it will be
>>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
>>> that eventual last runt zone with a different size is not a big problem.
>>
>>>> As for write pointer handling: yes, the write pointer on the zones is
>>>> not really useful for upper-level usage.
>>>> Where we do need it is to detect I/O which is crossing the write pointer
>>>> (eg when doing reads over the entire zone).
>>>> As per spec you will be getting an I/O error here, so we need to split
>>>> the I/O on the write pointer to get valid results back.
>>>
>>> To be precise here, the I/O splitting will be handled by the block layer
>>> thanks to the "chunk_sectors" setting. But that relies on a constant zone
>>> size assumption too.
>>>
>>> The RB-tree here is most useful for reads over or after the write pointer as
>>> this can have different behavior on different drives (URSWRZ bit). The RB-tree
>>> allows us to hide these differences to upper layers and simplify support at
>>> those levels.
>>
>> It is unfortunate that path was chosen rather than returning Made Up Data.
> But how would you return made up data without knowing that you _have_ to
> generate some?
> Of course it's trivial to generate made up data once an I/O error
> occurred, but that's too late as the I/O error already happened.
>
> The general idea behind this is that I _really_ do want to avoid
> triggering an I/O error on initial access. This kind of thing really
> tends to freak out users (connect a new drive and getting I/O errors;
> not the best user experience ...)
>
>> However I don't think this is a file system or device mapper problem as
>> neither of those layers attempt to read blocks they have not written
>> (or discarded).
>> All I can think of something that probes the drive media for a partition table
>> or other identification...isn't the conventional space sufficient to cover
>> those cases?
> Upon device scan the kernel will attempt to read the partition tables,
> which consists of reading the first few megabytes and the last sector
> (for GPT shadow partition tables).
> And depending on the driver manufacturer you might or might not have
> enough conventional space at the right locations.
>
>> Anyway you could just handle the error and sense codes [CHECK CONDITION /
>> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
>> when URSWRZ is 0. It would have the same effect as using the zone cache
>> without the possibility of the zone cache being out-of-sync.
>>
> Yes, but then we would already have registered an I/O error.
> And as indicated above, I really do _not_ want to handle all the
> customer calls for supposed faulty new SMR drives.

Fair enough. I just really want to be sure that all this memory locked
to the zone cache is really being made useful.

>>> I too agree that the write pointer value is not very useful to upper layers
>>> (e.g. FS). What matters most of the times for these layers is an "allocation
>>> pointer" or "submission pointer" which indicates the LBA to use for the next
>>> BIO submission. That LBA will most of the time be in advance of the zone WP
>>> because of request queing in the block scheduler.
>>
>> Which is kind of my point.
>>
>>> Note that from what we have done already, in many cases, upper layers do not
>>> even need this (e.g. F2FS, btrfs) and work fine without needing *any*
>>> zone->private_data because that "allocation pointer" information generally
>>> already exists in a different form (e.g. a bit in a bitmap).
>>
>> Agreed. Log structured and copy on write file systems are a natural fit for
>> these types of drives
>>
>>> For these cases, not having the RB-tree of zones would force a lot
>>> more code to be added to file systems for SMR support.
>>
>> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs.
>> Certainly any of these could trigger the zone cache to be loaded at mkfs
>> and/or mount time?
>>
>>>>> This all tails into domain responsibility. With the RB-Tree doing half of the
>>>>> work and the ‘responsible’ domain handling the active path via private_data
>>>>> why have the split at all? It seems to be a double work to have second object
>>>>> tracking the first so that I/O scheduling can function.
>>>>>
>>>> Tracking the zone states via RB trees is mainly to handly host-managed
>>>> drives seamlessly. Without an RB tree we will be subjected to I/O errors
>>>> during boot, as eg with new drives we inevitably will try to read beyond
>>>> the write pointer, which in turn will generate I/O errors on certain drives.
>>
>> If the zone cache is needed to boot then clearly my objection to reading
>> in the zone report information on drive attach is unfounded. I was under
>> the impression that this was not the case.
>>
>>>> I do agree that there is no strict need to setup an RB tree for
>>>> host-aware drives; I can easily add an attribute/flag to disable RB
>>>> trees for those.
>>>> However, tracking zone information in the RB tree _dramatically_ speeds
>>>> up device initialisation; issuing a blkdev_discard() for the entire
>>>> drive will take _ages_ without it.
>>>
>>> And the RB-tree will also be very useful for speeding up report zones
>>> ioctls too. Unless we want those to force an update of the RB-tree information ?
>>
>> That is debatable. I would tend to argue for both. *If* there must be
>> a zone cache then reading from the zone cache is a must. Forcing and
>> update of the zone cache from the media seems like a reasonable thing
>> to be able to do.
>>
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.

How can you make coalesce work transparently in the
sd layer _without_ keeping some sort of a discard cache along
with the zone cache?

Currently the block layer's blkdev_issue_discard() is breaking
large discard's into nice granular and aligned chunks but it is
not preventing small discards nor coalescing them.

In the sd layer would there be way to persist or purge an
overly large discard cache? What about honoring
discard_zeroes_data? Once the discard is completed with
discard_zeroes_data you have to return zeroes whenever
a discarded sector is read. Isn't that a log more than just
tracking a write pointer? Couldn't a zone have dozens of holes?

> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
so I'm not sure your argument is valid here.

[..]

>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>> be used).
>>
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>>
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instead.

I don't think dropping the zone length is a reasonable thing to do.

What I propose is an array of _descriptors_ it doesn't drop _any_
of the zone information that you are holding in an RB tree, it is
just a condensed format that _mostly_ plugs into your existing
API.

-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-14  0:09             ` Shaun Tancheff
@ 2016-08-16  4:00               ` Damien Le Moal
  2016-08-16  5:49                 ` Shaun Tancheff
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2016-08-16  4:00 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman


Shaun,

> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> wrote:
[…]
>>> 
>> No, surely not.
>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>> Without the RB tree any mkfs program will issue a 'discard' for every
>> sector. We will be able to coalesce those into one discard per zone, but
>> we still need to issue one for _every_ zone.
> 
> How can you make coalesce work transparently in the
> sd layer _without_ keeping some sort of a discard cache along
> with the zone cache?
> 
> Currently the block layer's blkdev_issue_discard() is breaking
> large discard's into nice granular and aligned chunks but it is
> not preventing small discards nor coalescing them.
> 
> In the sd layer would there be way to persist or purge an
> overly large discard cache? What about honoring
> discard_zeroes_data? Once the discard is completed with
> discard_zeroes_data you have to return zeroes whenever
> a discarded sector is read. Isn't that a log more than just
> tracking a write pointer? Couldn't a zone have dozens of holes?

My understanding of the standards regarding discard is that it is not
mandatory and that it is a hint to the drive. The drive can completely
ignore it if it thinks that is a better choice. I may be wrong on this
though. Need to check again.
For reset write pointer, the mapping to discard requires that the calls
to blkdev_issue_discard be zone aligned for anything to happen. Specify
less than a zone and nothing will be done. This I think preserve the
discard semantic.

As for the “discard_zeroes_data” thing, I also think that is a drive
feature not mandatory. Drives may have it or not, which is consistent
with the ZBC/ZAC standards regarding reading after write pointer (nothing
says that zeros have to be returned). In any case, discard of CMR zones
will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
choice.

> 
>> Which is (as indicated) really slow, and easily takes several minutes.
>> With the RB tree we can short-circuit discards to empty zones, and speed
>> up processing time dramatically.
>> Sure we could be moving the logic into mkfs and friends, but that would
>> require us to change the programs and agree on a library (libzbc?) which
>> should be handling that.
> 
> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
> so I'm not sure your argument is valid here.

This initial SMR support patch is just that: a first try. Jaegeuk
used SG_IO (in fact copy-paste of parts of libzbc) because the current
ZBC patch-set has no ioctl API for zone information manipulation. We
will fix this mkfs.f2fs once we agree on an ioctl interface.

> 
> [..]
> 
>>>> 3) Try to condense the blkzone data structure to save memory:
>>>> I think that we can at the very least remove the zone length, and also
>>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>>> be used).
>>> 
>>> I have a variant that is an array of descriptors that roughly mimics the
>>> api from blk-zoned.c that I did a few months ago as an example.
>>> I should be able to update that to the current kernel + patches.
>>> 
>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>> identical zone sizes of course we can remove the zone length.
>> And we can do away with the per-zone spinlock and use a global one instead.
> 
> I don't think dropping the zone length is a reasonable thing to do.
> 
> What I propose is an array of _descriptors_ it doesn't drop _any_
> of the zone information that you are holding in an RB tree, it is
> just a condensed format that _mostly_ plugs into your existing
> API.

I do not agree. The Seagate drive already has one zone (the last one)
that is not the same length as the other zones. Sure, since it is the
last one, we can had “if (last zone)” all over the place and make it
work. But that is really ugly. Keeping the length field makes the code
generic and following the standard, which has no restriction on the
zone sizes. We could do some memory optimisation using different types
of blk_zone sturcts, the types mapping to the SAME value: drives with
constant zone size can use a blk_zone type without the length field,
others use a different type that include the field. Accessor functions
can hide the different types in the zone manipulation code.

Best regards.


-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-16  4:00               ` Damien Le Moal
@ 2016-08-16  5:49                 ` Shaun Tancheff
  0 siblings, 0 replies; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-16  5:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> wrote:
> […]
>>>>
>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, but
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=1 in your
current patches. I believe that setting discard_zeroes_data=1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the “discard_zeroes_data” thing, I also think that is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and speed
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) which
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less memory. I would also like a path available that
does not require SD_ZBC or BLK_ZONED for Host Aware
drives to work, hence this set of patches and me begging
for a single bit in struct bio.

>>
>> [..]
>>
>>>>> 3) Try to condense the blkzone data structure to save memory:
>>>>> I think that we can at the very least remove the zone length, and also
>>>>> may be the per zone spinlock too (a single spinlock and proper state flags can
>>>>> be used).
>>>>
>>>> I have a variant that is an array of descriptors that roughly mimics the
>>>> api from blk-zoned.c that I did a few months ago as an example.
>>>> I should be able to update that to the current kernel + patches.
>>>>
>>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>>> identical zone sizes of course we can remove the zone length.
>>> And we can do away with the per-zone spinlock and use a global one instead.
>>
>> I don't think dropping the zone length is a reasonable thing to do.

REPEAT: I do _not_ think dropping the zone length is a good thing.

>> What I propose is an array of _descriptors_ it doesn't drop _any_
>> of the zone information that you are holding in an RB tree, it is
>> just a condensed format that _mostly_ plugs into your existing
>> API.
>
> I do not agree. The Seagate drive already has one zone (the last one)
> that is not the same length as the other zones. Sure, since it is the
> last one, we can had “if (last zone)” all over the place and make it
> work. But that is really ugly. Keeping the length field makes the code
> generic and following the standard, which has no restriction on the
> zone sizes. We could do some memory optimisation using different types
> of blk_zone sturcts, the types mapping to the SAME value: drives with
> constant zone size can use a blk_zone type without the length field,
> others use a different type that include the field. Accessor functions
> can hide the different types in the zone manipulation code.

Ah. I just said that dropping the zone length is not a good idea.
Why the antagonistic exposition?

All I am saying is that I can give you the zone cache with 1/7th of
the memory and the same performance with _no_ loss of information,
or features, as compared to the existing zone cache.

All the code is done now. I will post patches once my testing is
done.

I have also reworked all the zone integration so the BIO flags will
pull from and update the zone cache as opposed to the first hack
that only really integrated with some ioctls.

Regards,
Shaun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
  2016-08-10  4:38                 ` Damien Le Moal
@ 2016-08-16  6:07                   ` Shaun Tancheff
  0 siblings, 0 replies; 17+ messages in thread
From: Shaun Tancheff @ 2016-08-16  6:07 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block,
	linux-scsi, LKML, Jens Axboe, Jens Axboe,
	James E . J . Bottomley, Martin K . Petersen, Jeff Layton,
	J . Bruce Fields, Josh Bingaman

On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote:
> Shaun,
>
> On 8/10/16 12:58, Shaun Tancheff wrote:
>>
>> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com>
>> wrote:
>>>>
>>>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote:
>>
>>
>> [trim]
>>
>>>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>>>
>>>> Okay. So let's put in the 'zoned' attribute the device type:
>>>> 'host-managed', 'host-aware', or 'device managed'.
>>>
>>>
>>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned
>>> file.
>>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>>> else. This means that drive managed models are not exposed as zoned block
>>> devices. For HM vs HA differentiation, an application can look at the
>>> device type file since it is already present.
>>>
>>> We could indeed set the "zoned" file to the device type, but HM drives
>>> and
>>> regular drives will both have "0" in it, so no differentiation possible.
>>> The other choice could be the "zoned" bits defined by ZBC, but these
>>> do not define a value for host managed drives, and the drive managed
>>> value
>>> being not "0" could be confusing too. So I settled for a simple 0/1
>>> boolean.
>>
>>
>> This seems good to me.
>
>
> Another option I forgot is for the "zoned" file to indicate the total number
> of zones of the device, and 0 for a non zoned regular block device. That
> would work as well.

Clearly either is sufficient.

> [...]
>>>
>>> Done: I hacked Shaun ioctl code and added finish zone too. The
>>> difference with Shaun initial code is that the ioctl are propagated down
>>> to
>>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need
>>> for
>>> BIO request definition for the zone operations. So a lot less code added.
>>
>>
>> The purpose of the BIO flags is not to enable the ioctls so much as
>> the other way round. Creating BIO op's is to enable issuing ZBC
>> commands from device mapper targets and file systems without some
>> heinous ioctl hacks.
>> Making the resulting block layer interfaces available via ioctls is just a
>> reasonable way to exercise the code ... or that was my intent.
>
>
> Yes, I understood your code. However, since (or if) we keep the zone
> information in the RB-tree cache, there is no need for the report zone
> operation BIO interface. Same for reset write pointer by keeping the mapping
> to discard. blk_lookup_zone can be used in kernel as a report zone BIO
> replacement and works as well for the report zone ioctl implementation. For
> reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl
> becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and
> finish zone remains. For these, adding the BIO interface seemed an overkill.
> Hence my choice of propagating the ioctl to the driver.
> This is debatable of course, and adding an in-kernel interface is not hard:
> we can implement blk_open_zone, blk_close_zone and blk_finish_zone using
> __blkdev_driver_ioctl. That looks clean to me.

Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API
-> Kernel
is not really a good designed IMO.

> Overall, my concern with the BIO based interface for the ZBC commands is
> that it adds one flag for each command, which is not really the philosophy
> of the interface and potentially opens the door for more such
> implementations in the future with new standards and new commands coming up.
> Clearly that is not a sustainable path. So I think that a more specific
> interface for these zone operations is a better choice. That is consistent
> with what happens with the tons of ATA and SCSI commands not actually doing
> data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and
> are processed as request REQ_TYPE_BLOCK_PC.

Part of the reason for following on Mike Christie's bio op/flags cleanup was to
make these op's. The advantage of being added as ops is that there is only
1 extra bit need (not 4 or 5 bits for flags). The other reason for being
promoted into the block layer as commands is because it seems to me
to make sense that these abstractions could be allowed to be passed through
a DM layer and be handled by a files system.

>>> The ioctls do not mimic exactly the ZBC standard. For instance, there is
>>> no
>>> reporting options for report zones, nor is the "all" bit supported for
>>> open,
>>> close or finish zone commands. But the information provided on zones is
>>> complete
>>> and maps to the standard definitions.
>>
>>
>> For the reporting options I have planned to reuse the stream_id in
>> struct bio when that is formalized. There are certainly other places in
>> struct bio to stuff a few extra bits ...

Sorry I was confused here. I was under the impression you were talking
about one of my patches when you seem to have been talking about
your hacking thereof.

> We could add reporting options to blk_lookup_zones to filter the result and
> use that in the ioctl implementation as well. This can be added without any
> problem.
>
>> As far as the all bit ... this is being handled by all the zone action
>> commands. Just pass a sector of ~0ul and it's handled in sd.c by
>> sd_setup_zone_action_cmnd().
>>
>> Apologies as apparently my documentation here is lacking :-(
>
>
> Yes, I got it (libzbc does the same actually). I did not add it for
> simplicity. But indeed may be it should be. The same trick can be used with
> the ioctl to driver interface. No problems.
>
>>> I also added a reset_wp ioctl for completeness, but this one simply calls
>>> blkdev_issue_discard internally, so it is in fact equivalent to the
>>> BLKDISCARD
>>> ioctl call with a range exactly aligned on a zone.
>>
>>
>> I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
>> creates a REQ_OP_ZONE_RESET .. same as open and close. My
>> expectation being that BLKDISCARD doesn't really need yet another alias.

Same as above.

> Yes, we could remove the BLKRESETZONE ioctl and have applications use the
> BLKDISCARD ioctl instead. In the kernel, both generate blkdev_issue_discard
> calls and are thus identical. Only the ioctl interface differs (sector vs
> range argument). Again, I added it to have the full set of 5 ZBC/ZAC
> operations mapping to one BLKxxxZONE ioctl. But granted, reset is optional.

Which is the opposite of what I had stated above which is that I do _not_
think reset is optional. I think that if the user wants discard they should
call discard and if they want reset wp they *can* call reset wp and have
it behave as expected.
Why would I and a new ioctl that just calls discard? That makes no sense to me.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-08-16  6:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff
2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff
2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff
2016-08-01  9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig
2016-08-01 17:07   ` Shaun Tancheff
2016-08-02 14:35     ` Hannes Reinecke
2016-08-03  1:29       ` Damien Le Moal
2016-08-05 20:35         ` Shaun Tancheff
2016-08-09  6:47           ` Hannes Reinecke
2016-08-09  8:09             ` Damien Le Moal
2016-08-10  3:58               ` Shaun Tancheff
2016-08-10  4:38                 ` Damien Le Moal
2016-08-16  6:07                   ` Shaun Tancheff
2016-08-10  3:26             ` Shaun Tancheff
2016-08-14  0:09             ` Shaun Tancheff
2016-08-16  4:00               ` Damien Le Moal
2016-08-16  5:49                 ` Shaun Tancheff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).