* [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-07-29 19:02 Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw) To: linux-block, linux-scsi, linux-kernel Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Hi Jens, This series is based on linus' current tip after the merge of 'for-4.8/core' As Host Aware drives are becoming available we would like to be able to make use of such drives. This series is also intended to be suitable for use by Host Managed drives. ZAC/ZBC drives add new commands for discovering and working with Zones. This extends the ZAC/ZBC support up to the block layer allowing direct control by file systems or device mapper targets. Also by deferring the zone handling to the authoritative subsystem there is an overall lower memory usage for holding the active zone information as well as clarifying responsible party for maintaining the write pointer for each active zone. By way of example a DM target may have several writes in progress. To sector (or lba) for those writes will each depend on the previous write. While the drive's write pointer will be updated as writes are completed the DM target will be maintaining both where the next write should be scheduled from and where the write pointer is based on writes completed w/o errors. Knowing the drive's zone topology enables DM targets and file systems to extend their block allocation schemes and issue write pointer resets (or discards) that are zone aligned. A perhaps non-obvious approach is that a conventional drive will returns a zone report descriptor with a single large conventional zone. Patches for util-linux can be found here: https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux This patch is available here: https://github.com/stancheff/linux/tree/zbc.bio.v6 git@github.com:stancheff/linux.git zbc.bio.v6 v6: - Fix page alloc to include DMA flag for ioctl. v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's - Dropped ata16 hackery V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. - Use zoned report reserved bit for ata-passthrough flag. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Moved include/uapi/linux/fs.h changes to patch 3 - Fixed commit message for first patch in series. Shaun Tancheff (2): Add bio/request flags to issue ZBC/ZAC commands Add ioctl to issue ZBC/ZAC commands via block layer MAINTAINERS | 9 ++ block/blk-lib.c | 95 ++++++++++++++++ block/ioctl.c | 110 +++++++++++++++++++ drivers/scsi/sd.c | 118 ++++++++++++++++++++ drivers/scsi/sd.h | 1 + include/linux/bio.h | 7 +- include/linux/blk_types.h | 6 +- include/linux/blkzoned_api.h | 25 +++++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 220 ++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 1 + 11 files changed, 591 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h -- 2.8.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands 2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff @ 2016-07-29 19:02 ` Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff 2016-08-01 9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig 2 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw) To: linux-block, linux-scsi, linux-kernel Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman, Shaun Tancheff Add op flags to access to zone information as well as open, close and reset zones: - REQ_OP_ZONE_REPORT - Query zone information (Report zones) - REQ_OP_ZONE_OPEN - Explictly open a zone for writing - REQ_OP_ZONE_CLOSE - Explictly close a zone - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone These op flags can be used to create bio's to control zoned devices through the block layer. This is useful for filesystems and device mappers that need explicit control of zoned devices such as Host Mananged and Host Aware SMR drives, Report zones is a device read that requires a buffer. Open, Close and Reset are device commands that have no associated data transfer. Sending an LBA of ~0 will attempt to operate on all zones. This is typically used with Reset to wipe a drive as a Reset behaves similar to TRIM in that all data in the zone(s) is deleted. The Finish zone command is intentionally not implimented as there is no current use case for that operation. Report zones currently defaults to reporting on all zones. It expected that support for the zone option flag will piggy back on streamid support. The report option flag is useful as it can reduce the number of zones in each report, but not critical. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> --- v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Removed include/uapi/linux/fs.h from this patch. --- MAINTAINERS | 9 ++ block/blk-lib.c | 95 +++++++++++++++++ drivers/scsi/sd.c | 118 +++++++++++++++++++++ drivers/scsi/sd.h | 1 + include/linux/bio.h | 7 +- include/linux/blk_types.h | 6 +- include/linux/blkzoned_api.h | 25 +++++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 214 ++++++++++++++++++++++++++++++++++++++ 9 files changed, 474 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h diff --git a/MAINTAINERS b/MAINTAINERS index 771c31c..32f5598 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12785,6 +12785,15 @@ F: Documentation/networking/z8530drv.txt F: drivers/net/hamradio/*scc.c F: drivers/net/hamradio/z8530.h +ZBC AND ZBC BLOCK DEVICES +M: Shaun Tancheff <shaun.tancheff@seagate.com> +W: http://seagate.com +W: https://github.com/Seagate/ZDM-Device-Mapper +L: linux-block@vger.kernel.org +S: Maintained +F: include/linux/blkzoned_api.h +F: include/uapi/linux/blkzoned_api.h + ZBUD COMPRESSED PAGE ALLOCATOR M: Seth Jennings <sjenning@redhat.com> L: linux-mm@kvack.org diff --git a/block/blk-lib.c b/block/blk-lib.c index 083e56f..6dcdcbf 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -6,6 +6,7 @@ #include <linux/bio.h> #include <linux/blkdev.h> #include <linux/scatterlist.h> +#include <linux/blkzoned_api.h> #include "blk.h" @@ -266,3 +267,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask); } EXPORT_SYMBOL(blkdev_issue_zeroout); + +/** + * blkdev_issue_zone_report - queue a report zones operation + * @bdev: target blockdev + * @op_flags: extra bio rw flags. If unsure, use 0. + * @sector: starting sector (report will include this sector). + * @opt: See: zone_report_option, default is 0 (all zones). + * @page: one or more contiguous pages. + * @pgsz: up to size of page in bytes, size of report. + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + * Issue a zone report request for the sectors in question. + */ +int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags, + sector_t sector, u8 opt, struct page *page, + size_t pgsz, gfp_t gfp_mask) +{ + struct bdev_zone_report *conv = page_address(page); + struct bio *bio; + unsigned int nr_iovecs = 1; + int ret = 0; + + if (pgsz < (sizeof(struct bdev_zone_report) + + sizeof(struct bdev_zone_descriptor))) + return -EINVAL; + + bio = bio_alloc(gfp_mask, nr_iovecs); + if (!bio) + return -ENOMEM; + + conv->descriptor_count = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = bdev; + bio->bi_vcnt = 0; + bio->bi_iter.bi_size = 0; + + bio_add_page(bio, page, pgsz, 0); + bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags); + ret = submit_bio_wait(bio); + + /* + * When our request it nak'd the underlying device maybe conventional + * so ... report a single conventional zone the size of the device. + */ + if (ret == -EIO && conv->descriptor_count) { + /* Adjust the conventional to the size of the partition ... */ + __be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects); + + conv->maximum_lba = blksz; + conv->descriptors[0].type = ZTYP_CONVENTIONAL; + conv->descriptors[0].flags = ZCOND_CONVENTIONAL << 4; + conv->descriptors[0].length = blksz; + conv->descriptors[0].lba_start = 0; + conv->descriptors[0].lba_wptr = blksz; + ret = 0; + } + bio_put(bio); + return ret; +} +EXPORT_SYMBOL(blkdev_issue_zone_report); + +/** + * blkdev_issue_zone_action - queue a report zones operation + * @bdev: target blockdev + * @op: One of REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_RESET + * @op_flags: extra bio rw flags. If unsure, use 0. + * @sector: starting lba of sector + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + * Issue a zone report request for the sectors in question. + */ +int blkdev_issue_zone_action(struct block_device *bdev, unsigned int op, + unsigned int op_flags, sector_t sector, + gfp_t gfp_mask) +{ + int ret; + struct bio *bio; + + bio = bio_alloc(gfp_mask, 1); + if (!bio) + return -ENOMEM; + + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = bdev; + bio->bi_vcnt = 0; + bio->bi_iter.bi_size = 0; + bio_set_op_attrs(bio, op, op_flags); + ret = submit_bio_wait(bio); + bio_put(bio); + return ret; +} +EXPORT_SYMBOL(blkdev_issue_zone_action); diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d3e852a..b54f359 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -52,6 +52,7 @@ #include <linux/slab.h> #include <linux/pm_runtime.h> #include <linux/pr.h> +#include <linux/blkzoned_api.h> #include <asm/uaccess.h> #include <asm/unaligned.h> @@ -1134,6 +1135,115 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) return ret; } +static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd) +{ + struct request *rq = cmd->request; + struct scsi_device *sdp = cmd->device; + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); + struct bio *bio = rq->bio; + sector_t sector = blk_rq_pos(rq); + struct gendisk *disk = rq->rq_disk; + unsigned int nr_bytes = blk_rq_bytes(rq); + int ret = BLKPREP_KILL; + + WARN_ON(nr_bytes == 0); + + /* + * For conventional drives generate a report that shows a + * large single convetional zone the size of the block device + */ + if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) { + void *src; + struct bdev_zone_report *conv; + + if (nr_bytes < sizeof(struct bdev_zone_report)) + goto out; + + src = kmap_atomic(bio->bi_io_vec->bv_page); + conv = src + bio->bi_io_vec->bv_offset; + conv->descriptor_count = cpu_to_be32(1); + conv->same_field = ZS_ALL_SAME; + conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects); + kunmap_atomic(src); + goto out; + } + + ret = scsi_init_io(cmd); + if (ret != BLKPREP_OK) + goto out; + + cmd = rq->special; + if (sdp->changed) { + pr_err("SCSI disk has been changed or is not present."); + ret = BLKPREP_KILL; + goto out; + } + + cmd->cmd_len = 16; + memset(cmd->cmnd, 0, cmd->cmd_len); + cmd->cmnd[0] = ZBC_IN; + cmd->cmnd[1] = ZI_REPORT_ZONES; + put_unaligned_be64(sector, &cmd->cmnd[2]); + put_unaligned_be32(nr_bytes, &cmd->cmnd[10]); + /* FUTURE ... when streamid is available */ + /* cmd->cmnd[14] = bio_get_streamid(bio); */ + + cmd->sc_data_direction = DMA_FROM_DEVICE; + cmd->sdb.length = nr_bytes; + cmd->transfersize = sdp->sector_size; + cmd->underflow = 0; + cmd->allowed = SD_MAX_RETRIES; + ret = BLKPREP_OK; +out: + return ret; +} + +static int sd_setup_zone_action_cmnd(struct scsi_cmnd *cmd) +{ + struct request *rq = cmd->request; + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); + sector_t sector = blk_rq_pos(rq); + int ret = BLKPREP_KILL; + u8 allbit = 0; + + if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) + goto out; + + if (sector == ~0ul) { + allbit = 1; + sector = 0; + } + + cmd->cmd_len = 16; + memset(cmd->cmnd, 0, cmd->cmd_len); + memset(&cmd->sdb, 0, sizeof(cmd->sdb)); + cmd->cmnd[0] = ZBC_OUT; + switch (req_op(rq)) { + case REQ_OP_ZONE_OPEN: + cmd->cmnd[1] = ZO_OPEN_ZONE; + break; + case REQ_OP_ZONE_CLOSE: + cmd->cmnd[1] = ZO_CLOSE_ZONE; + break; + case REQ_OP_ZONE_RESET: + cmd->cmnd[1] = ZO_RESET_WRITE_POINTER; + break; + default: + goto out; + } + cmd->cmnd[14] = allbit; + put_unaligned_be64(sector, &cmd->cmnd[2]); + + cmd->transfersize = 0; + cmd->underflow = 0; + cmd->allowed = SD_MAX_RETRIES; + cmd->sc_data_direction = DMA_NONE; + + ret = BLKPREP_OK; +out: + return ret; +} + static int sd_init_command(struct scsi_cmnd *cmd) { struct request *rq = cmd->request; @@ -1148,6 +1258,12 @@ static int sd_init_command(struct scsi_cmnd *cmd) case REQ_OP_READ: case REQ_OP_WRITE: return sd_setup_read_write_cmnd(cmd); + case REQ_OP_ZONE_REPORT: + return sd_setup_zone_report_cmnd(cmd); + case REQ_OP_ZONE_OPEN: + case REQ_OP_ZONE_CLOSE: + case REQ_OP_ZONE_RESET: + return sd_setup_zone_action_cmnd(cmd); default: BUG(); } @@ -2737,6 +2853,8 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, sdkp->disk->queue); } + sdkp->zoned = (buffer[8] >> 4) & 3; + out: kfree(buffer); } diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 765a6f1..f782990 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -94,6 +94,7 @@ struct scsi_disk { unsigned lbpvpd : 1; unsigned ws10 : 1; unsigned ws16 : 1; + unsigned zoned: 2; }; #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev) diff --git a/include/linux/bio.h b/include/linux/bio.h index 583c108..ffa3179 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -79,7 +79,12 @@ static inline bool bio_has_data(struct bio *bio) static inline bool bio_no_advance_iter(struct bio *bio) { - return bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_WRITE_SAME; + return bio_op(bio) == REQ_OP_DISCARD || + bio_op(bio) == REQ_OP_WRITE_SAME || + bio_op(bio) == REQ_OP_ZONE_REPORT || + bio_op(bio) == REQ_OP_ZONE_OPEN || + bio_op(bio) == REQ_OP_ZONE_CLOSE || + bio_op(bio) == REQ_OP_ZONE_RESET; } static inline bool bio_is_rw(struct bio *bio) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index f254eb2..35e3813 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -231,13 +231,17 @@ enum rq_flag_bits { enum req_op { REQ_OP_READ, REQ_OP_WRITE, + REQ_OP_ZONE_REPORT, + REQ_OP_ZONE_OPEN, + REQ_OP_ZONE_CLOSE, + REQ_OP_ZONE_RESET, REQ_OP_DISCARD, /* request to discard sectors */ REQ_OP_SECURE_ERASE, /* request to securely erase sectors */ REQ_OP_WRITE_SAME, /* write same block many times */ REQ_OP_FLUSH, /* request for cache flush */ }; -#define REQ_OP_BITS 3 +#define REQ_OP_BITS 4 typedef unsigned int blk_qc_t; #define BLK_QC_T_NONE -1U diff --git a/include/linux/blkzoned_api.h b/include/linux/blkzoned_api.h new file mode 100644 index 0000000..47c091a --- /dev/null +++ b/include/linux/blkzoned_api.h @@ -0,0 +1,25 @@ +/* + * Functions for zone based SMR devices. + * + * Copyright (C) 2015 Seagate Technology PLC + * + * Written by: + * Shaun Tancheff <shaun.tancheff@seagate.com> + * + * This file is licensed under the terms of the GNU General Public + * License version 2. This program is licensed "as is" without any + * warranty of any kind, whether express or implied. + */ + +#ifndef _BLKZONED_API_H +#define _BLKZONED_API_H + +#include <uapi/linux/blkzoned_api.h> + +extern int blkdev_issue_zone_action(struct block_device *, unsigned int op, + unsigned int op_flags, sector_t, gfp_t); +extern int blkdev_issue_zone_report(struct block_device *, unsigned int op_flgs, + sector_t, u8 opt, struct page *, size_t, + gfp_t); + +#endif /* _BLKZONED_API_H */ diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index ec10cfe..b94461c 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -70,6 +70,7 @@ header-y += bfs_fs.h header-y += binfmts.h header-y += blkpg.h header-y += blktrace_api.h +header-y += blkzoned_api.h header-y += bpf_common.h header-y += bpf.h header-y += bpqether.h diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h new file mode 100644 index 0000000..48c17ad --- /dev/null +++ b/include/uapi/linux/blkzoned_api.h @@ -0,0 +1,214 @@ +/* + * Functions for zone based SMR devices. + * + * Copyright (C) 2015 Seagate Technology PLC + * + * Written by: + * Shaun Tancheff <shaun.tancheff@seagate.com> + * + * This file is licensed under the terms of the GNU General Public + * License version 2. This program is licensed "as is" without any + * warranty of any kind, whether express or implied. + */ + +#ifndef _UAPI_BLKZONED_API_H +#define _UAPI_BLKZONED_API_H + +#include <linux/types.h> + +/** + * enum zone_report_option - Report Zones types to be included. + * + * @ZOPT_NON_SEQ_AND_RESET: Default (all zones). + * @ZOPT_ZC1_EMPTY: Zones which are empty. + * @ZOPT_ZC2_OPEN_IMPLICIT: Zones open but not explicitly opened + * @ZOPT_ZC3_OPEN_EXPLICIT: Zones opened explicitly + * @ZOPT_ZC4_CLOSED: Zones closed for writing. + * @ZOPT_ZC5_FULL: Zones that are full. + * @ZOPT_ZC6_READ_ONLY: Zones that are read-only + * @ZOPT_ZC7_OFFLINE: Zones that are offline + * @ZOPT_RESET: Zones that are empty + * @ZOPT_NON_SEQ: Zones that have HA media-cache writes pending + * @ZOPT_NON_WP_ZONES: Zones that do not have Write Pointers (conventional) + * @ZOPT_PARTIAL_FLAG: Modifies the definition of the Zone List Length field. + * + * Used by Report Zones in bdev_zone_get_report: report_option + */ +enum zone_report_option { + ZOPT_NON_SEQ_AND_RESET = 0x00, + ZOPT_ZC1_EMPTY, + ZOPT_ZC2_OPEN_IMPLICIT, + ZOPT_ZC3_OPEN_EXPLICIT, + ZOPT_ZC4_CLOSED, + ZOPT_ZC5_FULL, + ZOPT_ZC6_READ_ONLY, + ZOPT_ZC7_OFFLINE, + ZOPT_RESET = 0x10, + ZOPT_NON_SEQ = 0x11, + ZOPT_NON_WP_ZONES = 0x3f, + ZOPT_PARTIAL_FLAG = 0x80, +}; + +/** + * enum bdev_zone_type - Type of zone in descriptor + * + * @ZTYP_RESERVED: Reserved + * @ZTYP_CONVENTIONAL: Conventional random write zone (No Write Pointer) + * @ZTYP_SEQ_WRITE_REQUIRED: Non-sequential writes are rejected. + * @ZTYP_SEQ_WRITE_PREFERRED: Non-sequential writes allowed but discouraged. + * + * Returned from Report Zones. See bdev_zone_descriptor* type. + */ +enum bdev_zone_type { + ZTYP_RESERVED = 0, + ZTYP_CONVENTIONAL = 1, + ZTYP_SEQ_WRITE_REQUIRED = 2, + ZTYP_SEQ_WRITE_PREFERRED = 3, +}; + + +/** + * enum bdev_zone_condition - Condition of zone in descriptor + * + * @ZCOND_CONVENTIONAL: N/A + * @ZCOND_ZC1_EMPTY: Empty + * @ZCOND_ZC2_OPEN_IMPLICIT: Opened via write to zone. + * @ZCOND_ZC3_OPEN_EXPLICIT: Opened via open zone command. + * @ZCOND_ZC4_CLOSED: Closed + * @ZCOND_ZC6_READ_ONLY: + * @ZCOND_ZC5_FULL: No remaining space in zone. + * @ZCOND_ZC7_OFFLINE: Offline + * + * Returned from Report Zones. See bdev_zone_descriptor* flags. + */ +enum bdev_zone_condition { + ZCOND_CONVENTIONAL = 0, + ZCOND_ZC1_EMPTY = 1, + ZCOND_ZC2_OPEN_IMPLICIT = 2, + ZCOND_ZC3_OPEN_EXPLICIT = 3, + ZCOND_ZC4_CLOSED = 4, + /* 0x5 to 0xC are reserved */ + ZCOND_ZC6_READ_ONLY = 0xd, + ZCOND_ZC5_FULL = 0xe, + ZCOND_ZC7_OFFLINE = 0xf, +}; + + +/** + * enum bdev_zone_same - Report Zones same code. + * + * @ZS_ALL_DIFFERENT: All zones differ in type and size. + * @ZS_ALL_SAME: All zones are the same size and type. + * @ZS_LAST_DIFFERS: All zones are the same size and type except the last zone. + * @ZS_SAME_LEN_DIFF_TYPES: All zones are the same length but types differ. + * + * Returned from Report Zones. See bdev_zone_report* same_field. + */ +enum bdev_zone_same { + ZS_ALL_DIFFERENT = 0, + ZS_ALL_SAME = 1, + ZS_LAST_DIFFERS = 2, + ZS_SAME_LEN_DIFF_TYPES = 3, +}; + + +/** + * struct bdev_zone_get_report - ioctl: Report Zones request + * + * @zone_locator_lba: starting lba for first [reported] zone + * @return_page_count: number of *bytes* allocated for result + * @report_option: see: zone_report_option enum + * + * Used to issue report zones command to connected device + */ +struct bdev_zone_get_report { + __u64 zone_locator_lba; + __u32 return_page_count; + __u8 report_option; +} __packed; + +/** + * struct bdev_zone_descriptor_le - See: bdev_zone_descriptor + */ +struct bdev_zone_descriptor_le { + __u8 type; + __u8 flags; + __u8 reserved1[6]; + __le64 length; + __le64 lba_start; + __le64 lba_wptr; + __u8 reserved[32]; +} __packed; + + +/** + * struct bdev_zone_report_le - See: bdev_zone_report + */ +struct bdev_zone_report_le { + __le32 descriptor_count; + __u8 same_field; + __u8 reserved1[3]; + __le64 maximum_lba; + __u8 reserved2[48]; + struct bdev_zone_descriptor_le descriptors[0]; +} __packed; + + +/** + * struct bdev_zone_descriptor - A Zone descriptor entry from report zones + * + * @type: see zone_type enum + * @flags: Bits 0:reset, 1:non-seq, 2-3: resv, 4-7: see zone_condition enum + * @reserved1: padding + * @length: length of zone in sectors + * @lba_start: lba where the zone starts. + * @lba_wptr: lba of the current write pointer. + * @reserved: padding + * + */ +struct bdev_zone_descriptor { + __u8 type; + __u8 flags; + __u8 reserved1[6]; + __be64 length; + __be64 lba_start; + __be64 lba_wptr; + __u8 reserved[32]; +} __packed; + + +/** + * struct bdev_zone_report - Report Zones result + * + * @descriptor_count: Number of descriptor entries that follow + * @same_field: bits 0-3: enum zone_same (MASK: 0x0F) + * @reserved1: padding + * @maximum_lba: LBA of the last logical sector on the device, inclusive + * of all logical sectors in all zones. + * @reserved2: padding + * @descriptors: array of descriptors follows. + */ +struct bdev_zone_report { + __be32 descriptor_count; + __u8 same_field; + __u8 reserved1[3]; + __be64 maximum_lba; + __u8 reserved2[48]; + struct bdev_zone_descriptor descriptors[0]; +} __packed; + + +/** + * struct bdev_zone_report_io - Report Zones ioctl argument. + * + * @in: Report Zones inputs + * @out: Report Zones output + */ +struct bdev_zone_report_io { + union { + struct bdev_zone_get_report in; + struct bdev_zone_report out; + } data; +} __packed; + +#endif /* _UAPI_BLKZONED_API_H */ -- 2.8.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer 2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff @ 2016-07-29 19:02 ` Shaun Tancheff 2016-08-01 9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig 2 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-07-29 19:02 UTC (permalink / raw) To: linux-block, linux-scsi, linux-kernel Cc: Shaun Tancheff, Jens Axboe, Jens Axboe, Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman, Shaun Tancheff Add support for ZBC ioctl's BLKREPORT - Issue Report Zones to device. BLKOPENZONE - Issue Zone Action: Open Zone command. BLKCLOSEZONE - Issue Zone Action: Close Zone command. BLKRESETZONE - Issue Zone Action: Reset Zone command. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> --- v6: - Added GFP_DMA to gfp mask. v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's --- block/ioctl.c | 110 ++++++++++++++++++++++++++++++++++++++ include/uapi/linux/blkzoned_api.h | 6 +++ include/uapi/linux/fs.h | 1 + 3 files changed, 117 insertions(+) diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..a2a6c2c 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -7,6 +7,7 @@ #include <linux/backing-dev.h> #include <linux/fs.h> #include <linux/blktrace_api.h> +#include <linux/blkzoned_api.h> #include <linux/pr.h> #include <asm/uaccess.h> @@ -194,6 +195,109 @@ int blkdev_reread_part(struct block_device *bdev) } EXPORT_SYMBOL(blkdev_reread_part); +static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode, + void __user *parg) +{ + int error = -EFAULT; + gfp_t gfp = GFP_KERNEL | GFP_DMA; + struct bdev_zone_report_io *zone_iodata = NULL; + int order = 0; + struct page *pgs = NULL; + u32 alloc_size = PAGE_SIZE; + unsigned long op_flags = 0; + u8 opt = 0; + + if (!(mode & FMODE_READ)) + return -EBADF; + + zone_iodata = (void *)get_zeroed_page(gfp); + if (!zone_iodata) { + error = -ENOMEM; + goto report_zones_out; + } + if (copy_from_user(zone_iodata, parg, sizeof(*zone_iodata))) { + error = -EFAULT; + goto report_zones_out; + } + if (zone_iodata->data.in.return_page_count > alloc_size) { + int npages; + + alloc_size = zone_iodata->data.in.return_page_count; + npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT; + pgs = alloc_pages(gfp, ilog2(npages)); + if (pgs) { + void *mem = page_address(pgs); + + if (!mem) { + error = -ENOMEM; + goto report_zones_out; + } + order = ilog2(npages); + memset(mem, 0, alloc_size); + memcpy(mem, zone_iodata, sizeof(*zone_iodata)); + free_page((unsigned long)zone_iodata); + zone_iodata = mem; + } else { + /* Result requires DMA capable memory */ + pr_err("Not enough memory available for request.\n"); + error = -ENOMEM; + goto report_zones_out; + } + } + opt = zone_iodata->data.in.report_option; + error = blkdev_issue_zone_report(bdev, op_flags, + zone_iodata->data.in.zone_locator_lba, opt, + pgs ? pgs : virt_to_page(zone_iodata), + alloc_size, GFP_KERNEL); + + if (error) + goto report_zones_out; + + if (copy_to_user(parg, zone_iodata, alloc_size)) + error = -EFAULT; + +report_zones_out: + if (pgs) + __free_pages(pgs, order); + else if (zone_iodata) + free_page((unsigned long)zone_iodata); + return error; +} + +static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + unsigned int op = 0; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + /* + * When acting on zones we explicitly disallow using a partition. + */ + if (bdev != bdev->bd_contains) { + pr_err("%s: All zone operations disallowed on this device\n", + __func__); + return -EFAULT; + } + + switch (cmd) { + case BLKOPENZONE: + op = REQ_OP_ZONE_OPEN; + break; + case BLKCLOSEZONE: + op = REQ_OP_ZONE_CLOSE; + break; + case BLKRESETZONE: + op = REQ_OP_ZONE_RESET; + break; + default: + pr_err("%s: Unknown action: %u\n", __func__, cmd); + return -EINVAL; + } + return blkdev_issue_zone_action(bdev, op, 0, arg, GFP_KERNEL); +} + static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode, unsigned long arg, unsigned long flags) { @@ -568,6 +672,12 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKTRACESETUP: case BLKTRACETEARDOWN: return blk_trace_ioctl(bdev, cmd, argp); + case BLKREPORT: + return blk_zoned_report_ioctl(bdev, mode, argp); + case BLKOPENZONE: + case BLKCLOSEZONE: + case BLKRESETZONE: + return blk_zoned_action_ioctl(bdev, mode, cmd, arg); case IOC_PR_REGISTER: return blkdev_pr_register(bdev, argp); case IOC_PR_RESERVE: diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h index 48c17ad..3566de0 100644 --- a/include/uapi/linux/blkzoned_api.h +++ b/include/uapi/linux/blkzoned_api.h @@ -211,4 +211,10 @@ struct bdev_zone_report_io { } data; } __packed; +/* continuing from uapi/linux/fs.h: */ +#define BLKREPORT _IOWR(0x12, 130, struct bdev_zone_report_io) +#define BLKOPENZONE _IO(0x12, 131) +#define BLKCLOSEZONE _IO(0x12, 132) +#define BLKRESETZONE _IO(0x12, 133) + #endif /* _UAPI_BLKZONED_API_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 3b00f7c..c0b565b 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -222,6 +222,7 @@ struct fsxattr { #define BLKSECDISCARD _IO(0x12,125) #define BLKROTATIONAL _IO(0x12,126) #define BLKZEROOUT _IO(0x12,127) +/* A jump here: See blkzoned_api.h, Reserving 130 to 133. */ #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ -- 2.8.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff @ 2016-08-01 9:41 ` Christoph Hellwig 2016-08-01 17:07 ` Shaun Tancheff 2 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2016-08-01 9:41 UTC (permalink / raw) To: Shaun Tancheff Cc: linux-block, linux-scsi, linux-kernel, Jens Axboe, Jens Axboe, Christoph Hellwig, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Can you please integrate this with Hannes series so that it uses his cache of the zone information? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-01 9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig @ 2016-08-01 17:07 ` Shaun Tancheff 0 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-01 17:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman, Hannes Reinecke, Damien Le Moal On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: > > Can you please integrate this with Hannes series so that it uses > his cache of the zone information? Adding Hannes and Damien to Cc. Christoph, I can make a patch the marshal Hannes' RB-Tree into to a block report, that= is quite simple. I can even have the open/close/reset zone commands update the RB-Tree .. the non-private parts anyway. I would prefer to do this around t= he CONFIG_SD_ZBC support, offering the existing type of patch for setups that = do not need the RB-Tree to function with zoned media. I do still have concerns with the approach which I have shared in smaller forums but perhaps I have to bring them to this group. First is the memory consumption. This isn't really much of a concern for la= rge servers with few drives but I think the embedded NAS market will grumble as well as the large data pods trying to stuff 300+ drives in a chassis. As of now the RB-Tree needs to hold ~30000 zones. sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields around 3.5 MB per zoned drive attached. Which is fine if it is really needed, but most of it is fixed information and it can be significantly condensed (I have proposed 8 bytes per zone hel= d in an array as more than adequate). Worse is that the crucial piece of information, the current wp needed for scheduling the next write, is mostly out of date because it is updated only after the write completes and zones being actively written to must work off of the last location / size that wa= s submitted, not completed. The work around is for that tracking to be handle= d in the private_data member. I am not saying that updating the wp on completing a write isn=E2=80=99t important, I am saying that the bi_end_io = hook is the existing hook that works just fine. This all tails into domain responsability. With the RB-Tree doing half of t= he work and the =E2=80=98responsible=E2=80=99 domain handling the active path = via private_data why have the split at all? It seems to be a double work to have second obje= ct tracking the first so that I/O scheduling can function. Finally is the error handling path when the RB-Tree encounters and error it attempts to requery the drive topology virtually guaranteeing that the private_data is now out-of-sync with the RB-Tree. Again this is something that can be better encapsulated in the bi_end_io to be informed of the failed I/O and schedule the appropriate recovery (including re-querying the zone information of the affected zone(s)). Anyway those are my concerns and why I am still reluctant to drop this line= of support. I have incorporated Hannes changes at various points. Hence the SCT Write Same to attempt to work around some of the flaws in mapping discard to reset write pointer. Thanks and Regards, Shaun > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at https://urldefense.proofpoint.com/v2/url?u=3Dhttp= -3A__vger.kernel.org_majordomo-2Dinfo.html&d=3DCwIBAg&c=3DIGDlg0lD0b-nebmJJ= 0Kp8A&r=3DWg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=3D0ZPyN4vfYZXSmuCmI= m3wpExF1K28PYO9KmgcqDsfQBg&s=3Daiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&= e=3D --=20 Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-01 17:07 ` Shaun Tancheff 0 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-01 17:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman, Hannes Reinecke, Damien Le Moal On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: > > Can you please integrate this with Hannes series so that it uses > his cache of the zone information? Adding Hannes and Damien to Cc. Christoph, I can make a patch the marshal Hannes' RB-Tree into to a block report, that is quite simple. I can even have the open/close/reset zone commands update the RB-Tree .. the non-private parts anyway. I would prefer to do this around the CONFIG_SD_ZBC support, offering the existing type of patch for setups that do not need the RB-Tree to function with zoned media. I do still have concerns with the approach which I have shared in smaller forums but perhaps I have to bring them to this group. First is the memory consumption. This isn't really much of a concern for large servers with few drives but I think the embedded NAS market will grumble as well as the large data pods trying to stuff 300+ drives in a chassis. As of now the RB-Tree needs to hold ~30000 zones. sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields around 3.5 MB per zoned drive attached. Which is fine if it is really needed, but most of it is fixed information and it can be significantly condensed (I have proposed 8 bytes per zone held in an array as more than adequate). Worse is that the crucial piece of information, the current wp needed for scheduling the next write, is mostly out of date because it is updated only after the write completes and zones being actively written to must work off of the last location / size that was submitted, not completed. The work around is for that tracking to be handled in the private_data member. I am not saying that updating the wp on completing a write isn’t important, I am saying that the bi_end_io hook is the existing hook that works just fine. This all tails into domain responsability. With the RB-Tree doing half of the work and the ‘responsible’ domain handling the active path via private_data why have the split at all? It seems to be a double work to have second object tracking the first so that I/O scheduling can function. Finally is the error handling path when the RB-Tree encounters and error it attempts to requery the drive topology virtually guaranteeing that the private_data is now out-of-sync with the RB-Tree. Again this is something that can be better encapsulated in the bi_end_io to be informed of the failed I/O and schedule the appropriate recovery (including re-querying the zone information of the affected zone(s)). Anyway those are my concerns and why I am still reluctant to drop this line of support. I have incorporated Hannes changes at various points. Hence the SCT Write Same to attempt to work around some of the flaws in mapping discard to reset write pointer. Thanks and Regards, Shaun > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg&s=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&e= -- Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-01 17:07 ` Shaun Tancheff (?) @ 2016-08-02 14:35 ` Hannes Reinecke 2016-08-03 1:29 ` Damien Le Moal -1 siblings, 1 reply; 26+ messages in thread From: Hannes Reinecke @ 2016-08-02 14:35 UTC (permalink / raw) To: Shaun Tancheff, Christoph Hellwig Cc: Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman, Damien Le Moal On 08/01/2016 07:07 PM, Shaun Tancheff wrote: > On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> Can you please integrate this with Hannes series so that it uses >> his cache of the zone information? > > Adding Hannes and Damien to Cc. > > Christoph, > > I can make a patch the marshal Hannes' RB-Tree into to a block report, that is > quite simple. I can even have the open/close/reset zone commands update the > RB-Tree .. the non-private parts anyway. I would prefer to do this around the > CONFIG_SD_ZBC support, offering the existing type of patch for setups that do > not need the RB-Tree to function with zoned media. > > I do still have concerns with the approach which I have shared in smaller > forums but perhaps I have to bring them to this group. > > First is the memory consumption. This isn't really much of a concern for large > servers with few drives but I think the embedded NAS market will grumble as > well as the large data pods trying to stuff 300+ drives in a chassis. > > As of now the RB-Tree needs to hold ~30000 zones. > sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields > around 3.5 MB per zoned drive attached. > Which is fine if it is really needed, but most of it is fixed information > and it can be significantly condensed (I have proposed 8 bytes per zone held > in an array as more than adequate). Worse is that the crucial piece of > information, the current wp needed for scheduling the next write, is mostly > out of date because it is updated only after the write completes and zones > being actively written to must work off of the last location / size that was > submitted, not completed. The work around is for that tracking to be handled > in the private_data member. I am not saying that updating the wp on > completing a write isn’t important, I am saying that the bi_end_io hook is > the existing hook that works just fine. > Which _actually_ is not true; with my patches I'll update the write pointer prior to submit the I/O (on the reasoning that most of the time I/O will succeed) and re-read the zone information if an I/O failed. (Which I'll have to do anyway as after an I/O failure the write pointer status is not clearly defined.) I have thought about condensing the RB tree information, but then I figured that for 'real' SMR handling we cannot assume all zones are of fixed size, and hence we need all the information there. Any condensing method would assume a given structure of the zones, which the standard just doesn't provide. Or am I missing something here? As for write pointer handling: yes, the write pointer on the zones is not really useful for upper-level usage. Where we do need it is to detect I/O which is crossing the write pointer (eg when doing reads over the entire zone). As per spec you will be getting an I/O error here, so we need to split the I/O on the write pointer to get valid results back. > This all tails into domain responsibility. With the RB-Tree doing half of the > work and the ‘responsible’ domain handling the active path via private_data > why have the split at all? It seems to be a double work to have second object > tracking the first so that I/O scheduling can function. > Tracking the zone states via RB trees is mainly to handly host-managed drives seamlessly. Without an RB tree we will be subjected to I/O errors during boot, as eg with new drives we inevitably will try to read beyond the write pointer, which in turn will generate I/O errors on certain drives. I do agree that there is no strict need to setup an RB tree for host-aware drives; I can easily add an attribute/flag to disable RB trees for those. However, tracking zone information in the RB tree _dramatically_ speeds up device initialisation; issuing a blkdev_discard() for the entire drive will take _ages_ without it. > Finally is the error handling path when the RB-Tree encounters and error it > attempts to requery the drive topology virtually guaranteeing that the > private_data is now out-of-sync with the RB-Tree. Again this is something > that can be better encapsulated in the bi_end_io to be informed of the > failed I/O and schedule the appropriate recovery (including re-querying the > zone information of the affected zone(s)). > Well, if we have an RB tree with write pointers of course we need to re-read the zone information on failure (and I thought I did that?). Plus the error information will be propagated via the usual mechanism, so they need to take care of updating any information in ->private_data. I'm perfectly fine with integrating your patches for the various blkdev_XX callouts and associated ioctls; Jens favours that approach, too, so we should be combining those efforts. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-02 14:35 ` Hannes Reinecke @ 2016-08-03 1:29 ` Damien Le Moal 0 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-03 1:29 UTC (permalink / raw) To: Hannes Reinecke Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman SGFubmVzLCBTaGF1biwNCg0KTGV0IG1lIGFkZCBzb21lIG1vcmUgY29tbWVudHMuDQoNCj4gT24g QXVnIDIsIDIwMTYsIGF0IDIzOjM1LCBIYW5uZXMgUmVpbmVja2UgPGhhcmVAc3VzZS5kZT4gd3Jv dGU6DQo+IA0KPiBPbiAwOC8wMS8yMDE2IDA3OjA3IFBNLCBTaGF1biBUYW5jaGVmZiB3cm90ZToN Cj4+IE9uIE1vbiwgQXVnIDEsIDIwMTYgYXQgNDo0MSBBTSwgQ2hyaXN0b3BoIEhlbGx3aWcgPGhj aEBsc3QuZGU+IHdyb3RlOg0KPj4+IA0KPj4+IENhbiB5b3UgcGxlYXNlIGludGVncmF0ZSB0aGlz IHdpdGggSGFubmVzIHNlcmllcyBzbyB0aGF0IGl0IHVzZXMNCj4+PiBoaXMgY2FjaGUgb2YgdGhl IHpvbmUgaW5mb3JtYXRpb24/DQo+PiANCj4+IEFkZGluZyBIYW5uZXMgYW5kIERhbWllbiB0byBD Yy4NCj4+IA0KPj4gQ2hyaXN0b3BoLA0KPj4gDQo+PiBJIGNhbiBtYWtlIGEgcGF0Y2ggdGhlIG1h cnNoYWwgSGFubmVzJyBSQi1UcmVlIGludG8gdG8gYSBibG9jayByZXBvcnQsIHRoYXQgaXMNCj4+ IHF1aXRlIHNpbXBsZS4gSSBjYW4gZXZlbiBoYXZlIHRoZSBvcGVuL2Nsb3NlL3Jlc2V0IHpvbmUg Y29tbWFuZHMgdXBkYXRlIHRoZQ0KPj4gUkItVHJlZSAuLiB0aGUgbm9uLXByaXZhdGUgcGFydHMg YW55d2F5LiBJIHdvdWxkIHByZWZlciB0byBkbyB0aGlzIGFyb3VuZCB0aGUNCj4+IENPTkZJR19T RF9aQkMgc3VwcG9ydCwgb2ZmZXJpbmcgdGhlIGV4aXN0aW5nIHR5cGUgb2YgcGF0Y2ggZm9yIHNl dHVwcyB0aGF0IGRvDQo+PiBub3QgbmVlZCB0aGUgUkItVHJlZSB0byBmdW5jdGlvbiB3aXRoIHpv bmVkIG1lZGlhLg0KPj4gDQo+PiBJIGRvIHN0aWxsIGhhdmUgY29uY2VybnMgd2l0aCB0aGUgYXBw cm9hY2ggd2hpY2ggSSBoYXZlIHNoYXJlZCBpbiBzbWFsbGVyDQo+PiBmb3J1bXMgYnV0IHBlcmhh cHMgSSBoYXZlIHRvIGJyaW5nIHRoZW0gdG8gdGhpcyBncm91cC4NCj4+IA0KPj4gRmlyc3QgaXMg dGhlIG1lbW9yeSBjb25zdW1wdGlvbi4gVGhpcyBpc24ndCByZWFsbHkgbXVjaCBvZiBhIGNvbmNl cm4gZm9yIGxhcmdlDQo+PiBzZXJ2ZXJzIHdpdGggZmV3IGRyaXZlcyBidXQgSSB0aGluayB0aGUg ZW1iZWRkZWQgTkFTIG1hcmtldCB3aWxsIGdydW1ibGUgYXMNCj4+IHdlbGwgYXMgdGhlIGxhcmdl IGRhdGEgcG9kcyB0cnlpbmcgdG8gc3R1ZmYgMzAwKyBkcml2ZXMgaW4gYSBjaGFzc2lzLg0KPj4g DQo+PiBBcyBvZiBub3cgdGhlIFJCLVRyZWUgbmVlZHMgdG8gaG9sZCB+MzAwMDAgem9uZXMuDQo+ PiBzaXplb2YoKSByZXBvcnRzIHN0cnVjdCBibGtfem9uZSB0byB1c2UgMTIwIGJ5dGVzIG9uIHg4 Nl82NC4gVGhpcyB5aWVsZHMNCj4+IGFyb3VuZCAzLjUgTUIgcGVyIHpvbmVkIGRyaXZlIGF0dGFj aGVkLg0KPj4gV2hpY2ggaXMgZmluZSBpZiBpdCBpcyByZWFsbHkgbmVlZGVkLCBidXQgbW9zdCBv ZiBpdCBpcyBmaXhlZCBpbmZvcm1hdGlvbg0KPj4gYW5kIGl0IGNhbiBiZSBzaWduaWZpY2FudGx5 IGNvbmRlbnNlZCAoSSBoYXZlIHByb3Bvc2VkIDggYnl0ZXMgcGVyIHpvbmUgaGVsZA0KPj4gaW4g YW4gYXJyYXkgYXMgbW9yZSB0aGFuIGFkZXF1YXRlKS4gV29yc2UgaXMgdGhhdCB0aGUgY3J1Y2lh bCBwaWVjZSBvZg0KPj4gaW5mb3JtYXRpb24sIHRoZSBjdXJyZW50IHdwIG5lZWRlZCBmb3Igc2No ZWR1bGluZyB0aGUgbmV4dCB3cml0ZSwgaXMgbW9zdGx5DQo+PiBvdXQgb2YgZGF0ZSBiZWNhdXNl IGl0IGlzIHVwZGF0ZWQgb25seSBhZnRlciB0aGUgd3JpdGUgY29tcGxldGVzIGFuZCB6b25lcw0K Pj4gYmVpbmcgYWN0aXZlbHkgd3JpdHRlbiB0byBtdXN0IHdvcmsgb2ZmIG9mIHRoZSBsYXN0IGxv Y2F0aW9uIC8gc2l6ZSB0aGF0IHdhcw0KPj4gc3VibWl0dGVkLCBub3QgY29tcGxldGVkLiBUaGUg d29yayBhcm91bmQgaXMgZm9yIHRoYXQgdHJhY2tpbmcgdG8gYmUgaGFuZGxlZA0KPj4gaW4gdGhl IHByaXZhdGVfZGF0YSBtZW1iZXIuIEkgYW0gbm90IHNheWluZyB0aGF0IHVwZGF0aW5nIHRoZSB3 cCBvbg0KPj4gY29tcGxldGluZyBhIHdyaXRlIGlzbuKAmXQgaW1wb3J0YW50LCBJIGFtIHNheWlu ZyB0aGF0IHRoZSBiaV9lbmRfaW8gaG9vayBpcw0KPj4gdGhlIGV4aXN0aW5nIGhvb2sgdGhhdCB3 b3JrcyBqdXN0IGZpbmUuDQo+PiANCj4gV2hpY2ggX2FjdHVhbGx5XyBpcyBub3QgdHJ1ZTsgd2l0 aCBteSBwYXRjaGVzIEknbGwgdXBkYXRlIHRoZSB3cml0ZQ0KPiBwb2ludGVyIHByaW9yIHRvIHN1 Ym1pdCB0aGUgSS9PIChvbiB0aGUgcmVhc29uaW5nIHRoYXQgbW9zdCBvZiB0aGUgdGltZQ0KPiBJ L08gd2lsbCBzdWNjZWVkKSBhbmQgcmUtcmVhZCB0aGUgem9uZSBpbmZvcm1hdGlvbiBpZiBhbiBJ L08gZmFpbGVkLg0KPiAoV2hpY2ggSSdsbCBoYXZlIHRvIGRvIGFueXdheSBhcyBhZnRlciBhbiBJ L08gZmFpbHVyZSB0aGUgd3JpdGUgcG9pbnRlcg0KPiBzdGF0dXMgaXMgbm90IGNsZWFybHkgZGVm aW5lZC4pDQo+IA0KPiBJIGhhdmUgdGhvdWdodCBhYm91dCBjb25kZW5zaW5nIHRoZSBSQiB0cmVl IGluZm9ybWF0aW9uLCBidXQgdGhlbiBJDQo+IGZpZ3VyZWQgdGhhdCBmb3IgJ3JlYWwnIFNNUiBo YW5kbGluZyB3ZSBjYW5ub3QgYXNzdW1lIGFsbCB6b25lcyBhcmUgb2YNCj4gZml4ZWQgc2l6ZSwg YW5kIGhlbmNlIHdlIG5lZWQgYWxsIHRoZSBpbmZvcm1hdGlvbiB0aGVyZS4NCj4gQW55IGNvbmRl bnNpbmcgbWV0aG9kIHdvdWxkIGFzc3VtZSBhIGdpdmVuIHN0cnVjdHVyZSBvZiB0aGUgem9uZXMs IHdoaWNoDQo+IHRoZSBzdGFuZGFyZCBqdXN0IGRvZXNuJ3QgcHJvdmlkZS4NCj4gT3IgYW0gSSBt aXNzaW5nIHNvbWV0aGluZyBoZXJlPw0KDQpJbmRlZWQsIHRoZSBzdGFuZGFyZHMgZG8gbm90IG1h bmRhdGUgYW55IHBhcnRpY3VsYXIgem9uZSBjb25maWd1cmF0aW9uLA0KY29uc3RhbnQgem9uZSBz aXplLCBldGMuIFNvIHdyaXRpbmcgY29kZSBzbyB0aGF0IGNhbiBiZSBoYW5kbGVkIGlzIGNlcnRh aW5seQ0KdGhlIHJpZ2h0IHdheSBvZiBkb2luZyB0aGluZ3MuIEhvd2V2ZXIsIGlmIHdlIGRlY2lk ZSB0byBnbyBmb3J3YXJkIHdpdGgNCm1hcHBpbmcgUkVTRVQgV1JJVEUgUE9JTlRFUiBjb21tYW5k IHRvIERJU0NBUkQsIHRoZW4gYXQgbGVhc3QgYSBjb25zdGFudA0Kem9uZSBzaXplIChtaW51cyB0 aGUgbGFzdCB6b25lIGFzIHlvdSBzYWlkKSBtdXN0IGJlIGFzc3VtZWQsIGFuZCB0aGF0DQppbmZv cm1hdGlvbiBjYW4gYmUgcmVtb3ZlZCBmcm9tIHRoZSBlbnRyaWVzIGluIHRoZSBSQiB0cmVlIChh cyBpdCB3aWxsIGJlDQpzYXZlZCBmb3IgdGhlIHN5c2ZzICJ6b25lX3NpemUiIGZpbGUgYW55d2F5 LiBBZGRpbmcgYSBsaXR0bGUgY29kZSB0byBoYW5kbGUNCnRoYXQgZXZlbnR1YWwgbGFzdCBydW50 IHpvbmUgd2l0aCBhIGRpZmZlcmVudCBzaXplIGlzIG5vdCBhIGJpZyBwcm9ibGVtLg0KDQo+IA0K PiBBcyBmb3Igd3JpdGUgcG9pbnRlciBoYW5kbGluZzogeWVzLCB0aGUgd3JpdGUgcG9pbnRlciBv biB0aGUgem9uZXMgaXMNCj4gbm90IHJlYWxseSB1c2VmdWwgZm9yIHVwcGVyLWxldmVsIHVzYWdl Lg0KPiBXaGVyZSB3ZSBkbyBuZWVkIGl0IGlzIHRvIGRldGVjdCBJL08gd2hpY2ggaXMgY3Jvc3Np bmcgdGhlIHdyaXRlIHBvaW50ZXINCj4gKGVnIHdoZW4gZG9pbmcgcmVhZHMgb3ZlciB0aGUgZW50 aXJlIHpvbmUpLg0KPiBBcyBwZXIgc3BlYyB5b3Ugd2lsbCBiZSBnZXR0aW5nIGFuIEkvTyBlcnJv ciBoZXJlLCBzbyB3ZSBuZWVkIHRvIHNwbGl0DQo+IHRoZSBJL08gb24gdGhlIHdyaXRlIHBvaW50 ZXIgdG8gZ2V0IHZhbGlkIHJlc3VsdHMgYmFjay4NCg0KVG8gYmUgcHJlY2lzZSBoZXJlLCB0aGUg SS9PIHNwbGl0dGluZyB3aWxsIGJlIGhhbmRsZWQgYnkgdGhlIGJsb2NrIGxheWVyDQp0aGFua3Mg dG8gdGhlICJjaHVua19zZWN0b3JzIiBzZXR0aW5nLiBCdXQgdGhhdCByZWxpZXMgb24gYSBjb25z dGFudCB6b25lDQpzaXplIGFzc3VtcHRpb24gdG9vLg0KDQpUaGUgUkItdHJlZSBoZXJlIGlzIG1v c3QgdXNlZnVsIGZvciByZWFkcyBvdmVyIG9yIGFmdGVyIHRoZSB3cml0ZSBwb2ludGVyIGFzDQp0 aGlzIGNhbiBoYXZlIGRpZmZlcmVudCBiZWhhdmlvciBvbiBkaWZmZXJlbnQgZHJpdmVzIChVUlNX UlogYml0KS4gVGhlIFJCLXRyZWUNCmFsbG93cyB1cyB0byBoaWRlIHRoZXNlIGRpZmZlcmVuY2Vz IHRvIHVwcGVyIGxheWVycyBhbmQgc2ltcGxpZnkgc3VwcG9ydCBhdA0KdGhvc2UgbGV2ZWxzLg0K DQpJIHRvbyBhZ3JlZSB0aGF0IHRoZSB3cml0ZSBwb2ludGVyIHZhbHVlIGlzIG5vdCB2ZXJ5IHVz ZWZ1bCB0byB1cHBlciBsYXllcnMNCihlLmcuIEZTKS4gV2hhdCBtYXR0ZXJzIG1vc3Qgb2YgdGhl IHRpbWVzIGZvciB0aGVzZSBsYXllcnMgaXMgYW4gImFsbG9jYXRpb24NCnBvaW50ZXIiIG9yICJz dWJtaXNzaW9uIHBvaW50ZXIiIHdoaWNoIGluZGljYXRlcyB0aGUgTEJBIHRvIHVzZSBmb3IgdGhl IG5leHQNCkJJTyBzdWJtaXNzaW9uLiBUaGF0IExCQSB3aWxsIG1vc3Qgb2YgdGhlIHRpbWUgYmUg aW4gYWR2YW5jZSBvZiB0aGUgem9uZSBXUA0KYmVjYXVzZSBvZiByZXF1ZXN0IHF1ZWluZyBpbiB0 aGUgYmxvY2sgc2NoZWR1bGVyLg0KTm90ZSB0aGF0IGZyb20gd2hhdCB3ZSBoYXZlIGRvbmUgYWxy ZWFkeSwgaW4gbWFueSBjYXNlcywgdXBwZXIgbGF5ZXJzIGRvIG5vdA0KZXZlbiBuZWVkIHRoaXMg KGUuZy4gRjJGUywgYnRyZnMpIGFuZCB3b3JrIGZpbmUgd2l0aG91dCBuZWVkaW5nICphbnkqIA0K em9uZS0+cHJpdmF0ZV9kYXRhIGJlY2F1c2UgdGhhdCAiYWxsb2NhdGlvbiBwb2ludGVyIiBpbmZv cm1hdGlvbiBnZW5lcmFsbHkNCmFscmVhZHkgZXhpc3RzIGluIGEgZGlmZmVyZW50IGZvcm0gKGUu Zy4gYSBiaXQgaW4gYSBiaXRtYXApLg0KRm9yIHRoZXNlIGNhc2VzLCBub3QgaGF2aW5nIHRoZSBS Qi10cmVlIG9mIHpvbmVzIHdvdWxkIGZvcmNlIGEgbG90DQptb3JlIGNvZGUgdG8gYmUgYWRkZWQg dG8gZmlsZSBzeXN0ZW1zIGZvciBTTVIgc3VwcG9ydC4NCg0KPiANCj4+IFRoaXMgYWxsIHRhaWxz IGludG8gZG9tYWluIHJlc3BvbnNpYmlsaXR5LiBXaXRoIHRoZSBSQi1UcmVlIGRvaW5nIGhhbGYg b2YgdGhlDQo+PiB3b3JrIGFuZCB0aGUg4oCYcmVzcG9uc2libGXigJkgZG9tYWluIGhhbmRsaW5n IHRoZSBhY3RpdmUgcGF0aCB2aWEgcHJpdmF0ZV9kYXRhDQo+PiB3aHkgaGF2ZSB0aGUgc3BsaXQg YXQgYWxsPyBJdCBzZWVtcyB0byBiZSBhIGRvdWJsZSB3b3JrIHRvIGhhdmUgc2Vjb25kIG9iamVj dA0KPj4gdHJhY2tpbmcgdGhlIGZpcnN0IHNvIHRoYXQgSS9PIHNjaGVkdWxpbmcgY2FuIGZ1bmN0 aW9uLg0KPj4gDQo+IFRyYWNraW5nIHRoZSB6b25lIHN0YXRlcyB2aWEgUkIgdHJlZXMgaXMgbWFp bmx5IHRvIGhhbmRseSBob3N0LW1hbmFnZWQNCj4gZHJpdmVzIHNlYW1sZXNzbHkuIFdpdGhvdXQg YW4gUkIgdHJlZSB3ZSB3aWxsIGJlIHN1YmplY3RlZCB0byBJL08gZXJyb3JzDQo+IGR1cmluZyBi b290LCBhcyBlZyB3aXRoIG5ldyBkcml2ZXMgd2UgaW5ldml0YWJseSB3aWxsIHRyeSB0byByZWFk IGJleW9uZA0KPiB0aGUgd3JpdGUgcG9pbnRlciwgd2hpY2ggaW4gdHVybiB3aWxsIGdlbmVyYXRl IEkvTyBlcnJvcnMgb24gY2VydGFpbiBkcml2ZXMuDQo+IEkgZG8gYWdyZWUgdGhhdCB0aGVyZSBp cyBubyBzdHJpY3QgbmVlZCB0byBzZXR1cCBhbiBSQiB0cmVlIGZvcg0KPiBob3N0LWF3YXJlIGRy aXZlczsgSSBjYW4gZWFzaWx5IGFkZCBhbiBhdHRyaWJ1dGUvZmxhZyB0byBkaXNhYmxlIFJCDQo+ IHRyZWVzIGZvciB0aG9zZS4NCj4gSG93ZXZlciwgdHJhY2tpbmcgem9uZSBpbmZvcm1hdGlvbiBp biB0aGUgUkIgdHJlZSBfZHJhbWF0aWNhbGx5XyBzcGVlZHMNCj4gdXAgZGV2aWNlIGluaXRpYWxp c2F0aW9uOyBpc3N1aW5nIGEgYmxrZGV2X2Rpc2NhcmQoKSBmb3IgdGhlIGVudGlyZQ0KPiBkcml2 ZSB3aWxsIHRha2UgX2FnZXNfIHdpdGhvdXQgaXQuDQoNCkFuZCB0aGUgUkItdHJlZSB3aWxsIGFs c28gYmUgdmVyeSB1c2VmdWwgZm9yIHNwZWVkaW5nIHVwIHJlcG9ydCB6b25lcw0KaW9jdGxzIHRv by4gVW5sZXNzIHdlIHdhbnQgdGhvc2UgdG8gZm9yY2UgYW4gdXBkYXRlIG9mIHRoZSBSQi10cmVl IGluZm9ybWF0aW9uID8NCg0KPiANCj4+IEZpbmFsbHkgaXMgdGhlIGVycm9yIGhhbmRsaW5nIHBh dGggd2hlbiB0aGUgUkItVHJlZSBlbmNvdW50ZXJzIGFuZCBlcnJvciBpdA0KPj4gYXR0ZW1wdHMg dG8gcmVxdWVyeSB0aGUgZHJpdmUgdG9wb2xvZ3kgdmlydHVhbGx5IGd1YXJhbnRlZWluZyB0aGF0 IHRoZQ0KPj4gcHJpdmF0ZV9kYXRhIGlzIG5vdyBvdXQtb2Ytc3luYyB3aXRoIHRoZSBSQi1UcmVl LiBBZ2FpbiB0aGlzIGlzIHNvbWV0aGluZw0KPj4gdGhhdCBjYW4gYmUgYmV0dGVyIGVuY2Fwc3Vs YXRlZCBpbiB0aGUgYmlfZW5kX2lvIHRvIGJlIGluZm9ybWVkIG9mIHRoZQ0KPj4gZmFpbGVkIEkv TyBhbmQgc2NoZWR1bGUgdGhlIGFwcHJvcHJpYXRlIHJlY292ZXJ5IChpbmNsdWRpbmcgcmUtcXVl cnlpbmcgdGhlDQo+PiB6b25lIGluZm9ybWF0aW9uIG9mIHRoZSBhZmZlY3RlZCB6b25lKHMpKS4N Cj4+IA0KPiBXZWxsLCBpZiB3ZSBoYXZlIGFuIFJCIHRyZWUgd2l0aCB3cml0ZSBwb2ludGVycyBv ZiBjb3Vyc2Ugd2UgbmVlZCB0bw0KPiByZS1yZWFkIHRoZSB6b25lIGluZm9ybWF0aW9uIG9uIGZh aWx1cmUgKGFuZCBJIHRob3VnaHQgSSBkaWQgdGhhdD8pLg0KPiBQbHVzIHRoZSBlcnJvciBpbmZv cm1hdGlvbiB3aWxsIGJlIHByb3BhZ2F0ZWQgdmlhIHRoZSB1c3VhbCBtZWNoYW5pc20sDQo+IHNv IHRoZXkgbmVlZCB0byB0YWtlIGNhcmUgb2YgdXBkYXRpbmcgYW55IGluZm9ybWF0aW9uIGluIC0+ cHJpdmF0ZV9kYXRhLg0KPiANCj4gSSdtIHBlcmZlY3RseSBmaW5lIHdpdGggaW50ZWdyYXRpbmcg eW91ciBwYXRjaGVzIGZvciB0aGUgdmFyaW91cw0KPiBibGtkZXZfWFggY2FsbG91dHMgYW5kIGFz c29jaWF0ZWQgaW9jdGxzOyBKZW5zIGZhdm91cnMgdGhhdCBhcHByb2FjaCwNCj4gdG9vLCBzbyB3 ZSBzaG91bGQgYmUgY29tYmluaW5nIHRob3NlIGVmZm9ydHMuDQoNCkNhbiB3ZSBhZ3JlZSBvbiBh biBpbnRlcmZhY2UgPw0KU3VtbWFyaXppbmcgYWxsIHRoZSBkaXNjdXNzaW9ucyB3ZSBoYWQsIEkg YW0gYWxsIGluIGZhdm9yIG9mIHRoZSBmb2xsb3dpbmc6DQoNCjEpIEEgInpvbmVfc2l6ZSIgc3lz ZnMgYXR0cmlidXRlIHRvIGluZGljYXRlIHRoYXQgYSBkcml2ZSBpcyB6b25lZDoNClRoZSBhbHJl YWR5IGV4aXN0aW5nIGRldmljZSB0eXBlIGZpbGUgY2FuIHRlbGwgdXMgaWYgdGhpcyBpcyBhbiBo b3N0DQptYW5hZ2VkIGRpc2sgKHR5cGU9PTIwKSBvciBhIGhvc3QgYXdhcmUgZGlzayAodHlwZT09 MCkuIERyaXZlIG1hbmFnZWQNCmRpc2tzIGNvdWxkIGFsc28gYmUgZGV0ZWN0ZWQsIGJ1dCBzaW5j ZSBubyBpbmZvcm1hdGlvbiBvbiB0aGVpciB6b25lDQpjb25maWd1cmF0aW9uIGNhbiBiZSBvYnRh aW5lZCwgbGV0cyB0cmVhdCB0aG9zZSBhcyByZWd1bGFyIGRyaXZlcy4NCg0KMikgQWRkIGlvY3Rs cyBmb3Igem9uZSBtYW5hZ2VtZW50Og0KUmVwb3J0IHpvbmVzIChnZXQgaW5mb3JtYXRpb24gZnJv bSBSQiB0cmVlKSwgcmVzZXQgem9uZSAoc2ltcGxlIHdyYXBwZXINCnRvIGlvY3RsIGZvciBibG9j ayBkaXNjYXJkKSwgb3BlbiB6b25lLCBjbG9zZSB6b25lIGFuZCBmaW5pc2ggem9uZS4gVGhhdA0K d2lsbCBhbGxvdyBta2ZzIGxpa2UgdG9vbHMgdG8gZ2V0IHpvbmUgaW5mb3JtYXRpb24gd2l0aG91 dCBoYXZpbmcgdG8gcGFyc2UNCnRob3VzYW5kcyBvZiBzeXNmcyBmaWxlcyAoYW5kIGNhbiBhbHNv IGJlIGludGVncmF0ZWQgaW4gbGliemJjIGJsb2NrIGJhY2tlbmQNCmRyaXZlciBmb3IgYSB1bmlm aWVkIGludGVyZmFjZSB3aXRoIHRoZSBkaXJlY3QgU0dfSU8gcGF0aCBmb3Iga2VybmVscyB3aXRo b3V0DQp0aGUgWkJDIGNvZGUgZW5hYmxlZCkuDQoNCjMpIFRyeSB0byBjb25kZW5zZSB0aGUgYmxr em9uZSBkYXRhIHN0cnVjdHVyZSB0byBzYXZlIG1lbW9yeToNCkkgdGhpbmsgdGhhdCB3ZSBjYW4g YXQgdGhlIHZlcnkgbGVhc3QgcmVtb3ZlIHRoZSB6b25lIGxlbmd0aCwgYW5kIGFsc28NCm1heSBi ZSB0aGUgcGVyIHpvbmUgc3BpbmxvY2sgdG9vIChhIHNpbmdsZSBzcGlubG9jayBhbmQgcHJvcGVy IHN0YXRlIGZsYWdzIGNhbg0KYmUgdXNlZCkuDQoNCkJlc3QgcmVnYXJkcy4NCg0KLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tDQpEYW1pZW4gTGUgTW9hbCwgUGguRC4NClNyLiBNYW5hZ2VyLCBTeXN0 ZW0gU29mdHdhcmUgR3JvdXAsIEhHU1QgUmVzZWFyY2gsDQpIR1NULCBhIFdlc3Rlcm4gRGlnaXRh bCBicmFuZA0KRGFtaWVuLkxlTW9hbEBoZ3N0LmNvbQ0KKCs4MSkgMDQ2Ni05OC0zNTkzIChleHQu IDUxMzU5MykNCjEga2lyaWhhcmEtY2hvLCBGdWppc2F3YSwgDQpLYW5hZ2F3YSwgMjUyLTA4ODgg SmFwYW4NCnd3dy5oZ3N0LmNvbSANCg0KV2VzdGVybiBEaWdpdGFsIENvcnBvcmF0aW9uIChhbmQg aXRzIHN1YnNpZGlhcmllcykgRS1tYWlsIENvbmZpZGVudGlhbGl0eSBOb3RpY2UgJiBEaXNjbGFp bWVyOgoKVGhpcyBlLW1haWwgYW5kIGFueSBmaWxlcyB0cmFuc21pdHRlZCB3aXRoIGl0IG1heSBj b250YWluIGNvbmZpZGVudGlhbCBvciBsZWdhbGx5IHByaXZpbGVnZWQgaW5mb3JtYXRpb24gb2Yg V0RDIGFuZC9vciBpdHMgYWZmaWxpYXRlcywgYW5kIGFyZSBpbnRlbmRlZCBzb2xlbHkgZm9yIHRo ZSB1c2Ugb2YgdGhlIGluZGl2aWR1YWwgb3IgZW50aXR5IHRvIHdoaWNoIHRoZXkgYXJlIGFkZHJl c3NlZC4gSWYgeW91IGFyZSBub3QgdGhlIGludGVuZGVkIHJlY2lwaWVudCwgYW55IGRpc2Nsb3N1 cmUsIGNvcHlpbmcsIGRpc3RyaWJ1dGlvbiBvciBhbnkgYWN0aW9uIHRha2VuIG9yIG9taXR0ZWQg dG8gYmUgdGFrZW4gaW4gcmVsaWFuY2Ugb24gaXQsIGlzIHByb2hpYml0ZWQuIElmIHlvdSBoYXZl IHJlY2VpdmVkIHRoaXMgZS1tYWlsIGluIGVycm9yLCBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIg aW1tZWRpYXRlbHkgYW5kIGRlbGV0ZSB0aGUgZS1tYWlsIGluIGl0cyBlbnRpcmV0eSBmcm9tIHlv dXIgc3lzdGVtLgo= ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-03 1:29 ` Damien Le Moal 0 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-03 1:29 UTC (permalink / raw) To: Hannes Reinecke Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Hannes, Shaun, Let me add some more comments. > On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: > > On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >>> >>> Can you please integrate this with Hannes series so that it uses >>> his cache of the zone information? >> >> Adding Hannes and Damien to Cc. >> >> Christoph, >> >> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is >> quite simple. I can even have the open/close/reset zone commands update the >> RB-Tree .. the non-private parts anyway. I would prefer to do this around the >> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do >> not need the RB-Tree to function with zoned media. >> >> I do still have concerns with the approach which I have shared in smaller >> forums but perhaps I have to bring them to this group. >> >> First is the memory consumption. This isn't really much of a concern for large >> servers with few drives but I think the embedded NAS market will grumble as >> well as the large data pods trying to stuff 300+ drives in a chassis. >> >> As of now the RB-Tree needs to hold ~30000 zones. >> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields >> around 3.5 MB per zoned drive attached. >> Which is fine if it is really needed, but most of it is fixed information >> and it can be significantly condensed (I have proposed 8 bytes per zone held >> in an array as more than adequate). Worse is that the crucial piece of >> information, the current wp needed for scheduling the next write, is mostly >> out of date because it is updated only after the write completes and zones >> being actively written to must work off of the last location / size that was >> submitted, not completed. The work around is for that tracking to be handled >> in the private_data member. I am not saying that updating the wp on >> completing a write isn’t important, I am saying that the bi_end_io hook is >> the existing hook that works just fine. >> > Which _actually_ is not true; with my patches I'll update the write > pointer prior to submit the I/O (on the reasoning that most of the time > I/O will succeed) and re-read the zone information if an I/O failed. > (Which I'll have to do anyway as after an I/O failure the write pointer > status is not clearly defined.) > > I have thought about condensing the RB tree information, but then I > figured that for 'real' SMR handling we cannot assume all zones are of > fixed size, and hence we need all the information there. > Any condensing method would assume a given structure of the zones, which > the standard just doesn't provide. > Or am I missing something here? Indeed, the standards do not mandate any particular zone configuration, constant zone size, etc. So writing code so that can be handled is certainly the right way of doing things. However, if we decide to go forward with mapping RESET WRITE POINTER command to DISCARD, then at least a constant zone size (minus the last zone as you said) must be assumed, and that information can be removed from the entries in the RB tree (as it will be saved for the sysfs "zone_size" file anyway. Adding a little code to handle that eventual last runt zone with a different size is not a big problem. > > As for write pointer handling: yes, the write pointer on the zones is > not really useful for upper-level usage. > Where we do need it is to detect I/O which is crossing the write pointer > (eg when doing reads over the entire zone). > As per spec you will be getting an I/O error here, so we need to split > the I/O on the write pointer to get valid results back. To be precise here, the I/O splitting will be handled by the block layer thanks to the "chunk_sectors" setting. But that relies on a constant zone size assumption too. The RB-tree here is most useful for reads over or after the write pointer as this can have different behavior on different drives (URSWRZ bit). The RB-tree allows us to hide these differences to upper layers and simplify support at those levels. I too agree that the write pointer value is not very useful to upper layers (e.g. FS). What matters most of the times for these layers is an "allocation pointer" or "submission pointer" which indicates the LBA to use for the next BIO submission. That LBA will most of the time be in advance of the zone WP because of request queing in the block scheduler. Note that from what we have done already, in many cases, upper layers do not even need this (e.g. F2FS, btrfs) and work fine without needing *any* zone->private_data because that "allocation pointer" information generally already exists in a different form (e.g. a bit in a bitmap). For these cases, not having the RB-tree of zones would force a lot more code to be added to file systems for SMR support. > >> This all tails into domain responsibility. With the RB-Tree doing half of the >> work and the ‘responsible’ domain handling the active path via private_data >> why have the split at all? It seems to be a double work to have second object >> tracking the first so that I/O scheduling can function. >> > Tracking the zone states via RB trees is mainly to handly host-managed > drives seamlessly. Without an RB tree we will be subjected to I/O errors > during boot, as eg with new drives we inevitably will try to read beyond > the write pointer, which in turn will generate I/O errors on certain drives. > I do agree that there is no strict need to setup an RB tree for > host-aware drives; I can easily add an attribute/flag to disable RB > trees for those. > However, tracking zone information in the RB tree _dramatically_ speeds > up device initialisation; issuing a blkdev_discard() for the entire > drive will take _ages_ without it. And the RB-tree will also be very useful for speeding up report zones ioctls too. Unless we want those to force an update of the RB-tree information ? > >> Finally is the error handling path when the RB-Tree encounters and error it >> attempts to requery the drive topology virtually guaranteeing that the >> private_data is now out-of-sync with the RB-Tree. Again this is something >> that can be better encapsulated in the bi_end_io to be informed of the >> failed I/O and schedule the appropriate recovery (including re-querying the >> zone information of the affected zone(s)). >> > Well, if we have an RB tree with write pointers of course we need to > re-read the zone information on failure (and I thought I did that?). > Plus the error information will be propagated via the usual mechanism, > so they need to take care of updating any information in ->private_data. > > I'm perfectly fine with integrating your patches for the various > blkdev_XX callouts and associated ioctls; Jens favours that approach, > too, so we should be combining those efforts. Can we agree on an interface ? Summarizing all the discussions we had, I am all in favor of the following: 1) A "zone_size" sysfs attribute to indicate that a drive is zoned: The already existing device type file can tell us if this is an host managed disk (type==20) or a host aware disk (type==0). Drive managed disks could also be detected, but since no information on their zone configuration can be obtained, lets treat those as regular drives. 2) Add ioctls for zone management: Report zones (get information from RB tree), reset zone (simple wrapper to ioctl for block discard), open zone, close zone and finish zone. That will allow mkfs like tools to get zone information without having to parse thousands of sysfs files (and can also be integrated in libzbc block backend driver for a unified interface with the direct SG_IO path for kernels without the ZBC code enabled). 3) Try to condense the blkzone data structure to save memory: I think that we can at the very least remove the zone length, and also may be the per zone spinlock too (a single spinlock and proper state flags can be used). Best regards. ------------------------ Damien Le Moal, Ph.D. Sr. Manager, System Software Group, HGST Research, HGST, a Western Digital brand Damien.LeMoal@hgst.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.hgst.com Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-03 1:29 ` Damien Le Moal @ 2016-08-05 20:35 ` Shaun Tancheff -1 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-05 20:35 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wro= te: > Hannes, Shaun, > > Let me add some more comments. > >> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: >> >> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >>>> >>>> Can you please integrate this with Hannes series so that it uses >>>> his cache of the zone information? >>> >>> Adding Hannes and Damien to Cc. >>> >>> Christoph, >>> >>> I can make a patch the marshal Hannes' RB-Tree into to a block report, = that is >>> quite simple. I can even have the open/close/reset zone commands update= the >>> RB-Tree .. the non-private parts anyway. I would prefer to do this arou= nd the >>> CONFIG_SD_ZBC support, offering the existing type of patch for setups t= hat do >>> not need the RB-Tree to function with zoned media. I have posted patches to integrate with the zone cache, hopefully they make sense. >>> >>> I do still have concerns with the approach which I have shared in small= er >>> forums but perhaps I have to bring them to this group. >>> >>> First is the memory consumption. This isn't really much of a concern fo= r large >>> servers with few drives but I think the embedded NAS market will grumbl= e as >>> well as the large data pods trying to stuff 300+ drives in a chassis. >>> >>> As of now the RB-Tree needs to hold ~30000 zones. >>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yield= s >>> around 3.5 MB per zoned drive attached. >>> Which is fine if it is really needed, but most of it is fixed informati= on >>> and it can be significantly condensed (I have proposed 8 bytes per zone= held >>> in an array as more than adequate). Worse is that the crucial piece of >>> information, the current wp needed for scheduling the next write, is mo= stly >>> out of date because it is updated only after the write completes and zo= nes >>> being actively written to must work off of the last location / size tha= t was >>> submitted, not completed. The work around is for that tracking to be ha= ndled >>> in the private_data member. I am not saying that updating the wp on >>> completing a write isn=E2=80=99t important, I am saying that the bi_end= _io hook is >>> the existing hook that works just fine. >>> >> Which _actually_ is not true; with my patches I'll update the write >> pointer prior to submit the I/O (on the reasoning that most of the time >> I/O will succeed) and re-read the zone information if an I/O failed. >> (Which I'll have to do anyway as after an I/O failure the write pointer >> status is not clearly defined.) Apologies for my mis-characterization. >> I have thought about condensing the RB tree information, but then I >> figured that for 'real' SMR handling we cannot assume all zones are of >> fixed size, and hence we need all the information there. >> Any condensing method would assume a given structure of the zones, which >> the standard just doesn't provide. >> Or am I missing something here? > > Indeed, the standards do not mandate any particular zone configuration, > constant zone size, etc. So writing code so that can be handled is certai= nly > the right way of doing things. However, if we decide to go forward with > mapping RESET WRITE POINTER command to DISCARD, then at least a constant > zone size (minus the last zone as you said) must be assumed, and that > information can be removed from the entries in the RB tree (as it will be > saved for the sysfs "zone_size" file anyway. Adding a little code to hand= le > that eventual last runt zone with a different size is not a big problem. >> As for write pointer handling: yes, the write pointer on the zones is >> not really useful for upper-level usage. >> Where we do need it is to detect I/O which is crossing the write pointer >> (eg when doing reads over the entire zone). >> As per spec you will be getting an I/O error here, so we need to split >> the I/O on the write pointer to get valid results back. > > To be precise here, the I/O splitting will be handled by the block layer > thanks to the "chunk_sectors" setting. But that relies on a constant zone > size assumption too. > > The RB-tree here is most useful for reads over or after the write pointer= as > this can have different behavior on different drives (URSWRZ bit). The RB= -tree > allows us to hide these differences to upper layers and simplify support = at > those levels. It is unfortunate that path was chosen rather than returning Made Up Data. However I don't think this is a file system or device mapper problem as neither of those layers attempt to read blocks they have not written (or discarded). All I can think of something that probes the drive media for a partition ta= ble or other identification...isn't the conventional space sufficient to cover those cases? Anyway you could just handle the error and sense codes [CHECK CONDITION / ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD when URSWRZ is 0. It would have the same effect as using the zone cache without the possibility of the zone cache being out-of-sync. > I too agree that the write pointer value is not very useful to upper laye= rs > (e.g. FS). What matters most of the times for these layers is an "allocat= ion > pointer" or "submission pointer" which indicates the LBA to use for the n= ext > BIO submission. That LBA will most of the time be in advance of the zone = WP > because of request queing in the block scheduler. Which is kind of my point. > Note that from what we have done already, in many cases, upper layers do = not > even need this (e.g. F2FS, btrfs) and work fine without needing *any* > zone->private_data because that "allocation pointer" information generall= y > already exists in a different form (e.g. a bit in a bitmap). Agreed. Log structured and copy on write file systems are a natural fit for these types of drives > For these cases, not having the RB-tree of zones would force a lot > more code to be added to file systems for SMR support. I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs. Certainly any of these could trigger the zone cache to be loaded at mkfs and/or mount time? >>> This all tails into domain responsibility. With the RB-Tree doing half = of the >>> work and the =E2=80=98responsible=E2=80=99 domain handling the active p= ath via private_data >>> why have the split at all? It seems to be a double work to have second = object >>> tracking the first so that I/O scheduling can function. >>> >> Tracking the zone states via RB trees is mainly to handly host-managed >> drives seamlessly. Without an RB tree we will be subjected to I/O errors >> during boot, as eg with new drives we inevitably will try to read beyond >> the write pointer, which in turn will generate I/O errors on certain dri= ves. If the zone cache is needed to boot then clearly my objection to reading in the zone report information on drive attach is unfounded. I was under the impression that this was not the case. >> I do agree that there is no strict need to setup an RB tree for >> host-aware drives; I can easily add an attribute/flag to disable RB >> trees for those. >> However, tracking zone information in the RB tree _dramatically_ speeds >> up device initialisation; issuing a blkdev_discard() for the entire >> drive will take _ages_ without it. > > And the RB-tree will also be very useful for speeding up report zones > ioctls too. Unless we want those to force an update of the RB-tree inform= ation ? That is debatable. I would tend to argue for both. *If* there must be a zone cache then reading from the zone cache is a must. Forcing and update of the zone = cache from the media seems like a reasonable thing to be able to do. Also the zone report is 'slow' in that there is an overhead for the report itself but the number of zones per query can be quite large so 4 or 5 I/Os that run into the hundreds if milliseconds to cache the entire drive isn't really unworkable = for something that is used infrequently. >>> Finally is the error handling path when the RB-Tree encounters and erro= r it >>> attempts to requery the drive topology virtually guaranteeing that the >>> private_data is now out-of-sync with the RB-Tree. Again this is somethi= ng >>> that can be better encapsulated in the bi_end_io to be informed of the >>> failed I/O and schedule the appropriate recovery (including re-querying= the >>> zone information of the affected zone(s)). >>> >> Well, if we have an RB tree with write pointers of course we need to >> re-read the zone information on failure (and I thought I did that?). >> Plus the error information will be propagated via the usual mechanism, >> so they need to take care of updating any information in ->private_data. >> >> I'm perfectly fine with integrating your patches for the various >> blkdev_XX callouts and associated ioctls; Jens favours that approach, >> too, so we should be combining those efforts. > > Can we agree on an interface ? > Summarizing all the discussions we had, I am all in favor of the followin= g: > > 1) A "zone_size" sysfs attribute to indicate that a drive is zoned: > The already existing device type file can tell us if this is an host > managed disk (type=3D=3D20) or a host aware disk (type=3D=3D0). Drive man= aged > disks could also be detected, but since no information on their zone > configuration can be obtained, lets treat those as regular drives. Since disk type =3D=3D 0 for everything that isn't HM so I would prefer the sysfs 'zoned' file just report if the drive is HA or HM. > 2) Add ioctls for zone management: > Report zones (get information from RB tree), reset zone (simple wrapper > to ioctl for block discard), open zone, close zone and finish zone. That > will allow mkfs like tools to get zone information without having to pars= e > thousands of sysfs files (and can also be integrated in libzbc block back= end > driver for a unified interface with the direct SG_IO path for kernels wit= hout > the ZBC code enabled). I can add finish zone ... but I really can't think of a use for it, myself. > 3) Try to condense the blkzone data structure to save memory: > I think that we can at the very least remove the zone length, and also > may be the per zone spinlock too (a single spinlock and proper state flag= s can > be used). I have a variant that is an array of descriptors that roughly mimics the api from blk-zoned.c that I did a few months ago as an example. I should be able to update that to the current kernel + patches. --=20 Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-05 20:35 ` Shaun Tancheff 0 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-05 20:35 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: > Hannes, Shaun, > > Let me add some more comments. > >> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: >> >> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >>>> >>>> Can you please integrate this with Hannes series so that it uses >>>> his cache of the zone information? >>> >>> Adding Hannes and Damien to Cc. >>> >>> Christoph, >>> >>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is >>> quite simple. I can even have the open/close/reset zone commands update the >>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the >>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do >>> not need the RB-Tree to function with zoned media. I have posted patches to integrate with the zone cache, hopefully they make sense. >>> >>> I do still have concerns with the approach which I have shared in smaller >>> forums but perhaps I have to bring them to this group. >>> >>> First is the memory consumption. This isn't really much of a concern for large >>> servers with few drives but I think the embedded NAS market will grumble as >>> well as the large data pods trying to stuff 300+ drives in a chassis. >>> >>> As of now the RB-Tree needs to hold ~30000 zones. >>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields >>> around 3.5 MB per zoned drive attached. >>> Which is fine if it is really needed, but most of it is fixed information >>> and it can be significantly condensed (I have proposed 8 bytes per zone held >>> in an array as more than adequate). Worse is that the crucial piece of >>> information, the current wp needed for scheduling the next write, is mostly >>> out of date because it is updated only after the write completes and zones >>> being actively written to must work off of the last location / size that was >>> submitted, not completed. The work around is for that tracking to be handled >>> in the private_data member. I am not saying that updating the wp on >>> completing a write isn’t important, I am saying that the bi_end_io hook is >>> the existing hook that works just fine. >>> >> Which _actually_ is not true; with my patches I'll update the write >> pointer prior to submit the I/O (on the reasoning that most of the time >> I/O will succeed) and re-read the zone information if an I/O failed. >> (Which I'll have to do anyway as after an I/O failure the write pointer >> status is not clearly defined.) Apologies for my mis-characterization. >> I have thought about condensing the RB tree information, but then I >> figured that for 'real' SMR handling we cannot assume all zones are of >> fixed size, and hence we need all the information there. >> Any condensing method would assume a given structure of the zones, which >> the standard just doesn't provide. >> Or am I missing something here? > > Indeed, the standards do not mandate any particular zone configuration, > constant zone size, etc. So writing code so that can be handled is certainly > the right way of doing things. However, if we decide to go forward with > mapping RESET WRITE POINTER command to DISCARD, then at least a constant > zone size (minus the last zone as you said) must be assumed, and that > information can be removed from the entries in the RB tree (as it will be > saved for the sysfs "zone_size" file anyway. Adding a little code to handle > that eventual last runt zone with a different size is not a big problem. >> As for write pointer handling: yes, the write pointer on the zones is >> not really useful for upper-level usage. >> Where we do need it is to detect I/O which is crossing the write pointer >> (eg when doing reads over the entire zone). >> As per spec you will be getting an I/O error here, so we need to split >> the I/O on the write pointer to get valid results back. > > To be precise here, the I/O splitting will be handled by the block layer > thanks to the "chunk_sectors" setting. But that relies on a constant zone > size assumption too. > > The RB-tree here is most useful for reads over or after the write pointer as > this can have different behavior on different drives (URSWRZ bit). The RB-tree > allows us to hide these differences to upper layers and simplify support at > those levels. It is unfortunate that path was chosen rather than returning Made Up Data. However I don't think this is a file system or device mapper problem as neither of those layers attempt to read blocks they have not written (or discarded). All I can think of something that probes the drive media for a partition table or other identification...isn't the conventional space sufficient to cover those cases? Anyway you could just handle the error and sense codes [CHECK CONDITION / ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD when URSWRZ is 0. It would have the same effect as using the zone cache without the possibility of the zone cache being out-of-sync. > I too agree that the write pointer value is not very useful to upper layers > (e.g. FS). What matters most of the times for these layers is an "allocation > pointer" or "submission pointer" which indicates the LBA to use for the next > BIO submission. That LBA will most of the time be in advance of the zone WP > because of request queing in the block scheduler. Which is kind of my point. > Note that from what we have done already, in many cases, upper layers do not > even need this (e.g. F2FS, btrfs) and work fine without needing *any* > zone->private_data because that "allocation pointer" information generally > already exists in a different form (e.g. a bit in a bitmap). Agreed. Log structured and copy on write file systems are a natural fit for these types of drives > For these cases, not having the RB-tree of zones would force a lot > more code to be added to file systems for SMR support. I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs. Certainly any of these could trigger the zone cache to be loaded at mkfs and/or mount time? >>> This all tails into domain responsibility. With the RB-Tree doing half of the >>> work and the ‘responsible’ domain handling the active path via private_data >>> why have the split at all? It seems to be a double work to have second object >>> tracking the first so that I/O scheduling can function. >>> >> Tracking the zone states via RB trees is mainly to handly host-managed >> drives seamlessly. Without an RB tree we will be subjected to I/O errors >> during boot, as eg with new drives we inevitably will try to read beyond >> the write pointer, which in turn will generate I/O errors on certain drives. If the zone cache is needed to boot then clearly my objection to reading in the zone report information on drive attach is unfounded. I was under the impression that this was not the case. >> I do agree that there is no strict need to setup an RB tree for >> host-aware drives; I can easily add an attribute/flag to disable RB >> trees for those. >> However, tracking zone information in the RB tree _dramatically_ speeds >> up device initialisation; issuing a blkdev_discard() for the entire >> drive will take _ages_ without it. > > And the RB-tree will also be very useful for speeding up report zones > ioctls too. Unless we want those to force an update of the RB-tree information ? That is debatable. I would tend to argue for both. *If* there must be a zone cache then reading from the zone cache is a must. Forcing and update of the zone cache from the media seems like a reasonable thing to be able to do. Also the zone report is 'slow' in that there is an overhead for the report itself but the number of zones per query can be quite large so 4 or 5 I/Os that run into the hundreds if milliseconds to cache the entire drive isn't really unworkable for something that is used infrequently. >>> Finally is the error handling path when the RB-Tree encounters and error it >>> attempts to requery the drive topology virtually guaranteeing that the >>> private_data is now out-of-sync with the RB-Tree. Again this is something >>> that can be better encapsulated in the bi_end_io to be informed of the >>> failed I/O and schedule the appropriate recovery (including re-querying the >>> zone information of the affected zone(s)). >>> >> Well, if we have an RB tree with write pointers of course we need to >> re-read the zone information on failure (and I thought I did that?). >> Plus the error information will be propagated via the usual mechanism, >> so they need to take care of updating any information in ->private_data. >> >> I'm perfectly fine with integrating your patches for the various >> blkdev_XX callouts and associated ioctls; Jens favours that approach, >> too, so we should be combining those efforts. > > Can we agree on an interface ? > Summarizing all the discussions we had, I am all in favor of the following: > > 1) A "zone_size" sysfs attribute to indicate that a drive is zoned: > The already existing device type file can tell us if this is an host > managed disk (type==20) or a host aware disk (type==0). Drive managed > disks could also be detected, but since no information on their zone > configuration can be obtained, lets treat those as regular drives. Since disk type == 0 for everything that isn't HM so I would prefer the sysfs 'zoned' file just report if the drive is HA or HM. > 2) Add ioctls for zone management: > Report zones (get information from RB tree), reset zone (simple wrapper > to ioctl for block discard), open zone, close zone and finish zone. That > will allow mkfs like tools to get zone information without having to parse > thousands of sysfs files (and can also be integrated in libzbc block backend > driver for a unified interface with the direct SG_IO path for kernels without > the ZBC code enabled). I can add finish zone ... but I really can't think of a use for it, myself. > 3) Try to condense the blkzone data structure to save memory: > I think that we can at the very least remove the zone length, and also > may be the per zone spinlock too (a single spinlock and proper state flags can > be used). I have a variant that is an array of descriptors that roughly mimics the api from blk-zoned.c that I did a few months ago as an example. I should be able to update that to the current kernel + patches. -- Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-05 20:35 ` Shaun Tancheff (?) @ 2016-08-09 6:47 ` Hannes Reinecke 2016-08-09 8:09 ` Damien Le Moal ` (2 more replies) -1 siblings, 3 replies; 26+ messages in thread From: Hannes Reinecke @ 2016-08-09 6:47 UTC (permalink / raw) To: Shaun Tancheff, Damien Le Moal Cc: Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On 08/05/2016 10:35 PM, Shaun Tancheff wrote: > On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: >> Hannes, Shaun, >> >> Let me add some more comments. >> >>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: >>> >>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >>>>> >>>>> Can you please integrate this with Hannes series so that it uses >>>>> his cache of the zone information? >>>> >>>> Adding Hannes and Damien to Cc. >>>> >>>> Christoph, >>>> >>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is >>>> quite simple. I can even have the open/close/reset zone commands update the >>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the >>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do >>>> not need the RB-Tree to function with zoned media. > > I have posted patches to integrate with the zone cache, hopefully they > make sense. > [ .. ] >>> I have thought about condensing the RB tree information, but then I >>> figured that for 'real' SMR handling we cannot assume all zones are of >>> fixed size, and hence we need all the information there. >>> Any condensing method would assume a given structure of the zones, which >>> the standard just doesn't provide. >>> Or am I missing something here? >> >> Indeed, the standards do not mandate any particular zone configuration, >> constant zone size, etc. So writing code so that can be handled is certainly >> the right way of doing things. However, if we decide to go forward with >> mapping RESET WRITE POINTER command to DISCARD, then at least a constant >> zone size (minus the last zone as you said) must be assumed, and that >> information can be removed from the entries in the RB tree (as it will be >> saved for the sysfs "zone_size" file anyway. Adding a little code to handle >> that eventual last runt zone with a different size is not a big problem. > >>> As for write pointer handling: yes, the write pointer on the zones is >>> not really useful for upper-level usage. >>> Where we do need it is to detect I/O which is crossing the write pointer >>> (eg when doing reads over the entire zone). >>> As per spec you will be getting an I/O error here, so we need to split >>> the I/O on the write pointer to get valid results back. >> >> To be precise here, the I/O splitting will be handled by the block layer >> thanks to the "chunk_sectors" setting. But that relies on a constant zone >> size assumption too. >> >> The RB-tree here is most useful for reads over or after the write pointer as >> this can have different behavior on different drives (URSWRZ bit). The RB-tree >> allows us to hide these differences to upper layers and simplify support at >> those levels. > > It is unfortunate that path was chosen rather than returning Made Up Data. But how would you return made up data without knowing that you _have_ to generate some? Of course it's trivial to generate made up data once an I/O error occurred, but that's too late as the I/O error already happened. The general idea behind this is that I _really_ do want to avoid triggering an I/O error on initial access. This kind of thing really tends to freak out users (connect a new drive and getting I/O errors; not the best user experience ...) > However I don't think this is a file system or device mapper problem as > neither of those layers attempt to read blocks they have not written > (or discarded). > All I can think of something that probes the drive media for a partition table > or other identification...isn't the conventional space sufficient to cover > those cases? Upon device scan the kernel will attempt to read the partition tables, which consists of reading the first few megabytes and the last sector (for GPT shadow partition tables). And depending on the driver manufacturer you might or might not have enough conventional space at the right locations. > Anyway you could just handle the error and sense codes [CHECK CONDITION / > ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD > when URSWRZ is 0. It would have the same effect as using the zone cache > without the possibility of the zone cache being out-of-sync. > Yes, but then we would already have registered an I/O error. And as indicated above, I really do _not_ want to handle all the customer calls for supposed faulty new SMR drives. >> I too agree that the write pointer value is not very useful to upper layers >> (e.g. FS). What matters most of the times for these layers is an "allocation >> pointer" or "submission pointer" which indicates the LBA to use for the next >> BIO submission. That LBA will most of the time be in advance of the zone WP >> because of request queing in the block scheduler. > > Which is kind of my point. > >> Note that from what we have done already, in many cases, upper layers do not >> even need this (e.g. F2FS, btrfs) and work fine without needing *any* >> zone->private_data because that "allocation pointer" information generally >> already exists in a different form (e.g. a bit in a bitmap). > > Agreed. Log structured and copy on write file systems are a natural fit for > these types of drives > >> For these cases, not having the RB-tree of zones would force a lot >> more code to be added to file systems for SMR support. > > I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs. > Certainly any of these could trigger the zone cache to be loaded at mkfs > and/or mount time? > >>>> This all tails into domain responsibility. With the RB-Tree doing half of the >>>> work and the ‘responsible’ domain handling the active path via private_data >>>> why have the split at all? It seems to be a double work to have second object >>>> tracking the first so that I/O scheduling can function. >>>> >>> Tracking the zone states via RB trees is mainly to handly host-managed >>> drives seamlessly. Without an RB tree we will be subjected to I/O errors >>> during boot, as eg with new drives we inevitably will try to read beyond >>> the write pointer, which in turn will generate I/O errors on certain drives. > > If the zone cache is needed to boot then clearly my objection to reading > in the zone report information on drive attach is unfounded. I was under > the impression that this was not the case. > >>> I do agree that there is no strict need to setup an RB tree for >>> host-aware drives; I can easily add an attribute/flag to disable RB >>> trees for those. >>> However, tracking zone information in the RB tree _dramatically_ speeds >>> up device initialisation; issuing a blkdev_discard() for the entire >>> drive will take _ages_ without it. >> >> And the RB-tree will also be very useful for speeding up report zones >> ioctls too. Unless we want those to force an update of the RB-tree information ? > > That is debatable. I would tend to argue for both. *If* there must be > a zone cache then reading from the zone cache is a must. Forcing and > update of the zone cache from the media seems like a reasonable thing > to be able to do. > > Also the zone report is 'slow' in that there is an overhead for the > report itself but > the number of zones per query can be quite large so 4 or 5 I/Os that > run into the > hundreds if milliseconds to cache the entire drive isn't really unworkable for > something that is used infrequently. > No, surely not. But one of the _big_ advantages for the RB tree is blkdev_discard(). Without the RB tree any mkfs program will issue a 'discard' for every sector. We will be able to coalesce those into one discard per zone, but we still need to issue one for _every_ zone. Which is (as indicated) really slow, and easily takes several minutes. With the RB tree we can short-circuit discards to empty zones, and speed up processing time dramatically. Sure we could be moving the logic into mkfs and friends, but that would require us to change the programs and agree on a library (libzbc?) which should be handling that. >>>> Finally is the error handling path when the RB-Tree encounters and error it >>>> attempts to requery the drive topology virtually guaranteeing that the >>>> private_data is now out-of-sync with the RB-Tree. Again this is something >>>> that can be better encapsulated in the bi_end_io to be informed of the >>>> failed I/O and schedule the appropriate recovery (including re-querying the >>>> zone information of the affected zone(s)). >>>> >>> Well, if we have an RB tree with write pointers of course we need to >>> re-read the zone information on failure (and I thought I did that?). >>> Plus the error information will be propagated via the usual mechanism, >>> so they need to take care of updating any information in ->private_data. >>> >>> I'm perfectly fine with integrating your patches for the various >>> blkdev_XX callouts and associated ioctls; Jens favours that approach, >>> too, so we should be combining those efforts. >> >> Can we agree on an interface ? >> Summarizing all the discussions we had, I am all in favor of the following: >> >> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned: >> The already existing device type file can tell us if this is an host >> managed disk (type==20) or a host aware disk (type==0). Drive managed >> disks could also be detected, but since no information on their zone >> configuration can be obtained, lets treat those as regular drives. > > Since disk type == 0 for everything that isn't HM so I would prefer the > sysfs 'zoned' file just report if the drive is HA or HM. > Okay. So let's put in the 'zoned' attribute the device type: 'host-managed', 'host-aware', or 'device managed'. >> 2) Add ioctls for zone management: >> Report zones (get information from RB tree), reset zone (simple wrapper >> to ioctl for block discard), open zone, close zone and finish zone. That >> will allow mkfs like tools to get zone information without having to parse >> thousands of sysfs files (and can also be integrated in libzbc block backend >> driver for a unified interface with the direct SG_IO path for kernels without >> the ZBC code enabled). > > I can add finish zone ... but I really can't think of a use for it, myself. > Which is not the point. The standard defines this, so clearly someone found it a reasonable addendum. So let's add this for completeness. >> 3) Try to condense the blkzone data structure to save memory: >> I think that we can at the very least remove the zone length, and also >> may be the per zone spinlock too (a single spinlock and proper state flags can >> be used). > > I have a variant that is an array of descriptors that roughly mimics the > api from blk-zoned.c that I did a few months ago as an example. > I should be able to update that to the current kernel + patches. > Okay. If we restrict the in-kernel SMR drive handling to devices with identical zone sizes of course we can remove the zone length. And we can do away with the per-zone spinlock and use a global one instead. I will be updating my patchset accordingly. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-09 6:47 ` Hannes Reinecke @ 2016-08-09 8:09 ` Damien Le Moal 2016-08-10 3:26 ` Shaun Tancheff 2016-08-14 0:09 ` Shaun Tancheff 2 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-09 8:09 UTC (permalink / raw) To: Hannes Reinecke Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Hannes, > On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote: [...] >>> = >>> Can we agree on an interface ? >>> Summarizing all the discussions we had, I am all in favor of the follow= ing: >>> = >>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned: >>> The already existing device type file can tell us if this is an host >>> managed disk (type=3D=3D20) or a host aware disk (type=3D=3D0). Drive m= anaged >>> disks could also be detected, but since no information on their zone >>> configuration can be obtained, lets treat those as regular drives. >> = >> Since disk type =3D=3D 0 for everything that isn't HM so I would prefer = the >> sysfs 'zoned' file just report if the drive is HA or HM. >> = > Okay. So let's put in the 'zoned' attribute the device type: > 'host-managed', 'host-aware', or 'device managed'. I hacked your patches and simply put a "0" or "1" in the sysfs zoned file. Any drive that has ZBC/ZAC command support gets a "1", "0" for everything else. This means that drive managed models are not exposed as zoned block devices. For HM vs HA differentiation, an application can look at the device type file since it is already present. We could indeed set the "zoned" file to the device type, but HM drives and regular drives will both have "0" in it, so no differentiation possible. The other choice could be the "zoned" bits defined by ZBC, but these do not define a value for host managed drives, and the drive managed value being not "0" could be confusing too. So I settled for a simple 0/1 boolean. > = >>> 2) Add ioctls for zone management: >>> Report zones (get information from RB tree), reset zone (simple wrapper >>> to ioctl for block discard), open zone, close zone and finish zone. That >>> will allow mkfs like tools to get zone information without having to pa= rse >>> thousands of sysfs files (and can also be integrated in libzbc block ba= ckend >>> driver for a unified interface with the direct SG_IO path for kernels w= ithout >>> the ZBC code enabled). >> = >> I can add finish zone ... but I really can't think of a use for it, myse= lf. >> = > Which is not the point. The standard defines this, so clearly someone > found it a reasonable addendum. So let's add this for completeness. Done: I hacked Shaun ioctl code and added finish zone too. The difference with Shaun initial code is that the ioctl are propagated down to the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for BIO request definition for the zone operations. So a lot less code added. The ioctls do not mimic exactly the ZBC standard. For instance, there is no reporting options for report zones, nor is the "all" bit supported for open, close or finish zone commands. But the information provided on zones is com= plete and maps to the standard definitions. I also added a reset_wp ioctl for completeness, but this one simply calls blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISC= ARD ioctl call with a range exactly aligned on a zone. > = >>> 3) Try to condense the blkzone data structure to save memory: >>> I think that we can at the very least remove the zone length, and also >>> may be the per zone spinlock too (a single spinlock and proper state fl= ags can >>> be used). >> = >> I have a variant that is an array of descriptors that roughly mimics the >> api from blk-zoned.c that I did a few months ago as an example. >> I should be able to update that to the current kernel + patches. >> = > Okay. If we restrict the in-kernel SMR drive handling to devices with > identical zone sizes of course we can remove the zone length. > And we can do away with the per-zone spinlock and use a global one instea= d. Did that too. The blk_zone struct is now exactly 64B. I removed the per zone spinlock and replaced it with a flag so that zones can still be locked independently using wait_on_bit_lock & friends (I added the functions blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per z= one locking comes in handy to implement the ioctls code without stalling the co= mmand queue for read, writes and discard on different zones. I however kept the zone length in the structure. The reason for doing so is= that this allows supporting drives with non-constant zone sizes, albeit with a m= ore limited interface since in such case the device chunk_sectors is not set (t= hat is how a user or application can detect that the zones are not constant siz= e). For these drives, the block layer may happily merge BIOs across zone bounda= ries and the discard code will not split and align calls on the zones. But upper= layers (an FS or a device mapper) can still do all this by themselves if they want= /can support non-constant zone sizes. The only exception is drives like the Seagate one with only the last zone o= f a different size. This case is handled exactly as if all zones are the same s= ize simply because any operation on the last smaller zone will naturally align = as the checks of operation size against the drive capacity will do the right t= hings. The ioctls work for all cases (drive with constant zone size or not). This = is again to allow supporting eventual weird drives at application level. I integrate= d all these ioctl into libzbc block device backend driver and everything is fine.= Can't tell the difference with direct-to-drive SG_IO accesses. But unlike these, = the zone ioctls keep the zone information RB-tree cache up to date. > = > I will be updating my patchset accordingly. I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and = I will send everything for review. If you have any comment on the above, please let me = know and I will be happy to incorporate changes. Best regards. ------------------------ Damien Le Moal, Ph.D. Sr. Manager, System Software Group, HGST Research, HGST, a Western Digital brand Damien.LeMoal@hgst.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, = Kanagawa, 252-0888 Japan www.hgst.com = Western Digital Corporation (and its subsidiaries) E-mail Confidentiality N= otice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or l= egally privileged information of WDC and/or its affiliates, and are intende= d solely for the use of the individual or entity to which they are addresse= d. If you are not the intended recipient, any disclosure, copying, distribu= tion or any action taken or omitted to be taken in reliance on it, is prohi= bited. If you have received this e-mail in error, please notify the sender = immediately and delete the e-mail in its entirety from your system. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-09 8:09 ` Damien Le Moal 0 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-09 8:09 UTC (permalink / raw) To: Hannes Reinecke Cc: Shaun Tancheff, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Hannes, > On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote: [...] >>> >>> Can we agree on an interface ? >>> Summarizing all the discussions we had, I am all in favor of the following: >>> >>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned: >>> The already existing device type file can tell us if this is an host >>> managed disk (type==20) or a host aware disk (type==0). Drive managed >>> disks could also be detected, but since no information on their zone >>> configuration can be obtained, lets treat those as regular drives. >> >> Since disk type == 0 for everything that isn't HM so I would prefer the >> sysfs 'zoned' file just report if the drive is HA or HM. >> > Okay. So let's put in the 'zoned' attribute the device type: > 'host-managed', 'host-aware', or 'device managed'. I hacked your patches and simply put a "0" or "1" in the sysfs zoned file. Any drive that has ZBC/ZAC command support gets a "1", "0" for everything else. This means that drive managed models are not exposed as zoned block devices. For HM vs HA differentiation, an application can look at the device type file since it is already present. We could indeed set the "zoned" file to the device type, but HM drives and regular drives will both have "0" in it, so no differentiation possible. The other choice could be the "zoned" bits defined by ZBC, but these do not define a value for host managed drives, and the drive managed value being not "0" could be confusing too. So I settled for a simple 0/1 boolean. > >>> 2) Add ioctls for zone management: >>> Report zones (get information from RB tree), reset zone (simple wrapper >>> to ioctl for block discard), open zone, close zone and finish zone. That >>> will allow mkfs like tools to get zone information without having to parse >>> thousands of sysfs files (and can also be integrated in libzbc block backend >>> driver for a unified interface with the direct SG_IO path for kernels without >>> the ZBC code enabled). >> >> I can add finish zone ... but I really can't think of a use for it, myself. >> > Which is not the point. The standard defines this, so clearly someone > found it a reasonable addendum. So let's add this for completeness. Done: I hacked Shaun ioctl code and added finish zone too. The difference with Shaun initial code is that the ioctl are propagated down to the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for BIO request definition for the zone operations. So a lot less code added. The ioctls do not mimic exactly the ZBC standard. For instance, there is no reporting options for report zones, nor is the "all" bit supported for open, close or finish zone commands. But the information provided on zones is complete and maps to the standard definitions. I also added a reset_wp ioctl for completeness, but this one simply calls blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD ioctl call with a range exactly aligned on a zone. > >>> 3) Try to condense the blkzone data structure to save memory: >>> I think that we can at the very least remove the zone length, and also >>> may be the per zone spinlock too (a single spinlock and proper state flags can >>> be used). >> >> I have a variant that is an array of descriptors that roughly mimics the >> api from blk-zoned.c that I did a few months ago as an example. >> I should be able to update that to the current kernel + patches. >> > Okay. If we restrict the in-kernel SMR drive handling to devices with > identical zone sizes of course we can remove the zone length. > And we can do away with the per-zone spinlock and use a global one instead. Did that too. The blk_zone struct is now exactly 64B. I removed the per zone spinlock and replaced it with a flag so that zones can still be locked independently using wait_on_bit_lock & friends (I added the functions blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone locking comes in handy to implement the ioctls code without stalling the command queue for read, writes and discard on different zones. I however kept the zone length in the structure. The reason for doing so is that this allows supporting drives with non-constant zone sizes, albeit with a more limited interface since in such case the device chunk_sectors is not set (that is how a user or application can detect that the zones are not constant size). For these drives, the block layer may happily merge BIOs across zone boundaries and the discard code will not split and align calls on the zones. But upper layers (an FS or a device mapper) can still do all this by themselves if they want/can support non-constant zone sizes. The only exception is drives like the Seagate one with only the last zone of a different size. This case is handled exactly as if all zones are the same size simply because any operation on the last smaller zone will naturally align as the checks of operation size against the drive capacity will do the right things. The ioctls work for all cases (drive with constant zone size or not). This is again to allow supporting eventual weird drives at application level. I integrated all these ioctl into libzbc block device backend driver and everything is fine. Can't tell the difference with direct-to-drive SG_IO accesses. But unlike these, the zone ioctls keep the zone information RB-tree cache up to date. > > I will be updating my patchset accordingly. I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I will send everything for review. If you have any comment on the above, please let me know and I will be happy to incorporate changes. Best regards. ------------------------ Damien Le Moal, Ph.D. Sr. Manager, System Software Group, HGST Research, HGST, a Western Digital brand Damien.LeMoal@hgst.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.hgst.com Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-09 8:09 ` Damien Le Moal @ 2016-08-10 3:58 ` Shaun Tancheff -1 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-10 3:58 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wro= te: >> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote: [trim] >>> Since disk type =3D=3D 0 for everything that isn't HM so I would prefer= the >>> sysfs 'zoned' file just report if the drive is HA or HM. >>> >> Okay. So let's put in the 'zoned' attribute the device type: >> 'host-managed', 'host-aware', or 'device managed'. > > I hacked your patches and simply put a "0" or "1" in the sysfs zoned file= . > Any drive that has ZBC/ZAC command support gets a "1", "0" for everything > else. This means that drive managed models are not exposed as zoned block > devices. For HM vs HA differentiation, an application can look at the > device type file since it is already present. > > We could indeed set the "zoned" file to the device type, but HM drives an= d > regular drives will both have "0" in it, so no differentiation possible. > The other choice could be the "zoned" bits defined by ZBC, but these > do not define a value for host managed drives, and the drive managed valu= e > being not "0" could be confusing too. So I settled for a simple 0/1 boole= an. This seems good to me. >>>> 2) Add ioctls for zone management: >>>> Report zones (get information from RB tree), reset zone (simple wrappe= r >>>> to ioctl for block discard), open zone, close zone and finish zone. Th= at >>>> will allow mkfs like tools to get zone information without having to p= arse >>>> thousands of sysfs files (and can also be integrated in libzbc block b= ackend >>>> driver for a unified interface with the direct SG_IO path for kernels = without >>>> the ZBC code enabled). >>> >>> I can add finish zone ... but I really can't think of a use for it, mys= elf. >>> >> Which is not the point. The standard defines this, so clearly someone >> found it a reasonable addendum. So let's add this for completeness. Agreed. > Done: I hacked Shaun ioctl code and added finish zone too. The > difference with Shaun initial code is that the ioctl are propagated down = to > the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need f= or > BIO request definition for the zone operations. So a lot less code added. The purpose of the BIO flags is not to enable the ioctls so much as the other way round. Creating BIO op's is to enable issuing ZBC commands from device mapper targets and file systems without some heinous ioctl hacks. Making the resulting block layer interfaces available via ioctls is just a reasonable way to exercise the code ... or that was my intent. > The ioctls do not mimic exactly the ZBC standard. For instance, there is = no > reporting options for report zones, nor is the "all" bit supported for op= en, > close or finish zone commands. But the information provided on zones is c= omplete > and maps to the standard definitions. For the reporting options I have planned to reuse the stream_id in struct bio when that is formalized. There are certainly other places in struct bio to stuff a few extra bits ... As far as the all bit ... this is being handled by all the zone action commands. Just pass a sector of ~0ul and it's handled in sd.c by sd_setup_zone_action_cmnd(). Apologies as apparently my documentation here is lacking :-( > I also added a reset_wp ioctl for completeness, but this one simply calls > blkdev_issue_discard internally, so it is in fact equivalent to the BLKDI= SCARD > ioctl call with a range exactly aligned on a zone. I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that creates a REQ_OP_ZONE_RESET .. same as open and close. My expectation being that BLKDISCARD doesn't really need yet another alias. [trim] > Did that too. The blk_zone struct is now exactly 64B. I removed the per z= one Thanks .. being a cache line is harder to whinge about... > spinlock and replaced it with a flag so that zones can still be locked > independently using wait_on_bit_lock & friends (I added the functions > blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per= zone > locking comes in handy to implement the ioctls code without stalling the = command > queue for read, writes and discard on different zones. > > I however kept the zone length in the structure. The reason for doing so = is that > this allows supporting drives with non-constant zone sizes, albeit with a= more > limited interface since in such case the device chunk_sectors is not set = (that > is how a user or application can detect that the zones are not constant s= ize). > For these drives, the block layer may happily merge BIOs across zone boun= daries > and the discard code will not split and align calls on the zones. But upp= er layers > (an FS or a device mapper) can still do all this by themselves if they wa= nt/can > support non-constant zone sizes. > > The only exception is drives like the Seagate one with only the last zone= of a > different size. This case is handled exactly as if all zones are the same= size > simply because any operation on the last smaller zone will naturally alig= n as > the checks of operation size against the drive capacity will do the right= things. > > The ioctls work for all cases (drive with constant zone size or not). Thi= s is again > to allow supporting eventual weird drives at application level. I integra= ted all > these ioctl into libzbc block device backend driver and everything is fin= e. Can't > tell the difference with direct-to-drive SG_IO accesses. But unlike these= , the zone > ioctls keep the zone information RB-tree cache up to date. > >> >> I will be updating my patchset accordingly. > > I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this an= d I will send > everything for review. If you have any comment on the above, please let m= e know and > I will be happy to incorporate changes. > > Best regards. > > > ------------------------ > Damien Le Moal, Ph.D. > Sr. Manager, System Software Group, HGST Research, > HGST, a Western Digital brand > Damien.LeMoal@hgst.com > (+81) 0466-98-3593 (ext. 513593) > 1 kirihara-cho, Fujisawa, > Kanagawa, 252-0888 Japan > www.hgst.com > > Western Digital Corporation (and its subsidiaries) E-mail Confidentiality= Notice & Disclaimer: > > This e-mail and any files transmitted with it may contain confidential or= legally privileged information of WDC and/or its affiliates, and are inten= ded solely for the use of the individual or entity to which they are addres= sed. If you are not the intended recipient, any disclosure, copying, distri= bution or any action taken or omitted to be taken in reliance on it, is pro= hibited. If you have received this e-mail in error, please notify the sende= r immediately and delete the e-mail in its entirety from your system. > --=20 Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-10 3:58 ` Shaun Tancheff 0 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-10 3:58 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: >> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote: [trim] >>> Since disk type == 0 for everything that isn't HM so I would prefer the >>> sysfs 'zoned' file just report if the drive is HA or HM. >>> >> Okay. So let's put in the 'zoned' attribute the device type: >> 'host-managed', 'host-aware', or 'device managed'. > > I hacked your patches and simply put a "0" or "1" in the sysfs zoned file. > Any drive that has ZBC/ZAC command support gets a "1", "0" for everything > else. This means that drive managed models are not exposed as zoned block > devices. For HM vs HA differentiation, an application can look at the > device type file since it is already present. > > We could indeed set the "zoned" file to the device type, but HM drives and > regular drives will both have "0" in it, so no differentiation possible. > The other choice could be the "zoned" bits defined by ZBC, but these > do not define a value for host managed drives, and the drive managed value > being not "0" could be confusing too. So I settled for a simple 0/1 boolean. This seems good to me. >>>> 2) Add ioctls for zone management: >>>> Report zones (get information from RB tree), reset zone (simple wrapper >>>> to ioctl for block discard), open zone, close zone and finish zone. That >>>> will allow mkfs like tools to get zone information without having to parse >>>> thousands of sysfs files (and can also be integrated in libzbc block backend >>>> driver for a unified interface with the direct SG_IO path for kernels without >>>> the ZBC code enabled). >>> >>> I can add finish zone ... but I really can't think of a use for it, myself. >>> >> Which is not the point. The standard defines this, so clearly someone >> found it a reasonable addendum. So let's add this for completeness. Agreed. > Done: I hacked Shaun ioctl code and added finish zone too. The > difference with Shaun initial code is that the ioctl are propagated down to > the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for > BIO request definition for the zone operations. So a lot less code added. The purpose of the BIO flags is not to enable the ioctls so much as the other way round. Creating BIO op's is to enable issuing ZBC commands from device mapper targets and file systems without some heinous ioctl hacks. Making the resulting block layer interfaces available via ioctls is just a reasonable way to exercise the code ... or that was my intent. > The ioctls do not mimic exactly the ZBC standard. For instance, there is no > reporting options for report zones, nor is the "all" bit supported for open, > close or finish zone commands. But the information provided on zones is complete > and maps to the standard definitions. For the reporting options I have planned to reuse the stream_id in struct bio when that is formalized. There are certainly other places in struct bio to stuff a few extra bits ... As far as the all bit ... this is being handled by all the zone action commands. Just pass a sector of ~0ul and it's handled in sd.c by sd_setup_zone_action_cmnd(). Apologies as apparently my documentation here is lacking :-( > I also added a reset_wp ioctl for completeness, but this one simply calls > blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD > ioctl call with a range exactly aligned on a zone. I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that creates a REQ_OP_ZONE_RESET .. same as open and close. My expectation being that BLKDISCARD doesn't really need yet another alias. [trim] > Did that too. The blk_zone struct is now exactly 64B. I removed the per zone Thanks .. being a cache line is harder to whinge about... > spinlock and replaced it with a flag so that zones can still be locked > independently using wait_on_bit_lock & friends (I added the functions > blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone > locking comes in handy to implement the ioctls code without stalling the command > queue for read, writes and discard on different zones. > > I however kept the zone length in the structure. The reason for doing so is that > this allows supporting drives with non-constant zone sizes, albeit with a more > limited interface since in such case the device chunk_sectors is not set (that > is how a user or application can detect that the zones are not constant size). > For these drives, the block layer may happily merge BIOs across zone boundaries > and the discard code will not split and align calls on the zones. But upper layers > (an FS or a device mapper) can still do all this by themselves if they want/can > support non-constant zone sizes. > > The only exception is drives like the Seagate one with only the last zone of a > different size. This case is handled exactly as if all zones are the same size > simply because any operation on the last smaller zone will naturally align as > the checks of operation size against the drive capacity will do the right things. > > The ioctls work for all cases (drive with constant zone size or not). This is again > to allow supporting eventual weird drives at application level. I integrated all > these ioctl into libzbc block device backend driver and everything is fine. Can't > tell the difference with direct-to-drive SG_IO accesses. But unlike these, the zone > ioctls keep the zone information RB-tree cache up to date. > >> >> I will be updating my patchset accordingly. > > I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I will send > everything for review. If you have any comment on the above, please let me know and > I will be happy to incorporate changes. > > Best regards. > > > ------------------------ > Damien Le Moal, Ph.D. > Sr. Manager, System Software Group, HGST Research, > HGST, a Western Digital brand > Damien.LeMoal@hgst.com > (+81) 0466-98-3593 (ext. 513593) > 1 kirihara-cho, Fujisawa, > Kanagawa, 252-0888 Japan > www.hgst.com > > Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: > > This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. > -- Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-10 3:58 ` Shaun Tancheff @ 2016-08-10 4:38 ` Damien Le Moal -1 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-10 4:38 UTC (permalink / raw) To: Shaun Tancheff Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman U2hhdW4sCgpPbiA4LzEwLzE2IDEyOjU4LCBTaGF1biBUYW5jaGVmZiB3cm90ZToKPiBPbiBUdWUs IEF1ZyA5LCAyMDE2IGF0IDM6MDkgQU0sIERhbWllbiBMZSBNb2FsIDxEYW1pZW4uTGVNb2FsQGhn c3QuY29tPiB3cm90ZToKPj4+IE9uIEF1ZyA5LCAyMDE2LCBhdCAxNTo0NywgSGFubmVzIFJlaW5l Y2tlIDxoYXJlQHN1c2UuZGU+IHdyb3RlOgo+Cj4gW3RyaW1dCj4KPj4+PiBTaW5jZSBkaXNrIHR5 cGUgPT0gMCBmb3IgZXZlcnl0aGluZyB0aGF0IGlzbid0IEhNIHNvIEkgd291bGQgcHJlZmVyIHRo ZQo+Pj4+IHN5c2ZzICd6b25lZCcgZmlsZSBqdXN0IHJlcG9ydCBpZiB0aGUgZHJpdmUgaXMgSEEg b3IgSE0uCj4+Pj4KPj4+IE9rYXkuIFNvIGxldCdzIHB1dCBpbiB0aGUgJ3pvbmVkJyBhdHRyaWJ1 dGUgdGhlIGRldmljZSB0eXBlOgo+Pj4gJ2hvc3QtbWFuYWdlZCcsICdob3N0LWF3YXJlJywgb3Ig J2RldmljZSBtYW5hZ2VkJy4KPj4KPj4gSSBoYWNrZWQgeW91ciBwYXRjaGVzIGFuZCBzaW1wbHkg cHV0IGEgIjAiIG9yICIxIiBpbiB0aGUgc3lzZnMgem9uZWQgZmlsZS4KPj4gQW55IGRyaXZlIHRo YXQgaGFzIFpCQy9aQUMgY29tbWFuZCBzdXBwb3J0IGdldHMgYSAiMSIsICIwIiBmb3IgZXZlcnl0 aGluZwo+PiBlbHNlLiBUaGlzIG1lYW5zIHRoYXQgZHJpdmUgbWFuYWdlZCBtb2RlbHMgYXJlIG5v dCBleHBvc2VkIGFzIHpvbmVkIGJsb2NrCj4+IGRldmljZXMuIEZvciBITSB2cyBIQSBkaWZmZXJl bnRpYXRpb24sIGFuIGFwcGxpY2F0aW9uIGNhbiBsb29rIGF0IHRoZQo+PiBkZXZpY2UgdHlwZSBm aWxlIHNpbmNlIGl0IGlzIGFscmVhZHkgcHJlc2VudC4KPj4KPj4gV2UgY291bGQgaW5kZWVkIHNl dCB0aGUgInpvbmVkIiBmaWxlIHRvIHRoZSBkZXZpY2UgdHlwZSwgYnV0IEhNIGRyaXZlcyBhbmQK Pj4gcmVndWxhciBkcml2ZXMgd2lsbCBib3RoIGhhdmUgIjAiIGluIGl0LCBzbyBubyBkaWZmZXJl bnRpYXRpb24gcG9zc2libGUuCj4+IFRoZSBvdGhlciBjaG9pY2UgY291bGQgYmUgdGhlICJ6b25l ZCIgYml0cyBkZWZpbmVkIGJ5IFpCQywgYnV0IHRoZXNlCj4+IGRvIG5vdCBkZWZpbmUgYSB2YWx1 ZSBmb3IgaG9zdCBtYW5hZ2VkIGRyaXZlcywgYW5kIHRoZSBkcml2ZSBtYW5hZ2VkIHZhbHVlCj4+ IGJlaW5nIG5vdCAiMCIgY291bGQgYmUgY29uZnVzaW5nIHRvby4gU28gSSBzZXR0bGVkIGZvciBh IHNpbXBsZSAwLzEgYm9vbGVhbi4KPgo+IFRoaXMgc2VlbXMgZ29vZCB0byBtZS4KCkFub3RoZXIg b3B0aW9uIEkgZm9yZ290IGlzIGZvciB0aGUgInpvbmVkIiBmaWxlIHRvIGluZGljYXRlIHRoZSB0 b3RhbCAKbnVtYmVyIG9mIHpvbmVzIG9mIHRoZSBkZXZpY2UsIGFuZCAwIGZvciBhIG5vbiB6b25l ZCByZWd1bGFyIGJsb2NrIApkZXZpY2UuIFRoYXQgd291bGQgd29yayBhcyB3ZWxsLgoKWy4uLl0K Pj4gRG9uZTogSSBoYWNrZWQgU2hhdW4gaW9jdGwgY29kZSBhbmQgYWRkZWQgZmluaXNoIHpvbmUg dG9vLiBUaGUKPj4gZGlmZmVyZW5jZSB3aXRoIFNoYXVuIGluaXRpYWwgY29kZSBpcyB0aGF0IHRo ZSBpb2N0bCBhcmUgcHJvcGFnYXRlZCBkb3duIHRvCj4+IHRoZSBkcml2ZXIgKF9fYmxrZGV2X2Ry aXZlcl9pb2N0bCAtPiBzZF9pb2N0bCkgc28gdGhhdCB0aGVyZSBpcyBubyBuZWVkIGZvcgo+PiBC SU8gcmVxdWVzdCBkZWZpbml0aW9uIGZvciB0aGUgem9uZSBvcGVyYXRpb25zLiBTbyBhIGxvdCBs ZXNzIGNvZGUgYWRkZWQuCj4KPiBUaGUgcHVycG9zZSBvZiB0aGUgQklPIGZsYWdzIGlzIG5vdCB0 byBlbmFibGUgdGhlIGlvY3RscyBzbyBtdWNoIGFzCj4gdGhlIG90aGVyIHdheSByb3VuZC4gQ3Jl YXRpbmcgQklPIG9wJ3MgaXMgdG8gZW5hYmxlIGlzc3VpbmcgWkJDCj4gY29tbWFuZHMgZnJvbSBk ZXZpY2UgbWFwcGVyIHRhcmdldHMgYW5kIGZpbGUgc3lzdGVtcyB3aXRob3V0IHNvbWUKPiBoZWlu b3VzIGlvY3RsIGhhY2tzLgo+IE1ha2luZyB0aGUgcmVzdWx0aW5nIGJsb2NrIGxheWVyIGludGVy ZmFjZXMgYXZhaWxhYmxlIHZpYSBpb2N0bHMgaXMganVzdCBhCj4gcmVhc29uYWJsZSB3YXkgdG8g ZXhlcmNpc2UgdGhlIGNvZGUgLi4uIG9yIHRoYXQgd2FzIG15IGludGVudC4KClllcywgSSB1bmRl cnN0b29kIHlvdXIgY29kZS4gSG93ZXZlciwgc2luY2UgKG9yIGlmKSB3ZSBrZWVwIHRoZSB6b25l IAppbmZvcm1hdGlvbiBpbiB0aGUgUkItdHJlZSBjYWNoZSwgdGhlcmUgaXMgbm8gbmVlZCBmb3Ig dGhlIHJlcG9ydCB6b25lIApvcGVyYXRpb24gQklPIGludGVyZmFjZS4gU2FtZSBmb3IgcmVzZXQg d3JpdGUgcG9pbnRlciBieSBrZWVwaW5nIHRoZSAKbWFwcGluZyB0byBkaXNjYXJkLiBibGtfbG9v a3VwX3pvbmUgY2FuIGJlIHVzZWQgaW4ga2VybmVsIGFzIGEgcmVwb3J0IAp6b25lIEJJTyByZXBs YWNlbWVudCBhbmQgd29ya3MgYXMgd2VsbCBmb3IgdGhlIHJlcG9ydCB6b25lIGlvY3RsIAppbXBs ZW1lbnRhdGlvbi4gRm9yIHJlc2V0LCB0aGVyZSBpcyBibGtkZXZfaXNzdWVfZGlzY3JhZCBpbiBr ZXJuZWwsIGFuZCAKdGhlIHJlc2V0IHpvbmUgaW9jdGwgYmVjb21lcyBlcXVpdmFsZW50IHRvIEJM S0RJU0NBUkQgaW9jdGwuIFRoZXNlIGFyZSAKc2ltcGxlLiBPcGVuLCBjbG9zZSBhbmQgZmluaXNo IHpvbmUgcmVtYWlucy4gRm9yIHRoZXNlLCBhZGRpbmcgdGhlIEJJTyAKaW50ZXJmYWNlIHNlZW1l ZCBhbiBvdmVya2lsbC4gSGVuY2UgbXkgY2hvaWNlIG9mIHByb3BhZ2F0aW5nIHRoZSBpb2N0bCAK dG8gdGhlIGRyaXZlci4KVGhpcyBpcyBkZWJhdGFibGUgb2YgY291cnNlLCBhbmQgYWRkaW5nIGFu IGluLWtlcm5lbCBpbnRlcmZhY2UgaXMgbm90IApoYXJkOiB3ZSBjYW4gaW1wbGVtZW50IGJsa19v cGVuX3pvbmUsIGJsa19jbG9zZV96b25lIGFuZCBibGtfZmluaXNoX3pvbmUgCnVzaW5nIF9fYmxr ZGV2X2RyaXZlcl9pb2N0bC4gVGhhdCBsb29rcyBjbGVhbiB0byBtZS4KCk92ZXJhbGwsIG15IGNv bmNlcm4gd2l0aCB0aGUgQklPIGJhc2VkIGludGVyZmFjZSBmb3IgdGhlIFpCQyBjb21tYW5kcyBp cyAKdGhhdCBpdCBhZGRzIG9uZSBmbGFnIGZvciBlYWNoIGNvbW1hbmQsIHdoaWNoIGlzIG5vdCBy ZWFsbHkgdGhlIApwaGlsb3NvcGh5IG9mIHRoZSBpbnRlcmZhY2UgYW5kIHBvdGVudGlhbGx5IG9w ZW5zIHRoZSBkb29yIGZvciBtb3JlIHN1Y2ggCmltcGxlbWVudGF0aW9ucyBpbiB0aGUgZnV0dXJl IHdpdGggbmV3IHN0YW5kYXJkcyBhbmQgbmV3IGNvbW1hbmRzIGNvbWluZyAKdXAuIENsZWFybHkg dGhhdCBpcyBub3QgYSBzdXN0YWluYWJsZSBwYXRoLiBTbyBJIHRoaW5rIHRoYXQgYSBtb3JlIApz cGVjaWZpYyBpbnRlcmZhY2UgZm9yIHRoZXNlIHpvbmUgb3BlcmF0aW9ucyBpcyBhIGJldHRlciBj aG9pY2UuIFRoYXQgaXMgCmNvbnNpc3RlbnQgd2l0aCB3aGF0IGhhcHBlbnMgd2l0aCB0aGUgdG9u cyBvZiBBVEEgYW5kIFNDU0kgY29tbWFuZHMgbm90IAphY3R1YWxseSBkb2luZyBkYXRhIEkvT3Mg KG1vZGUgc2Vuc2UsIGxvZyBwYWdlcywgU01BUlQsIGV0YykuIEFsbCB0aGVzZSAKZG8gbm90IHVz ZSBCSU9zIGFuZCBhcmUgcHJvY2Vzc2VkIGFzIHJlcXVlc3QgUkVRX1RZUEVfQkxPQ0tfUEMuCgo+ PiBUaGUgaW9jdGxzIGRvIG5vdCBtaW1pYyBleGFjdGx5IHRoZSBaQkMgc3RhbmRhcmQuIEZvciBp bnN0YW5jZSwgdGhlcmUgaXMgbm8KPj4gcmVwb3J0aW5nIG9wdGlvbnMgZm9yIHJlcG9ydCB6b25l cywgbm9yIGlzIHRoZSAiYWxsIiBiaXQgc3VwcG9ydGVkIGZvciBvcGVuLAo+PiBjbG9zZSBvciBm aW5pc2ggem9uZSBjb21tYW5kcy4gQnV0IHRoZSBpbmZvcm1hdGlvbiBwcm92aWRlZCBvbiB6b25l cyBpcyBjb21wbGV0ZQo+PiBhbmQgbWFwcyB0byB0aGUgc3RhbmRhcmQgZGVmaW5pdGlvbnMuCj4K PiBGb3IgdGhlIHJlcG9ydGluZyBvcHRpb25zIEkgaGF2ZSBwbGFubmVkIHRvIHJldXNlIHRoZSBz dHJlYW1faWQgaW4KPiBzdHJ1Y3QgYmlvIHdoZW4gdGhhdCBpcyBmb3JtYWxpemVkLiBUaGVyZSBh cmUgY2VydGFpbmx5IG90aGVyIHBsYWNlcyBpbgo+IHN0cnVjdCBiaW8gdG8gc3R1ZmYgYSBmZXcg ZXh0cmEgYml0cyAuLi4KCldlIGNvdWxkIGFkZCByZXBvcnRpbmcgb3B0aW9ucyB0byBibGtfbG9v a3VwX3pvbmVzIHRvIGZpbHRlciB0aGUgcmVzdWx0IAphbmQgdXNlIHRoYXQgaW4gdGhlIGlvY3Rs IGltcGxlbWVudGF0aW9uIGFzIHdlbGwuIFRoaXMgY2FuIGJlIGFkZGVkIAp3aXRob3V0IGFueSBw cm9ibGVtLgoKPiBBcyBmYXIgYXMgdGhlIGFsbCBiaXQgLi4uIHRoaXMgaXMgYmVpbmcgaGFuZGxl ZCBieSBhbGwgdGhlIHpvbmUgYWN0aW9uCj4gY29tbWFuZHMuIEp1c3QgcGFzcyBhIHNlY3RvciBv ZiB+MHVsIGFuZCBpdCdzIGhhbmRsZWQgaW4gc2QuYyBieQo+IHNkX3NldHVwX3pvbmVfYWN0aW9u X2NtbmQoKS4KPgo+IEFwb2xvZ2llcyBhcyBhcHBhcmVudGx5IG15IGRvY3VtZW50YXRpb24gaGVy ZSBpcyBsYWNraW5nIDotKAoKWWVzLCBJIGdvdCBpdCAobGliemJjIGRvZXMgdGhlIHNhbWUgYWN0 dWFsbHkpLiBJIGRpZCBub3QgYWRkIGl0IGZvciAKc2ltcGxpY2l0eS4gQnV0IGluZGVlZCBtYXkg YmUgaXQgc2hvdWxkIGJlLiBUaGUgc2FtZSB0cmljayBjYW4gYmUgdXNlZCAKd2l0aCB0aGUgaW9j dGwgdG8gZHJpdmVyIGludGVyZmFjZS4gTm8gcHJvYmxlbXMuCgo+PiBJIGFsc28gYWRkZWQgYSBy ZXNldF93cCBpb2N0bCBmb3IgY29tcGxldGVuZXNzLCBidXQgdGhpcyBvbmUgc2ltcGx5IGNhbGxz Cj4+IGJsa2Rldl9pc3N1ZV9kaXNjYXJkIGludGVybmFsbHksIHNvIGl0IGlzIGluIGZhY3QgZXF1 aXZhbGVudCB0byB0aGUgQkxLRElTQ0FSRAo+PiBpb2N0bCBjYWxsIHdpdGggYSByYW5nZSBleGFj dGx5IGFsaWduZWQgb24gYSB6b25lLgo+Cj4gSSdtIGNvbmZ1c2VkIGFzIG15IHBhdGNoIHNldCBp bmNsdWRlcyBhIFJlc2V0IFdQIChCTEtSRVNFVFpPTkUpIHRoYXQKPiBjcmVhdGVzIGEgUkVRX09Q X1pPTkVfUkVTRVQgLi4gc2FtZSBhcyBvcGVuIGFuZCBjbG9zZS4gTXkKPiBleHBlY3RhdGlvbiBi ZWluZyB0aGF0IEJMS0RJU0NBUkQgZG9lc24ndCByZWFsbHkgbmVlZCB5ZXQgYW5vdGhlciBhbGlh cy4KClllcywgd2UgY291bGQgcmVtb3ZlIHRoZSBCTEtSRVNFVFpPTkUgaW9jdGwgYW5kIGhhdmUg YXBwbGljYXRpb25zIHVzZSAKdGhlIEJMS0RJU0NBUkQgaW9jdGwgaW5zdGVhZC4gSW4gdGhlIGtl cm5lbCwgYm90aCBnZW5lcmF0ZSAKYmxrZGV2X2lzc3VlX2Rpc2NhcmQgY2FsbHMgYW5kIGFyZSB0 aHVzIGlkZW50aWNhbC4gT25seSB0aGUgaW9jdGwgCmludGVyZmFjZSBkaWZmZXJzIChzZWN0b3Ig dnMgcmFuZ2UgYXJndW1lbnQpLiBBZ2FpbiwgSSBhZGRlZCBpdCB0byBoYXZlIAp0aGUgZnVsbCBz ZXQgb2YgNSBaQkMvWkFDIG9wZXJhdGlvbnMgbWFwcGluZyB0byBvbmUgQkxLeHh4Wk9ORSBpb2N0 bC4gCkJ1dCBncmFudGVkLCByZXNldCBpcyBvcHRpb25hbC4KCkJlc3QgcmVnYXJkcy4KCi0tIApE YW1pZW4gTGUgTW9hbCwgUGguRC4KU3IuIE1hbmFnZXIsIFN5c3RlbSBTb2Z0d2FyZSBHcm91cCwg SEdTVCBSZXNlYXJjaCwKSEdTVCwgYSBXZXN0ZXJuIERpZ2l0YWwgYnJhbmQKRGFtaWVuLkxlTW9h bEBoZ3N0LmNvbQooKzgxKSAwNDY2LTk4LTM1OTMgKGV4dC4gNTEzNTkzKQoxIGtpcmloYXJhLWNo bywgRnVqaXNhd2EsCkthbmFnYXdhLCAyNTItMDg4OCBKYXBhbgp3d3cuaGdzdC5jb20KV2VzdGVy biBEaWdpdGFsIENvcnBvcmF0aW9uIChhbmQgaXRzIHN1YnNpZGlhcmllcykgRS1tYWlsIENvbmZp ZGVudGlhbGl0eSBOb3RpY2UgJiBEaXNjbGFpbWVyOgoKVGhpcyBlLW1haWwgYW5kIGFueSBmaWxl cyB0cmFuc21pdHRlZCB3aXRoIGl0IG1heSBjb250YWluIGNvbmZpZGVudGlhbCBvciBsZWdhbGx5 IHByaXZpbGVnZWQgaW5mb3JtYXRpb24gb2YgV0RDIGFuZC9vciBpdHMgYWZmaWxpYXRlcywgYW5k IGFyZSBpbnRlbmRlZCBzb2xlbHkgZm9yIHRoZSB1c2Ugb2YgdGhlIGluZGl2aWR1YWwgb3IgZW50 aXR5IHRvIHdoaWNoIHRoZXkgYXJlIGFkZHJlc3NlZC4gSWYgeW91IGFyZSBub3QgdGhlIGludGVu ZGVkIHJlY2lwaWVudCwgYW55IGRpc2Nsb3N1cmUsIGNvcHlpbmcsIGRpc3RyaWJ1dGlvbiBvciBh bnkgYWN0aW9uIHRha2VuIG9yIG9taXR0ZWQgdG8gYmUgdGFrZW4gaW4gcmVsaWFuY2Ugb24gaXQs IGlzIHByb2hpYml0ZWQuIElmIHlvdSBoYXZlIHJlY2VpdmVkIHRoaXMgZS1tYWlsIGluIGVycm9y LCBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIgaW1tZWRpYXRlbHkgYW5kIGRlbGV0ZSB0aGUgZS1t YWlsIGluIGl0cyBlbnRpcmV0eSBmcm9tIHlvdXIgc3lzdGVtLgo= ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-10 4:38 ` Damien Le Moal 0 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-10 4:38 UTC (permalink / raw) To: Shaun Tancheff Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Shaun, On 8/10/16 12:58, Shaun Tancheff wrote: > On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: >>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote: > > [trim] > >>>> Since disk type == 0 for everything that isn't HM so I would prefer the >>>> sysfs 'zoned' file just report if the drive is HA or HM. >>>> >>> Okay. So let's put in the 'zoned' attribute the device type: >>> 'host-managed', 'host-aware', or 'device managed'. >> >> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file. >> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything >> else. This means that drive managed models are not exposed as zoned block >> devices. For HM vs HA differentiation, an application can look at the >> device type file since it is already present. >> >> We could indeed set the "zoned" file to the device type, but HM drives and >> regular drives will both have "0" in it, so no differentiation possible. >> The other choice could be the "zoned" bits defined by ZBC, but these >> do not define a value for host managed drives, and the drive managed value >> being not "0" could be confusing too. So I settled for a simple 0/1 boolean. > > This seems good to me. Another option I forgot is for the "zoned" file to indicate the total number of zones of the device, and 0 for a non zoned regular block device. That would work as well. [...] >> Done: I hacked Shaun ioctl code and added finish zone too. The >> difference with Shaun initial code is that the ioctl are propagated down to >> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for >> BIO request definition for the zone operations. So a lot less code added. > > The purpose of the BIO flags is not to enable the ioctls so much as > the other way round. Creating BIO op's is to enable issuing ZBC > commands from device mapper targets and file systems without some > heinous ioctl hacks. > Making the resulting block layer interfaces available via ioctls is just a > reasonable way to exercise the code ... or that was my intent. Yes, I understood your code. However, since (or if) we keep the zone information in the RB-tree cache, there is no need for the report zone operation BIO interface. Same for reset write pointer by keeping the mapping to discard. blk_lookup_zone can be used in kernel as a report zone BIO replacement and works as well for the report zone ioctl implementation. For reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and finish zone remains. For these, adding the BIO interface seemed an overkill. Hence my choice of propagating the ioctl to the driver. This is debatable of course, and adding an in-kernel interface is not hard: we can implement blk_open_zone, blk_close_zone and blk_finish_zone using __blkdev_driver_ioctl. That looks clean to me. Overall, my concern with the BIO based interface for the ZBC commands is that it adds one flag for each command, which is not really the philosophy of the interface and potentially opens the door for more such implementations in the future with new standards and new commands coming up. Clearly that is not a sustainable path. So I think that a more specific interface for these zone operations is a better choice. That is consistent with what happens with the tons of ATA and SCSI commands not actually doing data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and are processed as request REQ_TYPE_BLOCK_PC. >> The ioctls do not mimic exactly the ZBC standard. For instance, there is no >> reporting options for report zones, nor is the "all" bit supported for open, >> close or finish zone commands. But the information provided on zones is complete >> and maps to the standard definitions. > > For the reporting options I have planned to reuse the stream_id in > struct bio when that is formalized. There are certainly other places in > struct bio to stuff a few extra bits ... We could add reporting options to blk_lookup_zones to filter the result and use that in the ioctl implementation as well. This can be added without any problem. > As far as the all bit ... this is being handled by all the zone action > commands. Just pass a sector of ~0ul and it's handled in sd.c by > sd_setup_zone_action_cmnd(). > > Apologies as apparently my documentation here is lacking :-( Yes, I got it (libzbc does the same actually). I did not add it for simplicity. But indeed may be it should be. The same trick can be used with the ioctl to driver interface. No problems. >> I also added a reset_wp ioctl for completeness, but this one simply calls >> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD >> ioctl call with a range exactly aligned on a zone. > > I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that > creates a REQ_OP_ZONE_RESET .. same as open and close. My > expectation being that BLKDISCARD doesn't really need yet another alias. Yes, we could remove the BLKRESETZONE ioctl and have applications use the BLKDISCARD ioctl instead. In the kernel, both generate blkdev_issue_discard calls and are thus identical. Only the ioctl interface differs (sector vs range argument). Again, I added it to have the full set of 5 ZBC/ZAC operations mapping to one BLKxxxZONE ioctl. But granted, reset is optional. Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Group, HGST Research, HGST, a Western Digital brand Damien.LeMoal@hgst.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.hgst.com Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-10 4:38 ` Damien Le Moal (?) @ 2016-08-16 6:07 ` Shaun Tancheff -1 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-16 6:07 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal <damien.lemoal@hgst.com> wrote: > Shaun, > > On 8/10/16 12:58, Shaun Tancheff wrote: >> >> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal <Damien.LeMoal@hgst.com> >> wrote: >>>> >>>> On Aug 9, 2016, at 15:47, Hannes Reinecke <hare@suse.de> wrote: >> >> >> [trim] >> >>>>> Since disk type == 0 for everything that isn't HM so I would prefer the >>>>> sysfs 'zoned' file just report if the drive is HA or HM. >>>>> >>>> Okay. So let's put in the 'zoned' attribute the device type: >>>> 'host-managed', 'host-aware', or 'device managed'. >>> >>> >>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned >>> file. >>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything >>> else. This means that drive managed models are not exposed as zoned block >>> devices. For HM vs HA differentiation, an application can look at the >>> device type file since it is already present. >>> >>> We could indeed set the "zoned" file to the device type, but HM drives >>> and >>> regular drives will both have "0" in it, so no differentiation possible. >>> The other choice could be the "zoned" bits defined by ZBC, but these >>> do not define a value for host managed drives, and the drive managed >>> value >>> being not "0" could be confusing too. So I settled for a simple 0/1 >>> boolean. >> >> >> This seems good to me. > > > Another option I forgot is for the "zoned" file to indicate the total number > of zones of the device, and 0 for a non zoned regular block device. That > would work as well. Clearly either is sufficient. > [...] >>> >>> Done: I hacked Shaun ioctl code and added finish zone too. The >>> difference with Shaun initial code is that the ioctl are propagated down >>> to >>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need >>> for >>> BIO request definition for the zone operations. So a lot less code added. >> >> >> The purpose of the BIO flags is not to enable the ioctls so much as >> the other way round. Creating BIO op's is to enable issuing ZBC >> commands from device mapper targets and file systems without some >> heinous ioctl hacks. >> Making the resulting block layer interfaces available via ioctls is just a >> reasonable way to exercise the code ... or that was my intent. > > > Yes, I understood your code. However, since (or if) we keep the zone > information in the RB-tree cache, there is no need for the report zone > operation BIO interface. Same for reset write pointer by keeping the mapping > to discard. blk_lookup_zone can be used in kernel as a report zone BIO > replacement and works as well for the report zone ioctl implementation. For > reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl > becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and > finish zone remains. For these, adding the BIO interface seemed an overkill. > Hence my choice of propagating the ioctl to the driver. > This is debatable of course, and adding an in-kernel interface is not hard: > we can implement blk_open_zone, blk_close_zone and blk_finish_zone using > __blkdev_driver_ioctl. That looks clean to me. Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API -> Kernel is not really a good designed IMO. > Overall, my concern with the BIO based interface for the ZBC commands is > that it adds one flag for each command, which is not really the philosophy > of the interface and potentially opens the door for more such > implementations in the future with new standards and new commands coming up. > Clearly that is not a sustainable path. So I think that a more specific > interface for these zone operations is a better choice. That is consistent > with what happens with the tons of ATA and SCSI commands not actually doing > data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and > are processed as request REQ_TYPE_BLOCK_PC. Part of the reason for following on Mike Christie's bio op/flags cleanup was to make these op's. The advantage of being added as ops is that there is only 1 extra bit need (not 4 or 5 bits for flags). The other reason for being promoted into the block layer as commands is because it seems to me to make sense that these abstractions could be allowed to be passed through a DM layer and be handled by a files system. >>> The ioctls do not mimic exactly the ZBC standard. For instance, there is >>> no >>> reporting options for report zones, nor is the "all" bit supported for >>> open, >>> close or finish zone commands. But the information provided on zones is >>> complete >>> and maps to the standard definitions. >> >> >> For the reporting options I have planned to reuse the stream_id in >> struct bio when that is formalized. There are certainly other places in >> struct bio to stuff a few extra bits ... Sorry I was confused here. I was under the impression you were talking about one of my patches when you seem to have been talking about your hacking thereof. > We could add reporting options to blk_lookup_zones to filter the result and > use that in the ioctl implementation as well. This can be added without any > problem. > >> As far as the all bit ... this is being handled by all the zone action >> commands. Just pass a sector of ~0ul and it's handled in sd.c by >> sd_setup_zone_action_cmnd(). >> >> Apologies as apparently my documentation here is lacking :-( > > > Yes, I got it (libzbc does the same actually). I did not add it for > simplicity. But indeed may be it should be. The same trick can be used with > the ioctl to driver interface. No problems. > >>> I also added a reset_wp ioctl for completeness, but this one simply calls >>> blkdev_issue_discard internally, so it is in fact equivalent to the >>> BLKDISCARD >>> ioctl call with a range exactly aligned on a zone. >> >> >> I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that >> creates a REQ_OP_ZONE_RESET .. same as open and close. My >> expectation being that BLKDISCARD doesn't really need yet another alias. Same as above. > Yes, we could remove the BLKRESETZONE ioctl and have applications use the > BLKDISCARD ioctl instead. In the kernel, both generate blkdev_issue_discard > calls and are thus identical. Only the ioctl interface differs (sector vs > range argument). Again, I added it to have the full set of 5 ZBC/ZAC > operations mapping to one BLKxxxZONE ioctl. But granted, reset is optional. Which is the opposite of what I had stated above which is that I do _not_ think reset is optional. I think that if the user wants discard they should call discard and if they want reset wp they *can* call reset wp and have it behave as expected. Why would I and a new ioctl that just calls discard? That makes no sense to me. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-09 6:47 ` Hannes Reinecke 2016-08-09 8:09 ` Damien Le Moal @ 2016-08-10 3:26 ` Shaun Tancheff 2016-08-14 0:09 ` Shaun Tancheff 2 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-10 3:26 UTC (permalink / raw) To: Hannes Reinecke Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote: > On 08/05/2016 10:35 PM, Shaun Tancheff wrote: >> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: >>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: >>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: [trim] >> Also the zone report is 'slow' in that there is an overhead for the >> report itself but >> the number of zones per query can be quite large so 4 or 5 I/Os that >> run into the >> hundreds if milliseconds to cache the entire drive isn't really unworkable for >> something that is used infrequently. >> > No, surely not. > But one of the _big_ advantages for the RB tree is blkdev_discard(). > Without the RB tree any mkfs program will issue a 'discard' for every > sector. We will be able to coalesce those into one discard per zone, but > we still need to issue one for _every_ zone. > Which is (as indicated) really slow, and easily takes several minutes. > With the RB tree we can short-circuit discards to empty zones, and speed > up processing time dramatically. > Sure we could be moving the logic into mkfs and friends, but that would > require us to change the programs and agree on a library (libzbc?) which > should be handling that. Adding an additional library dependency seems overkill for a program that is already doing ioctls and raw block I/O ... but I would leave that up to each file system. As it sits issuing the ioctl and walking the array of data returned [see blkreport.c] is already quite trivial. I believe the goal here is for F2FS, and perhaps NILFS? to "just work" with the DISCARD to Reset WP and zone cache in place. Still quite skeptical about other common file systems "just working" without their respective mkfs et. al. being zone aware and handling the topology of the media at mkfs time. Perhaps there is something I am unaware of? [trim] >> I can add finish zone ... but I really can't think of a use for it, myself. >> > Which is not the point. The standard defines this, so clearly someone > found it a reasonable addendum. So let's add this for completeness. Agreed and queued for the next version. Regards, Shaun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-09 6:47 ` Hannes Reinecke @ 2016-08-14 0:09 ` Shaun Tancheff 2016-08-10 3:26 ` Shaun Tancheff 2016-08-14 0:09 ` Shaun Tancheff 2 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-14 0:09 UTC (permalink / raw) To: Hannes Reinecke Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote: > On 08/05/2016 10:35 PM, Shaun Tancheff wrote: >> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> = wrote: >>> Hannes, Shaun, >>> >>> Let me add some more comments. >>> >>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: >>>> >>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >>>>>> >>>>>> Can you please integrate this with Hannes series so that it uses >>>>>> his cache of the zone information? >>>>> >>>>> Adding Hannes and Damien to Cc. >>>>> >>>>> Christoph, >>>>> >>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report= , that is >>>>> quite simple. I can even have the open/close/reset zone commands upda= te the >>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this ar= ound the >>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups= that do >>>>> not need the RB-Tree to function with zoned media. >> >> I have posted patches to integrate with the zone cache, hopefully they >> make sense. >> > [ .. ] >>>> I have thought about condensing the RB tree information, but then I >>>> figured that for 'real' SMR handling we cannot assume all zones are of >>>> fixed size, and hence we need all the information there. >>>> Any condensing method would assume a given structure of the zones, whi= ch >>>> the standard just doesn't provide. >>>> Or am I missing something here? Of course you can condense the zone cache without loosing any information. Here is the layout I used ... I haven't update the patch to the latest posted patches but this is the basic idea. [It was originally done as a follow on of making your zone cache work with Seagate's HA drive. I did not include the wp-in-arrays patch along with the HA drive support that I sent you in May as you were quite terse about RB trees when I tried to discuss this approach with you at Vault] struct blk_zone { unsigned type:4; unsigned state:5; unsigned extra:7; unsigned wp:40; void *private_data; }; struct contiguous_wps { u64 start_lba; u64 last_lba; /* or # of blocks */ u64 zone_size; /* size in blocks */ unsigned is_zoned:1; u32 zone_count; spinlock_t lock; struct blk_zone zones[0]; }; struct zone_wps { u32 wps_count; struct contiguous_wps **wps; }; Then in struct request_queue - struct rb_root zones; + struct struct zone_wps *zones; For each contiguous chunk of zones you need a descriptor. In the current drives you need 1 or 2 descriptors. Here a conventional drive is encapsulated as zoned media with one drive sized conventional zone. I have not spent time building an ad-hoc LVM comprised of zoned and conventional media so it's not all ironed out yet. I think you can see the advantage of being able to put conventional space anywhere you would like to work around zoned media not being laid out the the best manner for your setup. Yes things start to break down if every other zone is a different size .. The point being that even with supporting zones that order 48 bytes. in size this saves a lot of space with no loss of information. I still kind of prefer pushing blk_zone down to a u32 by reducing the max zone size and dropping the private_data ... but that may be going a bit too far. blk_lookup_zone then has an [unfortunate] signature change: /** * blk_lookup_zone() - Lookup zones * @q: Request Queue * @sector: Location to lookup * @start: Starting location zone (OUT: Required) * @len: Length of zone (OUT: Required) * @lock: Spinlock of zones (OUT: Required) */ struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector, sector_t *start, sector_t *len, spinlock_t **lock) { int iter; struct blk_zone *bzone =3D NULL; struct zone_wps *zi =3D q->zones; *start =3D 0; *len =3D 0; *lock =3D NULL; if (!q->zones) goto out; for (iter =3D 0; iter < zi->wps_count; iter++) { if (sector >=3D zi->wps[iter]->start_lba && sector < zi->wps[iter]->last_lba) { struct contiguous_wps *wp =3D zi->wps[iter]; u64 index =3D (sector - wp->start_lba) / wp->zone_s= ize; BUG_ON(index >=3D wp->zone_count); bzone =3D &wp->zones[index]; *len =3D wp->zone_size; *start =3D wp->start_lba + (index * wp->zone_size); *lock =3D &wp->lock; } } out: return bzone; } >>> >>> Indeed, the standards do not mandate any particular zone configuration, >>> constant zone size, etc. So writing code so that can be handled is cert= ainly >>> the right way of doing things. However, if we decide to go forward with >>> mapping RESET WRITE POINTER command to DISCARD, then at least a constan= t >>> zone size (minus the last zone as you said) must be assumed, and that >>> information can be removed from the entries in the RB tree (as it will = be >>> saved for the sysfs "zone_size" file anyway. Adding a little code to ha= ndle >>> that eventual last runt zone with a different size is not a big problem= . >> >>>> As for write pointer handling: yes, the write pointer on the zones is >>>> not really useful for upper-level usage. >>>> Where we do need it is to detect I/O which is crossing the write point= er >>>> (eg when doing reads over the entire zone). >>>> As per spec you will be getting an I/O error here, so we need to split >>>> the I/O on the write pointer to get valid results back. >>> >>> To be precise here, the I/O splitting will be handled by the block laye= r >>> thanks to the "chunk_sectors" setting. But that relies on a constant zo= ne >>> size assumption too. >>> >>> The RB-tree here is most useful for reads over or after the write point= er as >>> this can have different behavior on different drives (URSWRZ bit). The = RB-tree >>> allows us to hide these differences to upper layers and simplify suppor= t at >>> those levels. >> >> It is unfortunate that path was chosen rather than returning Made Up Dat= a. > But how would you return made up data without knowing that you _have_ to > generate some? > Of course it's trivial to generate made up data once an I/O error > occurred, but that's too late as the I/O error already happened. > > The general idea behind this is that I _really_ do want to avoid > triggering an I/O error on initial access. This kind of thing really > tends to freak out users (connect a new drive and getting I/O errors; > not the best user experience ...) > >> However I don't think this is a file system or device mapper problem as >> neither of those layers attempt to read blocks they have not written >> (or discarded). >> All I can think of something that probes the drive media for a partition= table >> or other identification...isn't the conventional space sufficient to cov= er >> those cases? > Upon device scan the kernel will attempt to read the partition tables, > which consists of reading the first few megabytes and the last sector > (for GPT shadow partition tables). > And depending on the driver manufacturer you might or might not have > enough conventional space at the right locations. > >> Anyway you could just handle the error and sense codes [CHECK CONDITION = / >> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD >> when URSWRZ is 0. It would have the same effect as using the zone cache >> without the possibility of the zone cache being out-of-sync. >> > Yes, but then we would already have registered an I/O error. > And as indicated above, I really do _not_ want to handle all the > customer calls for supposed faulty new SMR drives. Fair enough. I just really want to be sure that all this memory locked to the zone cache is really being made useful. >>> I too agree that the write pointer value is not very useful to upper la= yers >>> (e.g. FS). What matters most of the times for these layers is an "alloc= ation >>> pointer" or "submission pointer" which indicates the LBA to use for the= next >>> BIO submission. That LBA will most of the time be in advance of the zon= e WP >>> because of request queing in the block scheduler. >> >> Which is kind of my point. >> >>> Note that from what we have done already, in many cases, upper layers d= o not >>> even need this (e.g. F2FS, btrfs) and work fine without needing *any* >>> zone->private_data because that "allocation pointer" information genera= lly >>> already exists in a different form (e.g. a bit in a bitmap). >> >> Agreed. Log structured and copy on write file systems are a natural fit = for >> these types of drives >> >>> For these cases, not having the RB-tree of zones would force a lot >>> more code to be added to file systems for SMR support. >> >> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btr= fs. >> Certainly any of these could trigger the zone cache to be loaded at mkfs >> and/or mount time? >> >>>>> This all tails into domain responsibility. With the RB-Tree doing hal= f of the >>>>> work and the =E2=80=98responsible=E2=80=99 domain handling the active= path via private_data >>>>> why have the split at all? It seems to be a double work to have secon= d object >>>>> tracking the first so that I/O scheduling can function. >>>>> >>>> Tracking the zone states via RB trees is mainly to handly host-managed >>>> drives seamlessly. Without an RB tree we will be subjected to I/O erro= rs >>>> during boot, as eg with new drives we inevitably will try to read beyo= nd >>>> the write pointer, which in turn will generate I/O errors on certain d= rives. >> >> If the zone cache is needed to boot then clearly my objection to reading >> in the zone report information on drive attach is unfounded. I was under >> the impression that this was not the case. >> >>>> I do agree that there is no strict need to setup an RB tree for >>>> host-aware drives; I can easily add an attribute/flag to disable RB >>>> trees for those. >>>> However, tracking zone information in the RB tree _dramatically_ speed= s >>>> up device initialisation; issuing a blkdev_discard() for the entire >>>> drive will take _ages_ without it. >>> >>> And the RB-tree will also be very useful for speeding up report zones >>> ioctls too. Unless we want those to force an update of the RB-tree info= rmation ? >> >> That is debatable. I would tend to argue for both. *If* there must be >> a zone cache then reading from the zone cache is a must. Forcing and >> update of the zone cache from the media seems like a reasonable thing >> to be able to do. >> >> Also the zone report is 'slow' in that there is an overhead for the >> report itself but >> the number of zones per query can be quite large so 4 or 5 I/Os that >> run into the >> hundreds if milliseconds to cache the entire drive isn't really unworkab= le for >> something that is used infrequently. >> > No, surely not. > But one of the _big_ advantages for the RB tree is blkdev_discard(). > Without the RB tree any mkfs program will issue a 'discard' for every > sector. We will be able to coalesce those into one discard per zone, but > we still need to issue one for _every_ zone. How can you make coalesce work transparently in the sd layer _without_ keeping some sort of a discard cache along with the zone cache? Currently the block layer's blkdev_issue_discard() is breaking large discard's into nice granular and aligned chunks but it is not preventing small discards nor coalescing them. In the sd layer would there be way to persist or purge an overly large discard cache? What about honoring discard_zeroes_data? Once the discard is completed with discard_zeroes_data you have to return zeroes whenever a discarded sector is read. Isn't that a log more than just tracking a write pointer? Couldn't a zone have dozens of holes? > Which is (as indicated) really slow, and easily takes several minutes. > With the RB tree we can short-circuit discards to empty zones, and speed > up processing time dramatically. > Sure we could be moving the logic into mkfs and friends, but that would > require us to change the programs and agree on a library (libzbc?) which > should be handling that. F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... so I'm not sure your argument is valid here. [..] >>> 3) Try to condense the blkzone data structure to save memory: >>> I think that we can at the very least remove the zone length, and also >>> may be the per zone spinlock too (a single spinlock and proper state fl= ags can >>> be used). >> >> I have a variant that is an array of descriptors that roughly mimics the >> api from blk-zoned.c that I did a few months ago as an example. >> I should be able to update that to the current kernel + patches. >> > Okay. If we restrict the in-kernel SMR drive handling to devices with > identical zone sizes of course we can remove the zone length. > And we can do away with the per-zone spinlock and use a global one instea= d. I don't think dropping the zone length is a reasonable thing to do. What I propose is an array of _descriptors_ it doesn't drop _any_ of the zone information that you are holding in an RB tree, it is just a condensed format that _mostly_ plugs into your existing API. --=20 Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-14 0:09 ` Shaun Tancheff 0 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-14 0:09 UTC (permalink / raw) To: Hannes Reinecke Cc: Damien Le Moal, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke <hare@suse.de> wrote: > On 08/05/2016 10:35 PM, Shaun Tancheff wrote: >> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: >>> Hannes, Shaun, >>> >>> Let me add some more comments. >>> >>>> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@suse.de> wrote: >>>> >>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@lst.de> wrote: >>>>>> >>>>>> Can you please integrate this with Hannes series so that it uses >>>>>> his cache of the zone information? >>>>> >>>>> Adding Hannes and Damien to Cc. >>>>> >>>>> Christoph, >>>>> >>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is >>>>> quite simple. I can even have the open/close/reset zone commands update the >>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the >>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do >>>>> not need the RB-Tree to function with zoned media. >> >> I have posted patches to integrate with the zone cache, hopefully they >> make sense. >> > [ .. ] >>>> I have thought about condensing the RB tree information, but then I >>>> figured that for 'real' SMR handling we cannot assume all zones are of >>>> fixed size, and hence we need all the information there. >>>> Any condensing method would assume a given structure of the zones, which >>>> the standard just doesn't provide. >>>> Or am I missing something here? Of course you can condense the zone cache without loosing any information. Here is the layout I used ... I haven't update the patch to the latest posted patches but this is the basic idea. [It was originally done as a follow on of making your zone cache work with Seagate's HA drive. I did not include the wp-in-arrays patch along with the HA drive support that I sent you in May as you were quite terse about RB trees when I tried to discuss this approach with you at Vault] struct blk_zone { unsigned type:4; unsigned state:5; unsigned extra:7; unsigned wp:40; void *private_data; }; struct contiguous_wps { u64 start_lba; u64 last_lba; /* or # of blocks */ u64 zone_size; /* size in blocks */ unsigned is_zoned:1; u32 zone_count; spinlock_t lock; struct blk_zone zones[0]; }; struct zone_wps { u32 wps_count; struct contiguous_wps **wps; }; Then in struct request_queue - struct rb_root zones; + struct struct zone_wps *zones; For each contiguous chunk of zones you need a descriptor. In the current drives you need 1 or 2 descriptors. Here a conventional drive is encapsulated as zoned media with one drive sized conventional zone. I have not spent time building an ad-hoc LVM comprised of zoned and conventional media so it's not all ironed out yet. I think you can see the advantage of being able to put conventional space anywhere you would like to work around zoned media not being laid out the the best manner for your setup. Yes things start to break down if every other zone is a different size .. The point being that even with supporting zones that order 48 bytes. in size this saves a lot of space with no loss of information. I still kind of prefer pushing blk_zone down to a u32 by reducing the max zone size and dropping the private_data ... but that may be going a bit too far. blk_lookup_zone then has an [unfortunate] signature change: /** * blk_lookup_zone() - Lookup zones * @q: Request Queue * @sector: Location to lookup * @start: Starting location zone (OUT: Required) * @len: Length of zone (OUT: Required) * @lock: Spinlock of zones (OUT: Required) */ struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector, sector_t *start, sector_t *len, spinlock_t **lock) { int iter; struct blk_zone *bzone = NULL; struct zone_wps *zi = q->zones; *start = 0; *len = 0; *lock = NULL; if (!q->zones) goto out; for (iter = 0; iter < zi->wps_count; iter++) { if (sector >= zi->wps[iter]->start_lba && sector < zi->wps[iter]->last_lba) { struct contiguous_wps *wp = zi->wps[iter]; u64 index = (sector - wp->start_lba) / wp->zone_size; BUG_ON(index >= wp->zone_count); bzone = &wp->zones[index]; *len = wp->zone_size; *start = wp->start_lba + (index * wp->zone_size); *lock = &wp->lock; } } out: return bzone; } >>> >>> Indeed, the standards do not mandate any particular zone configuration, >>> constant zone size, etc. So writing code so that can be handled is certainly >>> the right way of doing things. However, if we decide to go forward with >>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant >>> zone size (minus the last zone as you said) must be assumed, and that >>> information can be removed from the entries in the RB tree (as it will be >>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle >>> that eventual last runt zone with a different size is not a big problem. >> >>>> As for write pointer handling: yes, the write pointer on the zones is >>>> not really useful for upper-level usage. >>>> Where we do need it is to detect I/O which is crossing the write pointer >>>> (eg when doing reads over the entire zone). >>>> As per spec you will be getting an I/O error here, so we need to split >>>> the I/O on the write pointer to get valid results back. >>> >>> To be precise here, the I/O splitting will be handled by the block layer >>> thanks to the "chunk_sectors" setting. But that relies on a constant zone >>> size assumption too. >>> >>> The RB-tree here is most useful for reads over or after the write pointer as >>> this can have different behavior on different drives (URSWRZ bit). The RB-tree >>> allows us to hide these differences to upper layers and simplify support at >>> those levels. >> >> It is unfortunate that path was chosen rather than returning Made Up Data. > But how would you return made up data without knowing that you _have_ to > generate some? > Of course it's trivial to generate made up data once an I/O error > occurred, but that's too late as the I/O error already happened. > > The general idea behind this is that I _really_ do want to avoid > triggering an I/O error on initial access. This kind of thing really > tends to freak out users (connect a new drive and getting I/O errors; > not the best user experience ...) > >> However I don't think this is a file system or device mapper problem as >> neither of those layers attempt to read blocks they have not written >> (or discarded). >> All I can think of something that probes the drive media for a partition table >> or other identification...isn't the conventional space sufficient to cover >> those cases? > Upon device scan the kernel will attempt to read the partition tables, > which consists of reading the first few megabytes and the last sector > (for GPT shadow partition tables). > And depending on the driver manufacturer you might or might not have > enough conventional space at the right locations. > >> Anyway you could just handle the error and sense codes [CHECK CONDITION / >> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD >> when URSWRZ is 0. It would have the same effect as using the zone cache >> without the possibility of the zone cache being out-of-sync. >> > Yes, but then we would already have registered an I/O error. > And as indicated above, I really do _not_ want to handle all the > customer calls for supposed faulty new SMR drives. Fair enough. I just really want to be sure that all this memory locked to the zone cache is really being made useful. >>> I too agree that the write pointer value is not very useful to upper layers >>> (e.g. FS). What matters most of the times for these layers is an "allocation >>> pointer" or "submission pointer" which indicates the LBA to use for the next >>> BIO submission. That LBA will most of the time be in advance of the zone WP >>> because of request queing in the block scheduler. >> >> Which is kind of my point. >> >>> Note that from what we have done already, in many cases, upper layers do not >>> even need this (e.g. F2FS, btrfs) and work fine without needing *any* >>> zone->private_data because that "allocation pointer" information generally >>> already exists in a different form (e.g. a bit in a bitmap). >> >> Agreed. Log structured and copy on write file systems are a natural fit for >> these types of drives >> >>> For these cases, not having the RB-tree of zones would force a lot >>> more code to be added to file systems for SMR support. >> >> I don't understand how the zone cache (RB-tree) is "helping" F2FS or btrfs. >> Certainly any of these could trigger the zone cache to be loaded at mkfs >> and/or mount time? >> >>>>> This all tails into domain responsibility. With the RB-Tree doing half of the >>>>> work and the ‘responsible’ domain handling the active path via private_data >>>>> why have the split at all? It seems to be a double work to have second object >>>>> tracking the first so that I/O scheduling can function. >>>>> >>>> Tracking the zone states via RB trees is mainly to handly host-managed >>>> drives seamlessly. Without an RB tree we will be subjected to I/O errors >>>> during boot, as eg with new drives we inevitably will try to read beyond >>>> the write pointer, which in turn will generate I/O errors on certain drives. >> >> If the zone cache is needed to boot then clearly my objection to reading >> in the zone report information on drive attach is unfounded. I was under >> the impression that this was not the case. >> >>>> I do agree that there is no strict need to setup an RB tree for >>>> host-aware drives; I can easily add an attribute/flag to disable RB >>>> trees for those. >>>> However, tracking zone information in the RB tree _dramatically_ speeds >>>> up device initialisation; issuing a blkdev_discard() for the entire >>>> drive will take _ages_ without it. >>> >>> And the RB-tree will also be very useful for speeding up report zones >>> ioctls too. Unless we want those to force an update of the RB-tree information ? >> >> That is debatable. I would tend to argue for both. *If* there must be >> a zone cache then reading from the zone cache is a must. Forcing and >> update of the zone cache from the media seems like a reasonable thing >> to be able to do. >> >> Also the zone report is 'slow' in that there is an overhead for the >> report itself but >> the number of zones per query can be quite large so 4 or 5 I/Os that >> run into the >> hundreds if milliseconds to cache the entire drive isn't really unworkable for >> something that is used infrequently. >> > No, surely not. > But one of the _big_ advantages for the RB tree is blkdev_discard(). > Without the RB tree any mkfs program will issue a 'discard' for every > sector. We will be able to coalesce those into one discard per zone, but > we still need to issue one for _every_ zone. How can you make coalesce work transparently in the sd layer _without_ keeping some sort of a discard cache along with the zone cache? Currently the block layer's blkdev_issue_discard() is breaking large discard's into nice granular and aligned chunks but it is not preventing small discards nor coalescing them. In the sd layer would there be way to persist or purge an overly large discard cache? What about honoring discard_zeroes_data? Once the discard is completed with discard_zeroes_data you have to return zeroes whenever a discarded sector is read. Isn't that a log more than just tracking a write pointer? Couldn't a zone have dozens of holes? > Which is (as indicated) really slow, and easily takes several minutes. > With the RB tree we can short-circuit discards to empty zones, and speed > up processing time dramatically. > Sure we could be moving the logic into mkfs and friends, but that would > require us to change the programs and agree on a library (libzbc?) which > should be handling that. F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... so I'm not sure your argument is valid here. [..] >>> 3) Try to condense the blkzone data structure to save memory: >>> I think that we can at the very least remove the zone length, and also >>> may be the per zone spinlock too (a single spinlock and proper state flags can >>> be used). >> >> I have a variant that is an array of descriptors that roughly mimics the >> api from blk-zoned.c that I did a few months ago as an example. >> I should be able to update that to the current kernel + patches. >> > Okay. If we restrict the in-kernel SMR drive handling to devices with > identical zone sizes of course we can remove the zone length. > And we can do away with the per-zone spinlock and use a global one instead. I don't think dropping the zone length is a reasonable thing to do. What I propose is an array of _descriptors_ it doesn't drop _any_ of the zone information that you are holding in an RB tree, it is just a condensed format that _mostly_ plugs into your existing API. -- Shaun Tancheff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-14 0:09 ` Shaun Tancheff @ 2016-08-16 4:00 ` Damien Le Moal -1 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-16 4:00 UTC (permalink / raw) To: Shaun Tancheff Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman DQpTaGF1biwNCg0KPiBPbiBBdWcgMTQsIDIwMTYsIGF0IDA5OjA5LCBTaGF1biBUYW5jaGVmZiA8 c2hhdW4udGFuY2hlZmZAc2VhZ2F0ZS5jb20+IHdyb3RlOg0KW+KApl0NCj4+PiANCj4+IE5vLCBz dXJlbHkgbm90Lg0KPj4gQnV0IG9uZSBvZiB0aGUgX2JpZ18gYWR2YW50YWdlcyBmb3IgdGhlIFJC IHRyZWUgaXMgYmxrZGV2X2Rpc2NhcmQoKS4NCj4+IFdpdGhvdXQgdGhlIFJCIHRyZWUgYW55IG1r ZnMgcHJvZ3JhbSB3aWxsIGlzc3VlIGEgJ2Rpc2NhcmQnIGZvciBldmVyeQ0KPj4gc2VjdG9yLiBX ZSB3aWxsIGJlIGFibGUgdG8gY29hbGVzY2UgdGhvc2UgaW50byBvbmUgZGlzY2FyZCBwZXIgem9u ZSwgYnV0DQo+PiB3ZSBzdGlsbCBuZWVkIHRvIGlzc3VlIG9uZSBmb3IgX2V2ZXJ5XyB6b25lLg0K PiANCj4gSG93IGNhbiB5b3UgbWFrZSBjb2FsZXNjZSB3b3JrIHRyYW5zcGFyZW50bHkgaW4gdGhl DQo+IHNkIGxheWVyIF93aXRob3V0XyBrZWVwaW5nIHNvbWUgc29ydCBvZiBhIGRpc2NhcmQgY2Fj aGUgYWxvbmcNCj4gd2l0aCB0aGUgem9uZSBjYWNoZT8NCj4gDQo+IEN1cnJlbnRseSB0aGUgYmxv Y2sgbGF5ZXIncyBibGtkZXZfaXNzdWVfZGlzY2FyZCgpIGlzIGJyZWFraW5nDQo+IGxhcmdlIGRp c2NhcmQncyBpbnRvIG5pY2UgZ3JhbnVsYXIgYW5kIGFsaWduZWQgY2h1bmtzIGJ1dCBpdCBpcw0K PiBub3QgcHJldmVudGluZyBzbWFsbCBkaXNjYXJkcyBub3IgY29hbGVzY2luZyB0aGVtLg0KPiAN Cj4gSW4gdGhlIHNkIGxheWVyIHdvdWxkIHRoZXJlIGJlIHdheSB0byBwZXJzaXN0IG9yIHB1cmdl IGFuDQo+IG92ZXJseSBsYXJnZSBkaXNjYXJkIGNhY2hlPyBXaGF0IGFib3V0IGhvbm9yaW5nDQo+ IGRpc2NhcmRfemVyb2VzX2RhdGE/IE9uY2UgdGhlIGRpc2NhcmQgaXMgY29tcGxldGVkIHdpdGgN Cj4gZGlzY2FyZF96ZXJvZXNfZGF0YSB5b3UgaGF2ZSB0byByZXR1cm4gemVyb2VzIHdoZW5ldmVy DQo+IGEgZGlzY2FyZGVkIHNlY3RvciBpcyByZWFkLiBJc24ndCB0aGF0IGEgbG9nIG1vcmUgdGhh biBqdXN0DQo+IHRyYWNraW5nIGEgd3JpdGUgcG9pbnRlcj8gQ291bGRuJ3QgYSB6b25lIGhhdmUg ZG96ZW5zIG9mIGhvbGVzPw0KDQpNeSB1bmRlcnN0YW5kaW5nIG9mIHRoZSBzdGFuZGFyZHMgcmVn YXJkaW5nIGRpc2NhcmQgaXMgdGhhdCBpdCBpcyBub3QNCm1hbmRhdG9yeSBhbmQgdGhhdCBpdCBp cyBhIGhpbnQgdG8gdGhlIGRyaXZlLiBUaGUgZHJpdmUgY2FuIGNvbXBsZXRlbHkNCmlnbm9yZSBp dCBpZiBpdCB0aGlua3MgdGhhdCBpcyBhIGJldHRlciBjaG9pY2UuIEkgbWF5IGJlIHdyb25nIG9u IHRoaXMNCnRob3VnaC4gTmVlZCB0byBjaGVjayBhZ2Fpbi4NCkZvciByZXNldCB3cml0ZSBwb2lu dGVyLCB0aGUgbWFwcGluZyB0byBkaXNjYXJkIHJlcXVpcmVzIHRoYXQgdGhlIGNhbGxzDQp0byBi bGtkZXZfaXNzdWVfZGlzY2FyZCBiZSB6b25lIGFsaWduZWQgZm9yIGFueXRoaW5nIHRvIGhhcHBl bi4gU3BlY2lmeQ0KbGVzcyB0aGFuIGEgem9uZSBhbmQgbm90aGluZyB3aWxsIGJlIGRvbmUuIFRo aXMgSSB0aGluayBwcmVzZXJ2ZSB0aGUNCmRpc2NhcmQgc2VtYW50aWMuDQoNCkFzIGZvciB0aGUg 4oCcZGlzY2FyZF96ZXJvZXNfZGF0YeKAnSB0aGluZywgSSBhbHNvIHRoaW5rIHRoYXQgaXMgYSBk cml2ZQ0KZmVhdHVyZSBub3QgbWFuZGF0b3J5LiBEcml2ZXMgbWF5IGhhdmUgaXQgb3Igbm90LCB3 aGljaCBpcyBjb25zaXN0ZW50DQp3aXRoIHRoZSBaQkMvWkFDIHN0YW5kYXJkcyByZWdhcmRpbmcg cmVhZGluZyBhZnRlciB3cml0ZSBwb2ludGVyIChub3RoaW5nDQpzYXlzIHRoYXQgemVyb3MgaGF2 ZSB0byBiZSByZXR1cm5lZCkuIEluIGFueSBjYXNlLCBkaXNjYXJkIG9mIENNUiB6b25lcw0Kd2ls bCBiZSBhIG5vcCwgc28gZm9yIFNNUiBkcml2ZXMsIGRpc2NhcmRfemVyb2VzX2RhdGE9MCBtYXkg YmUgYSBiZXR0ZXINCmNob2ljZS4NCg0KPiANCj4+IFdoaWNoIGlzIChhcyBpbmRpY2F0ZWQpIHJl YWxseSBzbG93LCBhbmQgZWFzaWx5IHRha2VzIHNldmVyYWwgbWludXRlcy4NCj4+IFdpdGggdGhl IFJCIHRyZWUgd2UgY2FuIHNob3J0LWNpcmN1aXQgZGlzY2FyZHMgdG8gZW1wdHkgem9uZXMsIGFu ZCBzcGVlZA0KPj4gdXAgcHJvY2Vzc2luZyB0aW1lIGRyYW1hdGljYWxseS4NCj4+IFN1cmUgd2Ug Y291bGQgYmUgbW92aW5nIHRoZSBsb2dpYyBpbnRvIG1rZnMgYW5kIGZyaWVuZHMsIGJ1dCB0aGF0 IHdvdWxkDQo+PiByZXF1aXJlIHVzIHRvIGNoYW5nZSB0aGUgcHJvZ3JhbXMgYW5kIGFncmVlIG9u IGEgbGlicmFyeSAobGliemJjPykgd2hpY2gNCj4+IHNob3VsZCBiZSBoYW5kbGluZyB0aGF0Lg0K PiANCj4gRjJGUydzIG1rZnMuZjJmcyBpcyBhbHJlYWR5IHJlYWRpbmcgdGhlIHpvbmUgdG9wb2xv Z3kgdmlhIFNHX0lPIC4uLg0KPiBzbyBJJ20gbm90IHN1cmUgeW91ciBhcmd1bWVudCBpcyB2YWxp ZCBoZXJlLg0KDQpUaGlzIGluaXRpYWwgU01SIHN1cHBvcnQgcGF0Y2ggaXMganVzdCB0aGF0OiBh IGZpcnN0IHRyeS4gSmFlZ2V1aw0KdXNlZCBTR19JTyAoaW4gZmFjdCBjb3B5LXBhc3RlIG9mIHBh cnRzIG9mIGxpYnpiYykgYmVjYXVzZSB0aGUgY3VycmVudA0KWkJDIHBhdGNoLXNldCBoYXMgbm8g aW9jdGwgQVBJIGZvciB6b25lIGluZm9ybWF0aW9uIG1hbmlwdWxhdGlvbi4gV2UNCndpbGwgZml4 IHRoaXMgbWtmcy5mMmZzIG9uY2Ugd2UgYWdyZWUgb24gYW4gaW9jdGwgaW50ZXJmYWNlLg0KDQo+ IA0KPiBbLi5dDQo+IA0KPj4+PiAzKSBUcnkgdG8gY29uZGVuc2UgdGhlIGJsa3pvbmUgZGF0YSBz dHJ1Y3R1cmUgdG8gc2F2ZSBtZW1vcnk6DQo+Pj4+IEkgdGhpbmsgdGhhdCB3ZSBjYW4gYXQgdGhl IHZlcnkgbGVhc3QgcmVtb3ZlIHRoZSB6b25lIGxlbmd0aCwgYW5kIGFsc28NCj4+Pj4gbWF5IGJl IHRoZSBwZXIgem9uZSBzcGlubG9jayB0b28gKGEgc2luZ2xlIHNwaW5sb2NrIGFuZCBwcm9wZXIg c3RhdGUgZmxhZ3MgY2FuDQo+Pj4+IGJlIHVzZWQpLg0KPj4+IA0KPj4+IEkgaGF2ZSBhIHZhcmlh bnQgdGhhdCBpcyBhbiBhcnJheSBvZiBkZXNjcmlwdG9ycyB0aGF0IHJvdWdobHkgbWltaWNzIHRo ZQ0KPj4+IGFwaSBmcm9tIGJsay16b25lZC5jIHRoYXQgSSBkaWQgYSBmZXcgbW9udGhzIGFnbyBh cyBhbiBleGFtcGxlLg0KPj4+IEkgc2hvdWxkIGJlIGFibGUgdG8gdXBkYXRlIHRoYXQgdG8gdGhl IGN1cnJlbnQga2VybmVsICsgcGF0Y2hlcy4NCj4+PiANCj4+IE9rYXkuIElmIHdlIHJlc3RyaWN0 IHRoZSBpbi1rZXJuZWwgU01SIGRyaXZlIGhhbmRsaW5nIHRvIGRldmljZXMgd2l0aA0KPj4gaWRl bnRpY2FsIHpvbmUgc2l6ZXMgb2YgY291cnNlIHdlIGNhbiByZW1vdmUgdGhlIHpvbmUgbGVuZ3Ro Lg0KPj4gQW5kIHdlIGNhbiBkbyBhd2F5IHdpdGggdGhlIHBlci16b25lIHNwaW5sb2NrIGFuZCB1 c2UgYSBnbG9iYWwgb25lIGluc3RlYWQuDQo+IA0KPiBJIGRvbid0IHRoaW5rIGRyb3BwaW5nIHRo ZSB6b25lIGxlbmd0aCBpcyBhIHJlYXNvbmFibGUgdGhpbmcgdG8gZG8uDQo+IA0KPiBXaGF0IEkg cHJvcG9zZSBpcyBhbiBhcnJheSBvZiBfZGVzY3JpcHRvcnNfIGl0IGRvZXNuJ3QgZHJvcCBfYW55 Xw0KPiBvZiB0aGUgem9uZSBpbmZvcm1hdGlvbiB0aGF0IHlvdSBhcmUgaG9sZGluZyBpbiBhbiBS QiB0cmVlLCBpdCBpcw0KPiBqdXN0IGEgY29uZGVuc2VkIGZvcm1hdCB0aGF0IF9tb3N0bHlfIHBs dWdzIGludG8geW91ciBleGlzdGluZw0KPiBBUEkuDQoNCkkgZG8gbm90IGFncmVlLiBUaGUgU2Vh Z2F0ZSBkcml2ZSBhbHJlYWR5IGhhcyBvbmUgem9uZSAodGhlIGxhc3Qgb25lKQ0KdGhhdCBpcyBu b3QgdGhlIHNhbWUgbGVuZ3RoIGFzIHRoZSBvdGhlciB6b25lcy4gU3VyZSwgc2luY2UgaXQgaXMg dGhlDQpsYXN0IG9uZSwgd2UgY2FuIGhhZCDigJxpZiAobGFzdCB6b25lKeKAnSBhbGwgb3ZlciB0 aGUgcGxhY2UgYW5kIG1ha2UgaXQNCndvcmsuIEJ1dCB0aGF0IGlzIHJlYWxseSB1Z2x5LiBLZWVw aW5nIHRoZSBsZW5ndGggZmllbGQgbWFrZXMgdGhlIGNvZGUNCmdlbmVyaWMgYW5kIGZvbGxvd2lu ZyB0aGUgc3RhbmRhcmQsIHdoaWNoIGhhcyBubyByZXN0cmljdGlvbiBvbiB0aGUNCnpvbmUgc2l6 ZXMuIFdlIGNvdWxkIGRvIHNvbWUgbWVtb3J5IG9wdGltaXNhdGlvbiB1c2luZyBkaWZmZXJlbnQg dHlwZXMNCm9mIGJsa196b25lIHN0dXJjdHMsIHRoZSB0eXBlcyBtYXBwaW5nIHRvIHRoZSBTQU1F IHZhbHVlOiBkcml2ZXMgd2l0aA0KY29uc3RhbnQgem9uZSBzaXplIGNhbiB1c2UgYSBibGtfem9u ZSB0eXBlIHdpdGhvdXQgdGhlIGxlbmd0aCBmaWVsZCwNCm90aGVycyB1c2UgYSBkaWZmZXJlbnQg dHlwZSB0aGF0IGluY2x1ZGUgdGhlIGZpZWxkLiBBY2Nlc3NvciBmdW5jdGlvbnMNCmNhbiBoaWRl IHRoZSBkaWZmZXJlbnQgdHlwZXMgaW4gdGhlIHpvbmUgbWFuaXB1bGF0aW9uIGNvZGUuDQoNCkJl c3QgcmVnYXJkcy4NCg0KDQotLSANCkRhbWllbiBMZSBNb2FsLCBQaC5ELg0KU3IuIE1hbmFnZXIs IFN5c3RlbSBTb2Z0d2FyZSBHcm91cCwgSEdTVCBSZXNlYXJjaCwNCkhHU1QsIGEgV2VzdGVybiBE aWdpdGFsIGJyYW5kDQpEYW1pZW4uTGVNb2FsQGhnc3QuY29tDQooKzgxKSAwNDY2LTk4LTM1OTMg KGV4dC4gNTEzNTkzKQ0KMSBraXJpaGFyYS1jaG8sIEZ1amlzYXdhLCANCkthbmFnYXdhLCAyNTIt MDg4OCBKYXBhbg0Kd3d3Lmhnc3QuY29tDQoNCldlc3Rlcm4gRGlnaXRhbCBDb3Jwb3JhdGlvbiAo YW5kIGl0cyBzdWJzaWRpYXJpZXMpIEUtbWFpbCBDb25maWRlbnRpYWxpdHkgTm90aWNlICYgRGlz Y2xhaW1lcjoKClRoaXMgZS1tYWlsIGFuZCBhbnkgZmlsZXMgdHJhbnNtaXR0ZWQgd2l0aCBpdCBt YXkgY29udGFpbiBjb25maWRlbnRpYWwgb3IgbGVnYWxseSBwcml2aWxlZ2VkIGluZm9ybWF0aW9u IG9mIFdEQyBhbmQvb3IgaXRzIGFmZmlsaWF0ZXMsIGFuZCBhcmUgaW50ZW5kZWQgc29sZWx5IGZv ciB0aGUgdXNlIG9mIHRoZSBpbmRpdmlkdWFsIG9yIGVudGl0eSB0byB3aGljaCB0aGV5IGFyZSBh ZGRyZXNzZWQuIElmIHlvdSBhcmUgbm90IHRoZSBpbnRlbmRlZCByZWNpcGllbnQsIGFueSBkaXNj bG9zdXJlLCBjb3B5aW5nLCBkaXN0cmlidXRpb24gb3IgYW55IGFjdGlvbiB0YWtlbiBvciBvbWl0 dGVkIHRvIGJlIHRha2VuIGluIHJlbGlhbmNlIG9uIGl0LCBpcyBwcm9oaWJpdGVkLiBJZiB5b3Ug aGF2ZSByZWNlaXZlZCB0aGlzIGUtbWFpbCBpbiBlcnJvciwgcGxlYXNlIG5vdGlmeSB0aGUgc2Vu ZGVyIGltbWVkaWF0ZWx5IGFuZCBkZWxldGUgdGhlIGUtbWFpbCBpbiBpdHMgZW50aXJldHkgZnJv bSB5b3VyIHN5c3RlbS4K ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-16 4:00 ` Damien Le Moal 0 siblings, 0 replies; 26+ messages in thread From: Damien Le Moal @ 2016-08-16 4:00 UTC (permalink / raw) To: Shaun Tancheff Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman Shaun, > On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> wrote: […] >>> >> No, surely not. >> But one of the _big_ advantages for the RB tree is blkdev_discard(). >> Without the RB tree any mkfs program will issue a 'discard' for every >> sector. We will be able to coalesce those into one discard per zone, but >> we still need to issue one for _every_ zone. > > How can you make coalesce work transparently in the > sd layer _without_ keeping some sort of a discard cache along > with the zone cache? > > Currently the block layer's blkdev_issue_discard() is breaking > large discard's into nice granular and aligned chunks but it is > not preventing small discards nor coalescing them. > > In the sd layer would there be way to persist or purge an > overly large discard cache? What about honoring > discard_zeroes_data? Once the discard is completed with > discard_zeroes_data you have to return zeroes whenever > a discarded sector is read. Isn't that a log more than just > tracking a write pointer? Couldn't a zone have dozens of holes? My understanding of the standards regarding discard is that it is not mandatory and that it is a hint to the drive. The drive can completely ignore it if it thinks that is a better choice. I may be wrong on this though. Need to check again. For reset write pointer, the mapping to discard requires that the calls to blkdev_issue_discard be zone aligned for anything to happen. Specify less than a zone and nothing will be done. This I think preserve the discard semantic. As for the “discard_zeroes_data” thing, I also think that is a drive feature not mandatory. Drives may have it or not, which is consistent with the ZBC/ZAC standards regarding reading after write pointer (nothing says that zeros have to be returned). In any case, discard of CMR zones will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better choice. > >> Which is (as indicated) really slow, and easily takes several minutes. >> With the RB tree we can short-circuit discards to empty zones, and speed >> up processing time dramatically. >> Sure we could be moving the logic into mkfs and friends, but that would >> require us to change the programs and agree on a library (libzbc?) which >> should be handling that. > > F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... > so I'm not sure your argument is valid here. This initial SMR support patch is just that: a first try. Jaegeuk used SG_IO (in fact copy-paste of parts of libzbc) because the current ZBC patch-set has no ioctl API for zone information manipulation. We will fix this mkfs.f2fs once we agree on an ioctl interface. > > [..] > >>>> 3) Try to condense the blkzone data structure to save memory: >>>> I think that we can at the very least remove the zone length, and also >>>> may be the per zone spinlock too (a single spinlock and proper state flags can >>>> be used). >>> >>> I have a variant that is an array of descriptors that roughly mimics the >>> api from blk-zoned.c that I did a few months ago as an example. >>> I should be able to update that to the current kernel + patches. >>> >> Okay. If we restrict the in-kernel SMR drive handling to devices with >> identical zone sizes of course we can remove the zone length. >> And we can do away with the per-zone spinlock and use a global one instead. > > I don't think dropping the zone length is a reasonable thing to do. > > What I propose is an array of _descriptors_ it doesn't drop _any_ > of the zone information that you are holding in an RB tree, it is > just a condensed format that _mostly_ plugs into your existing > API. I do not agree. The Seagate drive already has one zone (the last one) that is not the same length as the other zones. Sure, since it is the last one, we can had “if (last zone)” all over the place and make it work. But that is really ugly. Keeping the length field makes the code generic and following the standard, which has no restriction on the zone sizes. We could do some memory optimisation using different types of blk_zone sturcts, the types mapping to the SAME value: drives with constant zone size can use a blk_zone type without the length field, others use a different type that include the field. Accessor functions can hide the different types in the zone manipulation code. Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Group, HGST Research, HGST, a Western Digital brand Damien.LeMoal@hgst.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.hgst.com Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands 2016-08-16 4:00 ` Damien Le Moal @ 2016-08-16 5:49 ` Shaun Tancheff -1 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-16 5:49 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal <Damien.LeMoal@hgst.com> w= rote: > > Shaun, > >> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> w= rote: > [=E2=80=A6] >>>> >>> No, surely not. >>> But one of the _big_ advantages for the RB tree is blkdev_discard(). >>> Without the RB tree any mkfs program will issue a 'discard' for every >>> sector. We will be able to coalesce those into one discard per zone, bu= t >>> we still need to issue one for _every_ zone. >> >> How can you make coalesce work transparently in the >> sd layer _without_ keeping some sort of a discard cache along >> with the zone cache? >> >> Currently the block layer's blkdev_issue_discard() is breaking >> large discard's into nice granular and aligned chunks but it is >> not preventing small discards nor coalescing them. >> >> In the sd layer would there be way to persist or purge an >> overly large discard cache? What about honoring >> discard_zeroes_data? Once the discard is completed with >> discard_zeroes_data you have to return zeroes whenever >> a discarded sector is read. Isn't that a log more than just >> tracking a write pointer? Couldn't a zone have dozens of holes? > > My understanding of the standards regarding discard is that it is not > mandatory and that it is a hint to the drive. The drive can completely > ignore it if it thinks that is a better choice. I may be wrong on this > though. Need to check again. But you are currently setting discard_zeroes_data=3D1 in your current patches. I believe that setting discard_zeroes_data=3D1 effectively promotes discards to being mandatory. I have a follow on patch to my SCT Write Same series that handles the CMR zone case in the sd_zbc_setup_discard() handler. > For reset write pointer, the mapping to discard requires that the calls > to blkdev_issue_discard be zone aligned for anything to happen. Specify > less than a zone and nothing will be done. This I think preserve the > discard semantic. Oh. If that is the intent then there is just a bug in the handler. I have pointed out where I believe it to be in my response to the zone cache patch being posted. > As for the =E2=80=9Cdiscard_zeroes_data=E2=80=9D thing, I also think that= is a drive > feature not mandatory. Drives may have it or not, which is consistent > with the ZBC/ZAC standards regarding reading after write pointer (nothing > says that zeros have to be returned). In any case, discard of CMR zones > will be a nop, so for SMR drives, discard_zeroes_data=3D0 may be a better > choice. However I am still curious about discard's being coalesced. >>> Which is (as indicated) really slow, and easily takes several minutes. >>> With the RB tree we can short-circuit discards to empty zones, and spee= d >>> up processing time dramatically. >>> Sure we could be moving the logic into mkfs and friends, but that would >>> require us to change the programs and agree on a library (libzbc?) whic= h >>> should be handling that. >> >> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... >> so I'm not sure your argument is valid here. > > This initial SMR support patch is just that: a first try. Jaegeuk > used SG_IO (in fact copy-paste of parts of libzbc) because the current > ZBC patch-set has no ioctl API for zone information manipulation. We > will fix this mkfs.f2fs once we agree on an ioctl interface. Which again is my point. If mkfs.f2fs wants to speed up it's discard pass in mkfs.f2fs by _not_ sending unneccessary Reset WP for zones that are already empty it has all the information it needs to do so. Here it seems to me that the zone cache is _at_best_ doing double work. At works the zone cache could be doing the wrong thing _if_ the zone cache got out of sync. It is certainly possible (however unlikely) that someone was doing some raw sg activity that is not seed by the sd path. All I am trying to do is have a discussion about the reasons for and against have a zone cache. Where it works and where it breaks this should be entirely technical but I understand that we have all spent a lot of time _not_ discussing this for various non-technical reasons. So far the only reason I've been able to ascertain is that Host Manged drives really don't like being stuck with the URSWRZ and would like to have a software hack to return MUD rather than ship drives with some weird out-of-the box config where the last zone is marked as FINISH'd thereby returning MUD on reads as per spec. I understand that it would be strange state to see of first boot and likely people would just do a ResetWP and have weird boot errors, which would probably just make matters worse. I just would rather the work around be a bit cleaner and/or use less memory. I would also like a path available that does not require SD_ZBC or BLK_ZONED for Host Aware drives to work, hence this set of patches and me begging for a single bit in struct bio. >> >> [..] >> >>>>> 3) Try to condense the blkzone data structure to save memory: >>>>> I think that we can at the very least remove the zone length, and als= o >>>>> may be the per zone spinlock too (a single spinlock and proper state = flags can >>>>> be used). >>>> >>>> I have a variant that is an array of descriptors that roughly mimics t= he >>>> api from blk-zoned.c that I did a few months ago as an example. >>>> I should be able to update that to the current kernel + patches. >>>> >>> Okay. If we restrict the in-kernel SMR drive handling to devices with >>> identical zone sizes of course we can remove the zone length. >>> And we can do away with the per-zone spinlock and use a global one inst= ead. >> >> I don't think dropping the zone length is a reasonable thing to do. REPEAT: I do _not_ think dropping the zone length is a good thing. >> What I propose is an array of _descriptors_ it doesn't drop _any_ >> of the zone information that you are holding in an RB tree, it is >> just a condensed format that _mostly_ plugs into your existing >> API. > > I do not agree. The Seagate drive already has one zone (the last one) > that is not the same length as the other zones. Sure, since it is the > last one, we can had =E2=80=9Cif (last zone)=E2=80=9D all over the place = and make it > work. But that is really ugly. Keeping the length field makes the code > generic and following the standard, which has no restriction on the > zone sizes. We could do some memory optimisation using different types > of blk_zone sturcts, the types mapping to the SAME value: drives with > constant zone size can use a blk_zone type without the length field, > others use a different type that include the field. Accessor functions > can hide the different types in the zone manipulation code. Ah. I just said that dropping the zone length is not a good idea. Why the antagonistic exposition? All I am saying is that I can give you the zone cache with 1/7th of the memory and the same performance with _no_ loss of information, or features, as compared to the existing zone cache. All the code is done now. I will post patches once my testing is done. I have also reworked all the zone integration so the BIO flags will pull from and update the zone cache as opposed to the first hack that only really integrated with some ioctls. Regards, Shaun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands @ 2016-08-16 5:49 ` Shaun Tancheff 0 siblings, 0 replies; 26+ messages in thread From: Shaun Tancheff @ 2016-08-16 5:49 UTC (permalink / raw) To: Damien Le Moal Cc: Hannes Reinecke, Christoph Hellwig, Shaun Tancheff, linux-block, linux-scsi, LKML, Jens Axboe, Jens Axboe, James E . J . Bottomley, Martin K . Petersen, Jeff Layton, J . Bruce Fields, Josh Bingaman On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal <Damien.LeMoal@hgst.com> wrote: > > Shaun, > >> On Aug 14, 2016, at 09:09, Shaun Tancheff <shaun.tancheff@seagate.com> wrote: > […] >>>> >>> No, surely not. >>> But one of the _big_ advantages for the RB tree is blkdev_discard(). >>> Without the RB tree any mkfs program will issue a 'discard' for every >>> sector. We will be able to coalesce those into one discard per zone, but >>> we still need to issue one for _every_ zone. >> >> How can you make coalesce work transparently in the >> sd layer _without_ keeping some sort of a discard cache along >> with the zone cache? >> >> Currently the block layer's blkdev_issue_discard() is breaking >> large discard's into nice granular and aligned chunks but it is >> not preventing small discards nor coalescing them. >> >> In the sd layer would there be way to persist or purge an >> overly large discard cache? What about honoring >> discard_zeroes_data? Once the discard is completed with >> discard_zeroes_data you have to return zeroes whenever >> a discarded sector is read. Isn't that a log more than just >> tracking a write pointer? Couldn't a zone have dozens of holes? > > My understanding of the standards regarding discard is that it is not > mandatory and that it is a hint to the drive. The drive can completely > ignore it if it thinks that is a better choice. I may be wrong on this > though. Need to check again. But you are currently setting discard_zeroes_data=1 in your current patches. I believe that setting discard_zeroes_data=1 effectively promotes discards to being mandatory. I have a follow on patch to my SCT Write Same series that handles the CMR zone case in the sd_zbc_setup_discard() handler. > For reset write pointer, the mapping to discard requires that the calls > to blkdev_issue_discard be zone aligned for anything to happen. Specify > less than a zone and nothing will be done. This I think preserve the > discard semantic. Oh. If that is the intent then there is just a bug in the handler. I have pointed out where I believe it to be in my response to the zone cache patch being posted. > As for the “discard_zeroes_data” thing, I also think that is a drive > feature not mandatory. Drives may have it or not, which is consistent > with the ZBC/ZAC standards regarding reading after write pointer (nothing > says that zeros have to be returned). In any case, discard of CMR zones > will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better > choice. However I am still curious about discard's being coalesced. >>> Which is (as indicated) really slow, and easily takes several minutes. >>> With the RB tree we can short-circuit discards to empty zones, and speed >>> up processing time dramatically. >>> Sure we could be moving the logic into mkfs and friends, but that would >>> require us to change the programs and agree on a library (libzbc?) which >>> should be handling that. >> >> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... >> so I'm not sure your argument is valid here. > > This initial SMR support patch is just that: a first try. Jaegeuk > used SG_IO (in fact copy-paste of parts of libzbc) because the current > ZBC patch-set has no ioctl API for zone information manipulation. We > will fix this mkfs.f2fs once we agree on an ioctl interface. Which again is my point. If mkfs.f2fs wants to speed up it's discard pass in mkfs.f2fs by _not_ sending unneccessary Reset WP for zones that are already empty it has all the information it needs to do so. Here it seems to me that the zone cache is _at_best_ doing double work. At works the zone cache could be doing the wrong thing _if_ the zone cache got out of sync. It is certainly possible (however unlikely) that someone was doing some raw sg activity that is not seed by the sd path. All I am trying to do is have a discussion about the reasons for and against have a zone cache. Where it works and where it breaks this should be entirely technical but I understand that we have all spent a lot of time _not_ discussing this for various non-technical reasons. So far the only reason I've been able to ascertain is that Host Manged drives really don't like being stuck with the URSWRZ and would like to have a software hack to return MUD rather than ship drives with some weird out-of-the box config where the last zone is marked as FINISH'd thereby returning MUD on reads as per spec. I understand that it would be strange state to see of first boot and likely people would just do a ResetWP and have weird boot errors, which would probably just make matters worse. I just would rather the work around be a bit cleaner and/or use less memory. I would also like a path available that does not require SD_ZBC or BLK_ZONED for Host Aware drives to work, hence this set of patches and me begging for a single bit in struct bio. >> >> [..] >> >>>>> 3) Try to condense the blkzone data structure to save memory: >>>>> I think that we can at the very least remove the zone length, and also >>>>> may be the per zone spinlock too (a single spinlock and proper state flags can >>>>> be used). >>>> >>>> I have a variant that is an array of descriptors that roughly mimics the >>>> api from blk-zoned.c that I did a few months ago as an example. >>>> I should be able to update that to the current kernel + patches. >>>> >>> Okay. If we restrict the in-kernel SMR drive handling to devices with >>> identical zone sizes of course we can remove the zone length. >>> And we can do away with the per-zone spinlock and use a global one instead. >> >> I don't think dropping the zone length is a reasonable thing to do. REPEAT: I do _not_ think dropping the zone length is a good thing. >> What I propose is an array of _descriptors_ it doesn't drop _any_ >> of the zone information that you are holding in an RB tree, it is >> just a condensed format that _mostly_ plugs into your existing >> API. > > I do not agree. The Seagate drive already has one zone (the last one) > that is not the same length as the other zones. Sure, since it is the > last one, we can had “if (last zone)” all over the place and make it > work. But that is really ugly. Keeping the length field makes the code > generic and following the standard, which has no restriction on the > zone sizes. We could do some memory optimisation using different types > of blk_zone sturcts, the types mapping to the SAME value: drives with > constant zone size can use a blk_zone type without the length field, > others use a different type that include the field. Accessor functions > can hide the different types in the zone manipulation code. Ah. I just said that dropping the zone length is not a good idea. Why the antagonistic exposition? All I am saying is that I can give you the zone cache with 1/7th of the memory and the same performance with _no_ loss of information, or features, as compared to the existing zone cache. All the code is done now. I will post patches once my testing is done. I have also reworked all the zone integration so the BIO flags will pull from and update the zone cache as opposed to the first hack that only really integrated with some ioctls. Regards, Shaun ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2016-08-16 6:07 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-29 19:02 [PATCH v6 0/2] Block layer support ZAC/ZBC commands Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands Shaun Tancheff 2016-07-29 19:02 ` [PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer Shaun Tancheff 2016-08-01 9:41 ` [PATCH v6 0/2] Block layer support ZAC/ZBC commands Christoph Hellwig 2016-08-01 17:07 ` Shaun Tancheff 2016-08-01 17:07 ` Shaun Tancheff 2016-08-02 14:35 ` Hannes Reinecke 2016-08-03 1:29 ` Damien Le Moal 2016-08-03 1:29 ` Damien Le Moal 2016-08-05 20:35 ` Shaun Tancheff 2016-08-05 20:35 ` Shaun Tancheff 2016-08-09 6:47 ` Hannes Reinecke 2016-08-09 8:09 ` Damien Le Moal 2016-08-09 8:09 ` Damien Le Moal 2016-08-10 3:58 ` Shaun Tancheff 2016-08-10 3:58 ` Shaun Tancheff 2016-08-10 4:38 ` Damien Le Moal 2016-08-10 4:38 ` Damien Le Moal 2016-08-16 6:07 ` Shaun Tancheff 2016-08-10 3:26 ` Shaun Tancheff 2016-08-14 0:09 ` Shaun Tancheff 2016-08-14 0:09 ` Shaun Tancheff 2016-08-16 4:00 ` Damien Le Moal 2016-08-16 4:00 ` Damien Le Moal 2016-08-16 5:49 ` Shaun Tancheff 2016-08-16 5:49 ` Shaun Tancheff
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.