All of lore.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: "Judy Brock" <judy.brock@samsung.com>,
	"Javier González" <javier@javigon.com>,
	"Matias Bjørling" <mb@lightnvm.io>
Cc: Jens Axboe <axboe@kernel.dk>,
	Niklas Cassel <Niklas.Cassel@wdc.com>,
	Ajay Joshi <Ajay.Joshi@wdc.com>, Sagi Grimberg <sagi@grimberg.me>,
	Keith Busch <Keith.Busch@wdc.com>,
	Dmitry Fomichev <Dmitry.Fomichev@wdc.com>,
	Aravind Ramesh <Aravind.Ramesh@wdc.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Hans Holmberg <Hans.Holmberg@wdc.com>,
	Christoph Hellwig <hch@lst.de>,
	Matias Bjorling <Matias.Bjorling@wdc.com>
Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
Date: Tue, 16 Jun 2020 12:37:38 +0000	[thread overview]
Message-ID: <CY4PR04MB375103860D44AB2DF9040655E79D0@CY4PR04MB3751.namprd04.prod.outlook.com> (raw)
In-Reply-To: 3ab4032c18a8411389d3879026cde2a2@samsung.com

On 2020/06/16 21:35, Judy Brock wrote:
> >>> A namespace that does not support append is not supported by the driver.
> 
> I am not that knowledgeable about Linux kernel drivers so maybe this is a
> dumb question but won't the driver in question be the default kernel driver
> for ZNS devices? If so, why would that driver deliberately reject a device
> which is 100% compliant with the ZNS spec? That would seem to favor specific
> implementations which seems like an inappropriate thing for a community
> driver to do unless there is an actual technical reason the driver is unable
> to function w/o append. Is there any such reason and if so what is it? Thanks
> and sorry if I've misunderstood.

Judy,

please see my answer to Javier for the rational behind this software design
decision.

> 
> 
> Judy
> 
> -----Original Message----- From: linux-nvme
> [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Javier González 
> Sent: Tuesday, June 16, 2020 5:00 AM To: Matias Bjørling Cc: Jens Axboe;
> Niklas Cassel; Damien Le Moal; Ajay Joshi; Sagi Grimberg; Keith Busch; Dmitry
> Fomichev; Aravind Ramesh; linux-nvme@lists.infradead.org;
> linux-block@vger.kernel.org; Hans Holmberg; Christoph Hellwig; Matias
> Bjørling Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
> 
> On 16.06.2020 13:18, Matias Bjørling wrote:
>> On 16/06/2020 12.41, Javier González wrote:
>>> On 16.06.2020 08:34, Keith Busch wrote:
>>>> Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined 
>>>> in NVM Express TP4053. Zoned namespaces are discovered based on their 
>>>> Command Set Identifier reported in the namespaces Namespace 
>>>> Identification Descriptor list. A successfully discovered Zoned 
>>>> Namespace will be registered with the block layer as a host managed 
>>>> zoned block device with Zone Append command support. A namespace that 
>>>> does not support append is not supported by the driver.
>>> 
>>> Why are we enforcing the append command? Append is optional on the 
>>> current ZNS specification, so we should not make this mandatory in the 
>>> implementation. See specifics below.
> 
>> 
>> There is already general support in the kernel for the zone append command.
>> Feel free to submit patches to emulate the support. It is outside the scope
>> of this patchset.
>> 
> 
> It is fine that the kernel supports append, but the ZNS specification does
> not impose the implementation for append, so the driver should not do that
> either.
> 
> ZNS SSDs that choose to leave append as a non-implemented optional command
> should not rely on emulated SW support, specially when traditional writes
> work very fine for a large part of current ZNS use cases.
> 
> Please, remove this virtual constraint.
> 
>>> 
>>>> 
>>>> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by:
>>>> Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Ajay Joshi
>>>> <ajay.joshi@wdc.com> Signed-off-by: Aravind Ramesh
>>>> <aravind.ramesh@wdc.com> Signed-off-by: Niklas Cassel
>>>> <niklas.cassel@wdc.com> Signed-off-by: Matias Bjørling
>>>> <matias.bjorling@wdc.com> Signed-off-by: Damien Le Moal
>>>> <damien.lemoal@wdc.com> Signed-off-by: Keith Busch
>>>> <keith.busch@wdc.com> --- drivers/nvme/host/Makefile |   1 + 
>>>> drivers/nvme/host/core.c   |  91 ++++++++++++-- 
>>>> drivers/nvme/host/nvme.h   |  39 ++++++ drivers/nvme/host/zns.c    |
>>>> 238 +++++++++++++++++++++++++++++++++++++ include/linux/nvme.h       |
>>>> 111 +++++++++++++++++ 5 files changed, 468 insertions(+), 12
>>>> deletions(-) create mode 100644 drivers/nvme/host/zns.c
>>>> 
>>>> diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile 
>>>> index fc7b26be692d..d7f6a87687b8 100644 ---
>>>> a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -13,6
>>>> +13,7 @@ nvme-core-y                := core.o 
>>>> nvme-core-$(CONFIG_TRACING)        += trace.o 
>>>> nvme-core-$(CONFIG_NVME_MULTIPATH)    += multipath.o 
>>>> nvme-core-$(CONFIG_NVM)            += lightnvm.o 
>>>> +nvme-core-$(CONFIG_BLK_DEV_ZONED)    += zns.o 
>>>> nvme-core-$(CONFIG_FAULT_INJECTION_DEBUG_FS)    += fault_inject.o 
>>>> nvme-core-$(CONFIG_NVME_HWMON)        += hwmon.o
>>>> 
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
>>>> 58f137b9f2c5..e961910da4ac 100644 --- a/drivers/nvme/host/core.c +++
>>>> b/drivers/nvme/host/core.c @@ -89,7 +89,7 @@ static dev_t
>>>> nvme_chr_devt; static struct class *nvme_class; static struct class
>>>> *nvme_subsys_class;
>>>> 
>>>> -static int nvme_revalidate_disk(struct gendisk *disk); +static int
>>>> _nvme_revalidate_disk(struct gendisk *disk); static void
>>>> nvme_put_subsystem(struct nvme_subsystem *subsys); static void
>>>> nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl, unsigned nsid); 
>>>> @@ -287,6 +287,10 @@ void nvme_complete_rq(struct request *req) 
>>>> nvme_retry_req(req); return; } +    } else if
>>>> (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && +           req_op(req) ==
>>>> REQ_OP_ZONE_APPEND) { +        req->__sector =
>>>> nvme_lba_to_sect(req->q->queuedata, +
>>>> le64_to_cpu(nvme_req(req)->result.u64)); }
>>>> 
>>>> nvme_trace_bio_complete(req, status); @@ -673,7 +677,8 @@ static inline
>>>> blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns, }
>>>> 
>>>> static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, -
>>>> struct request *req, struct nvme_command *cmnd) +        struct request
>>>> *req, struct nvme_command *cmnd, +        enum nvme_opcode op) { struct
>>>> nvme_ctrl *ctrl = ns->ctrl; u16 control = 0; @@ -687,7 +692,7 @@ static
>>>> inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, if
>>>> (req->cmd_flags & REQ_RAHEAD) dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>>>> 
>>>> -    cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : 
>>>> nvme_cmd_read); +    cmnd->rw.opcode = op; cmnd->rw.nsid =
>>>> cpu_to_le32(ns->head->ns_id); cmnd->rw.slba =
>>>> cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); cmnd->rw.length =
>>>> cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); @@ -716,6 +721,8
>>>> @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, case
>>>> NVME_NS_DPS_PI_TYPE2: control |= NVME_RW_PRINFO_PRCHK_GUARD | 
>>>> NVME_RW_PRINFO_PRCHK_REF; +            if (op == nvme_cmd_zone_append) 
>>>> +                control |= NVME_RW_APPEND_PIREMAP; cmnd->rw.reftag =
>>>> cpu_to_le32(t10_pi_ref_tag(req)); break; } @@ -756,6 +763,19 @@
>>>> blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, 
>>>> case REQ_OP_FLUSH: nvme_setup_flush(ns, cmd); break; +    case
>>>> REQ_OP_ZONE_RESET_ALL: +    case REQ_OP_ZONE_RESET: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_RESET); +
>>>> break; +    case REQ_OP_ZONE_OPEN: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OPEN); +
>>>> break; +    case REQ_OP_ZONE_CLOSE: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_CLOSE); +
>>>> break; +    case REQ_OP_ZONE_FINISH: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH); +
>>>> break; case REQ_OP_WRITE_ZEROES: ret = nvme_setup_write_zeroes(ns, req,
>>>> cmd); break; @@ -763,8 +783,13 @@ blk_status_t nvme_setup_cmd(struct
>>>> nvme_ns *ns, struct request *req, ret = nvme_setup_discard(ns, req,
>>>> cmd); break; case REQ_OP_READ: +        ret = nvme_setup_rw(ns, req,
>>>> cmd, nvme_cmd_read); +        break; case REQ_OP_WRITE: -        ret =
>>>> nvme_setup_rw(ns, req, cmd); +        ret = nvme_setup_rw(ns, req, cmd,
>>>> nvme_cmd_write); +        break; +    case REQ_OP_ZONE_APPEND: +
>>>> ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); break; 
>>>> default: WARN_ON_ONCE(1); @@ -1392,14 +1417,23 @@ static u32
>>>> nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, return
>>>> effects; }
>>>> 
>>>> -static void nvme_update_formats(struct nvme_ctrl *ctrl) +static void
>>>> nvme_update_formats(struct nvme_ctrl *ctrl, u32 *effects) { struct
>>>> nvme_ns *ns;
>>>> 
>>>> down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns,
>>>> &ctrl->namespaces, list) -        if (ns->disk &&
>>>> nvme_revalidate_disk(ns->disk)) +        if (ns->disk &&
>>>> _nvme_revalidate_disk(ns->disk)) nvme_set_queue_dying(ns); +
>>>> else if (blk_queue_is_zoned(ns->disk->queue)) { +            /* +
>>>> * IO commands are required to fully revalidate a zoned +             *
>>>> device. Force the command effects to trigger rescan +             *
>>>> work so report zones can run in a context with +             * unfrozen
>>>> IO queues. +             */ +            *effects |=
>>>> NVME_CMD_EFFECTS_NCC; +        } up_read(&ctrl->namespaces_rwsem); }
>>>> 
>>>> @@ -1411,7 +1445,7 @@ static void nvme_passthru_end(struct nvme_ctrl
>>>> *ctrl, u32 effects) * this command. */ if (effects &
>>>> NVME_CMD_EFFECTS_LBCC) -        nvme_update_formats(ctrl); +
>>>> nvme_update_formats(ctrl, &effects); if (effects &
>>>> (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK)) { 
>>>> nvme_unfreeze(ctrl); nvme_mpath_unfreeze(ctrl->subsys); @@ -1526,7
>>>> +1560,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct
>>>> nvme_ns *ns, * Issue ioctl requests on the first available path.  Note
>>>> that unlike normal * block layer requests we will not retry failed
>>>> request on another controller. */ -static struct nvme_ns
>>>> *nvme_get_ns_from_disk(struct gendisk *disk, +struct nvme_ns
>>>> *nvme_get_ns_from_disk(struct gendisk *disk, struct nvme_ns_head
>>>> **head, int *srcu_idx) { #ifdef CONFIG_NVME_MULTIPATH @@ -1546,7
>>>> +1580,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk
>>>> *disk, return disk->private_data; }
>>>> 
>>>> -static void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx) 
>>>> +void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx) { if
>>>> (head) srcu_read_unlock(&head->srcu, idx); @@ -1939,21 +1973,28 @@
>>>> static void nvme_update_disk_info(struct gendisk *disk,
>>>> 
>>>> static int __nvme_revalidate_disk(struct gendisk *disk, struct 
>>>> nvme_id_ns *id) { +    unsigned lbaf = id->flbas &
>>>> NVME_NS_FLBAS_LBA_MASK; struct nvme_ns *ns = disk->private_data; struct
>>>> nvme_ctrl *ctrl = ns->ctrl; +    int ret; u32 iob;
>>>> 
>>>> /* * If identify namespace failed, use default 512 byte block size so *
>>>> block layer can use before failing read/write for 0 capacity. */ -
>>>> ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds; +
>>>> ns->lba_shift = id->lbaf[lbaf].ds; if (ns->lba_shift == 0) 
>>>> ns->lba_shift = 9;
>>>> 
>>>> switch (ns->head->ids.csi) { case NVME_CSI_NVM: break; +    case
>>>> NVME_CSI_ZNS: +        ret = nvme_update_zone_info(disk, ns, lbaf); +
>>>> if (ret) +            return ret; +        break; default: 
>>>> dev_warn(ctrl->device, "unknown csi:%d ns:%d\n", ns->head->ids.csi,
>>>> ns->head->ns_id); @@ -1967,7 +2008,7 @@ static int
>>>> __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) iob
>>>> = nvme_lba_to_sect(ns, le16_to_cpu(id->noiob));
>>>> 
>>>> ns->features = 0; -    ns->ms = le16_to_cpu(id->lbaf[id->flbas & 
>>>> NVME_NS_FLBAS_LBA_MASK].ms); +    ns->ms =
>>>> le16_to_cpu(id->lbaf[lbaf].ms); /* the PI implementation requires
>>>> metadata equal t10 pi tuple size */ if (ns->ms == sizeof(struct
>>>> t10_pi_tuple)) ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK; @@ -2010,7
>>>> +2051,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk,
>>>> struct nvme_id_ns *id) return 0; }
>>>> 
>>>> -static int nvme_revalidate_disk(struct gendisk *disk) +static int
>>>> _nvme_revalidate_disk(struct gendisk *disk) { struct nvme_ns *ns =
>>>> disk->private_data; struct nvme_ctrl *ctrl = ns->ctrl; @@ -2058,6
>>>> +2099,28 @@ static int nvme_revalidate_disk(struct gendisk *disk) 
>>>> return ret; }
>>>> 
>>>> +static int nvme_revalidate_disk(struct gendisk *disk) +{ +    int
>>>> ret; + +    ret = _nvme_revalidate_disk(disk); +    if (ret) +
>>>> return ret; + +#ifdef CONFIG_BLK_DEV_ZONED +    if
>>>> (blk_queue_is_zoned(disk->queue)) { +        struct nvme_ns *ns =
>>>> disk->private_data; +        struct nvme_ctrl *ctrl = ns->ctrl; + +
>>>> ret = blk_revalidate_disk_zones(disk, NULL); +        if (!ret) +
>>>> blk_queue_max_zone_append_sectors(disk->queue, +
>>>> ctrl->max_zone_append); +    } +#endif +    return ret; +} + static
>>>> char nvme_pr_type(enum pr_type type) { switch (type) { @@ -2188,6
>>>> +2251,7 @@ static const struct block_device_operations nvme_fops = { 
>>>> .release    = nvme_release, .getgeo        = nvme_getgeo, 
>>>> .revalidate_disk= nvme_revalidate_disk, +    .report_zones    =
>>>> nvme_report_zones, .pr_ops        = &nvme_pr_ops, };
>>>> 
>>>> @@ -2213,6 +2277,7 @@ const struct block_device_operations 
>>>> nvme_ns_head_ops = { .ioctl        = nvme_ioctl, .compat_ioctl    =
>>>> nvme_compat_ioctl, .getgeo        = nvme_getgeo, +    .report_zones
>>>> = nvme_report_zones, .pr_ops        = &nvme_pr_ops, }; #endif /*
>>>> CONFIG_NVME_MULTIPATH */ @@ -4439,6 +4504,8 @@ static inline void
>>>> _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_command) !=
>>>> 64); BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != 
>>>> NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct nvme_id_ns) !=
>>>> NVME_IDENTIFY_DATA_SIZE); +    BUILD_BUG_ON(sizeof(struct
>>>> nvme_id_ns_zns) != NVME_IDENTIFY_DATA_SIZE); +
>>>> BUILD_BUG_ON(sizeof(struct nvme_id_ctrl_zns) != 
>>>> NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct
>>>> nvme_lba_range_type) != 64); BUILD_BUG_ON(sizeof(struct nvme_smart_log)
>>>> != 512); BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64); diff --git
>>>> a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index
>>>> 58428e3a590e..662f95fbd909 100644 --- a/drivers/nvme/host/nvme.h +++
>>>> b/drivers/nvme/host/nvme.h @@ -239,6 +239,9 @@ struct nvme_ctrl { u32
>>>> max_hw_sectors; u32 max_segments; u32 max_integrity_segments; +#ifdef
>>>> CONFIG_BLK_DEV_ZONED +    u32 max_zone_append; +#endif u16 crdt[3]; u16
>>>> oncs; u16 oacs; @@ -403,6 +406,9 @@ struct nvme_ns { u16 sgs; u32 sws; 
>>>> u8 pi_type; +#ifdef CONFIG_BLK_DEV_ZONED +    u64 zsze; +#endif 
>>>> unsigned long features; unsigned long flags; #define NVME_NS_REMOVING
>>>> 0 @@ -568,6 +574,9 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
>>>> 
>>>> int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
>>>> u8 csi, void *log, size_t size, u64 offset); +struct nvme_ns
>>>> *nvme_get_ns_from_disk(struct gendisk *disk, +        struct
>>>> nvme_ns_head **head, int *srcu_idx); +void nvme_put_ns_from_disk(struct
>>>> nvme_ns_head *head, int idx);
>>>> 
>>>> extern const struct attribute_group *nvme_ns_id_attr_groups[]; extern
>>>> const struct block_device_operations nvme_ns_head_ops; @@ -689,6
>>>> +698,36 @@ static inline void nvme_mpath_start_freeze(struct
>>>> nvme_subsystem *subsys) } #endif /* CONFIG_NVME_MULTIPATH */
>>>> 
>>>> +#ifdef CONFIG_BLK_DEV_ZONED +int nvme_update_zone_info(struct gendisk
>>>> *disk, struct nvme_ns *ns, +              unsigned lbaf); + +int
>>>> nvme_report_zones(struct gendisk *disk, sector_t sector, +
>>>> unsigned int nr_zones, report_zones_cb cb, void *data); + +blk_status_t
>>>> nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req, +
>>>> struct nvme_command *cmnd, +                       enum
>>>> nvme_zone_mgmt_action action); +#else +#define nvme_report_zones NULL 
>>>> + +static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns
>>>> *ns, +        struct request *req, struct nvme_command *cmnd, +
>>>> enum nvme_zone_mgmt_action action) +{ +    return BLK_STS_NOTSUPP; +} 
>>>> + +static inline int nvme_update_zone_info(struct gendisk *disk, +
>>>> struct nvme_ns *ns, +                    unsigned lbaf) +{ +
>>>> dev_warn(ns->ctrl->device, +         "Please enable
>>>> CONFIG_BLK_DEV_ZONED to support ZNS devices\n"); +    return
>>>> -EPROTONOSUPPORT; +} +#endif + #ifdef CONFIG_NVM int
>>>> nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node); void
>>>> nvme_nvm_unregister(struct nvme_ns *ns); diff --git
>>>> a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c new file mode
>>>> 100644 index 000000000000..c08f6281b614 --- /dev/null +++
>>>> b/drivers/nvme/host/zns.c @@ -0,0 +1,238 @@ +//
>>>> SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2020 Western
>>>> Digital Corporation or its affiliates. + */ + +#include
>>>> <linux/blkdev.h> +#include <linux/vmalloc.h> +#include "nvme.h" + 
>>>> +static int nvme_set_max_append(struct nvme_ctrl *ctrl) +{ +    struct
>>>> nvme_command c = { }; +    struct nvme_id_ctrl_zns *id; +    int
>>>> status; + +    id = kzalloc(sizeof(*id), GFP_KERNEL); +    if (!id) +
>>>> return -ENOMEM; + +    c.identify.opcode = nvme_admin_identify; +
>>>> c.identify.cns = NVME_ID_CNS_CS_CTRL; +    c.identify.csi =
>>>> NVME_CSI_ZNS; + +    status = nvme_submit_sync_cmd(ctrl->admin_q, &c,
>>>> id, sizeof(*id)); +    if (status) { +        kfree(id); +
>>>> return status; +    } + +    ctrl->max_zone_append = 1 << (id->zamds +
>>>> 3); +    kfree(id); +    return 0; +} + +int
>>>> nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns, +
>>>> unsigned lbaf) +{ +    struct nvme_effects_log *log =
>>>> ns->head->effects; +    struct request_queue *q = disk->queue; +
>>>> struct nvme_command c = { }; +    struct nvme_id_ns_zns *id; +    int
>>>> status; + +    /* Driver requires zone append support */ +    if
>>>> (!(log->iocs[nvme_cmd_zone_append] & NVME_CMD_EFFECTS_CSUPP)) +
>>>> return -ENODEV;
>>> 
>>> Following up on the initial comment, this check should go.
>> 
>> See first comment.
> 
> See above and please remove.
> 
>> 
>>> 
>>>> + +    /* Lazily query controller append limit for the first zoned 
>>>> namespace */ +    if (!ns->ctrl->max_zone_append) { +        status =
>>>> nvme_set_max_append(ns->ctrl); +        if (status) +            return
>>>> status; +    }
>>> 
>>> This should only be applied if append is supported.
>> 
>> See first comment.
>> 
>>> 
>>>> + +    id = kzalloc(sizeof(*id), GFP_KERNEL); +    if (!id) +
>>>> return -ENOMEM; + +    c.identify.opcode = nvme_admin_identify; +
>>>> c.identify.nsid = cpu_to_le32(ns->head->ns_id); +    c.identify.cns =
>>>> NVME_ID_CNS_CS_NS; +    c.identify.csi = NVME_CSI_ZNS; + +    status =
>>>> nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, id, sizeof(*id)); +    if
>>>> (status) +        goto free_data; + +    /* +     * We currently do not
>>>> handle devices requiring any of the zoned +     * operation
>>>> characteristics. +     */ +    if (id->zoc) { +        status =
>>>> -EINVAL; +        goto free_data; +    }
>>> 
>>> I understand that "Variable Zone Capacity" is not supported as it 
>>> requires major changes at this moment, but we should support controllers 
>>> that enable "Zone Active Excursions", even when the AER event is not 
>>> implemented in this patchset.
>> 
>> 
>> NAK. Similarly to VZC, this allows an unsuspecting user to have major data
>> loss when a zone is suddenly moved to Full.
> 
> I buy that.
> 
>> 
>> 
>>> 
>>>> + +    ns->zsze = nvme_lba_to_sect(ns,
>>>> le64_to_cpu(id->lbafe[lbaf].zsze)); +    if (!ns->zsze) { +
>>>> status = -EINVAL; +        goto free_data; +    } + +
>>>> q->limits.zoned = BLK_ZONED_HM; +
>>>> blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); +free_data: +
>>>> kfree(id); +    return status; +} + +static void
>>>> *nvme_zns_alloc_report_buffer(struct nvme_ns *ns, +
>>>> unsigned int nr_zones, size_t *buflen) +{ +    struct request_queue *q
>>>> = ns->disk->queue; +    size_t bufsize; +    void *buf; + +    const
>>>> size_t min_bufsize = sizeof(struct nvme_zone_report) + +
>>>> sizeof(struct nvme_zone_descriptor); + +    nr_zones = min_t(unsigned
>>>> int, nr_zones, +             get_capacity(ns->disk) >>
>>>> ilog2(ns->zsze)); + +    bufsize = sizeof(struct nvme_zone_report) + +
>>>> nr_zones * sizeof(struct nvme_zone_descriptor); +    bufsize =
>>>> min_t(size_t, bufsize, +            queue_max_hw_sectors(q) <<
>>>> SECTOR_SHIFT); +    bufsize = min_t(size_t, bufsize,
>>>> queue_max_segments(q) << PAGE_SHIFT); + +    while (bufsize >=
>>>> min_bufsize) { +        buf = __vmalloc(bufsize, +
>>>> GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY); +        if (buf) { +
>>>> *buflen = bufsize; +            return buf; +        } +        bufsize
>>>> >>= 1; +    } +    return NULL; +} + +static int
>>>> __nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, +
>>>> struct nvme_zone_report *report, +                  size_t buflen) +{ +
>>>> struct nvme_command c = { }; +    int ret; + +    c.zmr.opcode =
>>>> nvme_cmd_zone_mgmt_recv; +    c.zmr.nsid =
>>>> cpu_to_le32(ns->head->ns_id); +    c.zmr.slba =
>>>> cpu_to_le64(nvme_sect_to_lba(ns, sector)); +    c.zmr.numd =
>>>> cpu_to_le32(nvme_bytes_to_numd(buflen)); +    c.zmr.zra =
>>>> NVME_ZRA_ZONE_REPORT; +    c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL; +
>>>> c.zmr.pr = NVME_REPORT_ZONE_PARTIAL; + +    ret =
>>>> nvme_submit_sync_cmd(ns->queue, &c, report, buflen); +    if (ret) +
>>>> return ret; + +    return le64_to_cpu(report->nr_zones); +} + +static
>>>> int nvme_zone_parse_entry(struct nvme_ns *ns, +                 struct
>>>> nvme_zone_descriptor *entry, +                 unsigned int idx,
>>>> report_zones_cb cb, +                 void *data) +{ +    struct
>>>> blk_zone zone = { }; + +    if ((entry->zt & 0xf) !=
>>>> NVME_ZONE_TYPE_SEQWRITE_REQ) { +        dev_err(ns->ctrl->device,
>>>> "invalid zone type %#x\n", +                entry->zt); +        return
>>>> -EINVAL; +    } + +    zone.type = BLK_ZONE_TYPE_SEQWRITE_REQ; +
>>>> zone.cond = entry->zs >> 4; +    zone.len = ns->zsze; +
>>>> zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap)); +
>>>> zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba)); +
>>>> zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp)); + +    return
>>>> cb(&zone, idx, data); +} + +static int nvme_ns_report_zones(struct
>>>> nvme_ns *ns, sector_t sector, +            unsigned int nr_zones,
>>>> report_zones_cb cb, void *data) +{ +    struct nvme_zone_report
>>>> *report; +    int ret, zone_idx = 0; +    unsigned int nz, i; +
>>>> size_t buflen; + +    report = nvme_zns_alloc_report_buffer(ns,
>>>> nr_zones, &buflen); +    if (!report) +        return -ENOMEM; + +
>>>> sector &= ~(ns->zsze - 1); +    while (zone_idx < nr_zones && sector <
>>>> get_capacity(ns->disk)) { +        memset(report, 0, buflen); +
>>>> ret = __nvme_ns_report_zones(ns, sector, report, buflen); +        if
>>>> (ret < 0) +            goto out_free; + +        nz = min_t(unsigned
>>>> int, ret, nr_zones); +        if (!nz) +            break; + +
>>>> for (i = 0; i < nz && zone_idx < nr_zones; i++) { +            ret =
>>>> nvme_zone_parse_entry(ns, &report->entries[i], +
>>>> zone_idx, cb, data); +            if (ret) +                goto
>>>> out_free; +            zone_idx++; +        } + +        sector +=
>>>> ns->zsze * nz; +    } + +    ret = zone_idx; +out_free: +
>>>> kvfree(report); +    return ret; +} + +int nvme_report_zones(struct
>>>> gendisk *disk, sector_t sector, +              unsigned int nr_zones,
>>>> report_zones_cb cb, void *data) +{ +    struct nvme_ns_head *head =
>>>> NULL; +    struct nvme_ns *ns; +    int srcu_idx, ret; + +    ns =
>>>> nvme_get_ns_from_disk(disk, &head, &srcu_idx); +    if (unlikely(!ns)) 
>>>> +        return -EWOULDBLOCK; + +    if (ns->head->ids.csi ==
>>>> NVME_CSI_ZNS) +        ret = nvme_ns_report_zones(ns, sector, nr_zones,
>>>> cb, data); +    else +        ret = -EINVAL; +
>>>> nvme_put_ns_from_disk(head, srcu_idx); + +    return ret; +} + 
>>>> +blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct 
>>>> request *req, +        struct nvme_command *c, enum
>>>> nvme_zone_mgmt_action action) +{ +    c->zms.opcode =
>>>> nvme_cmd_zone_mgmt_send; +    c->zms.nsid =
>>>> cpu_to_le32(ns->head->ns_id); +    c->zms.slba =
>>>> cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); +    c->zms.action
>>>> = action; + +    if (req_op(req) == REQ_OP_ZONE_RESET_ALL) +
>>>> c->zms.select = 1; + +    return BLK_STS_OK; +} diff --git
>>>> a/include/linux/nvme.h b/include/linux/nvme.h index
>>>> ea25da572eed..7b3fa7de07bd 100644 --- a/include/linux/nvme.h +++
>>>> b/include/linux/nvme.h @@ -374,6 +374,30 @@ struct nvme_id_ns { __u8
>>>> vs[3712]; };
>>>> 
>>>> +struct nvme_zns_lbafe { +    __le64            zsze; +    __u8
>>>> zdes; +    __u8            rsvd9[7]; +}; + +struct nvme_id_ns_zns { +
>>>> __le16            zoc; +    __le16            ozcs; +    __le32
>>>> mar; +    __le32            mor; +    __le32            rrl; +
>>>> __le32            frl; +    __u8            rsvd20[2796]; +    struct
>>>> nvme_zns_lbafe    lbafe[16]; +    __u8            rsvd3072[768]; +
>>>> __u8            vs[256]; +}; + +struct nvme_id_ctrl_zns { +    __u8
>>>> zamds; +    __u8    rsvd1[4095]; +}; + enum { NVME_ID_CNS_NS
>>>> = 0x00, NVME_ID_CNS_CTRL        = 0x01, @@ -392,6 +416,7 @@ enum {
>>>> 
>>>> enum { NVME_CSI_NVM            = 0, +    NVME_CSI_ZNS            = 2, 
>>>> };
>>>> 
>>>> enum { @@ -532,6 +557,27 @@ struct nvme_ana_rsp_hdr { __le16
>>>> rsvd10[3]; };
>>>> 
>>>> +struct nvme_zone_descriptor { +    __u8        zt; +    __u8
>>>> zs; +    __u8        za; +    __u8        rsvd3[5]; +    __le64
>>>> zcap; +    __le64        zslba; +    __le64        wp; +    __u8
>>>> rsvd32[32]; +}; + +enum { +    NVME_ZONE_TYPE_SEQWRITE_REQ    = 0x2, 
>>>> +}; + +struct nvme_zone_report { +    __le64        nr_zones; +    __u8
>>>> resv8[56]; +    struct nvme_zone_descriptor entries[]; +}; + enum { 
>>>> NVME_SMART_CRIT_SPARE        = 1 << 0, NVME_SMART_CRIT_TEMPERATURE    =
>>>> 1 << 1, @@ -626,6 +672,9 @@ enum nvme_opcode { nvme_cmd_resv_report
>>>> = 0x0e, nvme_cmd_resv_acquire    = 0x11, nvme_cmd_resv_release    =
>>>> 0x15, +    nvme_cmd_zone_mgmt_send    = 0x79, +
>>>> nvme_cmd_zone_mgmt_recv    = 0x7a, +    nvme_cmd_zone_append    =
>>>> 0x7d, };
>>>> 
>>>> #define nvme_opcode_name(opcode)    { opcode, #opcode } @@ -764,6
>>>> +813,7 @@ struct nvme_rw_command { enum { NVME_RW_LR            = 1 <<
>>>> 15, NVME_RW_FUA            = 1 << 14, +    NVME_RW_APPEND_PIREMAP
>>>> = 1 << 9, NVME_RW_DSM_FREQ_UNSPEC        = 0, NVME_RW_DSM_FREQ_TYPICAL
>>>> = 1, NVME_RW_DSM_FREQ_RARE        = 2, @@ -829,6 +879,53 @@ struct
>>>> nvme_write_zeroes_cmd { __le16            appmask; };
>>>> 
>>>> +enum nvme_zone_mgmt_action { +    NVME_ZONE_CLOSE        = 0x1, +
>>>> NVME_ZONE_FINISH    = 0x2, +    NVME_ZONE_OPEN        = 0x3, +
>>>> NVME_ZONE_RESET        = 0x4, +    NVME_ZONE_OFFLINE    = 0x5, +
>>>> NVME_ZONE_SET_DESC_EXT    = 0x10, +}; + +struct nvme_zone_mgmt_send_cmd
>>>> { +    __u8            opcode; +    __u8            flags; +    __u16
>>>> command_id; +    __le32            nsid; +    __le32
>>>> cdw2[2]; +    __le64            metadata; +    union nvme_data_ptr
>>>> dptr; +    __le64            slba; +    __le32            cdw12; +
>>>> __u8            action;
>>> 
>>> Why not zsa to make it easier to match to the spec
>>> 
>>> 
>>>> +    __u8            select;
>>> 
>>> sel_all?
>>> 
>>>> +    __u8            rsvd13[2]; +    __le32            cdw14[2]; +}; + 
>>>> +struct nvme_zone_mgmt_recv_cmd { +    __u8            opcode; +
>>>> __u8            flags; +    __u16            command_id; +    __le32
>>>> nsid; +    __le64            rsvd2[2]; +    union nvme_data_ptr
>>>> dptr; +    __le64            slba; +    __le32            numd; +
>>>> __u8            zra; +    __u8            zrasf; +    __u8
>>>> pr;
>>> 
>>> Partial Report is just one bit in the "Zone Receive Action Specific 
>>> Features". What about zrasfe?
>> 
>> There currently no users of pr, and bit 1-7 are reserved in the spec. Users
>> of the pr variable should shift and mask as necessary.
>> 
>> zrasf looks good to me. It is defined as a byte in the spec.
> 
> I meant for the pr variable name. Agree with the rest.
> 
>> 
>>> 
>>>> +    __u8            rsvd13; +    __le32            cdw14[2]; +}; + 
>>>> +enum { +    NVME_ZRA_ZONE_REPORT        = 0, +
>>>> NVME_ZRASF_ZONE_REPORT_ALL    = 0, +    NVME_REPORT_ZONE_PARTIAL    =
>>>> 1, +}; + /* Features */
>>>> 
>>>> enum { @@ -1300,6 +1397,8 @@ struct nvme_command { struct
>>>> nvme_format_cmd format; struct nvme_dsm_cmd dsm; struct
>>>> nvme_write_zeroes_cmd write_zeroes; +        struct
>>>> nvme_zone_mgmt_send_cmd zms; +        struct nvme_zone_mgmt_recv_cmd
>>>> zmr; struct nvme_abort_cmd abort; struct nvme_get_log_page_command
>>>> get_log_page; struct nvmf_common_command fabrics; @@ -1433,6 +1532,18
>>>> @@ enum { NVME_SC_DISCOVERY_RESTART    = 0x190, NVME_SC_AUTH_REQUIRED
>>>> = 0x191,
>>>> 
>>>> +    /* +     * I/O Command Set Specific - Zoned commands: +     */ +
>>>> NVME_SC_ZONE_BOUNDARY_ERROR    = 0x1b8, +    NVME_SC_ZONE_FULL        =
>>>> 0x1b9, +    NVME_SC_ZONE_READ_ONLY        = 0x1ba, +
>>>> NVME_SC_ZONE_OFFLINE        = 0x1bb, +    NVME_SC_ZONE_INVALID_WRITE
>>>> = 0x1bc, +    NVME_SC_ZONE_TOO_MANY_ACTIVE    = 0x1bd, +
>>>> NVME_SC_ZONE_TOO_MANY_OPEN    = 0x1be, +
>>>> NVME_SC_ZONE_INVALID_TRANSITION    = 0x1bf, + /* * Media and Data
>>>> Integrity Errors: */ -- 2.24.1
>>>> 
>> 
> 
> _______________________________________________ linux-nvme mailing list 
> linux-nvme@lists.infradead.org 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=JfeWlBa6VbDyTXraMENjy_b_0yKWuqQ4qY-FPhxK4x8w-TfgRBDyeV4hVQQBEgL2&r=YJM_QPk2w1CRIo5NNBXnCXGzNnmIIfG_iTRs6chBf6s&m=vuAxizG1aX1Dc1Tj0NPWUbhwmZIe1Y12kNIbJHLIdBU&s=uCIVhY22an8jd0FJv1lizpv_vA0tpe37xpz4af6KA10&e=
> 
> 


-- 
Damien Le Moal
Western Digital Research

WARNING: multiple messages have this Message-ID (diff)
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: "Judy Brock" <judy.brock@samsung.com>,
	"Javier González" <javier@javigon.com>,
	"Matias Bjørling" <mb@lightnvm.io>
Cc: Jens Axboe <axboe@kernel.dk>,
	Niklas Cassel <Niklas.Cassel@wdc.com>,
	Ajay Joshi <Ajay.Joshi@wdc.com>, Sagi Grimberg <sagi@grimberg.me>,
	Keith Busch <Keith.Busch@wdc.com>,
	Dmitry Fomichev <Dmitry.Fomichev@wdc.com>,
	Aravind Ramesh <Aravind.Ramesh@wdc.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Hans Holmberg <Hans.Holmberg@wdc.com>,
	Christoph Hellwig <hch@lst.de>,
	Matias Bjorling <Matias.Bjorling@wdc.com>
Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
Date: Tue, 16 Jun 2020 12:37:38 +0000	[thread overview]
Message-ID: <CY4PR04MB375103860D44AB2DF9040655E79D0@CY4PR04MB3751.namprd04.prod.outlook.com> (raw)
In-Reply-To: 3ab4032c18a8411389d3879026cde2a2@samsung.com

On 2020/06/16 21:35, Judy Brock wrote:
> >>> A namespace that does not support append is not supported by the driver.
> 
> I am not that knowledgeable about Linux kernel drivers so maybe this is a
> dumb question but won't the driver in question be the default kernel driver
> for ZNS devices? If so, why would that driver deliberately reject a device
> which is 100% compliant with the ZNS spec? That would seem to favor specific
> implementations which seems like an inappropriate thing for a community
> driver to do unless there is an actual technical reason the driver is unable
> to function w/o append. Is there any such reason and if so what is it? Thanks
> and sorry if I've misunderstood.

Judy,

please see my answer to Javier for the rational behind this software design
decision.

> 
> 
> Judy
> 
> -----Original Message----- From: linux-nvme
> [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Javier González 
> Sent: Tuesday, June 16, 2020 5:00 AM To: Matias Bjørling Cc: Jens Axboe;
> Niklas Cassel; Damien Le Moal; Ajay Joshi; Sagi Grimberg; Keith Busch; Dmitry
> Fomichev; Aravind Ramesh; linux-nvme@lists.infradead.org;
> linux-block@vger.kernel.org; Hans Holmberg; Christoph Hellwig; Matias
> Bjørling Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
> 
> On 16.06.2020 13:18, Matias Bjørling wrote:
>> On 16/06/2020 12.41, Javier González wrote:
>>> On 16.06.2020 08:34, Keith Busch wrote:
>>>> Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined 
>>>> in NVM Express TP4053. Zoned namespaces are discovered based on their 
>>>> Command Set Identifier reported in the namespaces Namespace 
>>>> Identification Descriptor list. A successfully discovered Zoned 
>>>> Namespace will be registered with the block layer as a host managed 
>>>> zoned block device with Zone Append command support. A namespace that 
>>>> does not support append is not supported by the driver.
>>> 
>>> Why are we enforcing the append command? Append is optional on the 
>>> current ZNS specification, so we should not make this mandatory in the 
>>> implementation. See specifics below.
> 
>> 
>> There is already general support in the kernel for the zone append command.
>> Feel free to submit patches to emulate the support. It is outside the scope
>> of this patchset.
>> 
> 
> It is fine that the kernel supports append, but the ZNS specification does
> not impose the implementation for append, so the driver should not do that
> either.
> 
> ZNS SSDs that choose to leave append as a non-implemented optional command
> should not rely on emulated SW support, specially when traditional writes
> work very fine for a large part of current ZNS use cases.
> 
> Please, remove this virtual constraint.
> 
>>> 
>>>> 
>>>> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by:
>>>> Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Ajay Joshi
>>>> <ajay.joshi@wdc.com> Signed-off-by: Aravind Ramesh
>>>> <aravind.ramesh@wdc.com> Signed-off-by: Niklas Cassel
>>>> <niklas.cassel@wdc.com> Signed-off-by: Matias Bjørling
>>>> <matias.bjorling@wdc.com> Signed-off-by: Damien Le Moal
>>>> <damien.lemoal@wdc.com> Signed-off-by: Keith Busch
>>>> <keith.busch@wdc.com> --- drivers/nvme/host/Makefile |   1 + 
>>>> drivers/nvme/host/core.c   |  91 ++++++++++++-- 
>>>> drivers/nvme/host/nvme.h   |  39 ++++++ drivers/nvme/host/zns.c    |
>>>> 238 +++++++++++++++++++++++++++++++++++++ include/linux/nvme.h       |
>>>> 111 +++++++++++++++++ 5 files changed, 468 insertions(+), 12
>>>> deletions(-) create mode 100644 drivers/nvme/host/zns.c
>>>> 
>>>> diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile 
>>>> index fc7b26be692d..d7f6a87687b8 100644 ---
>>>> a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -13,6
>>>> +13,7 @@ nvme-core-y                := core.o 
>>>> nvme-core-$(CONFIG_TRACING)        += trace.o 
>>>> nvme-core-$(CONFIG_NVME_MULTIPATH)    += multipath.o 
>>>> nvme-core-$(CONFIG_NVM)            += lightnvm.o 
>>>> +nvme-core-$(CONFIG_BLK_DEV_ZONED)    += zns.o 
>>>> nvme-core-$(CONFIG_FAULT_INJECTION_DEBUG_FS)    += fault_inject.o 
>>>> nvme-core-$(CONFIG_NVME_HWMON)        += hwmon.o
>>>> 
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
>>>> 58f137b9f2c5..e961910da4ac 100644 --- a/drivers/nvme/host/core.c +++
>>>> b/drivers/nvme/host/core.c @@ -89,7 +89,7 @@ static dev_t
>>>> nvme_chr_devt; static struct class *nvme_class; static struct class
>>>> *nvme_subsys_class;
>>>> 
>>>> -static int nvme_revalidate_disk(struct gendisk *disk); +static int
>>>> _nvme_revalidate_disk(struct gendisk *disk); static void
>>>> nvme_put_subsystem(struct nvme_subsystem *subsys); static void
>>>> nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl, unsigned nsid); 
>>>> @@ -287,6 +287,10 @@ void nvme_complete_rq(struct request *req) 
>>>> nvme_retry_req(req); return; } +    } else if
>>>> (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && +           req_op(req) ==
>>>> REQ_OP_ZONE_APPEND) { +        req->__sector =
>>>> nvme_lba_to_sect(req->q->queuedata, +
>>>> le64_to_cpu(nvme_req(req)->result.u64)); }
>>>> 
>>>> nvme_trace_bio_complete(req, status); @@ -673,7 +677,8 @@ static inline
>>>> blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns, }
>>>> 
>>>> static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, -
>>>> struct request *req, struct nvme_command *cmnd) +        struct request
>>>> *req, struct nvme_command *cmnd, +        enum nvme_opcode op) { struct
>>>> nvme_ctrl *ctrl = ns->ctrl; u16 control = 0; @@ -687,7 +692,7 @@ static
>>>> inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, if
>>>> (req->cmd_flags & REQ_RAHEAD) dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>>>> 
>>>> -    cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : 
>>>> nvme_cmd_read); +    cmnd->rw.opcode = op; cmnd->rw.nsid =
>>>> cpu_to_le32(ns->head->ns_id); cmnd->rw.slba =
>>>> cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); cmnd->rw.length =
>>>> cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); @@ -716,6 +721,8
>>>> @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, case
>>>> NVME_NS_DPS_PI_TYPE2: control |= NVME_RW_PRINFO_PRCHK_GUARD | 
>>>> NVME_RW_PRINFO_PRCHK_REF; +            if (op == nvme_cmd_zone_append) 
>>>> +                control |= NVME_RW_APPEND_PIREMAP; cmnd->rw.reftag =
>>>> cpu_to_le32(t10_pi_ref_tag(req)); break; } @@ -756,6 +763,19 @@
>>>> blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, 
>>>> case REQ_OP_FLUSH: nvme_setup_flush(ns, cmd); break; +    case
>>>> REQ_OP_ZONE_RESET_ALL: +    case REQ_OP_ZONE_RESET: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_RESET); +
>>>> break; +    case REQ_OP_ZONE_OPEN: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OPEN); +
>>>> break; +    case REQ_OP_ZONE_CLOSE: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_CLOSE); +
>>>> break; +    case REQ_OP_ZONE_FINISH: +        ret =
>>>> nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH); +
>>>> break; case REQ_OP_WRITE_ZEROES: ret = nvme_setup_write_zeroes(ns, req,
>>>> cmd); break; @@ -763,8 +783,13 @@ blk_status_t nvme_setup_cmd(struct
>>>> nvme_ns *ns, struct request *req, ret = nvme_setup_discard(ns, req,
>>>> cmd); break; case REQ_OP_READ: +        ret = nvme_setup_rw(ns, req,
>>>> cmd, nvme_cmd_read); +        break; case REQ_OP_WRITE: -        ret =
>>>> nvme_setup_rw(ns, req, cmd); +        ret = nvme_setup_rw(ns, req, cmd,
>>>> nvme_cmd_write); +        break; +    case REQ_OP_ZONE_APPEND: +
>>>> ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); break; 
>>>> default: WARN_ON_ONCE(1); @@ -1392,14 +1417,23 @@ static u32
>>>> nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, return
>>>> effects; }
>>>> 
>>>> -static void nvme_update_formats(struct nvme_ctrl *ctrl) +static void
>>>> nvme_update_formats(struct nvme_ctrl *ctrl, u32 *effects) { struct
>>>> nvme_ns *ns;
>>>> 
>>>> down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns,
>>>> &ctrl->namespaces, list) -        if (ns->disk &&
>>>> nvme_revalidate_disk(ns->disk)) +        if (ns->disk &&
>>>> _nvme_revalidate_disk(ns->disk)) nvme_set_queue_dying(ns); +
>>>> else if (blk_queue_is_zoned(ns->disk->queue)) { +            /* +
>>>> * IO commands are required to fully revalidate a zoned +             *
>>>> device. Force the command effects to trigger rescan +             *
>>>> work so report zones can run in a context with +             * unfrozen
>>>> IO queues. +             */ +            *effects |=
>>>> NVME_CMD_EFFECTS_NCC; +        } up_read(&ctrl->namespaces_rwsem); }
>>>> 
>>>> @@ -1411,7 +1445,7 @@ static void nvme_passthru_end(struct nvme_ctrl
>>>> *ctrl, u32 effects) * this command. */ if (effects &
>>>> NVME_CMD_EFFECTS_LBCC) -        nvme_update_formats(ctrl); +
>>>> nvme_update_formats(ctrl, &effects); if (effects &
>>>> (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK)) { 
>>>> nvme_unfreeze(ctrl); nvme_mpath_unfreeze(ctrl->subsys); @@ -1526,7
>>>> +1560,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct
>>>> nvme_ns *ns, * Issue ioctl requests on the first available path.  Note
>>>> that unlike normal * block layer requests we will not retry failed
>>>> request on another controller. */ -static struct nvme_ns
>>>> *nvme_get_ns_from_disk(struct gendisk *disk, +struct nvme_ns
>>>> *nvme_get_ns_from_disk(struct gendisk *disk, struct nvme_ns_head
>>>> **head, int *srcu_idx) { #ifdef CONFIG_NVME_MULTIPATH @@ -1546,7
>>>> +1580,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk
>>>> *disk, return disk->private_data; }
>>>> 
>>>> -static void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx) 
>>>> +void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx) { if
>>>> (head) srcu_read_unlock(&head->srcu, idx); @@ -1939,21 +1973,28 @@
>>>> static void nvme_update_disk_info(struct gendisk *disk,
>>>> 
>>>> static int __nvme_revalidate_disk(struct gendisk *disk, struct 
>>>> nvme_id_ns *id) { +    unsigned lbaf = id->flbas &
>>>> NVME_NS_FLBAS_LBA_MASK; struct nvme_ns *ns = disk->private_data; struct
>>>> nvme_ctrl *ctrl = ns->ctrl; +    int ret; u32 iob;
>>>> 
>>>> /* * If identify namespace failed, use default 512 byte block size so *
>>>> block layer can use before failing read/write for 0 capacity. */ -
>>>> ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds; +
>>>> ns->lba_shift = id->lbaf[lbaf].ds; if (ns->lba_shift == 0) 
>>>> ns->lba_shift = 9;
>>>> 
>>>> switch (ns->head->ids.csi) { case NVME_CSI_NVM: break; +    case
>>>> NVME_CSI_ZNS: +        ret = nvme_update_zone_info(disk, ns, lbaf); +
>>>> if (ret) +            return ret; +        break; default: 
>>>> dev_warn(ctrl->device, "unknown csi:%d ns:%d\n", ns->head->ids.csi,
>>>> ns->head->ns_id); @@ -1967,7 +2008,7 @@ static int
>>>> __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id) iob
>>>> = nvme_lba_to_sect(ns, le16_to_cpu(id->noiob));
>>>> 
>>>> ns->features = 0; -    ns->ms = le16_to_cpu(id->lbaf[id->flbas & 
>>>> NVME_NS_FLBAS_LBA_MASK].ms); +    ns->ms =
>>>> le16_to_cpu(id->lbaf[lbaf].ms); /* the PI implementation requires
>>>> metadata equal t10 pi tuple size */ if (ns->ms == sizeof(struct
>>>> t10_pi_tuple)) ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK; @@ -2010,7
>>>> +2051,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk,
>>>> struct nvme_id_ns *id) return 0; }
>>>> 
>>>> -static int nvme_revalidate_disk(struct gendisk *disk) +static int
>>>> _nvme_revalidate_disk(struct gendisk *disk) { struct nvme_ns *ns =
>>>> disk->private_data; struct nvme_ctrl *ctrl = ns->ctrl; @@ -2058,6
>>>> +2099,28 @@ static int nvme_revalidate_disk(struct gendisk *disk) 
>>>> return ret; }
>>>> 
>>>> +static int nvme_revalidate_disk(struct gendisk *disk) +{ +    int
>>>> ret; + +    ret = _nvme_revalidate_disk(disk); +    if (ret) +
>>>> return ret; + +#ifdef CONFIG_BLK_DEV_ZONED +    if
>>>> (blk_queue_is_zoned(disk->queue)) { +        struct nvme_ns *ns =
>>>> disk->private_data; +        struct nvme_ctrl *ctrl = ns->ctrl; + +
>>>> ret = blk_revalidate_disk_zones(disk, NULL); +        if (!ret) +
>>>> blk_queue_max_zone_append_sectors(disk->queue, +
>>>> ctrl->max_zone_append); +    } +#endif +    return ret; +} + static
>>>> char nvme_pr_type(enum pr_type type) { switch (type) { @@ -2188,6
>>>> +2251,7 @@ static const struct block_device_operations nvme_fops = { 
>>>> .release    = nvme_release, .getgeo        = nvme_getgeo, 
>>>> .revalidate_disk= nvme_revalidate_disk, +    .report_zones    =
>>>> nvme_report_zones, .pr_ops        = &nvme_pr_ops, };
>>>> 
>>>> @@ -2213,6 +2277,7 @@ const struct block_device_operations 
>>>> nvme_ns_head_ops = { .ioctl        = nvme_ioctl, .compat_ioctl    =
>>>> nvme_compat_ioctl, .getgeo        = nvme_getgeo, +    .report_zones
>>>> = nvme_report_zones, .pr_ops        = &nvme_pr_ops, }; #endif /*
>>>> CONFIG_NVME_MULTIPATH */ @@ -4439,6 +4504,8 @@ static inline void
>>>> _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_command) !=
>>>> 64); BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != 
>>>> NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct nvme_id_ns) !=
>>>> NVME_IDENTIFY_DATA_SIZE); +    BUILD_BUG_ON(sizeof(struct
>>>> nvme_id_ns_zns) != NVME_IDENTIFY_DATA_SIZE); +
>>>> BUILD_BUG_ON(sizeof(struct nvme_id_ctrl_zns) != 
>>>> NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct
>>>> nvme_lba_range_type) != 64); BUILD_BUG_ON(sizeof(struct nvme_smart_log)
>>>> != 512); BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64); diff --git
>>>> a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index
>>>> 58428e3a590e..662f95fbd909 100644 --- a/drivers/nvme/host/nvme.h +++
>>>> b/drivers/nvme/host/nvme.h @@ -239,6 +239,9 @@ struct nvme_ctrl { u32
>>>> max_hw_sectors; u32 max_segments; u32 max_integrity_segments; +#ifdef
>>>> CONFIG_BLK_DEV_ZONED +    u32 max_zone_append; +#endif u16 crdt[3]; u16
>>>> oncs; u16 oacs; @@ -403,6 +406,9 @@ struct nvme_ns { u16 sgs; u32 sws; 
>>>> u8 pi_type; +#ifdef CONFIG_BLK_DEV_ZONED +    u64 zsze; +#endif 
>>>> unsigned long features; unsigned long flags; #define NVME_NS_REMOVING
>>>> 0 @@ -568,6 +574,9 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
>>>> 
>>>> int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
>>>> u8 csi, void *log, size_t size, u64 offset); +struct nvme_ns
>>>> *nvme_get_ns_from_disk(struct gendisk *disk, +        struct
>>>> nvme_ns_head **head, int *srcu_idx); +void nvme_put_ns_from_disk(struct
>>>> nvme_ns_head *head, int idx);
>>>> 
>>>> extern const struct attribute_group *nvme_ns_id_attr_groups[]; extern
>>>> const struct block_device_operations nvme_ns_head_ops; @@ -689,6
>>>> +698,36 @@ static inline void nvme_mpath_start_freeze(struct
>>>> nvme_subsystem *subsys) } #endif /* CONFIG_NVME_MULTIPATH */
>>>> 
>>>> +#ifdef CONFIG_BLK_DEV_ZONED +int nvme_update_zone_info(struct gendisk
>>>> *disk, struct nvme_ns *ns, +              unsigned lbaf); + +int
>>>> nvme_report_zones(struct gendisk *disk, sector_t sector, +
>>>> unsigned int nr_zones, report_zones_cb cb, void *data); + +blk_status_t
>>>> nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req, +
>>>> struct nvme_command *cmnd, +                       enum
>>>> nvme_zone_mgmt_action action); +#else +#define nvme_report_zones NULL 
>>>> + +static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns
>>>> *ns, +        struct request *req, struct nvme_command *cmnd, +
>>>> enum nvme_zone_mgmt_action action) +{ +    return BLK_STS_NOTSUPP; +} 
>>>> + +static inline int nvme_update_zone_info(struct gendisk *disk, +
>>>> struct nvme_ns *ns, +                    unsigned lbaf) +{ +
>>>> dev_warn(ns->ctrl->device, +         "Please enable
>>>> CONFIG_BLK_DEV_ZONED to support ZNS devices\n"); +    return
>>>> -EPROTONOSUPPORT; +} +#endif + #ifdef CONFIG_NVM int
>>>> nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node); void
>>>> nvme_nvm_unregister(struct nvme_ns *ns); diff --git
>>>> a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c new file mode
>>>> 100644 index 000000000000..c08f6281b614 --- /dev/null +++
>>>> b/drivers/nvme/host/zns.c @@ -0,0 +1,238 @@ +//
>>>> SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2020 Western
>>>> Digital Corporation or its affiliates. + */ + +#include
>>>> <linux/blkdev.h> +#include <linux/vmalloc.h> +#include "nvme.h" + 
>>>> +static int nvme_set_max_append(struct nvme_ctrl *ctrl) +{ +    struct
>>>> nvme_command c = { }; +    struct nvme_id_ctrl_zns *id; +    int
>>>> status; + +    id = kzalloc(sizeof(*id), GFP_KERNEL); +    if (!id) +
>>>> return -ENOMEM; + +    c.identify.opcode = nvme_admin_identify; +
>>>> c.identify.cns = NVME_ID_CNS_CS_CTRL; +    c.identify.csi =
>>>> NVME_CSI_ZNS; + +    status = nvme_submit_sync_cmd(ctrl->admin_q, &c,
>>>> id, sizeof(*id)); +    if (status) { +        kfree(id); +
>>>> return status; +    } + +    ctrl->max_zone_append = 1 << (id->zamds +
>>>> 3); +    kfree(id); +    return 0; +} + +int
>>>> nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns, +
>>>> unsigned lbaf) +{ +    struct nvme_effects_log *log =
>>>> ns->head->effects; +    struct request_queue *q = disk->queue; +
>>>> struct nvme_command c = { }; +    struct nvme_id_ns_zns *id; +    int
>>>> status; + +    /* Driver requires zone append support */ +    if
>>>> (!(log->iocs[nvme_cmd_zone_append] & NVME_CMD_EFFECTS_CSUPP)) +
>>>> return -ENODEV;
>>> 
>>> Following up on the initial comment, this check should go.
>> 
>> See first comment.
> 
> See above and please remove.
> 
>> 
>>> 
>>>> + +    /* Lazily query controller append limit for the first zoned 
>>>> namespace */ +    if (!ns->ctrl->max_zone_append) { +        status =
>>>> nvme_set_max_append(ns->ctrl); +        if (status) +            return
>>>> status; +    }
>>> 
>>> This should only be applied if append is supported.
>> 
>> See first comment.
>> 
>>> 
>>>> + +    id = kzalloc(sizeof(*id), GFP_KERNEL); +    if (!id) +
>>>> return -ENOMEM; + +    c.identify.opcode = nvme_admin_identify; +
>>>> c.identify.nsid = cpu_to_le32(ns->head->ns_id); +    c.identify.cns =
>>>> NVME_ID_CNS_CS_NS; +    c.identify.csi = NVME_CSI_ZNS; + +    status =
>>>> nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, id, sizeof(*id)); +    if
>>>> (status) +        goto free_data; + +    /* +     * We currently do not
>>>> handle devices requiring any of the zoned +     * operation
>>>> characteristics. +     */ +    if (id->zoc) { +        status =
>>>> -EINVAL; +        goto free_data; +    }
>>> 
>>> I understand that "Variable Zone Capacity" is not supported as it 
>>> requires major changes at this moment, but we should support controllers 
>>> that enable "Zone Active Excursions", even when the AER event is not 
>>> implemented in this patchset.
>> 
>> 
>> NAK. Similarly to VZC, this allows an unsuspecting user to have major data
>> loss when a zone is suddenly moved to Full.
> 
> I buy that.
> 
>> 
>> 
>>> 
>>>> + +    ns->zsze = nvme_lba_to_sect(ns,
>>>> le64_to_cpu(id->lbafe[lbaf].zsze)); +    if (!ns->zsze) { +
>>>> status = -EINVAL; +        goto free_data; +    } + +
>>>> q->limits.zoned = BLK_ZONED_HM; +
>>>> blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); +free_data: +
>>>> kfree(id); +    return status; +} + +static void
>>>> *nvme_zns_alloc_report_buffer(struct nvme_ns *ns, +
>>>> unsigned int nr_zones, size_t *buflen) +{ +    struct request_queue *q
>>>> = ns->disk->queue; +    size_t bufsize; +    void *buf; + +    const
>>>> size_t min_bufsize = sizeof(struct nvme_zone_report) + +
>>>> sizeof(struct nvme_zone_descriptor); + +    nr_zones = min_t(unsigned
>>>> int, nr_zones, +             get_capacity(ns->disk) >>
>>>> ilog2(ns->zsze)); + +    bufsize = sizeof(struct nvme_zone_report) + +
>>>> nr_zones * sizeof(struct nvme_zone_descriptor); +    bufsize =
>>>> min_t(size_t, bufsize, +            queue_max_hw_sectors(q) <<
>>>> SECTOR_SHIFT); +    bufsize = min_t(size_t, bufsize,
>>>> queue_max_segments(q) << PAGE_SHIFT); + +    while (bufsize >=
>>>> min_bufsize) { +        buf = __vmalloc(bufsize, +
>>>> GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY); +        if (buf) { +
>>>> *buflen = bufsize; +            return buf; +        } +        bufsize
>>>> >>= 1; +    } +    return NULL; +} + +static int
>>>> __nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, +
>>>> struct nvme_zone_report *report, +                  size_t buflen) +{ +
>>>> struct nvme_command c = { }; +    int ret; + +    c.zmr.opcode =
>>>> nvme_cmd_zone_mgmt_recv; +    c.zmr.nsid =
>>>> cpu_to_le32(ns->head->ns_id); +    c.zmr.slba =
>>>> cpu_to_le64(nvme_sect_to_lba(ns, sector)); +    c.zmr.numd =
>>>> cpu_to_le32(nvme_bytes_to_numd(buflen)); +    c.zmr.zra =
>>>> NVME_ZRA_ZONE_REPORT; +    c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL; +
>>>> c.zmr.pr = NVME_REPORT_ZONE_PARTIAL; + +    ret =
>>>> nvme_submit_sync_cmd(ns->queue, &c, report, buflen); +    if (ret) +
>>>> return ret; + +    return le64_to_cpu(report->nr_zones); +} + +static
>>>> int nvme_zone_parse_entry(struct nvme_ns *ns, +                 struct
>>>> nvme_zone_descriptor *entry, +                 unsigned int idx,
>>>> report_zones_cb cb, +                 void *data) +{ +    struct
>>>> blk_zone zone = { }; + +    if ((entry->zt & 0xf) !=
>>>> NVME_ZONE_TYPE_SEQWRITE_REQ) { +        dev_err(ns->ctrl->device,
>>>> "invalid zone type %#x\n", +                entry->zt); +        return
>>>> -EINVAL; +    } + +    zone.type = BLK_ZONE_TYPE_SEQWRITE_REQ; +
>>>> zone.cond = entry->zs >> 4; +    zone.len = ns->zsze; +
>>>> zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap)); +
>>>> zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba)); +
>>>> zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp)); + +    return
>>>> cb(&zone, idx, data); +} + +static int nvme_ns_report_zones(struct
>>>> nvme_ns *ns, sector_t sector, +            unsigned int nr_zones,
>>>> report_zones_cb cb, void *data) +{ +    struct nvme_zone_report
>>>> *report; +    int ret, zone_idx = 0; +    unsigned int nz, i; +
>>>> size_t buflen; + +    report = nvme_zns_alloc_report_buffer(ns,
>>>> nr_zones, &buflen); +    if (!report) +        return -ENOMEM; + +
>>>> sector &= ~(ns->zsze - 1); +    while (zone_idx < nr_zones && sector <
>>>> get_capacity(ns->disk)) { +        memset(report, 0, buflen); +
>>>> ret = __nvme_ns_report_zones(ns, sector, report, buflen); +        if
>>>> (ret < 0) +            goto out_free; + +        nz = min_t(unsigned
>>>> int, ret, nr_zones); +        if (!nz) +            break; + +
>>>> for (i = 0; i < nz && zone_idx < nr_zones; i++) { +            ret =
>>>> nvme_zone_parse_entry(ns, &report->entries[i], +
>>>> zone_idx, cb, data); +            if (ret) +                goto
>>>> out_free; +            zone_idx++; +        } + +        sector +=
>>>> ns->zsze * nz; +    } + +    ret = zone_idx; +out_free: +
>>>> kvfree(report); +    return ret; +} + +int nvme_report_zones(struct
>>>> gendisk *disk, sector_t sector, +              unsigned int nr_zones,
>>>> report_zones_cb cb, void *data) +{ +    struct nvme_ns_head *head =
>>>> NULL; +    struct nvme_ns *ns; +    int srcu_idx, ret; + +    ns =
>>>> nvme_get_ns_from_disk(disk, &head, &srcu_idx); +    if (unlikely(!ns)) 
>>>> +        return -EWOULDBLOCK; + +    if (ns->head->ids.csi ==
>>>> NVME_CSI_ZNS) +        ret = nvme_ns_report_zones(ns, sector, nr_zones,
>>>> cb, data); +    else +        ret = -EINVAL; +
>>>> nvme_put_ns_from_disk(head, srcu_idx); + +    return ret; +} + 
>>>> +blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct 
>>>> request *req, +        struct nvme_command *c, enum
>>>> nvme_zone_mgmt_action action) +{ +    c->zms.opcode =
>>>> nvme_cmd_zone_mgmt_send; +    c->zms.nsid =
>>>> cpu_to_le32(ns->head->ns_id); +    c->zms.slba =
>>>> cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); +    c->zms.action
>>>> = action; + +    if (req_op(req) == REQ_OP_ZONE_RESET_ALL) +
>>>> c->zms.select = 1; + +    return BLK_STS_OK; +} diff --git
>>>> a/include/linux/nvme.h b/include/linux/nvme.h index
>>>> ea25da572eed..7b3fa7de07bd 100644 --- a/include/linux/nvme.h +++
>>>> b/include/linux/nvme.h @@ -374,6 +374,30 @@ struct nvme_id_ns { __u8
>>>> vs[3712]; };
>>>> 
>>>> +struct nvme_zns_lbafe { +    __le64            zsze; +    __u8
>>>> zdes; +    __u8            rsvd9[7]; +}; + +struct nvme_id_ns_zns { +
>>>> __le16            zoc; +    __le16            ozcs; +    __le32
>>>> mar; +    __le32            mor; +    __le32            rrl; +
>>>> __le32            frl; +    __u8            rsvd20[2796]; +    struct
>>>> nvme_zns_lbafe    lbafe[16]; +    __u8            rsvd3072[768]; +
>>>> __u8            vs[256]; +}; + +struct nvme_id_ctrl_zns { +    __u8
>>>> zamds; +    __u8    rsvd1[4095]; +}; + enum { NVME_ID_CNS_NS
>>>> = 0x00, NVME_ID_CNS_CTRL        = 0x01, @@ -392,6 +416,7 @@ enum {
>>>> 
>>>> enum { NVME_CSI_NVM            = 0, +    NVME_CSI_ZNS            = 2, 
>>>> };
>>>> 
>>>> enum { @@ -532,6 +557,27 @@ struct nvme_ana_rsp_hdr { __le16
>>>> rsvd10[3]; };
>>>> 
>>>> +struct nvme_zone_descriptor { +    __u8        zt; +    __u8
>>>> zs; +    __u8        za; +    __u8        rsvd3[5]; +    __le64
>>>> zcap; +    __le64        zslba; +    __le64        wp; +    __u8
>>>> rsvd32[32]; +}; + +enum { +    NVME_ZONE_TYPE_SEQWRITE_REQ    = 0x2, 
>>>> +}; + +struct nvme_zone_report { +    __le64        nr_zones; +    __u8
>>>> resv8[56]; +    struct nvme_zone_descriptor entries[]; +}; + enum { 
>>>> NVME_SMART_CRIT_SPARE        = 1 << 0, NVME_SMART_CRIT_TEMPERATURE    =
>>>> 1 << 1, @@ -626,6 +672,9 @@ enum nvme_opcode { nvme_cmd_resv_report
>>>> = 0x0e, nvme_cmd_resv_acquire    = 0x11, nvme_cmd_resv_release    =
>>>> 0x15, +    nvme_cmd_zone_mgmt_send    = 0x79, +
>>>> nvme_cmd_zone_mgmt_recv    = 0x7a, +    nvme_cmd_zone_append    =
>>>> 0x7d, };
>>>> 
>>>> #define nvme_opcode_name(opcode)    { opcode, #opcode } @@ -764,6
>>>> +813,7 @@ struct nvme_rw_command { enum { NVME_RW_LR            = 1 <<
>>>> 15, NVME_RW_FUA            = 1 << 14, +    NVME_RW_APPEND_PIREMAP
>>>> = 1 << 9, NVME_RW_DSM_FREQ_UNSPEC        = 0, NVME_RW_DSM_FREQ_TYPICAL
>>>> = 1, NVME_RW_DSM_FREQ_RARE        = 2, @@ -829,6 +879,53 @@ struct
>>>> nvme_write_zeroes_cmd { __le16            appmask; };
>>>> 
>>>> +enum nvme_zone_mgmt_action { +    NVME_ZONE_CLOSE        = 0x1, +
>>>> NVME_ZONE_FINISH    = 0x2, +    NVME_ZONE_OPEN        = 0x3, +
>>>> NVME_ZONE_RESET        = 0x4, +    NVME_ZONE_OFFLINE    = 0x5, +
>>>> NVME_ZONE_SET_DESC_EXT    = 0x10, +}; + +struct nvme_zone_mgmt_send_cmd
>>>> { +    __u8            opcode; +    __u8            flags; +    __u16
>>>> command_id; +    __le32            nsid; +    __le32
>>>> cdw2[2]; +    __le64            metadata; +    union nvme_data_ptr
>>>> dptr; +    __le64            slba; +    __le32            cdw12; +
>>>> __u8            action;
>>> 
>>> Why not zsa to make it easier to match to the spec
>>> 
>>> 
>>>> +    __u8            select;
>>> 
>>> sel_all?
>>> 
>>>> +    __u8            rsvd13[2]; +    __le32            cdw14[2]; +}; + 
>>>> +struct nvme_zone_mgmt_recv_cmd { +    __u8            opcode; +
>>>> __u8            flags; +    __u16            command_id; +    __le32
>>>> nsid; +    __le64            rsvd2[2]; +    union nvme_data_ptr
>>>> dptr; +    __le64            slba; +    __le32            numd; +
>>>> __u8            zra; +    __u8            zrasf; +    __u8
>>>> pr;
>>> 
>>> Partial Report is just one bit in the "Zone Receive Action Specific 
>>> Features". What about zrasfe?
>> 
>> There currently no users of pr, and bit 1-7 are reserved in the spec. Users
>> of the pr variable should shift and mask as necessary.
>> 
>> zrasf looks good to me. It is defined as a byte in the spec.
> 
> I meant for the pr variable name. Agree with the rest.
> 
>> 
>>> 
>>>> +    __u8            rsvd13; +    __le32            cdw14[2]; +}; + 
>>>> +enum { +    NVME_ZRA_ZONE_REPORT        = 0, +
>>>> NVME_ZRASF_ZONE_REPORT_ALL    = 0, +    NVME_REPORT_ZONE_PARTIAL    =
>>>> 1, +}; + /* Features */
>>>> 
>>>> enum { @@ -1300,6 +1397,8 @@ struct nvme_command { struct
>>>> nvme_format_cmd format; struct nvme_dsm_cmd dsm; struct
>>>> nvme_write_zeroes_cmd write_zeroes; +        struct
>>>> nvme_zone_mgmt_send_cmd zms; +        struct nvme_zone_mgmt_recv_cmd
>>>> zmr; struct nvme_abort_cmd abort; struct nvme_get_log_page_command
>>>> get_log_page; struct nvmf_common_command fabrics; @@ -1433,6 +1532,18
>>>> @@ enum { NVME_SC_DISCOVERY_RESTART    = 0x190, NVME_SC_AUTH_REQUIRED
>>>> = 0x191,
>>>> 
>>>> +    /* +     * I/O Command Set Specific - Zoned commands: +     */ +
>>>> NVME_SC_ZONE_BOUNDARY_ERROR    = 0x1b8, +    NVME_SC_ZONE_FULL        =
>>>> 0x1b9, +    NVME_SC_ZONE_READ_ONLY        = 0x1ba, +
>>>> NVME_SC_ZONE_OFFLINE        = 0x1bb, +    NVME_SC_ZONE_INVALID_WRITE
>>>> = 0x1bc, +    NVME_SC_ZONE_TOO_MANY_ACTIVE    = 0x1bd, +
>>>> NVME_SC_ZONE_TOO_MANY_OPEN    = 0x1be, +
>>>> NVME_SC_ZONE_INVALID_TRANSITION    = 0x1bf, + /* * Media and Data
>>>> Integrity Errors: */ -- 2.24.1
>>>> 
>> 
> 
> _______________________________________________ linux-nvme mailing list 
> linux-nvme@lists.infradead.org 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=JfeWlBa6VbDyTXraMENjy_b_0yKWuqQ4qY-FPhxK4x8w-TfgRBDyeV4hVQQBEgL2&r=YJM_QPk2w1CRIo5NNBXnCXGzNnmIIfG_iTRs6chBf6s&m=vuAxizG1aX1Dc1Tj0NPWUbhwmZIe1Y12kNIbJHLIdBU&s=uCIVhY22an8jd0FJv1lizpv_vA0tpe37xpz4af6KA10&e=
> 
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2020-06-16 12:37 UTC|newest]

Thread overview: 192+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 23:34 [PATCH 0/5] nvme support for zoned namespace command set Keith Busch
2020-06-15 23:34 ` Keith Busch
2020-06-15 23:34 ` [PATCH 1/5] block: add capacity field to zone descriptors Keith Busch
2020-06-15 23:34   ` Keith Busch
2020-06-15 23:49   ` Chaitanya Kulkarni
2020-06-15 23:49     ` Chaitanya Kulkarni
2020-06-16 10:28   ` Javier González
2020-06-16 10:28     ` Javier González
2020-06-16 13:47   ` Daniel Wagner
2020-06-16 13:47     ` Daniel Wagner
2020-06-16 13:54   ` Johannes Thumshirn
2020-06-16 13:54     ` Johannes Thumshirn
2020-06-16 15:41   ` Martin K. Petersen
2020-06-16 15:41     ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
2020-06-15 23:34   ` Keith Busch
2020-06-15 23:46   ` Chaitanya Kulkarni
2020-06-15 23:46     ` Chaitanya Kulkarni
2020-06-16 14:18   ` Daniel Wagner
2020-06-16 14:18     ` Daniel Wagner
2020-06-16 15:48   ` Martin K. Petersen
2020-06-16 15:48     ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
2020-06-15 23:34   ` Keith Busch
2020-06-16 10:33   ` Javier González
2020-06-16 10:33     ` Javier González
2020-06-16 17:14     ` Niklas Cassel
2020-06-16 17:14       ` Niklas Cassel
2020-06-16 15:58   ` Martin K. Petersen
2020-06-16 15:58     ` Martin K. Petersen
2020-06-16 17:01     ` Keith Busch
2020-06-16 17:01       ` Keith Busch
2020-06-17  9:50       ` Niklas Cassel
2020-06-17  9:50         ` Niklas Cassel
2020-06-16 17:06     ` Niklas Cassel
2020-06-16 17:06       ` Niklas Cassel
2020-06-17  2:01       ` Martin K. Petersen
2020-06-17  2:01         ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 4/5] nvme: support for multi-command set effects Keith Busch
2020-06-15 23:34   ` Keith Busch
2020-06-16 10:34   ` Javier González
2020-06-16 10:34     ` Javier González
2020-06-16 16:03   ` Martin K. Petersen
2020-06-16 16:03     ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 5/5] nvme: support for zoned namespaces Keith Busch
2020-06-15 23:34   ` Keith Busch
2020-06-16 10:41   ` Javier González
2020-06-16 10:41     ` Javier González
2020-06-16 11:18     ` Matias Bjørling
2020-06-16 11:18       ` Matias Bjørling
2020-06-16 12:00       ` Javier González
2020-06-16 12:00         ` Javier González
2020-06-16 12:06         ` Matias Bjørling
2020-06-16 12:06           ` Matias Bjørling
2020-06-16 12:24           ` Javier González
2020-06-16 12:24             ` Javier González
2020-06-16 12:27             ` Matias Bjørling
2020-06-16 12:27               ` Matias Bjørling
2020-06-16 12:35             ` Damien Le Moal
2020-06-16 12:35               ` Damien Le Moal
     [not found]               ` <CGME20200616130815uscas1p1be34e5fceaa548eac31fb30790a689d4@uscas1p1.samsung.com>
2020-06-16 13:08                 ` Judy Brock
2020-06-16 13:08                   ` Judy Brock
2020-06-16 13:32                   ` Matias Bjørling
2020-06-16 13:32                     ` Matias Bjørling
2020-06-16 13:34                   ` Damien Le Moal
2020-06-16 13:34                     ` Damien Le Moal
2020-06-16 14:16               ` Javier González
2020-06-16 14:16                 ` Javier González
2020-06-16 14:42                 ` Damien Le Moal
2020-06-16 14:42                   ` Damien Le Moal
2020-06-16 15:02                   ` Javier González
2020-06-16 15:02                     ` Javier González
2020-06-16 15:20                     ` Matias Bjørling
2020-06-16 15:20                       ` Matias Bjørling
2020-06-16 16:03                       ` Javier González
2020-06-16 16:03                         ` Javier González
2020-06-16 16:07                         ` Matias Bjorling
2020-06-16 16:07                           ` Matias Bjorling
2020-06-16 16:21                           ` Javier González
2020-06-16 16:21                             ` Javier González
2020-06-16 16:25                             ` Matias Bjørling
2020-06-16 16:25                               ` Matias Bjørling
2020-06-16 15:48                     ` Keith Busch
2020-06-16 15:48                       ` Keith Busch
2020-06-16 15:55                       ` Javier González
2020-06-16 15:55                         ` Javier González
2020-06-16 16:04                         ` Matias Bjorling
2020-06-16 16:04                           ` Matias Bjorling
2020-06-16 16:07                         ` Keith Busch
2020-06-16 16:07                           ` Keith Busch
2020-06-16 16:13                           ` Javier González
2020-06-16 16:13                             ` Javier González
2020-06-17  0:38                             ` Damien Le Moal
2020-06-17  0:38                               ` Damien Le Moal
2020-06-17  6:18                               ` Javier González
2020-06-17  6:18                                 ` Javier González
2020-06-17  6:54                                 ` Damien Le Moal
2020-06-17  6:54                                   ` Damien Le Moal
2020-06-17  7:11                                   ` Javier González
2020-06-17  7:11                                     ` Javier González
2020-06-17  7:29                                     ` Damien Le Moal
2020-06-17  7:29                                       ` Damien Le Moal
2020-06-17  7:34                                       ` Javier González
2020-06-17  7:34                                         ` Javier González
2020-06-17  0:14                     ` Damien Le Moal
2020-06-17  0:14                       ` Damien Le Moal
2020-06-17  6:09                       ` Javier González
2020-06-17  6:09                         ` Javier González
2020-06-17  6:47                         ` Damien Le Moal
2020-06-17  6:47                           ` Damien Le Moal
2020-06-17  7:02                           ` Javier González
2020-06-17  7:02                             ` Javier González
2020-06-17  7:24                             ` Damien Le Moal
2020-06-17  7:24                               ` Damien Le Moal
2020-06-17  7:29                               ` Javier González
2020-06-17  7:29                                 ` Javier González
     [not found]         ` <CGME20200616123503uscas1p22ce22054a1b4152a20437b5abdd55119@uscas1p2.samsung.com>
2020-06-16 12:35           ` Judy Brock
2020-06-16 12:35             ` Judy Brock
2020-06-16 12:37             ` Damien Le Moal [this message]
2020-06-16 12:37               ` Damien Le Moal
2020-06-16 12:37             ` Matias Bjørling
2020-06-16 12:37               ` Matias Bjørling
2020-06-16 13:12               ` Judy Brock
2020-06-16 13:12                 ` Judy Brock
2020-06-16 13:18                 ` Judy Brock
2020-06-16 13:18                   ` Judy Brock
2020-06-16 13:32                   ` Judy Brock
2020-06-16 13:32                     ` Judy Brock
2020-06-16 13:39                     ` Damien Le Moal
2020-06-16 13:39                       ` Damien Le Moal
2020-06-17  7:43     ` Christoph Hellwig
2020-06-17  7:43       ` Christoph Hellwig
2020-06-17 12:01       ` Martin K. Petersen
2020-06-17 12:01         ` Martin K. Petersen
2020-06-17 15:00         ` Javier González
2020-06-17 15:00           ` Javier González
2020-06-17 14:42       ` Javier González
2020-06-17 14:42         ` Javier González
2020-06-17 17:57         ` Matias Bjørling
2020-06-17 17:57           ` Matias Bjørling
2020-06-17 18:28           ` Javier González
2020-06-17 18:28             ` Javier González
2020-06-17 18:55             ` Matias Bjorling
2020-06-17 18:55               ` Matias Bjorling
2020-06-17 19:09               ` Javier González
2020-06-17 19:09                 ` Javier González
2020-06-17 19:23                 ` Matias Bjørling
2020-06-17 19:23                   ` Matias Bjørling
2020-06-17 19:40                   ` Javier González
2020-06-17 19:40                     ` Javier González
2020-06-17 23:44                     ` Heiner Litz
2020-06-17 23:44                       ` Heiner Litz
2020-06-18  1:55                       ` Keith Busch
2020-06-18  1:55                         ` Keith Busch
2020-06-18  4:24                         ` Heiner Litz
2020-06-18  4:24                           ` Heiner Litz
2020-06-18  5:15                           ` Damien Le Moal
2020-06-18  5:15                             ` Damien Le Moal
2020-06-18 20:47                             ` Heiner Litz
2020-06-18 20:47                               ` Heiner Litz
2020-06-18 21:04                               ` Matias Bjorling
2020-06-18 21:04                                 ` Matias Bjorling
2020-06-18 21:19                               ` Keith Busch
2020-06-18 21:19                                 ` Keith Busch
2020-06-18 22:05                                 ` Heiner Litz
2020-06-18 22:05                                   ` Heiner Litz
2020-06-19  0:57                                   ` Damien Le Moal
2020-06-19  0:57                                     ` Damien Le Moal
2020-06-19 10:29                                   ` Matias Bjorling
2020-06-19 10:29                                     ` Matias Bjorling
2020-06-19 18:08                                     ` Heiner Litz
2020-06-19 18:08                                       ` Heiner Litz
2020-06-19 18:10                                       ` Keith Busch
2020-06-19 18:10                                         ` Keith Busch
2020-06-19 18:17                                         ` Heiner Litz
2020-06-19 18:17                                           ` Heiner Litz
2020-06-19 18:22                                           ` Keith Busch
2020-06-19 18:22                                             ` Keith Busch
2020-06-19 18:25                                           ` Matias Bjørling
2020-06-19 18:25                                             ` Matias Bjørling
2020-06-19 18:40                                             ` Heiner Litz
2020-06-19 18:40                                               ` Heiner Litz
2020-06-19 18:18                                       ` Matias Bjørling
2020-06-19 18:18                                         ` Matias Bjørling
2020-06-20  6:33                                       ` Christoph Hellwig
2020-06-20  6:33                                         ` Christoph Hellwig
2020-06-20 17:52                                         ` Heiner Litz
2020-06-20 17:52                                           ` Heiner Litz
2020-06-22 14:01                                           ` Christoph Hellwig
2022-03-02 21:11                   ` Luis Chamberlain
2020-06-17  2:08   ` Martin K. Petersen
2020-06-17  2:08     ` Martin K. Petersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR04MB375103860D44AB2DF9040655E79D0@CY4PR04MB3751.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=Ajay.Joshi@wdc.com \
    --cc=Aravind.Ramesh@wdc.com \
    --cc=Dmitry.Fomichev@wdc.com \
    --cc=Hans.Holmberg@wdc.com \
    --cc=Keith.Busch@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=Niklas.Cassel@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=javier@javigon.com \
    --cc=judy.brock@samsung.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mb@lightnvm.io \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.