On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 

I think the fill_pattern feature stands separate, so it would be nice to
extract that to a patch on its own.

> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 04172f083e..daa13546c4 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
>  
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
> -    uint8_t  csi;
>      bool     attached;
>      QemuUUID uuid;
>  
> @@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
>      DeviceState  parent_obj;
>      BlockConf    blkconf;
>      int32_t      bootindex;
> +    uint8_t      csi;
>      int64_t      size;
>      NvmeIdNs     id_ns;

This should be squashed into the namespace types patch.

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 63ad03d6d6..38e25a4d1f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -54,6 +54,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qemu/error-report.h"
> +#include "crypto/random.h"

I think this is not used until the offline/read-only zones injection
patch, right?

> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +                                      bool failed)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +    NvmeZone *zone;
> +    uint64_t slba, start_wp = req->cqe.result64;
> +    uint32_t nlb, zone_idx;
> +    uint8_t zs;
> +
> +    if (rw->opcode != NVME_CMD_WRITE &&
> +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +        return false;
> +    }
> +
> +    slba = le64_to_cpu(rw->slba);
> +    nlb = le16_to_cpu(rw->nlb) + 1;
> +    zone_idx = nvme_zone_idx(ns, slba);
> +    assert(zone_idx < ns->num_zones);
> +    zone = &ns->zone_array[zone_idx];
> +
> +    if (!failed && zone->w_ptr < start_wp + nlb) {
> +        /*
> +         * A preceding queued write to the zone has failed,
> +         * now this write is not at the WP, fail it too.
> +         */
> +        failed = true;
> +    }
> +
> +    if (failed) {
> +        if (zone->w_ptr > start_wp) {
> +            zone->w_ptr = start_wp;
> +        }

It is possible (though unlikely) that you already posted the CQE for the
write that moved the WP to w_ptr - and now you are reverting it.  This
looks like a recipe for data corruption to me.

Take this example. I use append, because if you have multiple regular
writes in queue you're screwed anyway.

  w_ptr = 0, d.wp = 0
  append 1 lba  -> w_ptr = 1, start_wp = 0, issues aio A
  append 2 lbas -> w_ptr = 3, start_wp = 1, issues aio B

  aio B success -> d.wp = 2 (since you are adding nlb),

Now, I totally do the same. Even though the zone descriptor write
pointer gets "out of sync", it will be reconciled in the absence of
failures and its fair to define that the host cannot expect a consistent
view of the write pointer without quescing I/O.

The problem is if a write then fails:

  aio A fails   -> w_ptr > start_wp (3 > 1), so you revert to w_ptr = 1

That looks bad to me. I dont think this is ever reconciled? If another
append then comes in:

  append 1 lba -> w_ptr = 2, start_wp = 1, issues aio C and overwrites
                                           the second append from before.
  aio C success -> d.wp = 3 (but it should be 2)

> @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    NvmeIdCtrlZoned id = {};
>  
>      trace_pci_nvme_identify_ctrl_csi(c->csi);
>  
>      if (c->csi == NVME_CSI_NVM) {
>          return nvme_rpt_empty_id_struct(n, req);
> +    } else if (c->csi == NVME_CSI_ZONED) {
> +        id.zasl = n->zasl;

I dont think it should overwrite the zasl value specified by the user.
If the user specified 0, then it should return 0 for zasl here.

> @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>              continue;
>          }
>          ns->params.attached = false;
> -        switch (ns->params.csi) {
> +        switch (ns->csi) {
>          case NVME_CSI_NVM:
>              if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
>                  NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
>                  ns->params.attached = true;
>              }
>              break;
> +        case NVME_CSI_ZONED:
> +            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> +                ns->params.attached = true;
> +            }
> +            break;
>          }
>      }
>  
> +    if (!n->zasl_bs) {
> +        assert(n->params.mdts);

A value of 0 for MDTS is perfectly valid.

> @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>                  case CSS_NVM_ONLY:
>                      trace_pci_nvme_css_nvm_cset_selected_by_host(data &
>                                                                   0xffffffff);
> -                    break;
> +                break;

Spurious misaligned break here.