All of lore.kernel.org
 help / color / mirror / Atom feed
From: Klaus Jensen <its@irrelevant.dk>
To: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Cc: "Fam Zheng" <fam@euphon.net>, "Kevin Wolf" <kwolf@redhat.com>,
	"Damien Le Moal" <damien.lemoal@wdc.com>,
	qemu-block@nongnu.org, "Niklas Cassel" <niklas.cassel@wdc.com>,
	"Klaus Jensen" <k.jensen@samsung.com>,
	qemu-devel@nongnu.org, "Maxim Levitsky" <mlevitsk@redhat.com>,
	"Alistair Francis" <alistair.francis@wdc.com>,
	"Keith Busch" <kbusch@kernel.org>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Matias Bjorling" <matias.bjorling@wdc.com>
Subject: Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
Date: Mon, 28 Sep 2020 12:42:58 +0200	[thread overview]
Message-ID: <20200928104258.GB33043@apples.localdomain> (raw)
In-Reply-To: <20200928023528.15260-10-dmitry.fomichev@wdc.com>

[-- Attachment #1: Type: text/plain, Size: 7269 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 

I think the fill_pattern feature stands separate, so it would be nice to
extract that to a patch on its own.

> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 04172f083e..daa13546c4 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
>  
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
> -    uint8_t  csi;
>      bool     attached;
>      QemuUUID uuid;
>  
> @@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
>      DeviceState  parent_obj;
>      BlockConf    blkconf;
>      int32_t      bootindex;
> +    uint8_t      csi;
>      int64_t      size;
>      NvmeIdNs     id_ns;

This should be squashed into the namespace types patch.

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 63ad03d6d6..38e25a4d1f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -54,6 +54,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qemu/error-report.h"
> +#include "crypto/random.h"

I think this is not used until the offline/read-only zones injection
patch, right?

> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +                                      bool failed)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +    NvmeZone *zone;
> +    uint64_t slba, start_wp = req->cqe.result64;
> +    uint32_t nlb, zone_idx;
> +    uint8_t zs;
> +
> +    if (rw->opcode != NVME_CMD_WRITE &&
> +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +        return false;
> +    }
> +
> +    slba = le64_to_cpu(rw->slba);
> +    nlb = le16_to_cpu(rw->nlb) + 1;
> +    zone_idx = nvme_zone_idx(ns, slba);
> +    assert(zone_idx < ns->num_zones);
> +    zone = &ns->zone_array[zone_idx];
> +
> +    if (!failed && zone->w_ptr < start_wp + nlb) {
> +        /*
> +         * A preceding queued write to the zone has failed,
> +         * now this write is not at the WP, fail it too.
> +         */
> +        failed = true;
> +    }
> +
> +    if (failed) {
> +        if (zone->w_ptr > start_wp) {
> +            zone->w_ptr = start_wp;
> +        }

It is possible (though unlikely) that you already posted the CQE for the
write that moved the WP to w_ptr - and now you are reverting it.  This
looks like a recipe for data corruption to me.

Take this example. I use append, because if you have multiple regular
writes in queue you're screwed anyway.

  w_ptr = 0, d.wp = 0
  append 1 lba  -> w_ptr = 1, start_wp = 0, issues aio A
  append 2 lbas -> w_ptr = 3, start_wp = 1, issues aio B

  aio B success -> d.wp = 2 (since you are adding nlb),

Now, I totally do the same. Even though the zone descriptor write
pointer gets "out of sync", it will be reconciled in the absence of
failures and its fair to define that the host cannot expect a consistent
view of the write pointer without quescing I/O.

The problem is if a write then fails:

  aio A fails   -> w_ptr > start_wp (3 > 1), so you revert to w_ptr = 1

That looks bad to me. I dont think this is ever reconciled? If another
append then comes in:

  append 1 lba -> w_ptr = 2, start_wp = 1, issues aio C and overwrites
                                           the second append from before.
  aio C success -> d.wp = 3 (but it should be 2)

> @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    NvmeIdCtrlZoned id = {};
>  
>      trace_pci_nvme_identify_ctrl_csi(c->csi);
>  
>      if (c->csi == NVME_CSI_NVM) {
>          return nvme_rpt_empty_id_struct(n, req);
> +    } else if (c->csi == NVME_CSI_ZONED) {
> +        id.zasl = n->zasl;

I dont think it should overwrite the zasl value specified by the user.
If the user specified 0, then it should return 0 for zasl here.

> @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>              continue;
>          }
>          ns->params.attached = false;
> -        switch (ns->params.csi) {
> +        switch (ns->csi) {
>          case NVME_CSI_NVM:
>              if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
>                  NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
>                  ns->params.attached = true;
>              }
>              break;
> +        case NVME_CSI_ZONED:
> +            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> +                ns->params.attached = true;
> +            }
> +            break;
>          }
>      }
>  
> +    if (!n->zasl_bs) {
> +        assert(n->params.mdts);

A value of 0 for MDTS is perfectly valid.

> @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>                  case CSS_NVM_ONLY:
>                      trace_pci_nvme_css_nvm_cset_selected_by_host(data &
>                                                                   0xffffffff);
> -                    break;
> +                break;

Spurious misaligned break here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply	other threads:[~2020-09-28 10:47 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
2020-09-28  8:51   ` Klaus Jensen
2020-09-28  2:35 ` [PATCH v5 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
2020-09-30  8:08   ` Klaus Jensen
2020-09-30 15:21   ` Keith Busch
2020-09-28  2:35 ` [PATCH v5 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
2020-09-30  8:15   ` Klaus Jensen
2020-09-30 12:47   ` Niklas Cassel
2020-10-01 11:22   ` Niklas Cassel
2020-10-01 15:29     ` Keith Busch
2020-10-01 15:50       ` Niklas Cassel
2020-10-01 15:59         ` Keith Busch
2020-10-01 16:23           ` Niklas Cassel
2020-10-01 17:08             ` Keith Busch
2020-10-01 22:15   ` Klaus Jensen
2020-10-01 22:30     ` Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
2020-09-30 13:50   ` Niklas Cassel
2020-10-04 23:54     ` Dmitry Fomichev
2020-10-05 11:26       ` Niklas Cassel
2020-09-28  2:35 ` [PATCH v5 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
2020-09-28  6:44   ` Klaus Jensen
2020-09-28 10:42   ` Klaus Jensen [this message]
2020-09-30  5:20     ` Klaus Jensen
2020-10-05  0:53     ` Dmitry Fomichev
2020-09-30  5:59   ` Klaus Jensen
2020-10-04 23:48     ` Dmitry Fomichev
2020-09-30 14:50   ` Niklas Cassel
2020-09-30 18:23     ` Klaus Jensen
2020-10-04 23:57     ` Dmitry Fomichev
2020-10-05 11:41       ` Niklas Cassel
2020-10-05 23:08         ` Dmitry Fomichev
2020-09-30 15:12   ` Niklas Cassel
2020-09-28  2:35 ` [PATCH v5 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
2020-09-28  7:51   ` Klaus Jensen
2020-09-29 15:43     ` Dmitry Fomichev
2020-09-29 16:46       ` Klaus Jensen
2020-09-28  2:35 ` [PATCH v5 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200928104258.GB33043@apples.localdomain \
    --to=its@irrelevant.dk \
    --cc=alistair.francis@wdc.com \
    --cc=damien.lemoal@wdc.com \
    --cc=dmitry.fomichev@wdc.com \
    --cc=fam@euphon.net \
    --cc=k.jensen@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=kwolf@redhat.com \
    --cc=matias.bjorling@wdc.com \
    --cc=mlevitsk@redhat.com \
    --cc=niklas.cassel@wdc.com \
    --cc=philmd@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.