All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support
@ 2020-11-12 19:59 Klaus Jensen
  2020-11-12 19:59 ` [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter Klaus Jensen
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-12 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen, Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

This adds support for the Deallocated or Unwritten Logical Block error
recovery feature as well as the Dataset Management command.

v8:
  - Move req->opaque clearing to nvme_req_clear.
  - Add two preparation/cleanup patches.

v7:
  - Handle negative return value from bdrv_block_status.
  - bdrv_get_info may not be supported on all block drivers, so do not
    consider it a fatal error.

v6:
  - Skip the allocation of the discards integer and just use the opaque
    value directly (Philippe)
  - Split changes to include/block/nvme.h into a separate patch
    (Philippe)
  - Clean up some convoluted checks on the discards value (Philippe)
  - Use unambiguous units in the commit messages (Philippe)
  - Stack allocate the range array (Keith)

v5:
  - Restore status code from callback (Keith)

v4:
  - Removed mixed declaration and code (Keith)
  - Set NPDG and NPDA and account for the blockdev cluster size.

Klaus Jensen (5):
  hw/block/nvme: remove superfluous NvmeCtrl parameter
  hw/block/nvme: pull aio error handling
  hw/block/nvme: add dulbe support
  nvme: add namespace I/O optimization fields to shared header
  hw/block/nvme: add the dataset management command

 hw/block/nvme-ns.h    |   4 +
 hw/block/nvme.h       |   2 +
 include/block/nvme.h  |  12 +-
 hw/block/nvme-ns.c    |  34 +++++-
 hw/block/nvme.c       | 258 ++++++++++++++++++++++++++++++++++++------
 hw/block/trace-events |   4 +
 6 files changed, 276 insertions(+), 38 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter
  2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
@ 2020-11-12 19:59 ` Klaus Jensen
  2020-11-16 11:29   ` Minwoo Im
  2020-11-12 19:59 ` [PATCH v8 2/5] hw/block/nvme: pull aio error handling Klaus Jensen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2020-11-12 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen, Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

nvme_check_bounds has no use of the NvmeCtrl parameter; remove it.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 01b657b1c5e2..51f35e8cf18b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -866,8 +866,8 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len)
     return NVME_SUCCESS;
 }
 
-static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
-                                         uint64_t slba, uint32_t nlb)
+static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
+                                         uint32_t nlb)
 {
     uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
 
@@ -943,7 +943,7 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
 
     trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
 
-    status = nvme_check_bounds(n, ns, slba, nlb);
+    status = nvme_check_bounds(ns, slba, nlb);
     if (status) {
         trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
         return status;
@@ -979,7 +979,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
-    status = nvme_check_bounds(n, ns, slba, nlb);
+    status = nvme_check_bounds(ns, slba, nlb);
     if (status) {
         trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
         goto invalid;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v8 2/5] hw/block/nvme: pull aio error handling
  2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
  2020-11-12 19:59 ` [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter Klaus Jensen
@ 2020-11-12 19:59 ` Klaus Jensen
  2020-11-16 11:36   ` Minwoo Im
  2020-11-16 17:57   ` Keith Busch
  2020-11-12 19:59 ` [PATCH v8 3/5] hw/block/nvme: add dulbe support Klaus Jensen
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-12 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen, Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

Add a new function, nvme_aio_err, to handle errors resulting from AIOs
and use this from the callbacks.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 61 +++++++++++++++++++++++++++++--------------------
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 51f35e8cf18b..a96a996ddc0d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -878,6 +878,41 @@ static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
     return NVME_SUCCESS;
 }
 
+static void nvme_aio_err(NvmeRequest *req, int ret)
+{
+    uint16_t status = NVME_SUCCESS;
+    Error *local_err = NULL;
+
+    switch (req->cmd.opcode) {
+    case NVME_CMD_READ:
+        status = NVME_UNRECOVERED_READ;
+        break;
+    case NVME_CMD_FLUSH:
+    case NVME_CMD_WRITE:
+    case NVME_CMD_WRITE_ZEROES:
+        status = NVME_WRITE_FAULT;
+        break;
+    default:
+        status = NVME_INTERNAL_DEV_ERROR;
+        break;
+    }
+
+    trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret), status);
+
+    error_setg_errno(&local_err, -ret, "aio failed");
+    error_report_err(local_err);
+
+    /*
+     * Set the command status code to the first encountered error but allow a
+     * subsequent Internal Device Error to trump it.
+     */
+    if (req->status && status != NVME_INTERNAL_DEV_ERROR) {
+        return;
+    }
+
+    req->status = status;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -887,37 +922,13 @@ static void nvme_rw_cb(void *opaque, int ret)
     BlockAcctCookie *acct = &req->acct;
     BlockAcctStats *stats = blk_get_stats(blk);
 
-    Error *local_err = NULL;
-
     trace_pci_nvme_rw_cb(nvme_cid(req), blk_name(blk));
 
     if (!ret) {
         block_acct_done(stats, acct);
     } else {
-        uint16_t status;
-
         block_acct_failed(stats, acct);
-
-        switch (req->cmd.opcode) {
-        case NVME_CMD_READ:
-            status = NVME_UNRECOVERED_READ;
-            break;
-        case NVME_CMD_FLUSH:
-        case NVME_CMD_WRITE:
-        case NVME_CMD_WRITE_ZEROES:
-            status = NVME_WRITE_FAULT;
-            break;
-        default:
-            status = NVME_INTERNAL_DEV_ERROR;
-            break;
-        }
-
-        trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret), status);
-
-        error_setg_errno(&local_err, -ret, "aio failed");
-        error_report_err(local_err);
-
-        req->status = status;
+        nvme_aio_err(req, ret);
     }
 
     nvme_enqueue_req_completion(nvme_cq(req), req);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v8 3/5] hw/block/nvme: add dulbe support
  2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
  2020-11-12 19:59 ` [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter Klaus Jensen
  2020-11-12 19:59 ` [PATCH v8 2/5] hw/block/nvme: pull aio error handling Klaus Jensen
@ 2020-11-12 19:59 ` Klaus Jensen
  2020-11-16 11:43   ` Minwoo Im
  2020-11-16 18:00   ` Keith Busch
  2020-11-12 19:59 ` [PATCH v8 4/5] nvme: add namespace I/O optimization fields to shared header Klaus Jensen
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-12 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen, Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

     format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
    -----------------------------------------------------
      qcow2    ignore   n          n          y
      qcow2    unmap    n          n          y
      raw      ignore   n          y          y
      raw      unmap    n          y          y

So, this works best with an image in raw format and 4KiB LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64KiB by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.h    |  4 ++
 include/block/nvme.h  |  5 +++
 hw/block/nvme-ns.c    |  8 ++--
 hw/block/nvme.c       | 91 ++++++++++++++++++++++++++++++++++++++++++-
 hw/block/trace-events |  4 ++
 5 files changed, 107 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606e1..44bf6271b744 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,10 @@ typedef struct NvmeNamespace {
     NvmeIdNs     id_ns;
 
     NvmeNamespaceParams params;
+
+    struct {
+        uint32_t err_rec;
+    } features;
 } NvmeNamespace;
 
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8a46d9cf015f..966c3bb304bd 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -687,6 +687,7 @@ enum NvmeStatusCodes {
     NVME_E2E_REF_ERROR          = 0x0284,
     NVME_CMP_FAILURE            = 0x0285,
     NVME_ACCESS_DENIED          = 0x0286,
+    NVME_DULB                   = 0x0287,
     NVME_MORE                   = 0x2000,
     NVME_DNR                    = 0x4000,
     NVME_NO_COMPLETE            = 0xffff,
@@ -903,6 +904,9 @@ enum NvmeIdCtrlLpa {
 #define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
 
+#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0xffff)
+#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x10000)
+
 enum NvmeFeatureIds {
     NVME_ARBITRATION                = 0x1,
     NVME_POWER_MANAGEMENT           = 0x2,
@@ -1023,6 +1027,7 @@ enum NvmeNsIdentifierType {
 
 
 #define NVME_ID_NS_NSFEAT_THIN(nsfeat)      ((nsfeat & 0x1))
+#define NVME_ID_NS_NSFEAT_DULBE(nsfeat)     ((nsfeat >> 2) & 0x1)
 #define NVME_ID_NS_FLBAS_EXTENDED(flbas)    ((flbas >> 4) & 0x1)
 #define NVME_ID_NS_FLBAS_INDEX(flbas)       ((flbas & 0xf))
 #define NVME_ID_NS_MC_SEPARATE(mc)          ((mc >> 1) & 0x1)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 31c80cdf5b5f..f1cc734c60f5 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -33,9 +33,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
     NvmeIdNs *id_ns = &ns->id_ns;
     int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
 
-    if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
-        ns->id_ns.dlfeat = 0x9;
-    }
+    ns->id_ns.dlfeat = 0x9;
 
     id_ns->lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
 
@@ -44,6 +42,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
     /* no thin provisioning */
     id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
+
+    /* support DULBE */
+    id_ns->nsfeat |= 0x4;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -92,6 +93,7 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     nvme_ns_init(ns);
+
     if (nvme_register_namespace(n, ns, errp)) {
         return -1;
     }
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a96a996ddc0d..8d75a89fd872 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
     [NVME_TEMPERATURE_THRESHOLD]    = NVME_FEAT_CAP_CHANGE,
+    [NVME_ERROR_RECOVERY]           = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
     [NVME_VOLATILE_WRITE_CACHE]     = NVME_FEAT_CAP_CHANGE,
     [NVME_NUMBER_OF_QUEUES]         = NVME_FEAT_CAP_CHANGE,
     [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
@@ -878,6 +879,49 @@ static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba,
+                                 uint32_t nlb)
+{
+    BlockDriverState *bs = blk_bs(ns->blkconf.blk);
+
+    int64_t pnum = 0, bytes = nvme_l2b(ns, nlb);
+    int64_t offset = nvme_l2b(ns, slba);
+    bool zeroed;
+    int ret;
+
+    Error *local_err = NULL;
+
+    /*
+     * `pnum` holds the number of bytes after offset that shares the same
+     * allocation status as the byte at offset. If `pnum` is different from
+     * `bytes`, we should check the allocation status of the next range and
+     * continue this until all bytes have been checked.
+     */
+    do {
+        bytes -= pnum;
+
+        ret = bdrv_block_status(bs, offset, bytes, &pnum, NULL, NULL);
+        if (ret < 0) {
+            error_setg_errno(&local_err, -ret, "unable to get block status");
+            error_report_err(local_err);
+
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        zeroed = !!(ret & BDRV_BLOCK_ZERO);
+
+        trace_pci_nvme_block_status(offset, bytes, pnum, ret, zeroed);
+
+        if (zeroed) {
+            return NVME_DULB;
+        }
+
+        offset += pnum;
+    } while (pnum != bytes);
+
+    return NVME_SUCCESS;
+}
+
 static void nvme_aio_err(NvmeRequest *req, int ret)
 {
     uint16_t status = NVME_SUCCESS;
@@ -996,6 +1040,15 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
+    if (acct == BLOCK_ACCT_READ) {
+        if (NVME_ERR_REC_DULBE(ns->features.err_rec)) {
+            status = nvme_check_dulbe(ns, slba, nlb);
+            if (status) {
+                goto invalid;
+            }
+        }
+    }
+
     status = nvme_map_dptr(n, data_size, req);
     if (status) {
         goto invalid;
@@ -1643,6 +1696,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
     uint8_t fid = NVME_GETSETFEAT_FID(dw10);
     NvmeGetFeatureSelect sel = NVME_GETFEAT_SELECT(dw10);
     uint16_t iv;
+    NvmeNamespace *ns;
 
     static const uint32_t nvme_feature_default[NVME_FID_MAX] = {
         [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
@@ -1705,6 +1759,18 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
         }
 
         return NVME_INVALID_FIELD | NVME_DNR;
+    case NVME_ERROR_RECOVERY:
+        if (!nvme_nsid_valid(n, nsid)) {
+            return NVME_INVALID_NSID | NVME_DNR;
+        }
+
+        ns = nvme_ns(n, nsid);
+        if (unlikely(!ns)) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        result = ns->features.err_rec;
+        goto out;
     case NVME_VOLATILE_WRITE_CACHE:
         result = n->features.vwc;
         trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -1777,7 +1843,7 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeNamespace *ns;
+    NvmeNamespace *ns = NULL;
 
     NvmeCmd *cmd = &req->cmd;
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
@@ -1785,6 +1851,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
     uint32_t nsid = le32_to_cpu(cmd->nsid);
     uint8_t fid = NVME_GETSETFEAT_FID(dw10);
     uint8_t save = NVME_SETFEAT_SAVE(dw10);
+    int i;
 
     trace_pci_nvme_setfeat(nvme_cid(req), nsid, fid, save, dw11);
 
@@ -1844,11 +1911,31 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                NVME_LOG_SMART_INFO);
         }
 
+        break;
+    case NVME_ERROR_RECOVERY:
+        if (nsid == NVME_NSID_BROADCAST) {
+            for (i = 1; i <= n->num_namespaces; i++) {
+                ns = nvme_ns(n, i);
+
+                if (!ns) {
+                    continue;
+                }
+
+                if (NVME_ID_NS_NSFEAT_DULBE(ns->id_ns.nsfeat)) {
+                    ns->features.err_rec = dw11;
+                }
+            }
+
+            break;
+        }
+
+        assert(ns);
+        ns->features.err_rec = dw11;
         break;
     case NVME_VOLATILE_WRITE_CACHE:
         n->features.vwc = dw11 & 0x1;
 
-        for (int i = 1; i <= n->num_namespaces; i++) {
+        for (i = 1; i <= n->num_namespaces; i++) {
             ns = nvme_ns(n, i);
             if (!ns) {
                 continue;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index c1537e3ac0b0..1ffe0b3f76b5 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -43,6 +43,10 @@ pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opna
 pci_nvme_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_write_zeroes(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
+pci_nvme_block_status(int64_t offset, int64_t bytes, int64_t pnum, int ret, bool zeroed) "offset %"PRId64" bytes %"PRId64" pnum %"PRId64" ret 0x%x zeroed %d"
+pci_nvme_dsm(uint16_t cid, uint32_t nsid, uint32_t nr, uint32_t attr) "cid %"PRIu16" nsid %"PRIu32" nr %"PRIu32" attr 0x%"PRIx32""
+pci_nvme_dsm_deallocate(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
+pci_nvme_aio_discard_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v8 4/5] nvme: add namespace I/O optimization fields to shared header
  2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
                   ` (2 preceding siblings ...)
  2020-11-12 19:59 ` [PATCH v8 3/5] hw/block/nvme: add dulbe support Klaus Jensen
@ 2020-11-12 19:59 ` Klaus Jensen
  2020-11-16 11:47   ` Minwoo Im
  2020-11-12 19:59 ` [PATCH v8 5/5] hw/block/nvme: add the dataset management command Klaus Jensen
  2020-11-23 20:45 ` [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
  5 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2020-11-12 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Stefan Hajnoczi, Klaus Jensen,
	Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

This adds the NPWG, NPWA, NPDG, NPDA and NOWS family of fields to the
shared nvme.h header for use by later patches.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/nvme.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 966c3bb304bd..e95ff6ca9b37 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
     uint16_t    nabspf;
     uint16_t    noiob;
     uint8_t     nvmcap[16];
-    uint8_t     rsvd64[40];
+    uint16_t    npwg;
+    uint16_t    npwa;
+    uint16_t    npdg;
+    uint16_t    npda;
+    uint16_t    nows;
+    uint8_t     rsvd74[30];
     uint8_t     nguid[16];
     uint64_t    eui64;
     NvmeLBAF    lbaf[16];
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v8 5/5] hw/block/nvme: add the dataset management command
  2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
                   ` (3 preceding siblings ...)
  2020-11-12 19:59 ` [PATCH v8 4/5] nvme: add namespace I/O optimization fields to shared header Klaus Jensen
@ 2020-11-12 19:59 ` Klaus Jensen
  2020-11-16 18:01   ` Keith Busch
  2020-11-23 20:45 ` [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
  5 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2020-11-12 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen, Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

     format | discard | dsm (512B)  dsm (4KiB)  dsm (64KiB)
    --------------------------------------------------------
      qcow2    ignore   n           n           n
      qcow2    unmap    n           n           y
      raw      ignore   n           n           n
      raw      unmap    n           y           y

Again, a raw format and 4KiB LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4KiB. If we are using a passthru device supporting
discard at a 512B granularity, user should set the discard_granularity
property explicitly. NPDG and NPDA will also account for the
cluster_size of the block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.h    |  2 +
 hw/block/nvme-ns.c | 30 ++++++++++++--
 hw/block/nvme.c    | 98 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 125 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..574333caa3f9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -28,6 +28,7 @@ typedef struct NvmeRequest {
     struct NvmeNamespace    *ns;
     BlockAIOCB              *aiocb;
     uint16_t                status;
+    void                    *opaque;
     NvmeCqe                 cqe;
     NvmeCmd                 cmd;
     BlockAcctCookie         acct;
@@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
     case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
     case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
     case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
+    case NVME_CMD_DSM:              return "NVME_NVM_CMD_DSM";
     default:                        return "NVME_NVM_CMD_UNKNOWN";
     }
 }
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index f1cc734c60f5..2d69b5177b51 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -28,10 +28,14 @@
 #include "nvme.h"
 #include "nvme-ns.h"
 
-static void nvme_ns_init(NvmeNamespace *ns)
+#define MIN_DISCARD_GRANULARITY (4 * KiB)
+
+static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
 {
+    BlockDriverInfo bdi;
     NvmeIdNs *id_ns = &ns->id_ns;
     int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+    int npdg;
 
     ns->id_ns.dlfeat = 0x9;
 
@@ -43,8 +47,19 @@ static void nvme_ns_init(NvmeNamespace *ns)
     id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
 
-    /* support DULBE */
-    id_ns->nsfeat |= 0x4;
+    /* support DULBE and I/O optimization fields */
+    id_ns->nsfeat |= (0x4 | 0x10);
+
+    npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
+
+    if (bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi) >= 0 &&
+        bdi.cluster_size > ns->blkconf.discard_granularity) {
+        npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
+    }
+
+    id_ns->npda = id_ns->npdg = npdg - 1;
+
+    return 0;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -59,6 +74,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    if (ns->blkconf.discard_granularity == -1) {
+        ns->blkconf.discard_granularity =
+            MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
+    }
+
     ns->size = blk_getlength(ns->blkconf.blk);
     if (ns->size < 0) {
         error_setg_errno(errp, -ns->size, "could not get blockdev size");
@@ -92,7 +112,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
-    nvme_ns_init(ns);
+    if (nvme_ns_init(ns, errp)) {
+        return -1;
+    }
 
     if (nvme_register_namespace(n, ns, errp)) {
         return -1;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8d75a89fd872..f7f888402b06 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -242,6 +242,7 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 static void nvme_req_clear(NvmeRequest *req)
 {
     req->ns = NULL;
+    req->opaque = NULL;
     memset(&req->cqe, 0x0, sizeof(req->cqe));
     req->status = NVME_SUCCESS;
 }
@@ -978,6 +979,99 @@ static void nvme_rw_cb(void *opaque, int ret)
     nvme_enqueue_req_completion(nvme_cq(req), req);
 }
 
+static void nvme_aio_discard_cb(void *opaque, int ret)
+{
+    NvmeRequest *req = opaque;
+    uintptr_t *discards = (uintptr_t *)&req->opaque;
+
+    trace_pci_nvme_aio_discard_cb(nvme_cid(req));
+
+    if (ret) {
+        nvme_aio_err(req, ret);
+    }
+
+    (*discards)--;
+
+    if (*discards) {
+        return;
+    }
+
+    nvme_enqueue_req_completion(nvme_cq(req), req);
+}
+
+static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns = req->ns;
+    NvmeDsmCmd *dsm = (NvmeDsmCmd *) &req->cmd;
+
+    uint32_t attr = le32_to_cpu(dsm->attributes);
+    uint32_t nr = (le32_to_cpu(dsm->nr) & 0xff) + 1;
+
+    uint16_t status = NVME_SUCCESS;
+
+    trace_pci_nvme_dsm(nvme_cid(req), nvme_nsid(ns), nr, attr);
+
+    if (attr & NVME_DSMGMT_AD) {
+        int64_t offset;
+        size_t len;
+        NvmeDsmRange range[nr];
+        uintptr_t *discards = (uintptr_t *)&req->opaque;
+
+        status = nvme_dma(n, (uint8_t *)range, sizeof(range),
+                          DMA_DIRECTION_TO_DEVICE, req);
+        if (status) {
+            return status;
+        }
+
+        /*
+         * AIO callbacks may be called immediately, so initialize discards to 1
+         * to make sure the the callback does not complete the request before
+         * all discards have been issued.
+         */
+        *discards = 1;
+
+        for (int i = 0; i < nr; i++) {
+            uint64_t slba = le64_to_cpu(range[i].slba);
+            uint32_t nlb = le32_to_cpu(range[i].nlb);
+
+            if (nvme_check_bounds(ns, slba, nlb)) {
+                trace_pci_nvme_err_invalid_lba_range(slba, nlb,
+                                                     ns->id_ns.nsze);
+                continue;
+            }
+
+            trace_pci_nvme_dsm_deallocate(nvme_cid(req), nvme_nsid(ns), slba,
+                                          nlb);
+
+            offset = nvme_l2b(ns, slba);
+            len = nvme_l2b(ns, nlb);
+
+            while (len) {
+                size_t bytes = MIN(BDRV_REQUEST_MAX_BYTES, len);
+
+                (*discards)++;
+
+                blk_aio_pdiscard(ns->blkconf.blk, offset, bytes,
+                                 nvme_aio_discard_cb, req);
+
+                offset += bytes;
+                len -= bytes;
+            }
+        }
+
+        /* account for the 1-initialization */
+        (*discards)--;
+
+        if (*discards) {
+            status = NVME_NO_COMPLETE;
+        } else {
+            status = req->status;
+        }
+    }
+
+    return status;
+}
+
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 {
     block_acct_start(blk_get_stats(req->ns->blkconf.blk), &req->acct, 0,
@@ -1107,6 +1201,8 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
         return nvme_rw(n, req);
+    case NVME_CMD_DSM:
+        return nvme_dsm(n, req);
     default:
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -2831,7 +2927,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->cqes = (0x4 << 4) | 0x4;
     id->nn = cpu_to_le32(n->num_namespaces);
     id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
-                           NVME_ONCS_FEATURES);
+                           NVME_ONCS_FEATURES | NVME_ONCS_DSM);
 
     id->vwc = 0x1;
     id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN |
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter
  2020-11-12 19:59 ` [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter Klaus Jensen
@ 2020-11-16 11:29   ` Minwoo Im
  0 siblings, 0 replies; 18+ messages in thread
From: Minwoo Im @ 2020-11-16 11:29 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Keith Busch, Philippe Mathieu-Daudé

On 11/12 20:59, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> nvme_check_bounds has no use of the NvmeCtrl parameter; remove it.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 2/5] hw/block/nvme: pull aio error handling
  2020-11-12 19:59 ` [PATCH v8 2/5] hw/block/nvme: pull aio error handling Klaus Jensen
@ 2020-11-16 11:36   ` Minwoo Im
  2020-11-16 11:52     ` Klaus Jensen
  2020-11-16 17:57   ` Keith Busch
  1 sibling, 1 reply; 18+ messages in thread
From: Minwoo Im @ 2020-11-16 11:36 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Keith Busch, Philippe Mathieu-Daudé

On 11/12 20:59, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> Add a new function, nvme_aio_err, to handle errors resulting from AIOs
> and use this from the callbacks.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c | 61 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 36 insertions(+), 25 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 51f35e8cf18b..a96a996ddc0d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -878,6 +878,41 @@ static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
>      return NVME_SUCCESS;
>  }
>  
> +static void nvme_aio_err(NvmeRequest *req, int ret)
> +{
> +    uint16_t status = NVME_SUCCESS;
> +    Error *local_err = NULL;
> +
> +    switch (req->cmd.opcode) {
> +    case NVME_CMD_READ:
> +        status = NVME_UNRECOVERED_READ;
> +        break;
> +    case NVME_CMD_FLUSH:
> +    case NVME_CMD_WRITE:
> +    case NVME_CMD_WRITE_ZEROES:
> +        status = NVME_WRITE_FAULT;
> +        break;
> +    default:
> +        status = NVME_INTERNAL_DEV_ERROR;
> +        break;
> +    }
> +
> +    trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret), status);
> +
> +    error_setg_errno(&local_err, -ret, "aio failed");
> +    error_report_err(local_err);
> +
> +    /*
> +     * Set the command status code to the first encountered error but allow a
> +     * subsequent Internal Device Error to trump it.
> +     */
> +    if (req->status && status != NVME_INTERNAL_DEV_ERROR) {
> +        return;
> +    }

Are we okay with the trace print up there in case that this if is taken?
I guess if this if is taken, the trace print may not that matter.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 3/5] hw/block/nvme: add dulbe support
  2020-11-12 19:59 ` [PATCH v8 3/5] hw/block/nvme: add dulbe support Klaus Jensen
@ 2020-11-16 11:43   ` Minwoo Im
  2020-11-16 11:57     ` Klaus Jensen
  2020-11-16 18:00   ` Keith Busch
  1 sibling, 1 reply; 18+ messages in thread
From: Minwoo Im @ 2020-11-16 11:43 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Keith Busch, Philippe Mathieu-Daudé

On 11/12 20:59, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> Add support for reporting the Deallocated or Unwritten Logical Block
> Error (DULBE).
> 
> Rely on the block status flags reported by the block layer and consider
> any block with the BDRV_BLOCK_ZERO flag to be deallocated.
> 
> Multiple factors affect when a Write Zeroes command result in
> deallocation of blocks.
> 
>   * the underlying file system block size
>   * the blockdev format
>   * the 'discard' and 'logical_block_size' parameters
> 
>      format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
>     -----------------------------------------------------
>       qcow2    ignore   n          n          y
>       qcow2    unmap    n          n          y
>       raw      ignore   n          y          y
>       raw      unmap    n          y          y
> 
> So, this works best with an image in raw format and 4KiB LBAs, since
> holes can then be punched on a per-block basis (this assumes a file
> system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
> of 64KiB by default and blocks will only be marked deallocated if a full
> cluster is zeroed or discarded. However, this *is* consistent with the
> spec since Write Zeroes "should" deallocate the block if the Deallocate
> attribute is set and "may" deallocate if the Deallocate attribute is not
> set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
> always set).
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme-ns.h    |  4 ++
>  include/block/nvme.h  |  5 +++
>  hw/block/nvme-ns.c    |  8 ++--
>  hw/block/nvme.c       | 91 ++++++++++++++++++++++++++++++++++++++++++-
>  hw/block/trace-events |  4 ++
>  5 files changed, 107 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 83734f4606e1..44bf6271b744 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -31,6 +31,10 @@ typedef struct NvmeNamespace {
>      NvmeIdNs     id_ns;
>  
>      NvmeNamespaceParams params;
> +
> +    struct {
> +        uint32_t err_rec;
> +    } features;
>  } NvmeNamespace;
>  
>  static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 8a46d9cf015f..966c3bb304bd 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -687,6 +687,7 @@ enum NvmeStatusCodes {
>      NVME_E2E_REF_ERROR          = 0x0284,
>      NVME_CMP_FAILURE            = 0x0285,
>      NVME_ACCESS_DENIED          = 0x0286,
> +    NVME_DULB                   = 0x0287,
>      NVME_MORE                   = 0x2000,
>      NVME_DNR                    = 0x4000,
>      NVME_NO_COMPLETE            = 0xffff,
> @@ -903,6 +904,9 @@ enum NvmeIdCtrlLpa {
>  #define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
>  #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
>  
> +#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0xffff)
> +#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x10000)
> +
>  enum NvmeFeatureIds {
>      NVME_ARBITRATION                = 0x1,
>      NVME_POWER_MANAGEMENT           = 0x2,
> @@ -1023,6 +1027,7 @@ enum NvmeNsIdentifierType {
>  
>  
>  #define NVME_ID_NS_NSFEAT_THIN(nsfeat)      ((nsfeat & 0x1))
> +#define NVME_ID_NS_NSFEAT_DULBE(nsfeat)     ((nsfeat >> 2) & 0x1)
>  #define NVME_ID_NS_FLBAS_EXTENDED(flbas)    ((flbas >> 4) & 0x1)
>  #define NVME_ID_NS_FLBAS_INDEX(flbas)       ((flbas & 0xf))
>  #define NVME_ID_NS_MC_SEPARATE(mc)          ((mc >> 1) & 0x1)
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 31c80cdf5b5f..f1cc734c60f5 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -33,9 +33,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
>      NvmeIdNs *id_ns = &ns->id_ns;
>      int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
>  
> -    if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
> -        ns->id_ns.dlfeat = 0x9;
> -    }
> +    ns->id_ns.dlfeat = 0x9;
>  
>      id_ns->lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
>  
> @@ -44,6 +42,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
>      /* no thin provisioning */
>      id_ns->ncap = id_ns->nsze;
>      id_ns->nuse = id_ns->ncap;
> +
> +    /* support DULBE */
> +    id_ns->nsfeat |= 0x4;
>  }
>  
>  static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> @@ -92,6 +93,7 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>      }
>  
>      nvme_ns_init(ns);
> +
>      if (nvme_register_namespace(n, ns, errp)) {
>          return -1;
>      }
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index a96a996ddc0d..8d75a89fd872 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
>  
>  static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
>      [NVME_TEMPERATURE_THRESHOLD]    = NVME_FEAT_CAP_CHANGE,
> +    [NVME_ERROR_RECOVERY]           = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
>      [NVME_VOLATILE_WRITE_CACHE]     = NVME_FEAT_CAP_CHANGE,
>      [NVME_NUMBER_OF_QUEUES]         = NVME_FEAT_CAP_CHANGE,
>      [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
> @@ -878,6 +879,49 @@ static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba,
> +                                 uint32_t nlb)
> +{
> +    BlockDriverState *bs = blk_bs(ns->blkconf.blk);
> +
> +    int64_t pnum = 0, bytes = nvme_l2b(ns, nlb);
> +    int64_t offset = nvme_l2b(ns, slba);
> +    bool zeroed;
> +    int ret;
> +
> +    Error *local_err = NULL;
> +
> +    /*
> +     * `pnum` holds the number of bytes after offset that shares the same
> +     * allocation status as the byte at offset. If `pnum` is different from
> +     * `bytes`, we should check the allocation status of the next range and
> +     * continue this until all bytes have been checked.
> +     */
> +    do {
> +        bytes -= pnum;
> +
> +        ret = bdrv_block_status(bs, offset, bytes, &pnum, NULL, NULL);
> +        if (ret < 0) {
> +            error_setg_errno(&local_err, -ret, "unable to get block status");
> +            error_report_err(local_err);
> +
> +            return NVME_INTERNAL_DEV_ERROR;
> +        }
> +
> +        zeroed = !!(ret & BDRV_BLOCK_ZERO);
> +
> +        trace_pci_nvme_block_status(offset, bytes, pnum, ret, zeroed);
> +
> +        if (zeroed) {
> +            return NVME_DULB;
> +        }
> +
> +        offset += pnum;
> +    } while (pnum != bytes);
> +
> +    return NVME_SUCCESS;
> +}
> +
>  static void nvme_aio_err(NvmeRequest *req, int ret)
>  {
>      uint16_t status = NVME_SUCCESS;
> @@ -996,6 +1040,15 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
>          goto invalid;
>      }
>  
> +    if (acct == BLOCK_ACCT_READ) {
> +        if (NVME_ERR_REC_DULBE(ns->features.err_rec)) {
> +            status = nvme_check_dulbe(ns, slba, nlb);
> +            if (status) {
> +                goto invalid;
> +            }
> +        }
> +    }
> +
>      status = nvme_map_dptr(n, data_size, req);
>      if (status) {
>          goto invalid;
> @@ -1643,6 +1696,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
>      uint8_t fid = NVME_GETSETFEAT_FID(dw10);
>      NvmeGetFeatureSelect sel = NVME_GETFEAT_SELECT(dw10);
>      uint16_t iv;
> +    NvmeNamespace *ns;
>  
>      static const uint32_t nvme_feature_default[NVME_FID_MAX] = {
>          [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
> @@ -1705,6 +1759,18 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
>          }
>  
>          return NVME_INVALID_FIELD | NVME_DNR;
> +    case NVME_ERROR_RECOVERY:
> +        if (!nvme_nsid_valid(n, nsid)) {
> +            return NVME_INVALID_NSID | NVME_DNR;
> +        }
> +
> +        ns = nvme_ns(n, nsid);
> +        if (unlikely(!ns)) {
> +            return NVME_INVALID_FIELD | NVME_DNR;
> +        }
> +
> +        result = ns->features.err_rec;
> +        goto out;
>      case NVME_VOLATILE_WRITE_CACHE:
>          result = n->features.vwc;
>          trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> @@ -1777,7 +1843,7 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
>  
>  static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    NvmeNamespace *ns;
> +    NvmeNamespace *ns = NULL;
>  
>      NvmeCmd *cmd = &req->cmd;
>      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> @@ -1785,6 +1851,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>      uint32_t nsid = le32_to_cpu(cmd->nsid);
>      uint8_t fid = NVME_GETSETFEAT_FID(dw10);
>      uint8_t save = NVME_SETFEAT_SAVE(dw10);
> +    int i;
>  
>      trace_pci_nvme_setfeat(nvme_cid(req), nsid, fid, save, dw11);
>  
> @@ -1844,11 +1911,31 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>                                 NVME_LOG_SMART_INFO);
>          }
>  
> +        break;
> +    case NVME_ERROR_RECOVERY:
> +        if (nsid == NVME_NSID_BROADCAST) {
> +            for (i = 1; i <= n->num_namespaces; i++) {
> +                ns = nvme_ns(n, i);
> +
> +                if (!ns) {
> +                    continue;
> +                }
> +
> +                if (NVME_ID_NS_NSFEAT_DULBE(ns->id_ns.nsfeat)) {
> +                    ns->features.err_rec = dw11;
> +                }
> +            }
> +
> +            break;
> +        }
> +
> +        assert(ns);

Klaus,

Sorry, but can we have assert for the failure that might happen due to
user's mis-behavior?  Why don't we have NVME_INVALID_NSID for this case?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 4/5] nvme: add namespace I/O optimization fields to shared header
  2020-11-12 19:59 ` [PATCH v8 4/5] nvme: add namespace I/O optimization fields to shared header Klaus Jensen
@ 2020-11-16 11:47   ` Minwoo Im
  0 siblings, 0 replies; 18+ messages in thread
From: Minwoo Im @ 2020-11-16 11:47 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Klaus Jensen, qemu-devel,
	Max Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

On 11/12 20:59, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> This adds the NPWG, NPWA, NPDG, NPDA and NOWS family of fields to the
> shared nvme.h header for use by later patches.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Fam Zheng <fam@euphon.net>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 2/5] hw/block/nvme: pull aio error handling
  2020-11-16 11:36   ` Minwoo Im
@ 2020-11-16 11:52     ` Klaus Jensen
  0 siblings, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-16 11:52 UTC (permalink / raw)
  To: Minwoo Im
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Keith Busch, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]

On Nov 16 20:36, Minwoo Im wrote:
> On 11/12 20:59, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> > Add a new function, nvme_aio_err, to handle errors resulting from AIOs
> > and use this from the callbacks.
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme.c | 61 +++++++++++++++++++++++++++++--------------------
> >  1 file changed, 36 insertions(+), 25 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 51f35e8cf18b..a96a996ddc0d 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -878,6 +878,41 @@ static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
> >      return NVME_SUCCESS;
> >  }
> >  
> > +static void nvme_aio_err(NvmeRequest *req, int ret)
> > +{
> > +    uint16_t status = NVME_SUCCESS;
> > +    Error *local_err = NULL;
> > +
> > +    switch (req->cmd.opcode) {
> > +    case NVME_CMD_READ:
> > +        status = NVME_UNRECOVERED_READ;
> > +        break;
> > +    case NVME_CMD_FLUSH:
> > +    case NVME_CMD_WRITE:
> > +    case NVME_CMD_WRITE_ZEROES:
> > +        status = NVME_WRITE_FAULT;
> > +        break;
> > +    default:
> > +        status = NVME_INTERNAL_DEV_ERROR;
> > +        break;
> > +    }
> > +
> > +    trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret), status);
> > +
> > +    error_setg_errno(&local_err, -ret, "aio failed");
> > +    error_report_err(local_err);
> > +
> > +    /*
> > +     * Set the command status code to the first encountered error but allow a
> > +     * subsequent Internal Device Error to trump it.
> > +     */
> > +    if (req->status && status != NVME_INTERNAL_DEV_ERROR) {
> > +        return;
> > +    }
> 
> Are we okay with the trace print up there in case that this if is taken?
> I guess if this if is taken, the trace print may not that matter.
> 

Yeah, I was thinking we always print the error that corresponds to the
AIO that we are handling right now.

But maybe we should not include the "translated" nvme error in the trace
at all. That might make more sense.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 3/5] hw/block/nvme: add dulbe support
  2020-11-16 11:43   ` Minwoo Im
@ 2020-11-16 11:57     ` Klaus Jensen
  0 siblings, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-16 11:57 UTC (permalink / raw)
  To: Minwoo Im
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Keith Busch, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 10895 bytes --]

On Nov 16 20:43, Minwoo Im wrote:
> On 11/12 20:59, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> > Add support for reporting the Deallocated or Unwritten Logical Block
> > Error (DULBE).
> > 
> > Rely on the block status flags reported by the block layer and consider
> > any block with the BDRV_BLOCK_ZERO flag to be deallocated.
> > 
> > Multiple factors affect when a Write Zeroes command result in
> > deallocation of blocks.
> > 
> >   * the underlying file system block size
> >   * the blockdev format
> >   * the 'discard' and 'logical_block_size' parameters
> > 
> >      format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
> >     -----------------------------------------------------
> >       qcow2    ignore   n          n          y
> >       qcow2    unmap    n          n          y
> >       raw      ignore   n          y          y
> >       raw      unmap    n          y          y
> > 
> > So, this works best with an image in raw format and 4KiB LBAs, since
> > holes can then be punched on a per-block basis (this assumes a file
> > system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
> > of 64KiB by default and blocks will only be marked deallocated if a full
> > cluster is zeroed or discarded. However, this *is* consistent with the
> > spec since Write Zeroes "should" deallocate the block if the Deallocate
> > attribute is set and "may" deallocate if the Deallocate attribute is not
> > set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
> > always set).
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme-ns.h    |  4 ++
> >  include/block/nvme.h  |  5 +++
> >  hw/block/nvme-ns.c    |  8 ++--
> >  hw/block/nvme.c       | 91 ++++++++++++++++++++++++++++++++++++++++++-
> >  hw/block/trace-events |  4 ++
> >  5 files changed, 107 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index 83734f4606e1..44bf6271b744 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -31,6 +31,10 @@ typedef struct NvmeNamespace {
> >      NvmeIdNs     id_ns;
> >  
> >      NvmeNamespaceParams params;
> > +
> > +    struct {
> > +        uint32_t err_rec;
> > +    } features;
> >  } NvmeNamespace;
> >  
> >  static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 8a46d9cf015f..966c3bb304bd 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -687,6 +687,7 @@ enum NvmeStatusCodes {
> >      NVME_E2E_REF_ERROR          = 0x0284,
> >      NVME_CMP_FAILURE            = 0x0285,
> >      NVME_ACCESS_DENIED          = 0x0286,
> > +    NVME_DULB                   = 0x0287,
> >      NVME_MORE                   = 0x2000,
> >      NVME_DNR                    = 0x4000,
> >      NVME_NO_COMPLETE            = 0xffff,
> > @@ -903,6 +904,9 @@ enum NvmeIdCtrlLpa {
> >  #define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
> >  #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
> >  
> > +#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0xffff)
> > +#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x10000)
> > +
> >  enum NvmeFeatureIds {
> >      NVME_ARBITRATION                = 0x1,
> >      NVME_POWER_MANAGEMENT           = 0x2,
> > @@ -1023,6 +1027,7 @@ enum NvmeNsIdentifierType {
> >  
> >  
> >  #define NVME_ID_NS_NSFEAT_THIN(nsfeat)      ((nsfeat & 0x1))
> > +#define NVME_ID_NS_NSFEAT_DULBE(nsfeat)     ((nsfeat >> 2) & 0x1)
> >  #define NVME_ID_NS_FLBAS_EXTENDED(flbas)    ((flbas >> 4) & 0x1)
> >  #define NVME_ID_NS_FLBAS_INDEX(flbas)       ((flbas & 0xf))
> >  #define NVME_ID_NS_MC_SEPARATE(mc)          ((mc >> 1) & 0x1)
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 31c80cdf5b5f..f1cc734c60f5 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -33,9 +33,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >      NvmeIdNs *id_ns = &ns->id_ns;
> >      int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> >  
> > -    if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
> > -        ns->id_ns.dlfeat = 0x9;
> > -    }
> > +    ns->id_ns.dlfeat = 0x9;
> >  
> >      id_ns->lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
> >  
> > @@ -44,6 +42,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >      /* no thin provisioning */
> >      id_ns->ncap = id_ns->nsze;
> >      id_ns->nuse = id_ns->ncap;
> > +
> > +    /* support DULBE */
> > +    id_ns->nsfeat |= 0x4;
> >  }
> >  
> >  static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > @@ -92,6 +93,7 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >      }
> >  
> >      nvme_ns_init(ns);
> > +
> >      if (nvme_register_namespace(n, ns, errp)) {
> >          return -1;
> >      }
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index a96a996ddc0d..8d75a89fd872 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
> >  
> >  static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
> >      [NVME_TEMPERATURE_THRESHOLD]    = NVME_FEAT_CAP_CHANGE,
> > +    [NVME_ERROR_RECOVERY]           = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
> >      [NVME_VOLATILE_WRITE_CACHE]     = NVME_FEAT_CAP_CHANGE,
> >      [NVME_NUMBER_OF_QUEUES]         = NVME_FEAT_CAP_CHANGE,
> >      [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
> > @@ -878,6 +879,49 @@ static inline uint16_t nvme_check_bounds(NvmeNamespace *ns, uint64_t slba,
> >      return NVME_SUCCESS;
> >  }
> >  
> > +static uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba,
> > +                                 uint32_t nlb)
> > +{
> > +    BlockDriverState *bs = blk_bs(ns->blkconf.blk);
> > +
> > +    int64_t pnum = 0, bytes = nvme_l2b(ns, nlb);
> > +    int64_t offset = nvme_l2b(ns, slba);
> > +    bool zeroed;
> > +    int ret;
> > +
> > +    Error *local_err = NULL;
> > +
> > +    /*
> > +     * `pnum` holds the number of bytes after offset that shares the same
> > +     * allocation status as the byte at offset. If `pnum` is different from
> > +     * `bytes`, we should check the allocation status of the next range and
> > +     * continue this until all bytes have been checked.
> > +     */
> > +    do {
> > +        bytes -= pnum;
> > +
> > +        ret = bdrv_block_status(bs, offset, bytes, &pnum, NULL, NULL);
> > +        if (ret < 0) {
> > +            error_setg_errno(&local_err, -ret, "unable to get block status");
> > +            error_report_err(local_err);
> > +
> > +            return NVME_INTERNAL_DEV_ERROR;
> > +        }
> > +
> > +        zeroed = !!(ret & BDRV_BLOCK_ZERO);
> > +
> > +        trace_pci_nvme_block_status(offset, bytes, pnum, ret, zeroed);
> > +
> > +        if (zeroed) {
> > +            return NVME_DULB;
> > +        }
> > +
> > +        offset += pnum;
> > +    } while (pnum != bytes);
> > +
> > +    return NVME_SUCCESS;
> > +}
> > +
> >  static void nvme_aio_err(NvmeRequest *req, int ret)
> >  {
> >      uint16_t status = NVME_SUCCESS;
> > @@ -996,6 +1040,15 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
> >          goto invalid;
> >      }
> >  
> > +    if (acct == BLOCK_ACCT_READ) {
> > +        if (NVME_ERR_REC_DULBE(ns->features.err_rec)) {
> > +            status = nvme_check_dulbe(ns, slba, nlb);
> > +            if (status) {
> > +                goto invalid;
> > +            }
> > +        }
> > +    }
> > +
> >      status = nvme_map_dptr(n, data_size, req);
> >      if (status) {
> >          goto invalid;
> > @@ -1643,6 +1696,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
> >      uint8_t fid = NVME_GETSETFEAT_FID(dw10);
> >      NvmeGetFeatureSelect sel = NVME_GETFEAT_SELECT(dw10);
> >      uint16_t iv;
> > +    NvmeNamespace *ns;
> >  
> >      static const uint32_t nvme_feature_default[NVME_FID_MAX] = {
> >          [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
> > @@ -1705,6 +1759,18 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
> >          }
> >  
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > +    case NVME_ERROR_RECOVERY:
> > +        if (!nvme_nsid_valid(n, nsid)) {
> > +            return NVME_INVALID_NSID | NVME_DNR;
> > +        }
> > +
> > +        ns = nvme_ns(n, nsid);
> > +        if (unlikely(!ns)) {
> > +            return NVME_INVALID_FIELD | NVME_DNR;
> > +        }
> > +
> > +        result = ns->features.err_rec;
> > +        goto out;
> >      case NVME_VOLATILE_WRITE_CACHE:
> >          result = n->features.vwc;
> >          trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -1777,7 +1843,7 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
> >  
> >  static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
> >  {
> > -    NvmeNamespace *ns;
> > +    NvmeNamespace *ns = NULL;
> >  
> >      NvmeCmd *cmd = &req->cmd;
> >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > @@ -1785,6 +1851,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
> >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> >      uint8_t fid = NVME_GETSETFEAT_FID(dw10);
> >      uint8_t save = NVME_SETFEAT_SAVE(dw10);
> > +    int i;
> >  
> >      trace_pci_nvme_setfeat(nvme_cid(req), nsid, fid, save, dw11);
> >  
> > @@ -1844,11 +1911,31 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
> >                                 NVME_LOG_SMART_INFO);
> >          }
> >  
> > +        break;
> > +    case NVME_ERROR_RECOVERY:
> > +        if (nsid == NVME_NSID_BROADCAST) {
> > +            for (i = 1; i <= n->num_namespaces; i++) {
> > +                ns = nvme_ns(n, i);
> > +
> > +                if (!ns) {
> > +                    continue;
> > +                }
> > +
> > +                if (NVME_ID_NS_NSFEAT_DULBE(ns->id_ns.nsfeat)) {
> > +                    ns->features.err_rec = dw11;
> > +                }
> > +            }
> > +
> > +            break;
> > +        }
> > +
> > +        assert(ns);
> 
> Klaus,
> 
> Sorry, but can we have assert for the failure that might happen due to
> user's mis-behavior?  Why don't we have NVME_INVALID_NSID for this case?
> 

Oh. I see that this is pretty convoluted.

The assert catches if you add a new namespace specific feature but
forget to give it the NVME_FEAT_CAP_NS capability in the
nvme_feature_cap array.

If NVME_FEAT_CAP_NS is correctly set for the feature, then the code at
the beginning of nvme_set_feature will ensure that the namespace id
given is valid.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 2/5] hw/block/nvme: pull aio error handling
  2020-11-12 19:59 ` [PATCH v8 2/5] hw/block/nvme: pull aio error handling Klaus Jensen
  2020-11-16 11:36   ` Minwoo Im
@ 2020-11-16 17:57   ` Keith Busch
  2020-11-16 18:18     ` Klaus Jensen
  1 sibling, 1 reply; 18+ messages in thread
From: Keith Busch @ 2020-11-16 17:57 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Philippe Mathieu-Daudé

On Thu, Nov 12, 2020 at 08:59:42PM +0100, Klaus Jensen wrote:
> +static void nvme_aio_err(NvmeRequest *req, int ret)
> +{
> +    uint16_t status = NVME_SUCCESS;
> +    Error *local_err = NULL;
> +
> +    switch (req->cmd.opcode) {
> +    case NVME_CMD_READ:
> +        status = NVME_UNRECOVERED_READ;
> +        break;
> +    case NVME_CMD_FLUSH:
> +    case NVME_CMD_WRITE:
> +    case NVME_CMD_WRITE_ZEROES:
> +        status = NVME_WRITE_FAULT;
> +        break;
> +    default:
> +        status = NVME_INTERNAL_DEV_ERROR;
> +        break;
> +    }

Just curious, is there potentially a more appropriate way to set an nvme
status based on the value of 'ret'? What is 'ret' representing anyway?
Are these errno values?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 3/5] hw/block/nvme: add dulbe support
  2020-11-12 19:59 ` [PATCH v8 3/5] hw/block/nvme: add dulbe support Klaus Jensen
  2020-11-16 11:43   ` Minwoo Im
@ 2020-11-16 18:00   ` Keith Busch
  1 sibling, 0 replies; 18+ messages in thread
From: Keith Busch @ 2020-11-16 18:00 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Philippe Mathieu-Daudé

On Thu, Nov 12, 2020 at 08:59:43PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> Add support for reporting the Deallocated or Unwritten Logical Block
> Error (DULBE).
> 
> Rely on the block status flags reported by the block layer and consider
> any block with the BDRV_BLOCK_ZERO flag to be deallocated.
> 
> Multiple factors affect when a Write Zeroes command result in
> deallocation of blocks.
> 
>   * the underlying file system block size
>   * the blockdev format
>   * the 'discard' and 'logical_block_size' parameters
> 
>      format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
>     -----------------------------------------------------
>       qcow2    ignore   n          n          y
>       qcow2    unmap    n          n          y
>       raw      ignore   n          y          y
>       raw      unmap    n          y          y
> 
> So, this works best with an image in raw format and 4KiB LBAs, since
> holes can then be punched on a per-block basis (this assumes a file
> system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
> of 64KiB by default and blocks will only be marked deallocated if a full
> cluster is zeroed or discarded. However, this *is* consistent with the
> spec since Write Zeroes "should" deallocate the block if the Deallocate
> attribute is set and "may" deallocate if the Deallocate attribute is not
> set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
> always set).

These all sound like reasonable constraints and consistent with the
spec.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 5/5] hw/block/nvme: add the dataset management command
  2020-11-12 19:59 ` [PATCH v8 5/5] hw/block/nvme: add the dataset management command Klaus Jensen
@ 2020-11-16 18:01   ` Keith Busch
  0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2020-11-16 18:01 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Philippe Mathieu-Daudé

On Thu, Nov 12, 2020 at 08:59:45PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> Add support for the Dataset Management command and the Deallocate
> attribute. Deallocation results in discards being sent to the underlying
> block device. Whether of not the blocks are actually deallocated is
> affected by the same factors as Write Zeroes (see previous commit).
> 
>      format | discard | dsm (512B)  dsm (4KiB)  dsm (64KiB)
>     --------------------------------------------------------
>       qcow2    ignore   n           n           n
>       qcow2    unmap    n           n           y
>       raw      ignore   n           n           n
>       raw      unmap    n           y           y
> 
> Again, a raw format and 4KiB LBAs are preferable.
> 
> In order to set the Namespace Preferred Deallocate Granularity and
> Alignment fields (NPDG and NPDA), choose a sane minimum discard
> granularity of 4KiB. If we are using a passthru device supporting
> discard at a 512B granularity, user should set the discard_granularity
> property explicitly. NPDG and NPDA will also account for the
> cluster_size of the block driver if required (i.e. for QCOW2).
> 
> See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Looks good to me.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 2/5] hw/block/nvme: pull aio error handling
  2020-11-16 17:57   ` Keith Busch
@ 2020-11-16 18:18     ` Klaus Jensen
  2020-11-17  7:16       ` Klaus Jensen
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2020-11-16 18:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

On Nov 16 09:57, Keith Busch wrote:
> On Thu, Nov 12, 2020 at 08:59:42PM +0100, Klaus Jensen wrote:
> > +static void nvme_aio_err(NvmeRequest *req, int ret)
> > +{
> > +    uint16_t status = NVME_SUCCESS;
> > +    Error *local_err = NULL;
> > +
> > +    switch (req->cmd.opcode) {
> > +    case NVME_CMD_READ:
> > +        status = NVME_UNRECOVERED_READ;
> > +        break;
> > +    case NVME_CMD_FLUSH:
> > +    case NVME_CMD_WRITE:
> > +    case NVME_CMD_WRITE_ZEROES:
> > +        status = NVME_WRITE_FAULT;
> > +        break;
> > +    default:
> > +        status = NVME_INTERNAL_DEV_ERROR;
> > +        break;
> > +    }
> 
> Just curious, is there potentially a more appropriate way to set an nvme
> status based on the value of 'ret'? What is 'ret' representing anyway?
> Are these errno values?
> 

Yes, it's errno values from down below.

But looking at this more closely, it actually looks like this is where
we should behave as dictated by the rerror and werror drive options.

I'll do a follow up patch to fix that.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 2/5] hw/block/nvme: pull aio error handling
  2020-11-16 18:18     ` Klaus Jensen
@ 2020-11-17  7:16       ` Klaus Jensen
  0 siblings, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-17  7:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]

On Nov 16 19:18, Klaus Jensen wrote:
> On Nov 16 09:57, Keith Busch wrote:
> > On Thu, Nov 12, 2020 at 08:59:42PM +0100, Klaus Jensen wrote:
> > > +static void nvme_aio_err(NvmeRequest *req, int ret)
> > > +{
> > > +    uint16_t status = NVME_SUCCESS;
> > > +    Error *local_err = NULL;
> > > +
> > > +    switch (req->cmd.opcode) {
> > > +    case NVME_CMD_READ:
> > > +        status = NVME_UNRECOVERED_READ;
> > > +        break;
> > > +    case NVME_CMD_FLUSH:
> > > +    case NVME_CMD_WRITE:
> > > +    case NVME_CMD_WRITE_ZEROES:
> > > +        status = NVME_WRITE_FAULT;
> > > +        break;
> > > +    default:
> > > +        status = NVME_INTERNAL_DEV_ERROR;
> > > +        break;
> > > +    }
> > 
> > Just curious, is there potentially a more appropriate way to set an nvme
> > status based on the value of 'ret'? What is 'ret' representing anyway?
> > Are these errno values?
> > 
> 
> Yes, it's errno values from down below.
> 
> But looking at this more closely, it actually looks like this is where
> we should behave as dictated by the rerror and werror drive options.
> 
> I'll do a follow up patch to fix that.

So, following up on this after looking more into it.

Currently, the device is basically behaving as if werror and rerror were
both set to "report" - that is, report the error to the guest.

Since we currently do not support werror and rerror, I think it is fine
to behave as if it was report and set a meaningful status code that fits
the command that failed (if we can).

But I'll start working on a patch to support rerror/werror, since it
would be nice to support.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support
  2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
                   ` (4 preceding siblings ...)
  2020-11-12 19:59 ` [PATCH v8 5/5] hw/block/nvme: add the dataset management command Klaus Jensen
@ 2020-11-23 20:45 ` Klaus Jensen
  5 siblings, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2020-11-23 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]

On Nov 12 20:59, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> This adds support for the Deallocated or Unwritten Logical Block error
> recovery feature as well as the Dataset Management command.
> 
> v8:
>   - Move req->opaque clearing to nvme_req_clear.
>   - Add two preparation/cleanup patches.
> 
> v7:
>   - Handle negative return value from bdrv_block_status.
>   - bdrv_get_info may not be supported on all block drivers, so do not
>     consider it a fatal error.
> 
> v6:
>   - Skip the allocation of the discards integer and just use the opaque
>     value directly (Philippe)
>   - Split changes to include/block/nvme.h into a separate patch
>     (Philippe)
>   - Clean up some convoluted checks on the discards value (Philippe)
>   - Use unambiguous units in the commit messages (Philippe)
>   - Stack allocate the range array (Keith)
> 
> v5:
>   - Restore status code from callback (Keith)
> 
> v4:
>   - Removed mixed declaration and code (Keith)
>   - Set NPDG and NPDA and account for the blockdev cluster size.
> 
> Klaus Jensen (5):
>   hw/block/nvme: remove superfluous NvmeCtrl parameter
>   hw/block/nvme: pull aio error handling
>   hw/block/nvme: add dulbe support
>   nvme: add namespace I/O optimization fields to shared header
>   hw/block/nvme: add the dataset management command
> 
>  hw/block/nvme-ns.h    |   4 +
>  hw/block/nvme.h       |   2 +
>  include/block/nvme.h  |  12 +-
>  hw/block/nvme-ns.c    |  34 +++++-
>  hw/block/nvme.c       | 258 ++++++++++++++++++++++++++++++++++++------
>  hw/block/trace-events |   4 +
>  6 files changed, 276 insertions(+), 38 deletions(-)
> 
> -- 
> 2.29.2
> 

Thanks for the reviews everyone; applied to nvme-next.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-11-23 21:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-12 19:59 [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen
2020-11-12 19:59 ` [PATCH v8 1/5] hw/block/nvme: remove superfluous NvmeCtrl parameter Klaus Jensen
2020-11-16 11:29   ` Minwoo Im
2020-11-12 19:59 ` [PATCH v8 2/5] hw/block/nvme: pull aio error handling Klaus Jensen
2020-11-16 11:36   ` Minwoo Im
2020-11-16 11:52     ` Klaus Jensen
2020-11-16 17:57   ` Keith Busch
2020-11-16 18:18     ` Klaus Jensen
2020-11-17  7:16       ` Klaus Jensen
2020-11-12 19:59 ` [PATCH v8 3/5] hw/block/nvme: add dulbe support Klaus Jensen
2020-11-16 11:43   ` Minwoo Im
2020-11-16 11:57     ` Klaus Jensen
2020-11-16 18:00   ` Keith Busch
2020-11-12 19:59 ` [PATCH v8 4/5] nvme: add namespace I/O optimization fields to shared header Klaus Jensen
2020-11-16 11:47   ` Minwoo Im
2020-11-12 19:59 ` [PATCH v8 5/5] hw/block/nvme: add the dataset management command Klaus Jensen
2020-11-16 18:01   ` Keith Busch
2020-11-23 20:45 ` [PATCH v8 0/5] hw/block/nvme: dulbe and dsm support Klaus Jensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.