qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2] hw/block/nvme: dulbe and dsm support
@ 2020-10-22 18:49 Klaus Jensen
  2020-10-22 18:49 ` [PATCH v5 1/2] hw/block/nvme: add dulbe support Klaus Jensen
  2020-10-22 18:49 ` [PATCH v5 2/2] hw/block/nvme: add the dataset management command Klaus Jensen
  0 siblings, 2 replies; 7+ messages in thread
From: Klaus Jensen @ 2020-10-22 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

This adds support for the Deallocated or Unwritten Logical Block error
recovery feature as well as the Dataset Management command.

I wanted to add support for the NPDG and NPDA fields such that the host
could get a hint on how many blocks to request deallocation of for the
deallocation to actually happen, but I cannot find a realiable way to
get the actual block size of the underlying device. If it is an image on
a file system we could typically use the host page size, but if it is a
raw device, we might have 512 byte sectors that we can issue discards
on. And QEMU doesn't seem to provide this without root privileges at
least.

See the two patches for some gotchas.

I also integrated this into my zoned proposal. I'll spare you the v4, nobody
cares anyway. But I put it in my repo[1] for posterity.

  [1]: https://irrelevant.dk/g/pci-nvme.git/tag/?h=zoned-v4.

v5:
  - Restore status code from callback (Keith)

v4:
  - Removed mixed declaration and code (Keith)
  - Set NPDG and NPDA and account for the blockdev cluster size.

Klaus Jensen (2):
  hw/block/nvme: add dulbe support
  hw/block/nvme: add the dataset management command

 hw/block/nvme-ns.h    |   4 +
 hw/block/nvme.h       |   2 +
 include/block/nvme.h  |  12 ++-
 hw/block/nvme-ns.c    |  40 +++++++--
 hw/block/nvme.c       | 184 +++++++++++++++++++++++++++++++++++++++++-
 hw/block/trace-events |   4 +
 6 files changed, 237 insertions(+), 9 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v5 1/2] hw/block/nvme: add dulbe support
  2020-10-22 18:49 [PATCH v5 0/2] hw/block/nvme: dulbe and dsm support Klaus Jensen
@ 2020-10-22 18:49 ` Klaus Jensen
  2020-10-22 18:49 ` [PATCH v5 2/2] hw/block/nvme: add the dataset management command Klaus Jensen
  1 sibling, 0 replies; 7+ messages in thread
From: Klaus Jensen @ 2020-10-22 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

     format | discard | wz (512b)  wz (4kb)  wz (64kb)
    ---------------------------------------------------
      qcow2    ignore   n          n         y
      qcow2    unmap    n          n         y
      raw      ignore   n          y         y
      raw      unmap    n          y         y

So, this works best with an image in raw format and 4kb LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64kb by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
---
 hw/block/nvme-ns.h    |  4 +++
 include/block/nvme.h  |  5 +++
 hw/block/nvme-ns.c    |  8 +++--
 hw/block/nvme.c       | 83 +++++++++++++++++++++++++++++++++++++++++--
 hw/block/trace-events |  4 +++
 5 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606e1..44bf6271b744 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,10 @@ typedef struct NvmeNamespace {
     NvmeIdNs     id_ns;
 
     NvmeNamespaceParams params;
+
+    struct {
+        uint32_t err_rec;
+    } features;
 } NvmeNamespace;
 
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8a46d9cf015f..966c3bb304bd 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -687,6 +687,7 @@ enum NvmeStatusCodes {
     NVME_E2E_REF_ERROR          = 0x0284,
     NVME_CMP_FAILURE            = 0x0285,
     NVME_ACCESS_DENIED          = 0x0286,
+    NVME_DULB                   = 0x0287,
     NVME_MORE                   = 0x2000,
     NVME_DNR                    = 0x4000,
     NVME_NO_COMPLETE            = 0xffff,
@@ -903,6 +904,9 @@ enum NvmeIdCtrlLpa {
 #define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
 
+#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0xffff)
+#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x10000)
+
 enum NvmeFeatureIds {
     NVME_ARBITRATION                = 0x1,
     NVME_POWER_MANAGEMENT           = 0x2,
@@ -1023,6 +1027,7 @@ enum NvmeNsIdentifierType {
 
 
 #define NVME_ID_NS_NSFEAT_THIN(nsfeat)      ((nsfeat & 0x1))
+#define NVME_ID_NS_NSFEAT_DULBE(nsfeat)     ((nsfeat >> 2) & 0x1)
 #define NVME_ID_NS_FLBAS_EXTENDED(flbas)    ((flbas >> 4) & 0x1)
 #define NVME_ID_NS_FLBAS_INDEX(flbas)       ((flbas & 0xf))
 #define NVME_ID_NS_MC_SEPARATE(mc)          ((mc >> 1) & 0x1)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 31c80cdf5b5f..f1cc734c60f5 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -33,9 +33,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
     NvmeIdNs *id_ns = &ns->id_ns;
     int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
 
-    if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
-        ns->id_ns.dlfeat = 0x9;
-    }
+    ns->id_ns.dlfeat = 0x9;
 
     id_ns->lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
 
@@ -44,6 +42,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
     /* no thin provisioning */
     id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
+
+    /* support DULBE */
+    id_ns->nsfeat |= 0x4;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -92,6 +93,7 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     nvme_ns_init(ns);
+
     if (nvme_register_namespace(n, ns, errp)) {
         return -1;
     }
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fa2cba744b57..4ab0705f5a92 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
     [NVME_TEMPERATURE_THRESHOLD]    = NVME_FEAT_CAP_CHANGE,
+    [NVME_ERROR_RECOVERY]           = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
     [NVME_VOLATILE_WRITE_CACHE]     = NVME_FEAT_CAP_CHANGE,
     [NVME_NUMBER_OF_QUEUES]         = NVME_FEAT_CAP_CHANGE,
     [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
@@ -878,6 +879,41 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba,
+                                 uint32_t nlb)
+{
+    BlockDriverState *bs = blk_bs(ns->blkconf.blk);
+
+    int64_t pnum = 0, bytes = nvme_l2b(ns, nlb);
+    int64_t offset = nvme_l2b(ns, slba);
+    bool zeroed;
+    int ret;
+
+    /*
+     * `pnum` holds the number of bytes after offset that shares the same
+     * allocation status as the byte at offset. If `pnum` is different from
+     * `bytes`, we should check the allocation status of the next range and
+     * continue this until all bytes have been checked.
+     */
+    do {
+        bytes -= pnum;
+
+        ret = bdrv_block_status(bs, offset, bytes, &pnum, NULL, NULL);
+
+        zeroed = !!(ret & BDRV_BLOCK_ZERO);
+
+        trace_pci_nvme_block_status(offset, bytes, pnum, ret, zeroed);
+
+        if (zeroed) {
+            return NVME_DULB;
+        }
+
+        offset += pnum;
+    } while (pnum != bytes);
+
+    return NVME_SUCCESS;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -985,6 +1021,15 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
+    if (acct == BLOCK_ACCT_READ) {
+        if (NVME_ERR_REC_DULBE(ns->features.err_rec)) {
+            status = nvme_check_dulbe(ns, slba, nlb);
+            if (status) {
+                goto invalid;
+            }
+        }
+    }
+
     status = nvme_map_dptr(n, data_size, req);
     if (status) {
         goto invalid;
@@ -1632,6 +1677,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
     uint8_t fid = NVME_GETSETFEAT_FID(dw10);
     NvmeGetFeatureSelect sel = NVME_GETFEAT_SELECT(dw10);
     uint16_t iv;
+    NvmeNamespace *ns;
 
     static const uint32_t nvme_feature_default[NVME_FID_MAX] = {
         [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
@@ -1694,6 +1740,18 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
         }
 
         return NVME_INVALID_FIELD | NVME_DNR;
+    case NVME_ERROR_RECOVERY:
+        if (!nvme_nsid_valid(n, nsid)) {
+            return NVME_INVALID_NSID | NVME_DNR;
+        }
+
+        ns = nvme_ns(n, nsid);
+        if (unlikely(!ns)) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        result = ns->features.err_rec;
+        goto out;
     case NVME_VOLATILE_WRITE_CACHE:
         result = n->features.vwc;
         trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -1766,7 +1824,7 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeNamespace *ns;
+    NvmeNamespace *ns = NULL;
 
     NvmeCmd *cmd = &req->cmd;
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
@@ -1774,6 +1832,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
     uint32_t nsid = le32_to_cpu(cmd->nsid);
     uint8_t fid = NVME_GETSETFEAT_FID(dw10);
     uint8_t save = NVME_SETFEAT_SAVE(dw10);
+    int i;
 
     trace_pci_nvme_setfeat(nvme_cid(req), nsid, fid, save, dw11);
 
@@ -1833,11 +1892,31 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                NVME_LOG_SMART_INFO);
         }
 
+        break;
+    case NVME_ERROR_RECOVERY:
+        if (nsid == NVME_NSID_BROADCAST) {
+            for (i = 1; i <= n->num_namespaces; i++) {
+                ns = nvme_ns(n, i);
+
+                if (!ns) {
+                    continue;
+                }
+
+                if (NVME_ID_NS_NSFEAT_DULBE(ns->id_ns.nsfeat)) {
+                    ns->features.err_rec = dw11;
+                }
+            }
+
+            break;
+        }
+
+        assert(ns);
+        ns->features.err_rec = dw11;
         break;
     case NVME_VOLATILE_WRITE_CACHE:
         n->features.vwc = dw11 & 0x1;
 
-        for (int i = 1; i <= n->num_namespaces; i++) {
+        for (i = 1; i <= n->num_namespaces; i++) {
             ns = nvme_ns(n, i);
             if (!ns) {
                 continue;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index c1537e3ac0b0..1ffe0b3f76b5 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -43,6 +43,10 @@ pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opna
 pci_nvme_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_write_zeroes(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
+pci_nvme_block_status(int64_t offset, int64_t bytes, int64_t pnum, int ret, bool zeroed) "offset %"PRId64" bytes %"PRId64" pnum %"PRId64" ret 0x%x zeroed %d"
+pci_nvme_dsm(uint16_t cid, uint32_t nsid, uint32_t nr, uint32_t attr) "cid %"PRIu16" nsid %"PRIu32" nr %"PRIu32" attr 0x%"PRIx32""
+pci_nvme_dsm_deallocate(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
+pci_nvme_aio_discard_cb(uint16_t cid) "cid %"PRIu16""
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v5 2/2] hw/block/nvme: add the dataset management command
  2020-10-22 18:49 [PATCH v5 0/2] hw/block/nvme: dulbe and dsm support Klaus Jensen
  2020-10-22 18:49 ` [PATCH v5 1/2] hw/block/nvme: add dulbe support Klaus Jensen
@ 2020-10-22 18:49 ` Klaus Jensen
  2020-10-22 21:02   ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 7+ messages in thread
From: Klaus Jensen @ 2020-10-22 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz, Keith Busch,
	Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

     format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
    ------------------------------------------------------
      qcow2    ignore   n           n          n
      qcow2    unmap    n           n          y
      raw      ignore   n           n          n
      raw      unmap    n           y          y

Again, a raw format and 4kb LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4kb. If we are using a passthru device supporting discard
at a 512b granularity, user should set the discard_granularity property
explicitly. NPDG and NPDA will also account for the cluster_size of the
block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.h      |   2 +
 include/block/nvme.h |   7 ++-
 hw/block/nvme-ns.c   |  36 +++++++++++++--
 hw/block/nvme.c      | 101 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..574333caa3f9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -28,6 +28,7 @@ typedef struct NvmeRequest {
     struct NvmeNamespace    *ns;
     BlockAIOCB              *aiocb;
     uint16_t                status;
+    void                    *opaque;
     NvmeCqe                 cqe;
     NvmeCmd                 cmd;
     BlockAcctCookie         acct;
@@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
     case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
     case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
     case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
+    case NVME_CMD_DSM:              return "NVME_NVM_CMD_DSM";
     default:                        return "NVME_NVM_CMD_UNKNOWN";
     }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 966c3bb304bd..e95ff6ca9b37 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
     uint16_t    nabspf;
     uint16_t    noiob;
     uint8_t     nvmcap[16];
-    uint8_t     rsvd64[40];
+    uint16_t    npwg;
+    uint16_t    npwa;
+    uint16_t    npdg;
+    uint16_t    npda;
+    uint16_t    nows;
+    uint8_t     rsvd74[30];
     uint8_t     nguid[16];
     uint64_t    eui64;
     NvmeLBAF    lbaf[16];
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index f1cc734c60f5..840651db7256 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -28,10 +28,14 @@
 #include "nvme.h"
 #include "nvme-ns.h"
 
-static void nvme_ns_init(NvmeNamespace *ns)
+#define MIN_DISCARD_GRANULARITY (4 * KiB)
+
+static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
 {
+    BlockDriverInfo bdi;
     NvmeIdNs *id_ns = &ns->id_ns;
     int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+    int npdg, ret;
 
     ns->id_ns.dlfeat = 0x9;
 
@@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
     id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
 
-    /* support DULBE */
-    id_ns->nsfeat |= 0x4;
+    /* support DULBE and I/O optimization fields */
+    id_ns->nsfeat |= (0x4 | 0x10);
+
+    npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
+
+    ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "could not get block driver info");
+        return ret;
+    }
+
+    if (bdi.cluster_size &&
+        bdi.cluster_size > ns->blkconf.discard_granularity) {
+        npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
+    }
+
+    id_ns->npda = id_ns->npdg = npdg - 1;
+
+    return 0;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -59,6 +80,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    if (ns->blkconf.discard_granularity == -1) {
+        ns->blkconf.discard_granularity =
+            MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
+    }
+
     ns->size = blk_getlength(ns->blkconf.blk);
     if (ns->size < 0) {
         error_setg_errno(errp, -ns->size, "could not get blockdev size");
@@ -92,7 +118,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
-    nvme_ns_init(ns);
+    if (nvme_ns_init(ns, errp)) {
+        return -1;
+    }
 
     if (nvme_register_namespace(n, ns, errp)) {
         return -1;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4ab0705f5a92..7acb9e9dc38a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -959,6 +959,103 @@ static void nvme_rw_cb(void *opaque, int ret)
     nvme_enqueue_req_completion(nvme_cq(req), req);
 }
 
+static void nvme_aio_discard_cb(void *opaque, int ret)
+{
+    NvmeRequest *req = opaque;
+    int *discards = req->opaque;
+
+    trace_pci_nvme_aio_discard_cb(nvme_cid(req));
+
+    if (ret) {
+        req->status = NVME_INTERNAL_DEV_ERROR;
+        trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
+                               req->status);
+    }
+
+    if (discards && --(*discards) > 0) {
+        return;
+    }
+
+    g_free(req->opaque);
+    req->opaque = NULL;
+
+    nvme_enqueue_req_completion(nvme_cq(req), req);
+}
+
+static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns = req->ns;
+    NvmeDsmCmd *dsm = (NvmeDsmCmd *) &req->cmd;
+    NvmeDsmRange *range = NULL;
+    int *discards = NULL;
+
+    uint32_t attr = le32_to_cpu(dsm->attributes);
+    uint32_t nr = (le32_to_cpu(dsm->nr) & 0xff) + 1;
+
+    uint16_t status = NVME_SUCCESS;
+
+    trace_pci_nvme_dsm(nvme_cid(req), nvme_nsid(ns), nr, attr);
+
+    if (attr & NVME_DSMGMT_AD) {
+        int64_t offset;
+        size_t len;
+
+        range = g_new(NvmeDsmRange, nr);
+
+        status = nvme_dma(n, (uint8_t *)range, nr * sizeof(NvmeDsmRange),
+                          DMA_DIRECTION_TO_DEVICE, req);
+        if (status) {
+            goto out;
+        }
+
+        discards = g_new(int, 1);
+        *discards = 1;
+        req->opaque = discards;
+
+        for (int i = 0; i < nr; i++) {
+            uint64_t slba = le64_to_cpu(range[i].slba);
+            uint32_t nlb = le32_to_cpu(range[i].nlb);
+
+            if (nvme_check_bounds(n, ns, slba, nlb)) {
+                trace_pci_nvme_err_invalid_lba_range(slba, nlb,
+                                                     ns->id_ns.nsze);
+                continue;
+            }
+
+            trace_pci_nvme_dsm_deallocate(nvme_cid(req), nvme_nsid(ns), slba,
+                                          nlb);
+
+            offset = nvme_l2b(ns, slba);
+            len = nvme_l2b(ns, nlb);
+
+            while (len) {
+                size_t bytes = MIN(BDRV_REQUEST_MAX_BYTES, len);
+
+                (*discards)++;
+
+                blk_aio_pdiscard(ns->blkconf.blk, offset, bytes,
+                                 nvme_aio_discard_cb, req);
+
+                offset += bytes;
+                len -= bytes;
+            }
+        }
+
+        if (--(*discards)) {
+            status = NVME_NO_COMPLETE;
+        } else {
+            g_free(discards);
+            req->opaque = NULL;
+            status = req->status;
+        }
+    }
+
+out:
+    g_free(range);
+
+    return status;
+}
+
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 {
     block_acct_start(blk_get_stats(req->ns->blkconf.blk), &req->acct, 0,
@@ -1088,6 +1185,8 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
         return nvme_rw(n, req);
+    case NVME_CMD_DSM:
+        return nvme_dsm(n, req);
     default:
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -2813,7 +2912,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->cqes = (0x4 << 4) | 0x4;
     id->nn = cpu_to_le32(n->num_namespaces);
     id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
-                           NVME_ONCS_FEATURES);
+                           NVME_ONCS_FEATURES | NVME_ONCS_DSM);
 
     id->vwc = 0x1;
     id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN |
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command
  2020-10-22 18:49 ` [PATCH v5 2/2] hw/block/nvme: add the dataset management command Klaus Jensen
@ 2020-10-22 21:02   ` Philippe Mathieu-Daudé
  2020-10-22 21:24     ` Keith Busch
  2020-10-23  5:25     ` Klaus Jensen
  0 siblings, 2 replies; 7+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-22 21:02 UTC (permalink / raw)
  To: Klaus Jensen, qemu-devel
  Cc: Kevin Wolf, Klaus Jensen, Keith Busch, qemu-block, Max Reitz

Hi Klaus,

On 10/22/20 8:49 PM, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> Add support for the Dataset Management command and the Deallocate
> attribute. Deallocation results in discards being sent to the underlying
> block device. Whether of not the blocks are actually deallocated is
> affected by the same factors as Write Zeroes (see previous commit).
> 
>       format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)

Please use B/KiB units which are unambiguous (kb is for kbits)
(if you queue this yourself, you can fix when applying, no need
to repost).

>      ------------------------------------------------------
>        qcow2    ignore   n           n          n
>        qcow2    unmap    n           n          y
>        raw      ignore   n           n          n
>        raw      unmap    n           y          y
> 
> Again, a raw format and 4kb LBAs are preferable.
> 
> In order to set the Namespace Preferred Deallocate Granularity and
> Alignment fields (NPDG and NPDA), choose a sane minimum discard
> granularity of 4kb. If we are using a passthru device supporting discard
> at a 512b granularity, user should set the discard_granularity property

Ditto.

> explicitly. NPDG and NPDA will also account for the cluster_size of the
> block driver if required (i.e. for QCOW2).
> 
> See NVM Express 1.3d, Section 6.7 ("Dataset Management command").
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>   hw/block/nvme.h      |   2 +
>   include/block/nvme.h |   7 ++-
>   hw/block/nvme-ns.c   |  36 +++++++++++++--
>   hw/block/nvme.c      | 101 ++++++++++++++++++++++++++++++++++++++++++-
>   4 files changed, 140 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index e080a2318a50..574333caa3f9 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -28,6 +28,7 @@ typedef struct NvmeRequest {
>       struct NvmeNamespace    *ns;
>       BlockAIOCB              *aiocb;
>       uint16_t                status;
> +    void                    *opaque;
>       NvmeCqe                 cqe;
>       NvmeCmd                 cmd;
>       BlockAcctCookie         acct;
> @@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
>       case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
>       case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
>       case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
> +    case NVME_CMD_DSM:              return "NVME_NVM_CMD_DSM";
>       default:                        return "NVME_NVM_CMD_UNKNOWN";
>       }
>   }
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 966c3bb304bd..e95ff6ca9b37 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
>       uint16_t    nabspf;
>       uint16_t    noiob;
>       uint8_t     nvmcap[16];
> -    uint8_t     rsvd64[40];
> +    uint16_t    npwg;
> +    uint16_t    npwa;
> +    uint16_t    npdg;
> +    uint16_t    npda;
> +    uint16_t    nows;
> +    uint8_t     rsvd74[30];
>       uint8_t     nguid[16];
>       uint64_t    eui64;
>       NvmeLBAF    lbaf[16];

If you consider "block/nvme.h" shared by 2 different subsystems,
it is better to add the changes in a separate patch. That way
the changes can be acked individually.

> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index f1cc734c60f5..840651db7256 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -28,10 +28,14 @@
>   #include "nvme.h"
>   #include "nvme-ns.h"
>   
> -static void nvme_ns_init(NvmeNamespace *ns)
> +#define MIN_DISCARD_GRANULARITY (4 * KiB)
> +
> +static int nvme_ns_init(NvmeNamespace *ns, Error **errp)

Hmm the Error* argument could be squashed in "hw/block/nvme:
support multiple namespaces". Else better split patch in dumb
units IMHO (maybe a reviewer's taste).

>   {
> +    BlockDriverInfo bdi;
>       NvmeIdNs *id_ns = &ns->id_ns;
>       int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> +    int npdg, ret;
>   
>       ns->id_ns.dlfeat = 0x9;
>   
> @@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
>       id_ns->ncap = id_ns->nsze;
>       id_ns->nuse = id_ns->ncap;
>   
> -    /* support DULBE */
> -    id_ns->nsfeat |= 0x4;
> +    /* support DULBE and I/O optimization fields */
> +    id_ns->nsfeat |= (0x4 | 0x10);

The comment helps, but isn't needed if you use explicit definitions
for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
This is why I personally prefer the registerfields API (see
"hw/registerfields.h").

> +
> +    npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
> +
> +    ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "could not get block driver info");
> +        return ret;
> +    }
> +
> +    if (bdi.cluster_size &&
> +        bdi.cluster_size > ns->blkconf.discard_granularity) {
> +        npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
> +    }
> +
> +    id_ns->npda = id_ns->npdg = npdg - 1;
> +
> +    return 0;
>   }
>   
>   static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> @@ -59,6 +80,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>           return -1;
>       }
>   
> +    if (ns->blkconf.discard_granularity == -1) {
> +        ns->blkconf.discard_granularity =
> +            MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
> +    }
> +
>       ns->size = blk_getlength(ns->blkconf.blk);
>       if (ns->size < 0) {
>           error_setg_errno(errp, -ns->size, "could not get blockdev size");
> @@ -92,7 +118,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>           return -1;
>       }
>   
> -    nvme_ns_init(ns);
> +    if (nvme_ns_init(ns, errp)) {
> +        return -1;
> +    }
>   
>       if (nvme_register_namespace(n, ns, errp)) {
>           return -1;
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 4ab0705f5a92..7acb9e9dc38a 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -959,6 +959,103 @@ static void nvme_rw_cb(void *opaque, int ret)
>       nvme_enqueue_req_completion(nvme_cq(req), req);
>   }
>   
> +static void nvme_aio_discard_cb(void *opaque, int ret)
> +{
> +    NvmeRequest *req = opaque;
> +    int *discards = req->opaque;
> +
> +    trace_pci_nvme_aio_discard_cb(nvme_cid(req));
> +
> +    if (ret) {
> +        req->status = NVME_INTERNAL_DEV_ERROR;
> +        trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
> +                               req->status);
> +    }
> +
> +    if (discards && --(*discards) > 0) {

This line is hard to read.

> +        return;
> +    }
> +
> +    g_free(req->opaque);
> +    req->opaque = NULL;
> +
> +    nvme_enqueue_req_completion(nvme_cq(req), req);
> +}
> +
> +static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns = req->ns;
> +    NvmeDsmCmd *dsm = (NvmeDsmCmd *) &req->cmd;
> +    NvmeDsmRange *range = NULL;

g_autofree?

> +    int *discards = NULL;
> +
> +    uint32_t attr = le32_to_cpu(dsm->attributes);
> +    uint32_t nr = (le32_to_cpu(dsm->nr) & 0xff) + 1;
> +
> +    uint16_t status = NVME_SUCCESS;
> +
> +    trace_pci_nvme_dsm(nvme_cid(req), nvme_nsid(ns), nr, attr);
> +
> +    if (attr & NVME_DSMGMT_AD) {
> +        int64_t offset;
> +        size_t len;
> +
> +        range = g_new(NvmeDsmRange, nr);
> +
> +        status = nvme_dma(n, (uint8_t *)range, nr * sizeof(NvmeDsmRange),
> +                          DMA_DIRECTION_TO_DEVICE, req);
> +        if (status) {
> +            goto out;
> +        }
> +
> +        discards = g_new(int, 1);
> +        *discards = 1;
> +        req->opaque = discards;

If opaque is a pointer, why not instead using it as an uintptr_t
discards directly without using the heap?

> +
> +        for (int i = 0; i < nr; i++) {
> +            uint64_t slba = le64_to_cpu(range[i].slba);
> +            uint32_t nlb = le32_to_cpu(range[i].nlb);
> +
> +            if (nvme_check_bounds(n, ns, slba, nlb)) {
> +                trace_pci_nvme_err_invalid_lba_range(slba, nlb,
> +                                                     ns->id_ns.nsze);
> +                continue;
> +            }
> +
> +            trace_pci_nvme_dsm_deallocate(nvme_cid(req), nvme_nsid(ns), slba,
> +                                          nlb);
> +
> +            offset = nvme_l2b(ns, slba);
> +            len = nvme_l2b(ns, nlb);
> +
> +            while (len) {
> +                size_t bytes = MIN(BDRV_REQUEST_MAX_BYTES, len);
> +
> +                (*discards)++;
> +
> +                blk_aio_pdiscard(ns->blkconf.blk, offset, bytes,
> +                                 nvme_aio_discard_cb, req);
> +
> +                offset += bytes;
> +                len -= bytes;
> +            }
> +        }
> +
> +        if (--(*discards)) {
> +            status = NVME_NO_COMPLETE;
> +        } else {
> +            g_free(discards);
> +            req->opaque = NULL;
> +            status = req->status;
> +        }
> +    }
> +
> +out:
> +    g_free(range);
> +
> +    return status;
> +}
[...]



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command
  2020-10-22 21:02   ` Philippe Mathieu-Daudé
@ 2020-10-22 21:24     ` Keith Busch
  2020-10-23  5:25     ` Klaus Jensen
  1 sibling, 0 replies; 7+ messages in thread
From: Keith Busch @ 2020-10-22 21:24 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz,
	Klaus Jensen

On Thu, Oct 22, 2020 at 11:02:04PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > +static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns = req->ns;
> > +    NvmeDsmCmd *dsm = (NvmeDsmCmd *) &req->cmd;
> > +    NvmeDsmRange *range = NULL;
> 
> g_autofree?

Or just allocate the array on the stack. The upper limit is just 512
entries.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command
  2020-10-22 21:02   ` Philippe Mathieu-Daudé
  2020-10-22 21:24     ` Keith Busch
@ 2020-10-23  5:25     ` Klaus Jensen
  2020-10-23  6:24       ` Klaus Jensen
  1 sibling, 1 reply; 7+ messages in thread
From: Klaus Jensen @ 2020-10-23  5:25 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 9444 bytes --]

On Oct 22 23:02, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 

Hi Philippe,

Thanks for your comments!

> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> > Add support for the Dataset Management command and the Deallocate
> > attribute. Deallocation results in discards being sent to the underlying
> > block device. Whether of not the blocks are actually deallocated is
> > affected by the same factors as Write Zeroes (see previous commit).
> > 
> >       format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
> 
> Please use B/KiB units which are unambiguous (kb is for kbits)
> (if you queue this yourself, you can fix when applying, no need
> to repost).
> 

Thanks, I'll change it.

> >      ------------------------------------------------------
> >        qcow2    ignore   n           n          n
> >        qcow2    unmap    n           n          y
> >        raw      ignore   n           n          n
> >        raw      unmap    n           y          y
> > 
> > Again, a raw format and 4kb LBAs are preferable.
> > 
> > In order to set the Namespace Preferred Deallocate Granularity and
> > Alignment fields (NPDG and NPDA), choose a sane minimum discard
> > granularity of 4kb. If we are using a passthru device supporting discard
> > at a 512b granularity, user should set the discard_granularity property
> 
> Ditto.
> 
> > explicitly. NPDG and NPDA will also account for the cluster_size of the
> > block driver if required (i.e. for QCOW2).
> > 
> > See NVM Express 1.3d, Section 6.7 ("Dataset Management command").
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >   hw/block/nvme.h      |   2 +
> >   include/block/nvme.h |   7 ++-
> >   hw/block/nvme-ns.c   |  36 +++++++++++++--
> >   hw/block/nvme.c      | 101 ++++++++++++++++++++++++++++++++++++++++++-
> >   4 files changed, 140 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index e080a2318a50..574333caa3f9 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -28,6 +28,7 @@ typedef struct NvmeRequest {
> >       struct NvmeNamespace    *ns;
> >       BlockAIOCB              *aiocb;
> >       uint16_t                status;
> > +    void                    *opaque;
> >       NvmeCqe                 cqe;
> >       NvmeCmd                 cmd;
> >       BlockAcctCookie         acct;
> > @@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
> >       case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
> >       case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
> >       case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
> > +    case NVME_CMD_DSM:              return "NVME_NVM_CMD_DSM";
> >       default:                        return "NVME_NVM_CMD_UNKNOWN";
> >       }
> >   }
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 966c3bb304bd..e95ff6ca9b37 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
> >       uint16_t    nabspf;
> >       uint16_t    noiob;
> >       uint8_t     nvmcap[16];
> > -    uint8_t     rsvd64[40];
> > +    uint16_t    npwg;
> > +    uint16_t    npwa;
> > +    uint16_t    npdg;
> > +    uint16_t    npda;
> > +    uint16_t    nows;
> > +    uint8_t     rsvd74[30];
> >       uint8_t     nguid[16];
> >       uint64_t    eui64;
> >       NvmeLBAF    lbaf[16];
> 
> If you consider "block/nvme.h" shared by 2 different subsystems,
> it is better to add the changes in a separate patch. That way
> the changes can be acked individually.
> 

Sure. Some other stuff here warrents a v6 I think, so I will split it.

> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index f1cc734c60f5..840651db7256 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -28,10 +28,14 @@
> >   #include "nvme.h"
> >   #include "nvme-ns.h"
> > -static void nvme_ns_init(NvmeNamespace *ns)
> > +#define MIN_DISCARD_GRANULARITY (4 * KiB)
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> 
> Hmm the Error* argument could be squashed in "hw/block/nvme:
> support multiple namespaces". Else better split patch in dumb
> units IMHO (maybe a reviewer's taste).
> 

Yeah, I guess I can squash that in.

> >   {
> > +    BlockDriverInfo bdi;
> >       NvmeIdNs *id_ns = &ns->id_ns;
> >       int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > +    int npdg, ret;
> >       ns->id_ns.dlfeat = 0x9;
> > @@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >       id_ns->ncap = id_ns->nsze;
> >       id_ns->nuse = id_ns->ncap;
> > -    /* support DULBE */
> > -    id_ns->nsfeat |= 0x4;
> > +    /* support DULBE and I/O optimization fields */
> > +    id_ns->nsfeat |= (0x4 | 0x10);
> 
> The comment helps, but isn't needed if you use explicit definitions
> for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
> and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
> This is why I personally prefer the registerfields API (see
> "hw/registerfields.h").
> 

I've been wanting to fix those constants - but the convention that they
only extract bits pre-dates the nvme device and is from when the nvme
block driver was introduced - I have just been following the precedence
by defining them like that.

> > +
> > +    npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
> > +
> > +    ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "could not get block driver info");
> > +        return ret;
> > +    }
> > +
> > +    if (bdi.cluster_size &&
> > +        bdi.cluster_size > ns->blkconf.discard_granularity) {
> > +        npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
> > +    }
> > +
> > +    id_ns->npda = id_ns->npdg = npdg - 1;
> > +
> > +    return 0;
> >   }
> >   static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > @@ -59,6 +80,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >           return -1;
> >       }
> > +    if (ns->blkconf.discard_granularity == -1) {
> > +        ns->blkconf.discard_granularity =
> > +            MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
> > +    }
> > +
> >       ns->size = blk_getlength(ns->blkconf.blk);
> >       if (ns->size < 0) {
> >           error_setg_errno(errp, -ns->size, "could not get blockdev size");
> > @@ -92,7 +118,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >           return -1;
> >       }
> > -    nvme_ns_init(ns);
> > +    if (nvme_ns_init(ns, errp)) {
> > +        return -1;
> > +    }
> >       if (nvme_register_namespace(n, ns, errp)) {
> >           return -1;
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4ab0705f5a92..7acb9e9dc38a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -959,6 +959,103 @@ static void nvme_rw_cb(void *opaque, int ret)
> >       nvme_enqueue_req_completion(nvme_cq(req), req);
> >   }
> > +static void nvme_aio_discard_cb(void *opaque, int ret)
> > +{
> > +    NvmeRequest *req = opaque;
> > +    int *discards = req->opaque;
> > +
> > +    trace_pci_nvme_aio_discard_cb(nvme_cid(req));
> > +
> > +    if (ret) {
> > +        req->status = NVME_INTERNAL_DEV_ERROR;
> > +        trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
> > +                               req->status);
> > +    }
> > +
> > +    if (discards && --(*discards) > 0) {
> 
> This line is hard to read.
> 

Yes. Probably too clever for my own good. I'll fix it up.

> > +        return;
> > +    }
> > +
> > +    g_free(req->opaque);
> > +    req->opaque = NULL;
> > +
> > +    nvme_enqueue_req_completion(nvme_cq(req), req);
> > +}
> > +
> > +static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns = req->ns;
> > +    NvmeDsmCmd *dsm = (NvmeDsmCmd *) &req->cmd;
> > +    NvmeDsmRange *range = NULL;
> 
> g_autofree?
> 

What sorcery is this?! I think I'll just replace it with a stack
allocation like Keith suggested.

> > +    int *discards = NULL;
> > +
> > +    uint32_t attr = le32_to_cpu(dsm->attributes);
> > +    uint32_t nr = (le32_to_cpu(dsm->nr) & 0xff) + 1;
> > +
> > +    uint16_t status = NVME_SUCCESS;
> > +
> > +    trace_pci_nvme_dsm(nvme_cid(req), nvme_nsid(ns), nr, attr);
> > +
> > +    if (attr & NVME_DSMGMT_AD) {
> > +        int64_t offset;
> > +        size_t len;
> > +
> > +        range = g_new(NvmeDsmRange, nr);
> > +
> > +        status = nvme_dma(n, (uint8_t *)range, nr * sizeof(NvmeDsmRange),
> > +                          DMA_DIRECTION_TO_DEVICE, req);
> > +        if (status) {
> > +            goto out;
> > +        }
> > +
> > +        discards = g_new(int, 1);
> > +        *discards = 1;
> > +        req->opaque = discards;
> 
> If opaque is a pointer, why not instead using it as an uintptr_t
> discards directly without using the heap?
> 

It was a "keep it simple, stupid" thing to just do the allocation. DSM
is typically not on the fast path ;)

But there is really no reason not to use that here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command
  2020-10-23  5:25     ` Klaus Jensen
@ 2020-10-23  6:24       ` Klaus Jensen
  0 siblings, 0 replies; 7+ messages in thread
From: Klaus Jensen @ 2020-10-23  6:24 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]

On Oct 23 07:25, Klaus Jensen wrote:
> On Oct 22 23:02, Philippe Mathieu-Daudé wrote:
> > On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > > -    /* support DULBE */
> > > -    id_ns->nsfeat |= 0x4;
> > > +    /* support DULBE and I/O optimization fields */
> > > +    id_ns->nsfeat |= (0x4 | 0x10);
> > 
> > The comment helps, but isn't needed if you use explicit definitions
> > for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
> > and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
> > This is why I personally prefer the registerfields API (see
> > "hw/registerfields.h").
> > 
> 
> I've been wanting to fix those constants - but the convention that they
> only extract bits pre-dates the nvme device and is from when the nvme
> block driver was introduced - I have just been following the precedence
> by defining them like that.
> 

I did not know about the hw/registerfields.h API. Looks promising!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-10-23  6:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-22 18:49 [PATCH v5 0/2] hw/block/nvme: dulbe and dsm support Klaus Jensen
2020-10-22 18:49 ` [PATCH v5 1/2] hw/block/nvme: add dulbe support Klaus Jensen
2020-10-22 18:49 ` [PATCH v5 2/2] hw/block/nvme: add the dataset management command Klaus Jensen
2020-10-22 21:02   ` Philippe Mathieu-Daudé
2020-10-22 21:24     ` Keith Busch
2020-10-23  5:25     ` Klaus Jensen
2020-10-23  6:24       ` Klaus Jensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).