Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

From: Klaus Jensen <its@irrelevant.dk>
To: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org, Klaus Jensen <k.jensen@samsung.com>,
	qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command
Date: Fri, 23 Oct 2020 07:25:12 +0200	[thread overview]
Message-ID: <20201023052512.GA244769@apples.localdomain> (raw)
In-Reply-To: <05c41ab7-039f-f327-c6a0-5864430f5ba6@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 9444 bytes --]

On Oct 22 23:02, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 

Hi Philippe,

Thanks for your comments!

> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> > Add support for the Dataset Management command and the Deallocate
> > attribute. Deallocation results in discards being sent to the underlying
> > block device. Whether of not the blocks are actually deallocated is
> > affected by the same factors as Write Zeroes (see previous commit).
> > 
> >       format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
> 
> Please use B/KiB units which are unambiguous (kb is for kbits)
> (if you queue this yourself, you can fix when applying, no need
> to repost).
> 

Thanks, I'll change it.

> >      ------------------------------------------------------
> >        qcow2    ignore   n           n          n
> >        qcow2    unmap    n           n          y
> >        raw      ignore   n           n          n
> >        raw      unmap    n           y          y
> > 
> > Again, a raw format and 4kb LBAs are preferable.
> > 
> > In order to set the Namespace Preferred Deallocate Granularity and
> > Alignment fields (NPDG and NPDA), choose a sane minimum discard
> > granularity of 4kb. If we are using a passthru device supporting discard
> > at a 512b granularity, user should set the discard_granularity property
> 
> Ditto.
> 
> > explicitly. NPDG and NPDA will also account for the cluster_size of the
> > block driver if required (i.e. for QCOW2).
> > 
> > See NVM Express 1.3d, Section 6.7 ("Dataset Management command").
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >   hw/block/nvme.h      |   2 +
> >   include/block/nvme.h |   7 ++-
> >   hw/block/nvme-ns.c   |  36 +++++++++++++--
> >   hw/block/nvme.c      | 101 ++++++++++++++++++++++++++++++++++++++++++-
> >   4 files changed, 140 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index e080a2318a50..574333caa3f9 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -28,6 +28,7 @@ typedef struct NvmeRequest {
> >       struct NvmeNamespace    *ns;
> >       BlockAIOCB              *aiocb;
> >       uint16_t                status;
> > +    void                    *opaque;
> >       NvmeCqe                 cqe;
> >       NvmeCmd                 cmd;
> >       BlockAcctCookie         acct;
> > @@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
> >       case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
> >       case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
> >       case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
> > +    case NVME_CMD_DSM:              return "NVME_NVM_CMD_DSM";
> >       default:                        return "NVME_NVM_CMD_UNKNOWN";
> >       }
> >   }
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 966c3bb304bd..e95ff6ca9b37 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
> >       uint16_t    nabspf;
> >       uint16_t    noiob;
> >       uint8_t     nvmcap[16];
> > -    uint8_t     rsvd64[40];
> > +    uint16_t    npwg;
> > +    uint16_t    npwa;
> > +    uint16_t    npdg;
> > +    uint16_t    npda;
> > +    uint16_t    nows;
> > +    uint8_t     rsvd74[30];
> >       uint8_t     nguid[16];
> >       uint64_t    eui64;
> >       NvmeLBAF    lbaf[16];
> 
> If you consider "block/nvme.h" shared by 2 different subsystems,
> it is better to add the changes in a separate patch. That way
> the changes can be acked individually.
> 

Sure. Some other stuff here warrents a v6 I think, so I will split it.

> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index f1cc734c60f5..840651db7256 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -28,10 +28,14 @@
> >   #include "nvme.h"
> >   #include "nvme-ns.h"
> > -static void nvme_ns_init(NvmeNamespace *ns)
> > +#define MIN_DISCARD_GRANULARITY (4 * KiB)
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> 
> Hmm the Error* argument could be squashed in "hw/block/nvme:
> support multiple namespaces". Else better split patch in dumb
> units IMHO (maybe a reviewer's taste).
> 

Yeah, I guess I can squash that in.

> >   {
> > +    BlockDriverInfo bdi;
> >       NvmeIdNs *id_ns = &ns->id_ns;
> >       int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > +    int npdg, ret;
> >       ns->id_ns.dlfeat = 0x9;
> > @@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >       id_ns->ncap = id_ns->nsze;
> >       id_ns->nuse = id_ns->ncap;
> > -    /* support DULBE */
> > -    id_ns->nsfeat |= 0x4;
> > +    /* support DULBE and I/O optimization fields */
> > +    id_ns->nsfeat |= (0x4 | 0x10);
> 
> The comment helps, but isn't needed if you use explicit definitions
> for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
> and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
> This is why I personally prefer the registerfields API (see
> "hw/registerfields.h").
> 

I've been wanting to fix those constants - but the convention that they
only extract bits pre-dates the nvme device and is from when the nvme
block driver was introduced - I have just been following the precedence
by defining them like that.

> > +
> > +    npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
> > +
> > +    ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "could not get block driver info");
> > +        return ret;
> > +    }
> > +
> > +    if (bdi.cluster_size &&
> > +        bdi.cluster_size > ns->blkconf.discard_granularity) {
> > +        npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
> > +    }
> > +
> > +    id_ns->npda = id_ns->npdg = npdg - 1;
> > +
> > +    return 0;
> >   }
> >   static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > @@ -59,6 +80,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >           return -1;
> >       }
> > +    if (ns->blkconf.discard_granularity == -1) {
> > +        ns->blkconf.discard_granularity =
> > +            MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
> > +    }
> > +
> >       ns->size = blk_getlength(ns->blkconf.blk);
> >       if (ns->size < 0) {
> >           error_setg_errno(errp, -ns->size, "could not get blockdev size");
> > @@ -92,7 +118,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >           return -1;
> >       }
> > -    nvme_ns_init(ns);
> > +    if (nvme_ns_init(ns, errp)) {
> > +        return -1;
> > +    }
> >       if (nvme_register_namespace(n, ns, errp)) {
> >           return -1;
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4ab0705f5a92..7acb9e9dc38a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -959,6 +959,103 @@ static void nvme_rw_cb(void *opaque, int ret)
> >       nvme_enqueue_req_completion(nvme_cq(req), req);
> >   }
> > +static void nvme_aio_discard_cb(void *opaque, int ret)
> > +{
> > +    NvmeRequest *req = opaque;
> > +    int *discards = req->opaque;
> > +
> > +    trace_pci_nvme_aio_discard_cb(nvme_cid(req));
> > +
> > +    if (ret) {
> > +        req->status = NVME_INTERNAL_DEV_ERROR;
> > +        trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
> > +                               req->status);
> > +    }
> > +
> > +    if (discards && --(*discards) > 0) {
> 
> This line is hard to read.
> 

Yes. Probably too clever for my own good. I'll fix it up.

> > +        return;
> > +    }
> > +
> > +    g_free(req->opaque);
> > +    req->opaque = NULL;
> > +
> > +    nvme_enqueue_req_completion(nvme_cq(req), req);
> > +}
> > +
> > +static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns = req->ns;
> > +    NvmeDsmCmd *dsm = (NvmeDsmCmd *) &req->cmd;
> > +    NvmeDsmRange *range = NULL;
> 
> g_autofree?
> 

What sorcery is this?! I think I'll just replace it with a stack
allocation like Keith suggested.

> > +    int *discards = NULL;
> > +
> > +    uint32_t attr = le32_to_cpu(dsm->attributes);
> > +    uint32_t nr = (le32_to_cpu(dsm->nr) & 0xff) + 1;
> > +
> > +    uint16_t status = NVME_SUCCESS;
> > +
> > +    trace_pci_nvme_dsm(nvme_cid(req), nvme_nsid(ns), nr, attr);
> > +
> > +    if (attr & NVME_DSMGMT_AD) {
> > +        int64_t offset;
> > +        size_t len;
> > +
> > +        range = g_new(NvmeDsmRange, nr);
> > +
> > +        status = nvme_dma(n, (uint8_t *)range, nr * sizeof(NvmeDsmRange),
> > +                          DMA_DIRECTION_TO_DEVICE, req);
> > +        if (status) {
> > +            goto out;
> > +        }
> > +
> > +        discards = g_new(int, 1);
> > +        *discards = 1;
> > +        req->opaque = discards;
> 
> If opaque is a pointer, why not instead using it as an uintptr_t
> discards directly without using the heap?
> 

It was a "keep it simple, stupid" thing to just do the allocation. DSM
is typically not on the fast path ;)

But there is really no reason not to use that here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]