All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
@ 2017-03-31  8:41 Haozhong Zhang
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Haozhong Zhang @ 2017-03-31  8:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: dan.j.williams, Haozhong Zhang, Michael S. Tsirkin,
	Igor Mammedov, Xiao Guangrong, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

This patch series constructs the flush hint address structures for
nvdimm devices in QEMU.

It's of course not for 2.9. I send it out early in order to get
comments on one point I'm uncertain (see the detailed explanation
below). Thanks for any comments in advance!


Background
---------------
Flush hint address structure is a substructure of NFIT and specifies
one or more addresses, namely Flush Hint Addresses. Software can write
to any one of these flush hint addresses to cause any preceding writes
to the NVDIMM region to be flushed out of the intervening platform
buffers to the targeted NVDIMM. More details can be found in ACPI Spec
6.1, Section 5.2.25.8 "Flush Hint Address Structure".


Why is it RFC?
---------------
RFC is added because I'm not sure whether the way in this patch series
that allocates the guest flush hint addresses is right.

QEMU needs to trap guest accesses (at least for writes) to the flush
hint addresses in order to perform the necessary flush on the host
back store. Therefore, QEMU needs to create IO memory regions that
cover those flush hint addresses. In order to create those IO memory
regions, QEMU needs to know the flush hint addresses or their offsets
to other known memory regions in advance. So far looks good.

Flush hint addresses are in the guest address space. Looking at how
the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
to take the same way for flush hint addresses, i.e. let the guest
firmware allocate from free addresses and patch them in the flush hint
address structure. (*Please correct me If my following understand is wrong*)
However, the current allocation and pointer patching are transparent
to QEMU, so QEMU will be unaware of the flush hint addresses, and
consequently have no way to create corresponding IO memory regions in
order to trap guest accesses.

Alternatively, this patch series moves the allocation of flush hint
addresses to QEMU:

1. (Patch 1) We reserve an address range after the end address of each
   nvdimm device. Its size is specified by the user via a new pc-dimm
   option 'reserved-size'.

   For the following example,
        -object memory-backend-file,id=mem0,size=4G,...
        -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
        -device pc-dimm,id=dimm1,...
   if dimm0 is allocated to address N ~ N+4G, the address of dimm1
   will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
   dimm0.

2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
   structure for each nvdimm device, it will allocate them from the
   above reserved area, e.g. the flush hint addresses of above dimm0
   are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
   this way, so QEMU can easily create IO memory regions for them.

   If the reserved area is not present or too small, QEMU will report
   errors.


How to test?
---------------
Add options 'flush-hint' and 'reserved-size' when creating a nvdimm
device, e.g.
    qemu-system-x86_64 -machine pc,nvdimm \
                       -m 4G,slots=4,maxmem=128G \
                       -object memory-backend-file,id=mem1,share,mem-path=/dev/pmem0 \
                       -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
                       ...

The guest OS should be able to find a flush hint address structure in
NFIT. For guest Linux kernel v4.8 or later which supports flush hint,
if QEMU is built with NVDIMM_DEBUG = 1 in include/hw/mem/nvdimm.h, it
will print debug messages like
    nvdimm: Write Flush Hint: offset 0x0, data 0x1
    nvdimm: Write Flush Hint: offset 0x4, data 0x0
when linux performs flush via flush hint address.



Haozhong Zhang (4):
  pc-dimm: add 'reserved-size' to reserve address range after the ending address
  nvdimm: add functions to initialize and perform flush on back store
  nvdimm acpi: record the cache line size in AcpiNVDIMMState
  nvdimm acpi: build flush hint address structure if required

 hw/acpi/nvdimm.c         | 111 ++++++++++++++++++++++++++++++++++++++++++++---
 hw/i386/pc.c             |   5 ++-
 hw/i386/pc_piix.c        |   2 +-
 hw/i386/pc_q35.c         |   2 +-
 hw/mem/nvdimm.c          |  48 ++++++++++++++++++++
 hw/mem/pc-dimm.c         |  48 ++++++++++++++++++--
 include/hw/mem/nvdimm.h  |  20 ++++++++-
 include/hw/mem/pc-dimm.h |   2 +
 8 files changed, 224 insertions(+), 14 deletions(-)

-- 
2.10.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
@ 2017-03-31  8:41 ` Haozhong Zhang
  2017-04-06 10:24   ` Stefan Hajnoczi
  2017-04-06 11:50   ` Xiao Guangrong
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Haozhong Zhang @ 2017-03-31  8:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: dan.j.williams, Haozhong Zhang, Michael S. Tsirkin,
	Igor Mammedov, Xiao Guangrong

If option 'reserved-size=RSVD' is present, QEMU will reserve an
address range of size 'RSVD' after the ending address of pc-dimm
device.

For the following example,
    -object memory-backend-file,id=mem0,size=4G,...
    -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
    -device pc-dimm,id=dimm1,...
if dimm0 is allocated to address N ~ N+4G, the address range of
dimm1 will start from N+4G+4K or higher.

Its current usage is to reserve spaces for flush hint addresses
of nvdimm devices.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/mem/pc-dimm.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++---
 include/hw/mem/pc-dimm.h |  2 ++
 2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 9e8dab0..13dcd71 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -28,6 +28,7 @@
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "hw/virtio/vhost.h"
+#include "exec/address-spaces.h"
 
 typedef struct pc_dimms_capacity {
      uint64_t size;
@@ -44,7 +45,12 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
     MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
     Error *local_err = NULL;
     uint64_t existing_dimms_capacity = 0;
-    uint64_t addr;
+    uint64_t addr, size = memory_region_size(mr);
+
+    size += object_property_get_int(OBJECT(dimm), PC_DIMM_RSVD_PROP, &local_err);
+    if (local_err) {
+        goto out;
+    }
 
     addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, &local_err);
     if (local_err) {
@@ -54,7 +60,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
     addr = pc_dimm_get_free_addr(hpms->base,
                                  memory_region_size(&hpms->mr),
                                  !addr ? NULL : &addr, align,
-                                 memory_region_size(mr), &local_err);
+                                 size, &local_err);
     if (local_err) {
         goto out;
     }
@@ -64,7 +70,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
         goto out;
     }
 
-    if (existing_dimms_capacity + memory_region_size(mr) >
+    if (existing_dimms_capacity + size >
         machine->maxram_size - machine->ram_size) {
         error_setg(&local_err, "not enough space, currently 0x%" PRIx64
                    " in use of total hot pluggable 0x" RAM_ADDR_FMT,
@@ -315,6 +321,9 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
         PCDIMMDevice *dimm = item->data;
         uint64_t dimm_size = object_property_get_int(OBJECT(dimm),
                                                      PC_DIMM_SIZE_PROP,
+                                                     errp) +
+                             object_property_get_int(OBJECT(dimm),
+                                                     PC_DIMM_RSVD_PROP,
                                                      errp);
         if (errp && *errp) {
             goto out;
@@ -382,6 +391,37 @@ static void pc_dimm_check_memdev_is_busy(Object *obj, const char *name,
     error_propagate(errp, local_err);
 }
 
+static void pc_dimm_get_reserved_size(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    PCDIMMDevice *dimm = PC_DIMM(obj);
+    uint64_t value = dimm->reserved_size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void pc_dimm_set_reserved_size(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    PCDIMMDevice *dimm = PC_DIMM(obj);
+    Error *local_err = NULL;
+    uint64_t value;
+
+    if (dimm->reserved_size) {
+        error_setg(&local_err, "cannot change 'reserved-size'");
+        goto out;
+    }
+
+    visit_type_size(v, name, &value, &local_err);
+    if (local_err) {
+        goto out;
+    }
+    dimm->reserved_size = value;
+
+ out:
+    error_propagate(errp, local_err);
+}
+
 static void pc_dimm_init(Object *obj)
 {
     PCDIMMDevice *dimm = PC_DIMM(obj);
@@ -393,6 +433,8 @@ static void pc_dimm_init(Object *obj)
                              pc_dimm_check_memdev_is_busy,
                              OBJ_PROP_LINK_UNREF_ON_RELEASE,
                              &error_abort);
+    object_property_add(obj, PC_DIMM_RSVD_PROP, "int", pc_dimm_get_reserved_size,
+                        pc_dimm_set_reserved_size, NULL, NULL, &error_abort);
 }
 
 static void pc_dimm_realize(DeviceState *dev, Error **errp)
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 1e483f2..99c4dfd 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -33,6 +33,7 @@
 #define PC_DIMM_NODE_PROP "node"
 #define PC_DIMM_SIZE_PROP "size"
 #define PC_DIMM_MEMDEV_PROP "memdev"
+#define PC_DIMM_RSVD_PROP "reserved-size"
 
 #define PC_DIMM_UNASSIGNED_SLOT -1
 
@@ -53,6 +54,7 @@ typedef struct PCDIMMDevice {
     uint64_t addr;
     uint32_t node;
     int32_t slot;
+    uint64_t reserved_size;
     HostMemoryBackend *hostmem;
 } PCDIMMDevice;
 
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
  2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
@ 2017-03-31  8:41 ` Haozhong Zhang
  2017-04-06 11:52   ` Xiao Guangrong
  2017-04-20 13:12   ` Igor Mammedov
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState Haozhong Zhang
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Haozhong Zhang @ 2017-03-31  8:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: dan.j.williams, Haozhong Zhang, Michael S. Tsirkin,
	Igor Mammedov, Xiao Guangrong

fsync() is used to persist modifications to the back store. If the
host NVDIMM is used as the back store, fsync() on Linux will trigger
the write to the host flush hint address.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/mem/nvdimm.c         | 22 ++++++++++++++++++++++
 include/hw/mem/nvdimm.h | 13 +++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index db896b0..484ab8b 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
     return &nvdimm->nvdimm_mr;
 }
 
+static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
+{
+    if (nvdimm->flush_hint_enabled) {
+        nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);
+    } else {
+        nvdimm->backend_fd = -1;
+    }
+}
+
+void nvdimm_flush(NVDIMMDevice *nvdimm)
+{
+    if (nvdimm->backend_fd != -1) {
+        /*
+         * If the backend store is a physical NVDIMM device, fsync()
+         * will trigger the flush via the flush hint on the host device.
+         */
+        fsync(nvdimm->backend_fd);
+    }
+}
+
 static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
 {
     MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem, errp);
@@ -105,6 +125,8 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
     memory_region_init_alias(&nvdimm->nvdimm_mr, OBJECT(dimm),
                              "nvdimm-memory", mr, 0, pmem_size);
     nvdimm->nvdimm_mr.align = align;
+
+    nvdimm_flush_init(nvdimm, mr);
 }
 
 /*
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index 03e1ff9..eb71f41 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -71,6 +71,18 @@ struct NVDIMMDevice {
      * guest via ACPI NFIT and _FIT method if NVDIMM hotplug is supported.
      */
     MemoryRegion nvdimm_mr;
+
+    /*
+     * If true, a flush hint address structure will be built for this
+     * NVDIMM device.
+     */
+    bool flush_hint_enabled;
+    /*
+     * File descriptor of the backend store, which is used in nvdimm
+     * flush.  It's cached in NVDIMMDevice rather than being fetched
+     * at each request in order to accelerate the flush a little bit.
+     */
+    int backend_fd;
 };
 typedef struct NVDIMMDevice NVDIMMDevice;
 
@@ -132,4 +144,5 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        uint32_t ram_slots);
 void nvdimm_plug(AcpiNVDIMMState *state);
 void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
+void nvdimm_flush(NVDIMMDevice *nvdimm);
 #endif
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState
  2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
@ 2017-03-31  8:41 ` Haozhong Zhang
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Haozhong Zhang @ 2017-03-31  8:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: dan.j.williams, Haozhong Zhang, Michael S. Tsirkin,
	Igor Mammedov, Xiao Guangrong, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

Software is allowed to write up to a cache line of data to the flush
hint address (ACPI spec 6.1, Table 5-135). NVDIMM ACPI code needs this
parameter to decide the address space size for flush hint addresses.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/acpi/nvdimm.c        | 5 ++++-
 hw/i386/pc_piix.c       | 2 +-
 hw/i386/pc_q35.c        | 2 +-
 include/hw/mem/nvdimm.h | 5 ++++-
 4 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 8e7d6ec..ea2ac3e 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -881,7 +881,8 @@ void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev)
 }
 
 void nvdimm_init_acpi_state(AcpiNVDIMMState *state, MemoryRegion *io,
-                            FWCfgState *fw_cfg, Object *owner)
+                            FWCfgState *fw_cfg, Object *owner,
+                            unsigned int cache_line_size)
 {
     memory_region_init_io(&state->io_mr, owner, &nvdimm_dsm_ops, state,
                           "nvdimm-acpi-io", NVDIMM_ACPI_IO_LEN);
@@ -893,6 +894,8 @@ void nvdimm_init_acpi_state(AcpiNVDIMMState *state, MemoryRegion *io,
                     state->dsm_mem->len);
 
     nvdimm_init_fit_buffer(&state->fit_buf);
+
+    state->cache_line_size = cache_line_size;
 }
 
 #define NVDIMM_COMMON_DSM       "NCAL"
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 9f102aa..81dd379 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -298,7 +298,7 @@ static void pc_init1(MachineState *machine,
 
     if (pcms->acpi_nvdimm_state.is_enabled) {
         nvdimm_init_acpi_state(&pcms->acpi_nvdimm_state, system_io,
-                               pcms->fw_cfg, OBJECT(pcms));
+                               pcms->fw_cfg, OBJECT(pcms), 64);
     }
 }
 
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index dd792a8..19f5515 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -272,7 +272,7 @@ static void pc_q35_init(MachineState *machine)
 
     if (pcms->acpi_nvdimm_state.is_enabled) {
         nvdimm_init_acpi_state(&pcms->acpi_nvdimm_state, system_io,
-                               pcms->fw_cfg, OBJECT(pcms));
+                               pcms->fw_cfg, OBJECT(pcms), 64);
     }
 }
 
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index eb71f41..888def6 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -134,11 +134,14 @@ struct AcpiNVDIMMState {
 
     /* the IO region used by OSPM to transfer control to QEMU. */
     MemoryRegion io_mr;
+
+    unsigned int cache_line_size;
 };
 typedef struct AcpiNVDIMMState AcpiNVDIMMState;
 
 void nvdimm_init_acpi_state(AcpiNVDIMMState *state, MemoryRegion *io,
-                            FWCfgState *fw_cfg, Object *owner);
+                            FWCfgState *fw_cfg, Object *owner,
+                            unsigned int cache_line_size);
 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        BIOSLinker *linker, AcpiNVDIMMState *state,
                        uint32_t ram_slots);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
                   ` (2 preceding siblings ...)
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState Haozhong Zhang
@ 2017-03-31  8:41 ` Haozhong Zhang
  2017-04-06 10:13   ` Stefan Hajnoczi
                     ` (2 more replies)
  2017-04-06  9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
  2017-04-06  9:43 ` Stefan Hajnoczi
  5 siblings, 3 replies; 38+ messages in thread
From: Haozhong Zhang @ 2017-03-31  8:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: dan.j.williams, Haozhong Zhang, Michael S. Tsirkin,
	Igor Mammedov, Xiao Guangrong, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost


Add an boolean option 'flush-hint' to device 'nvdimm'. If it's on, a
flush hint address structure will be constructed for each nvdimm
device.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/acpi/nvdimm.c        | 106 +++++++++++++++++++++++++++++++++++++++++++++---
 hw/i386/pc.c            |   5 ++-
 hw/mem/nvdimm.c         |  26 ++++++++++++
 include/hw/mem/nvdimm.h |   2 +-
 4 files changed, 132 insertions(+), 7 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index ea2ac3e..a7ff0b2 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -32,6 +32,8 @@
 #include "hw/acpi/bios-linker-loader.h"
 #include "hw/nvram/fw_cfg.h"
 #include "hw/mem/nvdimm.h"
+#include "exec/address-spaces.h"
+#include "qapi/error.h"
 
 static int nvdimm_device_list(Object *obj, void *opaque)
 {
@@ -168,6 +170,22 @@ struct NvdimmNfitControlRegion {
 typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
 
 /*
+ * NVDIMM Flush Hint Address Structure
+ *
+ * It enables the data durability mechanism via ACPI.
+ */
+struct NvdimmNfitFlushHintAddress {
+    uint16_t type;
+    uint16_t length;
+    uint32_t nfit_handle;
+    uint16_t nr_flush_hint_addr;
+    uint8_t  reserved[6];
+#define NR_FLUSH_HINT_ADDR 1
+    uint64_t flush_hint_addr[NR_FLUSH_HINT_ADDR];
+} QEMU_PACKED;
+typedef struct NvdimmNfitFlushHintAddress NvdimmNfitFlushHintAddress;
+
+/*
  * Module serial number is a unique number for each device. We use the
  * slot id of NVDIMM device to generate this number so that each device
  * associates with a different number.
@@ -343,10 +361,79 @@ static void nvdimm_build_structure_dcr(GArray *structures, DeviceState *dev)
                                          (DSM) in DSM Spec Rev1.*/);
 }
 
-static GArray *nvdimm_build_device_structure(void)
+static uint64_t nvdimm_flush_hint_read(void *opaque, hwaddr addr, unsigned size)
+{
+    return 0;
+}
+
+static void nvdimm_flush_hint_write(void *opaque, hwaddr addr,
+                                    uint64_t data, unsigned size)
+{
+    nvdimm_debug("Write Flush Hint: offset 0x%"HWADDR_PRIx", data 0x%"PRIx64"\n",
+                 addr, data);
+    nvdimm_flush((NVDIMMDevice *)opaque);
+}
+
+static const MemoryRegionOps nvdimm_flush_hint_ops = {
+    .read = nvdimm_flush_hint_read,
+    .write = nvdimm_flush_hint_write,
+};
+
+/*
+ * ACPI 6.0: 5.2.25.7 Flush Hint Address Structure
+ */
+static void nvdimm_build_structure_flush_hint(GArray *structures,
+                                              DeviceState *dev,
+                                              unsigned int cache_line_size,
+                                              Error **errp)
+{
+    NvdimmNfitFlushHintAddress *flush_hint;
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP, NULL);
+    PCDIMMDevice *dimm = PC_DIMM(dev);
+    NVDIMMDevice *nvdimm = NVDIMM(dev);
+    uint64_t addr;
+    unsigned int i;
+    MemoryRegion *mr;
+    Error *local_err = NULL;
+
+    if (!nvdimm->flush_hint_enabled) {
+        return;
+    }
+
+    if (cache_line_size * NR_FLUSH_HINT_ADDR > dimm->reserved_size) {
+        error_setg(&local_err,
+                   "insufficient reserved space for flush hint buffers");
+        goto out;
+    }
+
+    addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP, NULL);
+    addr += object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP, NULL);
+
+    flush_hint = acpi_data_push(structures, sizeof(*flush_hint));
+    flush_hint->type = cpu_to_le16(6 /* Flush Hint Address Structure */);
+    flush_hint->length = cpu_to_le16(sizeof(*flush_hint));
+    flush_hint->nfit_handle = cpu_to_le32(nvdimm_slot_to_handle(slot));
+    flush_hint->nr_flush_hint_addr = cpu_to_le16(NR_FLUSH_HINT_ADDR);
+
+    for (i = 0; i < NR_FLUSH_HINT_ADDR; i++, addr += cache_line_size) {
+        flush_hint->flush_hint_addr[i] = cpu_to_le64(addr);
+
+        mr = g_new0(MemoryRegion, 1);
+        memory_region_init_io(mr, OBJECT(dev), &nvdimm_flush_hint_ops, nvdimm,
+                              "nvdimm-flush-hint", cache_line_size);
+        memory_region_add_subregion(get_system_memory(), addr, mr);
+    }
+
+ out:
+    error_propagate(errp, local_err);
+}
+
+static GArray *nvdimm_build_device_structure(AcpiNVDIMMState *state,
+                                             Error **errp)
 {
     GSList *device_list = nvdimm_get_device_list();
     GArray *structures = g_array_new(false, true /* clear */, 1);
+    Error *local_err = NULL;
 
     for (; device_list; device_list = device_list->next) {
         DeviceState *dev = device_list->data;
@@ -362,9 +449,17 @@ static GArray *nvdimm_build_device_structure(void)
 
         /* build NVDIMM Control Region Structure. */
         nvdimm_build_structure_dcr(structures, dev);
+
+        /* build Flush Hint Address Structure */
+        nvdimm_build_structure_flush_hint(structures, dev,
+                                          state->cache_line_size, &local_err);
+        if (local_err) {
+            break;
+        }
     }
     g_slist_free(device_list);
 
+    error_propagate(errp, local_err);
     return structures;
 }
 
@@ -373,16 +468,17 @@ static void nvdimm_init_fit_buffer(NvdimmFitBuffer *fit_buf)
     fit_buf->fit = g_array_new(false, true /* clear */, 1);
 }
 
-static void nvdimm_build_fit_buffer(NvdimmFitBuffer *fit_buf)
+static void nvdimm_build_fit_buffer(AcpiNVDIMMState *state, Error **errp)
 {
+    NvdimmFitBuffer *fit_buf = &state->fit_buf;
     g_array_free(fit_buf->fit, true);
-    fit_buf->fit = nvdimm_build_device_structure();
+    fit_buf->fit = nvdimm_build_device_structure(state, errp);
     fit_buf->dirty = true;
 }
 
-void nvdimm_plug(AcpiNVDIMMState *state)
+void nvdimm_plug(AcpiNVDIMMState *state, Error **errp)
 {
-    nvdimm_build_fit_buffer(&state->fit_buf);
+    nvdimm_build_fit_buffer(state, errp);
 }
 
 static void nvdimm_build_nfit(AcpiNVDIMMState *state, GArray *table_offsets,
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index d24388e..da4a5d7 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1718,7 +1718,10 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
                        "nvdimm is not enabled: missing 'nvdimm' in '-M'");
             goto out;
         }
-        nvdimm_plug(&pcms->acpi_nvdimm_state);
+        nvdimm_plug(&pcms->acpi_nvdimm_state, &local_err);
+        if (local_err) {
+            goto out;
+        }
     }
 
     hhc = HOTPLUG_HANDLER_GET_CLASS(pcms->acpi_dev);
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 484ab8b..a26908b 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -64,11 +64,37 @@ out:
     error_propagate(errp, local_err);
 }
 
+static bool nvdimm_get_flush_hint(Object *obj, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+    return nvdimm->flush_hint_enabled;
+}
+
+static void nvdimm_set_flush_hint(Object *obj, bool val, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+    Error *local_err = NULL;
+
+    if (nvdimm->flush_hint_enabled) {
+        error_setg(&local_err, "cannot change property value");
+        goto out;
+    }
+
+    nvdimm->flush_hint_enabled = val;
+
+ out:
+    error_propagate(errp, local_err);
+}
+
 static void nvdimm_init(Object *obj)
 {
     object_property_add(obj, "label-size", "int",
                         nvdimm_get_label_size, nvdimm_set_label_size, NULL,
                         NULL, NULL);
+    object_property_add_bool(obj, "flush-hint",
+                             nvdimm_get_flush_hint, nvdimm_set_flush_hint,
+                             NULL);
 }
 
 static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index 888def6..726f4d9 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -145,7 +145,7 @@ void nvdimm_init_acpi_state(AcpiNVDIMMState *state, MemoryRegion *io,
 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        BIOSLinker *linker, AcpiNVDIMMState *state,
                        uint32_t ram_slots);
-void nvdimm_plug(AcpiNVDIMMState *state);
+void nvdimm_plug(AcpiNVDIMMState *state, Error **errp);
 void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
 void nvdimm_flush(NVDIMMDevice *nvdimm);
 #endif
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
                   ` (3 preceding siblings ...)
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
@ 2017-04-06  9:39 ` Xiao Guangrong
  2017-04-06  9:58   ` Haozhong Zhang
  2017-04-06  9:43 ` Stefan Hajnoczi
  5 siblings, 1 reply; 38+ messages in thread
From: Xiao Guangrong @ 2017-04-06  9:39 UTC (permalink / raw)
  To: Haozhong Zhang, qemu-devel
  Cc: dan.j.williams, Michael S. Tsirkin, Igor Mammedov, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost



On 31/03/2017 4:41 PM, Haozhong Zhang wrote:
> This patch series constructs the flush hint address structures for
> nvdimm devices in QEMU.
>
> It's of course not for 2.9. I send it out early in order to get
> comments on one point I'm uncertain (see the detailed explanation
> below). Thanks for any comments in advance!
>
>
> Background
> ---------------
> Flush hint address structure is a substructure of NFIT and specifies
> one or more addresses, namely Flush Hint Addresses. Software can write
> to any one of these flush hint addresses to cause any preceding writes
> to the NVDIMM region to be flushed out of the intervening platform
> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
>
> Why is it RFC?
> ---------------
> RFC is added because I'm not sure whether the way in this patch series
> that allocates the guest flush hint addresses is right.
>
> QEMU needs to trap guest accesses (at least for writes) to the flush
> hint addresses in order to perform the necessary flush on the host
> back store. Therefore, QEMU needs to create IO memory regions that
> cover those flush hint addresses. In order to create those IO memory
> regions, QEMU needs to know the flush hint addresses or their offsets
> to other known memory regions in advance. So far looks good.
>
> Flush hint addresses are in the guest address space. Looking at how
> the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
> to take the same way for flush hint addresses, i.e. let the guest
> firmware allocate from free addresses and patch them in the flush hint
> address structure. (*Please correct me If my following understand is wrong*)
> However, the current allocation and pointer patching are transparent
> to QEMU, so QEMU will be unaware of the flush hint addresses, and
> consequently have no way to create corresponding IO memory regions in
> order to trap guest accesses.

Er, it is awkward and flush-hint-table is static which may not be
easily patched.

>
> Alternatively, this patch series moves the allocation of flush hint
> addresses to QEMU:
>
> 1. (Patch 1) We reserve an address range after the end address of each
>    nvdimm device. Its size is specified by the user via a new pc-dimm
>    option 'reserved-size'.
>

We should make it only work for nvdimm?

>    For the following example,
>         -object memory-backend-file,id=mem0,size=4G,...
>         -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
>         -device pc-dimm,id=dimm1,...
>    if dimm0 is allocated to address N ~ N+4G, the address of dimm1
>    will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
>    dimm0.
>
> 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
>    structure for each nvdimm device, it will allocate them from the
>    above reserved area, e.g. the flush hint addresses of above dimm0
>    are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
>    this way, so QEMU can easily create IO memory regions for them.
>
>    If the reserved area is not present or too small, QEMU will report
>    errors.
>

We should make 'reserved-size' always be page-aligned and should be
transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
is specified?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
                   ` (4 preceding siblings ...)
  2017-04-06  9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
@ 2017-04-06  9:43 ` Stefan Hajnoczi
  2017-04-06 10:31   ` Haozhong Zhang
                     ` (3 more replies)
  5 siblings, 4 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-06  9:43 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 3873 bytes --]

On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> This patch series constructs the flush hint address structures for
> nvdimm devices in QEMU.
> 
> It's of course not for 2.9. I send it out early in order to get
> comments on one point I'm uncertain (see the detailed explanation
> below). Thanks for any comments in advance!
> Background
> ---------------

Extra background:

Flush Hint Addresses are necessary because:

1. Some hardware configurations may require them.  In other words, a
   cache flush instruction is not enough to persist data.

2. The host file system may need fsync(2) calls (e.g. to persist
   metadata changes).

Without Flush Hint Addresses only some NVDIMM configurations actually
guarantee data persistence.

> Flush hint address structure is a substructure of NFIT and specifies
> one or more addresses, namely Flush Hint Addresses. Software can write
> to any one of these flush hint addresses to cause any preceding writes
> to the NVDIMM region to be flushed out of the intervening platform
> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".

Do you have performance data?  I'm concerned that Flush Hint Address
hardware interface is not virtualization-friendly.

In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:

  wmb();
  for (i = 0; i < nd_region->ndr_mappings; i++)
      if (ndrd_get_flush_wpq(ndrd, i, 0))
          writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
  wmb();

That looks pretty lightweight - it's an MMIO write between write
barriers.

This patch implements the MMIO write like this:

  void nvdimm_flush(NVDIMMDevice *nvdimm)
  {
      if (nvdimm->backend_fd != -1) {
          /*
           * If the backend store is a physical NVDIMM device, fsync()
           * will trigger the flush via the flush hint on the host device.
           */
          fsync(nvdimm->backend_fd);
      }
  }

The MMIO store instruction turned into a synchronous fsync(2) system
call plus vmexit/vmenter and QEMU userspace context switch:

1. The vcpu blocks during the fsync(2) system call.  The MMIO write
   instruction has an unexpected and huge latency.

2. The vcpu thread holds the QEMU global mutex so all other threads
   (including the monitor) are blocked during fsync(2).  Other vcpu
   threads may block if they vmexit.

It is hard to implement this efficiently in QEMU.  This is why I said
the hardware interface is not virtualization-friendly.  It's cheap on
real hardware but expensive under virtualization.

We should think about the optimal way of implementing Flush Hint
Addresses in QEMU.  But if there is no reasonable way to implement them
then I think it's better *not* to implement them, just like the Block
Window feature which is also not virtualization-friendly.  Users who
want a block device can use virtio-blk.  I don't think NVDIMM Block
Window can achieve better performance than virtio-blk under
virtualization (although I'm happy to be proven wrong).

Some ideas for a faster implementation:

1. Use memory_region_clear_global_locking() to avoid taking the QEMU
   global mutex.  Little synchronization is necessary as long as the
   NVDIMM device isn't hot unplugged (not yet supported anyway).

2. Can the host kernel provide a way to mmap Address Flush Hints from
   the physical NVDIMM in cases where the configuration does not require
   host kernel interception?  That way QEMU can map the physical
   NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
   is bypassed and performance would be good.

I'm not sure there is anything we can do to make the case where the host
kernel wants an fsync(2) fast :(.

Benchmark results would be important for deciding how big the problem
is.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06  9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
@ 2017-04-06  9:58   ` Haozhong Zhang
  2017-04-06 11:46     ` Xiao Guangrong
  0 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-06  9:58 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: qemu-devel, dan.j.williams, Michael S. Tsirkin, Igor Mammedov,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On 04/06/17 17:39 +0800, Xiao Guangrong wrote:
> 
> 
> On 31/03/2017 4:41 PM, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > 
> > 
> > Background
> > ---------------
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > 
> > Why is it RFC?
> > ---------------
> > RFC is added because I'm not sure whether the way in this patch series
> > that allocates the guest flush hint addresses is right.
> > 
> > QEMU needs to trap guest accesses (at least for writes) to the flush
> > hint addresses in order to perform the necessary flush on the host
> > back store. Therefore, QEMU needs to create IO memory regions that
> > cover those flush hint addresses. In order to create those IO memory
> > regions, QEMU needs to know the flush hint addresses or their offsets
> > to other known memory regions in advance. So far looks good.
> > 
> > Flush hint addresses are in the guest address space. Looking at how
> > the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
> > to take the same way for flush hint addresses, i.e. let the guest
> > firmware allocate from free addresses and patch them in the flush hint
> > address structure. (*Please correct me If my following understand is wrong*)
> > However, the current allocation and pointer patching are transparent
> > to QEMU, so QEMU will be unaware of the flush hint addresses, and
> > consequently have no way to create corresponding IO memory regions in
> > order to trap guest accesses.
> 
> Er, it is awkward and flush-hint-table is static which may not be
> easily patched.
> 
> > 
> > Alternatively, this patch series moves the allocation of flush hint
> > addresses to QEMU:
> > 
> > 1. (Patch 1) We reserve an address range after the end address of each
> >    nvdimm device. Its size is specified by the user via a new pc-dimm
> >    option 'reserved-size'.
> > 
> 
> We should make it only work for nvdimm?
>

Yes, we can check whether the machine option 'nvdimm' is present when
plugging the nvdimm.

> >    For the following example,
> >         -object memory-backend-file,id=mem0,size=4G,...
> >         -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
> >         -device pc-dimm,id=dimm1,...
> >    if dimm0 is allocated to address N ~ N+4G, the address of dimm1
> >    will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
> >    dimm0.
> > 
> > 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
> >    structure for each nvdimm device, it will allocate them from the
> >    above reserved area, e.g. the flush hint addresses of above dimm0
> >    are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
> >    this way, so QEMU can easily create IO memory regions for them.
> > 
> >    If the reserved area is not present or too small, QEMU will report
> >    errors.
> > 
> 
> We should make 'reserved-size' always be page-aligned and should be
> transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
> is specified?
>

4K alignment is already enforced by current memory plug code.

About the automatic reservation, is a non-zero default value
acceptable by qemu design/convention in general?

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
@ 2017-04-06 10:13   ` Stefan Hajnoczi
  2017-04-06 10:53     ` Haozhong Zhang
  2017-04-07 15:51     ` Dan Williams
  2017-04-06 10:25   ` Stefan Hajnoczi
  2017-04-20 11:22   ` Igor Mammedov
  2 siblings, 2 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-06 10:13 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 3211 bytes --]

On Fri, Mar 31, 2017 at 04:41:47PM +0800, Haozhong Zhang wrote:
> 
> Add an boolean option 'flush-hint' to device 'nvdimm'. If it's on, a
> flush hint address structure will be constructed for each nvdimm
> device.

Users should not need to set the flush hint option.  NVDIMM
configurations that persist data properly without Flush Hint Addresses
shouldn't use them (for best performance).  Configurations that rely on
flush hints *must* use them to guarantee data integrity.

I don't remember if there's a way to detect the configuration from host
userspace, but we should focus on setting this correctly without
requiring users to know which setting is necessary.

> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/acpi/nvdimm.c        | 106 +++++++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/pc.c            |   5 ++-
>  hw/mem/nvdimm.c         |  26 ++++++++++++
>  include/hw/mem/nvdimm.h |   2 +-
>  4 files changed, 132 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index ea2ac3e..a7ff0b2 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -32,6 +32,8 @@
>  #include "hw/acpi/bios-linker-loader.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "hw/mem/nvdimm.h"
> +#include "exec/address-spaces.h"
> +#include "qapi/error.h"
>  
>  static int nvdimm_device_list(Object *obj, void *opaque)
>  {
> @@ -168,6 +170,22 @@ struct NvdimmNfitControlRegion {
>  typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
>  
>  /*
> + * NVDIMM Flush Hint Address Structure
> + *
> + * It enables the data durability mechanism via ACPI.
> + */
> +struct NvdimmNfitFlushHintAddress {
> +    uint16_t type;
> +    uint16_t length;
> +    uint32_t nfit_handle;
> +    uint16_t nr_flush_hint_addr;
> +    uint8_t  reserved[6];
> +#define NR_FLUSH_HINT_ADDR 1
> +    uint64_t flush_hint_addr[NR_FLUSH_HINT_ADDR];

How should the number of flush hints be set?  This patch hardcodes it to
1 but the Linux nvdimm drivers tries to balance between flush hint
addresses to improve performance (to prevent cache line bouncing?).

> +} QEMU_PACKED;
> +typedef struct NvdimmNfitFlushHintAddress NvdimmNfitFlushHintAddress;
> +
> +/*
>   * Module serial number is a unique number for each device. We use the
>   * slot id of NVDIMM device to generate this number so that each device
>   * associates with a different number.
> @@ -343,10 +361,79 @@ static void nvdimm_build_structure_dcr(GArray *structures, DeviceState *dev)
>                                           (DSM) in DSM Spec Rev1.*/);
>  }
>  
> -static GArray *nvdimm_build_device_structure(void)
> +static uint64_t nvdimm_flush_hint_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    return 0;
> +}
> +
> +static void nvdimm_flush_hint_write(void *opaque, hwaddr addr,
> +                                    uint64_t data, unsigned size)
> +{
> +    nvdimm_debug("Write Flush Hint: offset 0x%"HWADDR_PRIx", data 0x%"PRIx64"\n",
> +                 addr, data);
> +    nvdimm_flush((NVDIMMDevice *)opaque);

C automatically casts void * to any other pointer type.  The cast is
unnecessary.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
@ 2017-04-06 10:24   ` Stefan Hajnoczi
  2017-04-06 10:46     ` Haozhong Zhang
  2017-04-06 11:50   ` Xiao Guangrong
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-06 10:24 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 5872 bytes --]

On Fri, Mar 31, 2017 at 04:41:44PM +0800, Haozhong Zhang wrote:
> If option 'reserved-size=RSVD' is present, QEMU will reserve an
> address range of size 'RSVD' after the ending address of pc-dimm
> device.
> 
> For the following example,
>     -object memory-backend-file,id=mem0,size=4G,...
>     -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...

"reserved-size" is not a clear name.  I suggest calling the property
"num-flush-hints" (default 0).  QEMU can calculate the actual size in
bytes.

  -device nvdimm,num-flush-hints=1

QEMU will use one flush hint and reserve enough space (e.g. 1 page) for
the MMIO region.

>     -device pc-dimm,id=dimm1,...
> if dimm0 is allocated to address N ~ N+4G, the address range of
> dimm1 will start from N+4G+4K or higher.
> 
> Its current usage is to reserve spaces for flush hint addresses
> of nvdimm devices.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/mem/pc-dimm.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++---
>  include/hw/mem/pc-dimm.h |  2 ++
>  2 files changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> index 9e8dab0..13dcd71 100644
> --- a/hw/mem/pc-dimm.c
> +++ b/hw/mem/pc-dimm.c
> @@ -28,6 +28,7 @@
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  #include "hw/virtio/vhost.h"
> +#include "exec/address-spaces.h"
>  
>  typedef struct pc_dimms_capacity {
>       uint64_t size;
> @@ -44,7 +45,12 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>      MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
>      Error *local_err = NULL;
>      uint64_t existing_dimms_capacity = 0;
> -    uint64_t addr;
> +    uint64_t addr, size = memory_region_size(mr);
> +
> +    size += object_property_get_int(OBJECT(dimm), PC_DIMM_RSVD_PROP, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
>  
>      addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, &local_err);
>      if (local_err) {
> @@ -54,7 +60,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>      addr = pc_dimm_get_free_addr(hpms->base,
>                                   memory_region_size(&hpms->mr),
>                                   !addr ? NULL : &addr, align,
> -                                 memory_region_size(mr), &local_err);
> +                                 size, &local_err);
>      if (local_err) {
>          goto out;
>      }
> @@ -64,7 +70,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>          goto out;
>      }
>  
> -    if (existing_dimms_capacity + memory_region_size(mr) >
> +    if (existing_dimms_capacity + size >
>          machine->maxram_size - machine->ram_size) {
>          error_setg(&local_err, "not enough space, currently 0x%" PRIx64
>                     " in use of total hot pluggable 0x" RAM_ADDR_FMT,
> @@ -315,6 +321,9 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
>          PCDIMMDevice *dimm = item->data;
>          uint64_t dimm_size = object_property_get_int(OBJECT(dimm),
>                                                       PC_DIMM_SIZE_PROP,
> +                                                     errp) +
> +                             object_property_get_int(OBJECT(dimm),
> +                                                     PC_DIMM_RSVD_PROP,
>                                                       errp);
>          if (errp && *errp) {
>              goto out;
> @@ -382,6 +391,37 @@ static void pc_dimm_check_memdev_is_busy(Object *obj, const char *name,
>      error_propagate(errp, local_err);
>  }
>  
> +static void pc_dimm_get_reserved_size(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    PCDIMMDevice *dimm = PC_DIMM(obj);
> +    uint64_t value = dimm->reserved_size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void pc_dimm_set_reserved_size(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    PCDIMMDevice *dimm = PC_DIMM(obj);
> +    Error *local_err = NULL;
> +    uint64_t value;
> +
> +    if (dimm->reserved_size) {
> +        error_setg(&local_err, "cannot change 'reserved-size'");
> +        goto out;
> +    }
> +
> +    visit_type_size(v, name, &value, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +    dimm->reserved_size = value;
> +
> + out:
> +    error_propagate(errp, local_err);
> +}
> +
>  static void pc_dimm_init(Object *obj)
>  {
>      PCDIMMDevice *dimm = PC_DIMM(obj);
> @@ -393,6 +433,8 @@ static void pc_dimm_init(Object *obj)
>                               pc_dimm_check_memdev_is_busy,
>                               OBJ_PROP_LINK_UNREF_ON_RELEASE,
>                               &error_abort);
> +    object_property_add(obj, PC_DIMM_RSVD_PROP, "int", pc_dimm_get_reserved_size,
> +                        pc_dimm_set_reserved_size, NULL, NULL, &error_abort);
>  }
>  
>  static void pc_dimm_realize(DeviceState *dev, Error **errp)
> diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> index 1e483f2..99c4dfd 100644
> --- a/include/hw/mem/pc-dimm.h
> +++ b/include/hw/mem/pc-dimm.h
> @@ -33,6 +33,7 @@
>  #define PC_DIMM_NODE_PROP "node"
>  #define PC_DIMM_SIZE_PROP "size"
>  #define PC_DIMM_MEMDEV_PROP "memdev"
> +#define PC_DIMM_RSVD_PROP "reserved-size"
>  
>  #define PC_DIMM_UNASSIGNED_SLOT -1
>  
> @@ -53,6 +54,7 @@ typedef struct PCDIMMDevice {
>      uint64_t addr;
>      uint32_t node;
>      int32_t slot;
> +    uint64_t reserved_size;
>      HostMemoryBackend *hostmem;
>  } PCDIMMDevice;
>  
> -- 
> 2.10.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
  2017-04-06 10:13   ` Stefan Hajnoczi
@ 2017-04-06 10:25   ` Stefan Hajnoczi
  2017-04-20 11:22   ` Igor Mammedov
  2 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-06 10:25 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 2259 bytes --]

On Fri, Mar 31, 2017 at 04:41:47PM +0800, Haozhong Zhang wrote:
> +/*
> + * ACPI 6.0: 5.2.25.7 Flush Hint Address Structure
> + */
> +static void nvdimm_build_structure_flush_hint(GArray *structures,
> +                                              DeviceState *dev,
> +                                              unsigned int cache_line_size,
> +                                              Error **errp)
> +{
> +    NvdimmNfitFlushHintAddress *flush_hint;
> +    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP, NULL);
> +    PCDIMMDevice *dimm = PC_DIMM(dev);
> +    NVDIMMDevice *nvdimm = NVDIMM(dev);
> +    uint64_t addr;
> +    unsigned int i;
> +    MemoryRegion *mr;
> +    Error *local_err = NULL;
> +
> +    if (!nvdimm->flush_hint_enabled) {
> +        return;
> +    }
> +
> +    if (cache_line_size * NR_FLUSH_HINT_ADDR > dimm->reserved_size) {
> +        error_setg(&local_err,
> +                   "insufficient reserved space for flush hint buffers");
> +        goto out;
> +    }
> +
> +    addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP, NULL);
> +    addr += object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP, NULL);
> +
> +    flush_hint = acpi_data_push(structures, sizeof(*flush_hint));
> +    flush_hint->type = cpu_to_le16(6 /* Flush Hint Address Structure */);
> +    flush_hint->length = cpu_to_le16(sizeof(*flush_hint));
> +    flush_hint->nfit_handle = cpu_to_le32(nvdimm_slot_to_handle(slot));
> +    flush_hint->nr_flush_hint_addr = cpu_to_le16(NR_FLUSH_HINT_ADDR);
> +
> +    for (i = 0; i < NR_FLUSH_HINT_ADDR; i++, addr += cache_line_size) {
> +        flush_hint->flush_hint_addr[i] = cpu_to_le64(addr);
> +
> +        mr = g_new0(MemoryRegion, 1);
> +        memory_region_init_io(mr, OBJECT(dev), &nvdimm_flush_hint_ops, nvdimm,
> +                              "nvdimm-flush-hint", cache_line_size);
> +        memory_region_add_subregion(get_system_memory(), addr, mr);

It would be cleaner to create the memory region in the nvdimm device
code instead of the acpi code.

The acpi code just needs to get the number of flush hints and the MMIO
page address from the nvdimm device.  It doesn't need to create the
memory region object or provide the read/write functions.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06  9:43 ` Stefan Hajnoczi
@ 2017-04-06 10:31   ` Haozhong Zhang
  2017-04-07 14:38     ` Stefan Hajnoczi
  2017-04-06 12:02   ` Xiao Guangrong
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-06 10:31 UTC (permalink / raw)
  To: Stefan Hajnoczi, dan.j.williams
  Cc: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, Richard Henderson

On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > Background
> > ---------------
> 
> Extra background:
> 
> Flush Hint Addresses are necessary because:
> 
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
> 
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
> 
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
> 
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> 
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>

I haven't tested how much vNVDIMM performance drops with this patch
series.

I tested the fsycn latency of a regular file on the bare metal by
writing 1 GB random data to a file (on ext4 fs on SSD) and then
performing fsync. The average latency of fsync in that case is 3 ms.
I currently don't have NVDIMM hardware, so I cannot get its latency
data. Anyway, as your comment below, the latency should be larger for
VM.

> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> 
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
> 
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
> 
> This patch implements the MMIO write like this:
> 
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
> 
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
> 
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
> 
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
> 
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>

I don't have the NVDIMM hardware, so I don't know the latency of
writing to host flush hint address. Dan, do you have any latency data
on the bare metal?

> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
> 
> Some ideas for a faster implementation:
> 
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>

ACPI spec does not say it allows or disallows multiple writes to the
same flush hint address in parallel. If it can, I think we can remove
the global locking requirement for the MMIO memory region of the flush
hint address of vNVDIMM.

> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>

It may work if the backend store of vNVDIMM is the physical NVDIMM
region and the latency of writing to host flush hint address is much
cheaper then performing fsync.

However, if the backend store is a regular file, then we still need to
use fsync.

> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.
> 
> Benchmark results would be important for deciding how big the problem
> is.

Let me collect performance data w/ and w/o this patch series.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-04-06 10:24   ` Stefan Hajnoczi
@ 2017-04-06 10:46     ` Haozhong Zhang
  2017-04-07 13:46       ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-06 10:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin

On 04/06/17 11:24 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:44PM +0800, Haozhong Zhang wrote:
> > If option 'reserved-size=RSVD' is present, QEMU will reserve an
> > address range of size 'RSVD' after the ending address of pc-dimm
> > device.
> > 
> > For the following example,
> >     -object memory-backend-file,id=mem0,size=4G,...
> >     -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
> 
> "reserved-size" is not a clear name.  I suggest calling the property
> "num-flush-hints" (default 0).  QEMU can calculate the actual size in
> bytes.
> 
>   -device nvdimm,num-flush-hints=1
> 
> QEMU will use one flush hint and reserve enough space (e.g. 1 page) for
> the MMIO region.
>

Each flush hint can be as small as one cache line size which is also
the size used in this patch series.

We need to calculate the size of all flush hints in pc_dimm_memory_plug(),
because when building the flush hint address structure we need to know
the address of flush hints.

IIUC, pc_dimm_memory_plug() is not specific to x86, so it's better
take a general way to get the vcpu cache line size in pc_dimm_memory_plug(),
which seemingly lacks in QEMU (though I believe it should be doable).

To make things simple, I leave the size decision to users, and check
whether it's large enough when building the flush hint address
structures in patch 4.


Haozhong

> >     -device pc-dimm,id=dimm1,...
> > if dimm0 is allocated to address N ~ N+4G, the address range of
> > dimm1 will start from N+4G+4K or higher.
> > 
> > Its current usage is to reserve spaces for flush hint addresses
> > of nvdimm devices.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  hw/mem/pc-dimm.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++---
> >  include/hw/mem/pc-dimm.h |  2 ++
> >  2 files changed, 47 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> > index 9e8dab0..13dcd71 100644
> > --- a/hw/mem/pc-dimm.c
> > +++ b/hw/mem/pc-dimm.c
> > @@ -28,6 +28,7 @@
> >  #include "sysemu/kvm.h"
> >  #include "trace.h"
> >  #include "hw/virtio/vhost.h"
> > +#include "exec/address-spaces.h"
> >  
> >  typedef struct pc_dimms_capacity {
> >       uint64_t size;
> > @@ -44,7 +45,12 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
> >      MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
> >      Error *local_err = NULL;
> >      uint64_t existing_dimms_capacity = 0;
> > -    uint64_t addr;
> > +    uint64_t addr, size = memory_region_size(mr);
> > +
> > +    size += object_property_get_int(OBJECT(dimm), PC_DIMM_RSVD_PROP, &local_err);
> > +    if (local_err) {
> > +        goto out;
> > +    }
> >  
> >      addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, &local_err);
> >      if (local_err) {
> > @@ -54,7 +60,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
> >      addr = pc_dimm_get_free_addr(hpms->base,
> >                                   memory_region_size(&hpms->mr),
> >                                   !addr ? NULL : &addr, align,
> > -                                 memory_region_size(mr), &local_err);
> > +                                 size, &local_err);
> >      if (local_err) {
> >          goto out;
> >      }
> > @@ -64,7 +70,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
> >          goto out;
> >      }
> >  
> > -    if (existing_dimms_capacity + memory_region_size(mr) >
> > +    if (existing_dimms_capacity + size >
> >          machine->maxram_size - machine->ram_size) {
> >          error_setg(&local_err, "not enough space, currently 0x%" PRIx64
> >                     " in use of total hot pluggable 0x" RAM_ADDR_FMT,
> > @@ -315,6 +321,9 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
> >          PCDIMMDevice *dimm = item->data;
> >          uint64_t dimm_size = object_property_get_int(OBJECT(dimm),
> >                                                       PC_DIMM_SIZE_PROP,
> > +                                                     errp) +
> > +                             object_property_get_int(OBJECT(dimm),
> > +                                                     PC_DIMM_RSVD_PROP,
> >                                                       errp);
> >          if (errp && *errp) {
> >              goto out;
> > @@ -382,6 +391,37 @@ static void pc_dimm_check_memdev_is_busy(Object *obj, const char *name,
> >      error_propagate(errp, local_err);
> >  }
> >  
> > +static void pc_dimm_get_reserved_size(Object *obj, Visitor *v, const char *name,
> > +                                      void *opaque, Error **errp)
> > +{
> > +    PCDIMMDevice *dimm = PC_DIMM(obj);
> > +    uint64_t value = dimm->reserved_size;
> > +
> > +    visit_type_size(v, name, &value, errp);
> > +}
> > +
> > +static void pc_dimm_set_reserved_size(Object *obj, Visitor *v, const char *name,
> > +                                      void *opaque, Error **errp)
> > +{
> > +    PCDIMMDevice *dimm = PC_DIMM(obj);
> > +    Error *local_err = NULL;
> > +    uint64_t value;
> > +
> > +    if (dimm->reserved_size) {
> > +        error_setg(&local_err, "cannot change 'reserved-size'");
> > +        goto out;
> > +    }
> > +
> > +    visit_type_size(v, name, &value, &local_err);
> > +    if (local_err) {
> > +        goto out;
> > +    }
> > +    dimm->reserved_size = value;
> > +
> > + out:
> > +    error_propagate(errp, local_err);
> > +}
> > +
> >  static void pc_dimm_init(Object *obj)
> >  {
> >      PCDIMMDevice *dimm = PC_DIMM(obj);
> > @@ -393,6 +433,8 @@ static void pc_dimm_init(Object *obj)
> >                               pc_dimm_check_memdev_is_busy,
> >                               OBJ_PROP_LINK_UNREF_ON_RELEASE,
> >                               &error_abort);
> > +    object_property_add(obj, PC_DIMM_RSVD_PROP, "int", pc_dimm_get_reserved_size,
> > +                        pc_dimm_set_reserved_size, NULL, NULL, &error_abort);
> >  }
> >  
> >  static void pc_dimm_realize(DeviceState *dev, Error **errp)
> > diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> > index 1e483f2..99c4dfd 100644
> > --- a/include/hw/mem/pc-dimm.h
> > +++ b/include/hw/mem/pc-dimm.h
> > @@ -33,6 +33,7 @@
> >  #define PC_DIMM_NODE_PROP "node"
> >  #define PC_DIMM_SIZE_PROP "size"
> >  #define PC_DIMM_MEMDEV_PROP "memdev"
> > +#define PC_DIMM_RSVD_PROP "reserved-size"
> >  
> >  #define PC_DIMM_UNASSIGNED_SLOT -1
> >  
> > @@ -53,6 +54,7 @@ typedef struct PCDIMMDevice {
> >      uint64_t addr;
> >      uint32_t node;
> >      int32_t slot;
> > +    uint64_t reserved_size;
> >      HostMemoryBackend *hostmem;
> >  } PCDIMMDevice;
> >  
> > -- 
> > 2.10.1
> > 
> > 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-04-06 10:13   ` Stefan Hajnoczi
@ 2017-04-06 10:53     ` Haozhong Zhang
  2017-04-07 14:41       ` Stefan Hajnoczi
  2017-04-07 15:51     ` Dan Williams
  1 sibling, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-06 10:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

On 04/06/17 11:13 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:47PM +0800, Haozhong Zhang wrote:
> > 
> > Add an boolean option 'flush-hint' to device 'nvdimm'. If it's on, a
> > flush hint address structure will be constructed for each nvdimm
> > device.
> 
> Users should not need to set the flush hint option.  NVDIMM
> configurations that persist data properly without Flush Hint Addresses
> shouldn't use them (for best performance).  Configurations that rely on
> flush hints *must* use them to guarantee data integrity.

It's for backwards compatibility, i.e. migrating a VM on QEMU w/o
flush hint support to another one w/ flush hint support. By using a
flush-hint option and making it disabled by default, users can ensure
both QEMU provide the same VM configuration.

Haozhong

> 
> I don't remember if there's a way to detect the configuration from host
> userspace, but we should focus on setting this correctly without
> requiring users to know which setting is necessary.
> 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  hw/acpi/nvdimm.c        | 106 +++++++++++++++++++++++++++++++++++++++++++++---
> >  hw/i386/pc.c            |   5 ++-
> >  hw/mem/nvdimm.c         |  26 ++++++++++++
> >  include/hw/mem/nvdimm.h |   2 +-
> >  4 files changed, 132 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> > index ea2ac3e..a7ff0b2 100644
> > --- a/hw/acpi/nvdimm.c
> > +++ b/hw/acpi/nvdimm.c
> > @@ -32,6 +32,8 @@
> >  #include "hw/acpi/bios-linker-loader.h"
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "hw/mem/nvdimm.h"
> > +#include "exec/address-spaces.h"
> > +#include "qapi/error.h"
> >  
> >  static int nvdimm_device_list(Object *obj, void *opaque)
> >  {
> > @@ -168,6 +170,22 @@ struct NvdimmNfitControlRegion {
> >  typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
> >  
> >  /*
> > + * NVDIMM Flush Hint Address Structure
> > + *
> > + * It enables the data durability mechanism via ACPI.
> > + */
> > +struct NvdimmNfitFlushHintAddress {
> > +    uint16_t type;
> > +    uint16_t length;
> > +    uint32_t nfit_handle;
> > +    uint16_t nr_flush_hint_addr;
> > +    uint8_t  reserved[6];
> > +#define NR_FLUSH_HINT_ADDR 1
> > +    uint64_t flush_hint_addr[NR_FLUSH_HINT_ADDR];
> 
> How should the number of flush hints be set?  This patch hardcodes it to
> 1 but the Linux nvdimm drivers tries to balance between flush hint
> addresses to improve performance (to prevent cache line bouncing?).
> 
> > +} QEMU_PACKED;
> > +typedef struct NvdimmNfitFlushHintAddress NvdimmNfitFlushHintAddress;
> > +
> > +/*
> >   * Module serial number is a unique number for each device. We use the
> >   * slot id of NVDIMM device to generate this number so that each device
> >   * associates with a different number.
> > @@ -343,10 +361,79 @@ static void nvdimm_build_structure_dcr(GArray *structures, DeviceState *dev)
> >                                           (DSM) in DSM Spec Rev1.*/);
> >  }
> >  
> > -static GArray *nvdimm_build_device_structure(void)
> > +static uint64_t nvdimm_flush_hint_read(void *opaque, hwaddr addr, unsigned size)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void nvdimm_flush_hint_write(void *opaque, hwaddr addr,
> > +                                    uint64_t data, unsigned size)
> > +{
> > +    nvdimm_debug("Write Flush Hint: offset 0x%"HWADDR_PRIx", data 0x%"PRIx64"\n",
> > +                 addr, data);
> > +    nvdimm_flush((NVDIMMDevice *)opaque);
> 
> C automatically casts void * to any other pointer type.  The cast is
> unnecessary.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06  9:58   ` Haozhong Zhang
@ 2017-04-06 11:46     ` Xiao Guangrong
  0 siblings, 0 replies; 38+ messages in thread
From: Xiao Guangrong @ 2017-04-06 11:46 UTC (permalink / raw)
  To: qemu-devel, dan.j.williams, Michael S. Tsirkin, Igor Mammedov,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost



On 04/06/2017 05:58 PM, Haozhong Zhang wrote:
> On 04/06/17 17:39 +0800, Xiao Guangrong wrote:
>>
>>
>> On 31/03/2017 4:41 PM, Haozhong Zhang wrote:
>>> This patch series constructs the flush hint address structures for
>>> nvdimm devices in QEMU.
>>>
>>> It's of course not for 2.9. I send it out early in order to get
>>> comments on one point I'm uncertain (see the detailed explanation
>>> below). Thanks for any comments in advance!
>>>
>>>
>>> Background
>>> ---------------
>>> Flush hint address structure is a substructure of NFIT and specifies
>>> one or more addresses, namely Flush Hint Addresses. Software can write
>>> to any one of these flush hint addresses to cause any preceding writes
>>> to the NVDIMM region to be flushed out of the intervening platform
>>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>>>
>>>
>>> Why is it RFC?
>>> ---------------
>>> RFC is added because I'm not sure whether the way in this patch series
>>> that allocates the guest flush hint addresses is right.
>>>
>>> QEMU needs to trap guest accesses (at least for writes) to the flush
>>> hint addresses in order to perform the necessary flush on the host
>>> back store. Therefore, QEMU needs to create IO memory regions that
>>> cover those flush hint addresses. In order to create those IO memory
>>> regions, QEMU needs to know the flush hint addresses or their offsets
>>> to other known memory regions in advance. So far looks good.
>>>
>>> Flush hint addresses are in the guest address space. Looking at how
>>> the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
>>> to take the same way for flush hint addresses, i.e. let the guest
>>> firmware allocate from free addresses and patch them in the flush hint
>>> address structure. (*Please correct me If my following understand is wrong*)
>>> However, the current allocation and pointer patching are transparent
>>> to QEMU, so QEMU will be unaware of the flush hint addresses, and
>>> consequently have no way to create corresponding IO memory regions in
>>> order to trap guest accesses.
>>
>> Er, it is awkward and flush-hint-table is static which may not be
>> easily patched.
>>
>>>
>>> Alternatively, this patch series moves the allocation of flush hint
>>> addresses to QEMU:
>>>
>>> 1. (Patch 1) We reserve an address range after the end address of each
>>>    nvdimm device. Its size is specified by the user via a new pc-dimm
>>>    option 'reserved-size'.
>>>
>>
>> We should make it only work for nvdimm?
>>
>
> Yes, we can check whether the machine option 'nvdimm' is present when
> plugging the nvdimm.
>
>>>    For the following example,
>>>         -object memory-backend-file,id=mem0,size=4G,...
>>>         -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
>>>         -device pc-dimm,id=dimm1,...
>>>    if dimm0 is allocated to address N ~ N+4G, the address of dimm1
>>>    will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
>>>    dimm0.
>>>
>>> 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
>>>    structure for each nvdimm device, it will allocate them from the
>>>    above reserved area, e.g. the flush hint addresses of above dimm0
>>>    are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
>>>    this way, so QEMU can easily create IO memory regions for them.
>>>
>>>    If the reserved area is not present or too small, QEMU will report
>>>    errors.
>>>
>>
>> We should make 'reserved-size' always be page-aligned and should be
>> transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
>> is specified?
>>
>
> 4K alignment is already enforced by current memory plug code.
>
> About the automatic reservation, is a non-zero default value
> acceptable by qemu design/convention in general?

Needn't make it as a user-visible parameter, just a field contained in
dimm-dev struct or nvdimm-dev struct indicates the reserved size is
okay.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
  2017-04-06 10:24   ` Stefan Hajnoczi
@ 2017-04-06 11:50   ` Xiao Guangrong
  1 sibling, 0 replies; 38+ messages in thread
From: Xiao Guangrong @ 2017-04-06 11:50 UTC (permalink / raw)
  To: Haozhong Zhang, qemu-devel
  Cc: dan.j.williams, Xiao Guangrong, Igor Mammedov, Michael S. Tsirkin



On 03/31/2017 04:41 PM, Haozhong Zhang wrote:
> If option 'reserved-size=RSVD' is present, QEMU will reserve an
> address range of size 'RSVD' after the ending address of pc-dimm
> device.
>
> For the following example,
>     -object memory-backend-file,id=mem0,size=4G,...
>     -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
>     -device pc-dimm,id=dimm1,...
> if dimm0 is allocated to address N ~ N+4G, the address range of
> dimm1 will start from N+4G+4K or higher.
>
> Its current usage is to reserve spaces for flush hint addresses
> of nvdimm devices.
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/mem/pc-dimm.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++---
>  include/hw/mem/pc-dimm.h |  2 ++
>  2 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> index 9e8dab0..13dcd71 100644
> --- a/hw/mem/pc-dimm.c
> +++ b/hw/mem/pc-dimm.c
> @@ -28,6 +28,7 @@
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  #include "hw/virtio/vhost.h"
> +#include "exec/address-spaces.h"
>
>  typedef struct pc_dimms_capacity {
>       uint64_t size;
> @@ -44,7 +45,12 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>      MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
>      Error *local_err = NULL;
>      uint64_t existing_dimms_capacity = 0;
> -    uint64_t addr;
> +    uint64_t addr, size = memory_region_size(mr);
> +
> +    size += object_property_get_int(OBJECT(dimm), PC_DIMM_RSVD_PROP, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
>
>      addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, &local_err);
>      if (local_err) {
> @@ -54,7 +60,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>      addr = pc_dimm_get_free_addr(hpms->base,
>                                   memory_region_size(&hpms->mr),
>                                   !addr ? NULL : &addr, align,
> -                                 memory_region_size(mr), &local_err);
> +                                 size, &local_err);
>      if (local_err) {
>          goto out;
>      }
> @@ -64,7 +70,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
>          goto out;
>      }
>
> -    if (existing_dimms_capacity + memory_region_size(mr) >
> +    if (existing_dimms_capacity + size >
>          machine->maxram_size - machine->ram_size) {
>          error_setg(&local_err, "not enough space, currently 0x%" PRIx64
>                     " in use of total hot pluggable 0x" RAM_ADDR_FMT,
> @@ -315,6 +321,9 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
>          PCDIMMDevice *dimm = item->data;
>          uint64_t dimm_size = object_property_get_int(OBJECT(dimm),
>                                                       PC_DIMM_SIZE_PROP,
> +                                                     errp) +
> +                             object_property_get_int(OBJECT(dimm),
> +                                                     PC_DIMM_RSVD_PROP,
>                                                       errp);
>          if (errp && *errp) {
>              goto out;
> @@ -382,6 +391,37 @@ static void pc_dimm_check_memdev_is_busy(Object *obj, const char *name,
>      error_propagate(errp, local_err);
>  }
>

Should count the reserved size in pc_existing_dimms_capacity_internal()
too.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
@ 2017-04-06 11:52   ` Xiao Guangrong
  2017-04-11  8:22     ` Haozhong Zhang
  2017-04-20 13:12   ` Igor Mammedov
  1 sibling, 1 reply; 38+ messages in thread
From: Xiao Guangrong @ 2017-04-06 11:52 UTC (permalink / raw)
  To: Haozhong Zhang, qemu-devel
  Cc: dan.j.williams, Xiao Guangrong, Igor Mammedov, Michael S. Tsirkin



On 03/31/2017 04:41 PM, Haozhong Zhang wrote:
> fsync() is used to persist modifications to the back store. If the
> host NVDIMM is used as the back store, fsync() on Linux will trigger
> the write to the host flush hint address.
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/mem/nvdimm.c         | 22 ++++++++++++++++++++++
>  include/hw/mem/nvdimm.h | 13 +++++++++++++
>  2 files changed, 35 insertions(+)
>
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index db896b0..484ab8b 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
>      return &nvdimm->nvdimm_mr;
>  }
>
> +static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
> +{
> +    if (nvdimm->flush_hint_enabled) {
> +        nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);

Hmm, IIRC host-mem-file does not initalize backend_fd at all.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06  9:43 ` Stefan Hajnoczi
  2017-04-06 10:31   ` Haozhong Zhang
@ 2017-04-06 12:02   ` Xiao Guangrong
  2017-04-11  8:41     ` Haozhong Zhang
  2017-04-06 14:32   ` Dan Williams
  2017-04-11  6:34   ` Haozhong Zhang
  3 siblings, 1 reply; 38+ messages in thread
From: Xiao Guangrong @ 2017-04-06 12:02 UTC (permalink / raw)
  To: Stefan Hajnoczi, Haozhong Zhang
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Igor Mammedov,
	Paolo Bonzini, dan.j.williams, Richard Henderson, Xiao Guangrong



On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> This patch series constructs the flush hint address structures for
>> nvdimm devices in QEMU.
>>
>> It's of course not for 2.9. I send it out early in order to get
>> comments on one point I'm uncertain (see the detailed explanation
>> below). Thanks for any comments in advance!
>> Background
>> ---------------
>
> Extra background:
>
> Flush Hint Addresses are necessary because:
>
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
>
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
>
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
>
>> Flush hint address structure is a substructure of NFIT and specifies
>> one or more addresses, namely Flush Hint Addresses. Software can write
>> to any one of these flush hint addresses to cause any preceding writes
>> to the NVDIMM region to be flushed out of the intervening platform
>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
>
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
>
> This patch implements the MMIO write like this:
>
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
>
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
>
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
>
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
>
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
>
> Some ideas for a faster implementation:
>
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.

Good point.

We can assume flush-CPU-cache-to-make-persistence is always
available on Intel's hardware so that flush-hint-table is not
needed if the vNVDIMM is based on a real Intel's NVDIMM device.

If the vNVDIMM device is based on the regular file, i think
fsync is the bottleneck rather than this mmio-virtualization. :(

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06  9:43 ` Stefan Hajnoczi
  2017-04-06 10:31   ` Haozhong Zhang
  2017-04-06 12:02   ` Xiao Guangrong
@ 2017-04-06 14:32   ` Dan Williams
  2017-04-07 14:31     ` Stefan Hajnoczi
  2017-04-11  6:34   ` Haozhong Zhang
  3 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2017-04-06 14:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Haozhong Zhang, qemu-devel, Xiao Guangrong, Michael S. Tsirkin,
	Eduardo Habkost, Paolo Bonzini, Igor Mammedov, Richard Henderson

On Thu, Apr 6, 2017 at 2:43 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> This patch series constructs the flush hint address structures for
>> nvdimm devices in QEMU.
>>
>> It's of course not for 2.9. I send it out early in order to get
>> comments on one point I'm uncertain (see the detailed explanation
>> below). Thanks for any comments in advance!
>> Background
>> ---------------
>
> Extra background:
>
> Flush Hint Addresses are necessary because:
>
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
>
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
>
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
>
>> Flush hint address structure is a substructure of NFIT and specifies
>> one or more addresses, namely Flush Hint Addresses. Software can write
>> to any one of these flush hint addresses to cause any preceding writes
>> to the NVDIMM region to be flushed out of the intervening platform
>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
>
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
>
> This patch implements the MMIO write like this:
>
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
>
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
>
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
>
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
>
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
>
> Some ideas for a faster implementation:
>
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.

A few thoughts:

* You don't need to call fsync in the case that the host file is a
/dev/daxX.Y device. Since there is no intervening filesystem.

* While the host implementation is fast it still is not meant to be
called very frequently. We could also look to make a paravirtualized
way for it complete asynchronously if the guest just wants to post a
flush request and get notified of a completion at a later time.

* We're still actively looking for ways to make it safe and efficient
to minimize sync overhead if an mmap write did not dirty filesystem
data, so hopefully we can make that fsync call overhead smaller over
time.

* I don't think we can just skip implementing support for this flush.
Yes it's virtualization unfriendly, but it completely defeats the
persistence of persistent memory if the host filesystem is free to
throw away writes that the guest expects to find after a surprise
power loss.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-04-06 10:46     ` Haozhong Zhang
@ 2017-04-07 13:46       ` Stefan Hajnoczi
  2017-04-11  8:57         ` Haozhong Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 13:46 UTC (permalink / raw)
  To: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1731 bytes --]

On Thu, Apr 06, 2017 at 06:46:49PM +0800, Haozhong Zhang wrote:
> On 04/06/17 11:24 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:44PM +0800, Haozhong Zhang wrote:
> > > If option 'reserved-size=RSVD' is present, QEMU will reserve an
> > > address range of size 'RSVD' after the ending address of pc-dimm
> > > device.
> > > 
> > > For the following example,
> > >     -object memory-backend-file,id=mem0,size=4G,...
> > >     -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
> > 
> > "reserved-size" is not a clear name.  I suggest calling the property
> > "num-flush-hints" (default 0).  QEMU can calculate the actual size in
> > bytes.
> > 
> >   -device nvdimm,num-flush-hints=1
> > 
> > QEMU will use one flush hint and reserve enough space (e.g. 1 page) for
> > the MMIO region.
> >
> 
> Each flush hint can be as small as one cache line size which is also
> the size used in this patch series.
> 
> We need to calculate the size of all flush hints in pc_dimm_memory_plug(),
> because when building the flush hint address structure we need to know
> the address of flush hints.
> 
> IIUC, pc_dimm_memory_plug() is not specific to x86, so it's better
> take a general way to get the vcpu cache line size in pc_dimm_memory_plug(),
> which seemingly lacks in QEMU (though I believe it should be doable).
> 
> To make things simple, I leave the size decision to users, and check
> whether it's large enough when building the flush hint address
> structures in patch 4.

Do you see my concern that "reserved-size" is not a good property?

 * How does the user choose a sensible value?

 * Why is it called "reserved-size" instead of "flush-hints-size"?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06 14:32   ` Dan Williams
@ 2017-04-07 14:31     ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 14:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Haozhong Zhang, qemu-devel, Xiao Guangrong, Michael S. Tsirkin,
	Eduardo Habkost, Paolo Bonzini, Igor Mammedov, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 1063 bytes --]

On Thu, Apr 06, 2017 at 07:32:01AM -0700, Dan Williams wrote:
> On Thu, Apr 6, 2017 at 2:43 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> * I don't think we can just skip implementing support for this flush.
> Yes it's virtualization unfriendly, but it completely defeats the
> persistence of persistent memory if the host filesystem is free to
> throw away writes that the guest expects to find after a surprise
> power loss.

Right, if we don't implement Address Flush Hints then the only safe
option for configurations that require them would be to print an error
and refuse to start.

This synchronous fsync(2) business is like taking a host page fault or
being scheduled out by the host scheduler.  It means the vcpu is stuck
for a little bit.  Hopefully not long enough to upset the guest kernel
or the application.

One of the things that guest software can do to help minimize the
performance impact is to avoid holding locks across the address flush
mmio write instruction.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06 10:31   ` Haozhong Zhang
@ 2017-04-07 14:38     ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 14:38 UTC (permalink / raw)
  To: dan.j.williams, qemu-devel, Xiao Guangrong, Michael S. Tsirkin,
	Eduardo Habkost, Paolo Bonzini, Igor Mammedov, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]

On Thu, Apr 06, 2017 at 06:31:17PM +0800, Haozhong Zhang wrote:
> On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU.  But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly.  Users who
> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> > 
> > Some ideas for a faster implementation:
> > 
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >    global mutex.  Little synchronization is necessary as long as the
> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> >
> 
> ACPI spec does not say it allows or disallows multiple writes to the
> same flush hint address in parallel. If it can, I think we can remove
> the global locking requirement for the MMIO memory region of the flush
> hint address of vNVDIMM.

The Linux code tries to spread the writes but two CPUs can write to the
same address sometimes.

It doesn't matter if two vcpus access the same address because QEMU just
invokes fsync(2).  The host kernel has synchronization to make the fsync
safe.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-04-06 10:53     ` Haozhong Zhang
@ 2017-04-07 14:41       ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 14:41 UTC (permalink / raw)
  To: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Thu, Apr 06, 2017 at 06:53:09PM +0800, Haozhong Zhang wrote:
> On 04/06/17 11:13 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:47PM +0800, Haozhong Zhang wrote:
> > > 
> > > Add an boolean option 'flush-hint' to device 'nvdimm'. If it's on, a
> > > flush hint address structure will be constructed for each nvdimm
> > > device.
> > 
> > Users should not need to set the flush hint option.  NVDIMM
> > configurations that persist data properly without Flush Hint Addresses
> > shouldn't use them (for best performance).  Configurations that rely on
> > flush hints *must* use them to guarantee data integrity.
> 
> It's for backwards compatibility, i.e. migrating a VM on QEMU w/o
> flush hint support to another one w/ flush hint support. By using a
> flush-hint option and making it disabled by default, users can ensure
> both QEMU provide the same VM configuration.

I think QEMU should play a role in deciding whether to use Address Flush
Hints or not.

We should not require the user to set a sensible value.  If they get it
wrong then they may suffer data loss!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-04-06 10:13   ` Stefan Hajnoczi
  2017-04-06 10:53     ` Haozhong Zhang
@ 2017-04-07 15:51     ` Dan Williams
  1 sibling, 0 replies; 38+ messages in thread
From: Dan Williams @ 2017-04-07 15:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Haozhong Zhang, qemu-devel, Xiao Guangrong, Michael S. Tsirkin,
	Eduardo Habkost, Paolo Bonzini, Igor Mammedov, Richard Henderson

On Thu, Apr 6, 2017 at 3:13 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 04:41:47PM +0800, Haozhong Zhang wrote:
>>
>> Add an boolean option 'flush-hint' to device 'nvdimm'. If it's on, a
>> flush hint address structure will be constructed for each nvdimm
>> device.
>
> Users should not need to set the flush hint option.  NVDIMM
> configurations that persist data properly without Flush Hint Addresses
> shouldn't use them (for best performance).  Configurations that rely on
> flush hints *must* use them to guarantee data integrity.
>
> I don't remember if there's a way to detect the configuration from host
> userspace, but we should focus on setting this correctly without
> requiring users to know which setting is necessary.
>
>> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
>> ---
>>  hw/acpi/nvdimm.c        | 106 +++++++++++++++++++++++++++++++++++++++++++++---
>>  hw/i386/pc.c            |   5 ++-
>>  hw/mem/nvdimm.c         |  26 ++++++++++++
>>  include/hw/mem/nvdimm.h |   2 +-
>>  4 files changed, 132 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
>> index ea2ac3e..a7ff0b2 100644
>> --- a/hw/acpi/nvdimm.c
>> +++ b/hw/acpi/nvdimm.c
>> @@ -32,6 +32,8 @@
>>  #include "hw/acpi/bios-linker-loader.h"
>>  #include "hw/nvram/fw_cfg.h"
>>  #include "hw/mem/nvdimm.h"
>> +#include "exec/address-spaces.h"
>> +#include "qapi/error.h"
>>
>>  static int nvdimm_device_list(Object *obj, void *opaque)
>>  {
>> @@ -168,6 +170,22 @@ struct NvdimmNfitControlRegion {
>>  typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
>>
>>  /*
>> + * NVDIMM Flush Hint Address Structure
>> + *
>> + * It enables the data durability mechanism via ACPI.
>> + */
>> +struct NvdimmNfitFlushHintAddress {
>> +    uint16_t type;
>> +    uint16_t length;
>> +    uint32_t nfit_handle;
>> +    uint16_t nr_flush_hint_addr;
>> +    uint8_t  reserved[6];
>> +#define NR_FLUSH_HINT_ADDR 1
>> +    uint64_t flush_hint_addr[NR_FLUSH_HINT_ADDR];
>
> How should the number of flush hints be set?  This patch hardcodes it to
> 1 but the Linux nvdimm drivers tries to balance between flush hint
> addresses to improve performance (to prevent cache line bouncing?).

No, 1 should be fine for qemu. The reason for multiple flush hints is
to accommodate multiple flush queues in the hardware. Since each flush
takes some amount of time you can get more parallelism  if multiple
threads are waiting for the same flush event, rather than being queued
serially. I don't think you can realize the same parallelism with
fsync() calls.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06  9:43 ` Stefan Hajnoczi
                     ` (2 preceding siblings ...)
  2017-04-06 14:32   ` Dan Williams
@ 2017-04-11  6:34   ` Haozhong Zhang
  2017-04-18 10:15     ` Stefan Hajnoczi
  3 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-11  6:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > Background
> > ---------------
> 
> Extra background:
> 
> Flush Hint Addresses are necessary because:
> 
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
> 
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
> 
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
> 
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> 
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.

Some performance data below.

Host HW config:
  CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled
  MEM: 64 GB

  As I don't have NVDIMM hardware, so I use files in ext4 fs on a
  normal SATA SSD as the back storage of vNVDIMM.


Host SW config:
  Kernel: 4.10.1
  QEMU: commit ea2afcf with this patch series applied.


Guest config:
  For flush hint enabled case, the following QEMU options are used
    -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
    -m 4G,slots=4,maxmem=128G \
    -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
    -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
    -hda GUEST_DISK_IMG -serial pty

  For flush hint disabled case, the following QEMU options are used
    -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
    -m 4G,slots=4,maxmem=128G \
    -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
    -device nvdimm,id=nv1,memdev=mem1 \
    -hda GUEST_DISK_IMG -serial pty

  nvm-img used above is created in ext4 fs on the host SSD by
    dd if=/dev/zero of=nvm-img bs=1G count=8

  Guest kernel: 4.11.0-rc4


Benchmark in guest:
  mkfs.ext4 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt
  dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping
  rm /mnt/data                               #
  dd if=/dev/zero of=/mnt/data bs=1G count=7

  and record the write speed reported by the last 'dd' command.


Result:
  - Flush hint disabled
    Vary from 161 MB/s to 708 MB/s, depending on how many fs/device
    flush operations are performed on the host side during the guest
    'dd'.

  - Flush hint enabled
  
    Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in
    QEMU takes. Usually, there is at least one fsync() during one 'dd'
    command that takes several seconds (the worst one takes 39 s).

    To be worse, during those long host-side fsync() operations, guest
    kernel complained stalls.


Some thoughts:

- If the non-NVDIMM hardware is used as the back store of vNVDIMM,
  QEMU may perform the host-side flush operations asynchronously with
  VM, which will not block VM too long but sacrifice the durability
  guarantee.

- If physical NVDIMM is used as the back store and ADR is supported on
  the host, QEMU can rely on ADR to guarantee the data durability and
  will not need to emulate flush hint for guest.

- If physical NVDIMM is used as the back store and ADR is not
  supported on the host, QEMU will still need to emulate flush hint
  for guest and need to use a fast approach other than fsync() to
  trigger writes to host flush hint.

  Could kernel expose an interface to allow the userland (i.e. QEMU in
  this case) to directly write to the flush hint of a NVDIMM region?


Haozhong

> 
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> 
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
> 
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
> 
> This patch implements the MMIO write like this:
> 
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
> 
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
> 
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
> 
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
> 
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
> 
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
> 
> Some ideas for a faster implementation:
> 
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
> 
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
> 
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.
> 
> Benchmark results would be important for deciding how big the problem
> is.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
  2017-04-06 11:52   ` Xiao Guangrong
@ 2017-04-11  8:22     ` Haozhong Zhang
  2017-04-11  8:29       ` Haozhong Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-11  8:22 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin

On 04/06/17 19:52 +0800, Xiao Guangrong wrote:
> 
> 
> On 03/31/2017 04:41 PM, Haozhong Zhang wrote:
> > fsync() is used to persist modifications to the back store. If the
> > host NVDIMM is used as the back store, fsync() on Linux will trigger
> > the write to the host flush hint address.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> >  hw/mem/nvdimm.c         | 22 ++++++++++++++++++++++
> >  include/hw/mem/nvdimm.h | 13 +++++++++++++
> >  2 files changed, 35 insertions(+)
> > 
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index db896b0..484ab8b 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> >      return &nvdimm->nvdimm_mr;
> >  }
> > 
> > +static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
> > +{
> > +    if (nvdimm->flush_hint_enabled) {
> > +        nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);
> 
> Hmm, IIRC host-mem-file does not initalize backend_fd at all.

Oops, forgot to add this part. Thanks for remind.

Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
  2017-04-11  8:22     ` Haozhong Zhang
@ 2017-04-11  8:29       ` Haozhong Zhang
  2017-04-11 11:55         ` Xiao Guangrong
  0 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-11  8:29 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin

On 04/11/17 16:22 +0800, Haozhong Zhang wrote:
> On 04/06/17 19:52 +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 03/31/2017 04:41 PM, Haozhong Zhang wrote:
> > > fsync() is used to persist modifications to the back store. If the
> > > host NVDIMM is used as the back store, fsync() on Linux will trigger
> > > the write to the host flush hint address.
> > > 
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > ---
> > >  hw/mem/nvdimm.c         | 22 ++++++++++++++++++++++
> > >  include/hw/mem/nvdimm.h | 13 +++++++++++++
> > >  2 files changed, 35 insertions(+)
> > > 
> > > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > > index db896b0..484ab8b 100644
> > > --- a/hw/mem/nvdimm.c
> > > +++ b/hw/mem/nvdimm.c
> > > @@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> > >      return &nvdimm->nvdimm_mr;
> > >  }
> > > 
> > > +static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
> > > +{
> > > +    if (nvdimm->flush_hint_enabled) {
> > > +        nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);
> > 
> > Hmm, IIRC host-mem-file does not initalize backend_fd at all.
> 
> Oops, forgot to add this part. Thanks for remind.

Sorry, I replied too quick.

For hostmem-file, hostmem_mr->ram_block->fd is set in file_ram_alloc(),
which can be got by memory_region_get_fd() here.

Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-06 12:02   ` Xiao Guangrong
@ 2017-04-11  8:41     ` Haozhong Zhang
  2017-04-11 14:56       ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-11  8:41 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Stefan Hajnoczi, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Igor Mammedov, Paolo Bonzini, dan.j.williams, Richard Henderson,
	Xiao Guangrong

On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
> 
> 
> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > > This patch series constructs the flush hint address structures for
> > > nvdimm devices in QEMU.
> > > 
> > > It's of course not for 2.9. I send it out early in order to get
> > > comments on one point I'm uncertain (see the detailed explanation
> > > below). Thanks for any comments in advance!
> > > Background
> > > ---------------
> > 
> > Extra background:
> > 
> > Flush Hint Addresses are necessary because:
> > 
> > 1. Some hardware configurations may require them.  In other words, a
> >    cache flush instruction is not enough to persist data.
> > 
> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >    metadata changes).
> > 
> > Without Flush Hint Addresses only some NVDIMM configurations actually
> > guarantee data persistence.
> > 
> > > Flush hint address structure is a substructure of NFIT and specifies
> > > one or more addresses, namely Flush Hint Addresses. Software can write
> > > to any one of these flush hint addresses to cause any preceding writes
> > > to the NVDIMM region to be flushed out of the intervening platform
> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > Do you have performance data?  I'm concerned that Flush Hint Address
> > hardware interface is not virtualization-friendly.
> > 
> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> > 
> >   wmb();
> >   for (i = 0; i < nd_region->ndr_mappings; i++)
> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >   wmb();
> > 
> > That looks pretty lightweight - it's an MMIO write between write
> > barriers.
> > 
> > This patch implements the MMIO write like this:
> > 
> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
> >   {
> >       if (nvdimm->backend_fd != -1) {
> >           /*
> >            * If the backend store is a physical NVDIMM device, fsync()
> >            * will trigger the flush via the flush hint on the host device.
> >            */
> >           fsync(nvdimm->backend_fd);
> >       }
> >   }
> > 
> > The MMIO store instruction turned into a synchronous fsync(2) system
> > call plus vmexit/vmenter and QEMU userspace context switch:
> > 
> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
> >    instruction has an unexpected and huge latency.
> > 
> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >    (including the monitor) are blocked during fsync(2).  Other vcpu
> >    threads may block if they vmexit.
> > 
> > It is hard to implement this efficiently in QEMU.  This is why I said
> > the hardware interface is not virtualization-friendly.  It's cheap on
> > real hardware but expensive under virtualization.
> > 
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU.  But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly.  Users who
> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> > 
> > Some ideas for a faster implementation:
> > 
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >    global mutex.  Little synchronization is necessary as long as the
> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> > 
> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >    the physical NVDIMM in cases where the configuration does not require
> >    host kernel interception?  That way QEMU can map the physical
> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
> >    is bypassed and performance would be good.
> > 
> > I'm not sure there is anything we can do to make the case where the host
> > kernel wants an fsync(2) fast :(.
> 
> Good point.
> 
> We can assume flush-CPU-cache-to-make-persistence is always
> available on Intel's hardware so that flush-hint-table is not
> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>

We can let users of qemu (e.g. libvirt) detect whether the backend
device supports ADR, and pass 'flush-hint' option to qemu only if ADR
is not supported.

> If the vNVDIMM device is based on the regular file, i think
> fsync is the bottleneck rather than this mmio-virtualization. :(
> 

Yes, fsync() on the regular file is the bottleneck. We may either

1/ perform the host-side flush in an asynchronous way which will not
   block vcpu too long,

or

2/ not provide strong durability guarantee for non-NVDIMM backend and
   not emulate flush-hint for guest at all. (I know 1/ does not
   provide strong durability guarantee either).

Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-04-07 13:46       ` Stefan Hajnoczi
@ 2017-04-11  8:57         ` Haozhong Zhang
  2017-04-20 10:54           ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Haozhong Zhang @ 2017-04-11  8:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin

On 04/07/17 14:46 +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 06, 2017 at 06:46:49PM +0800, Haozhong Zhang wrote:
> > On 04/06/17 11:24 +0100, Stefan Hajnoczi wrote:
> > > On Fri, Mar 31, 2017 at 04:41:44PM +0800, Haozhong Zhang wrote:
> > > > If option 'reserved-size=RSVD' is present, QEMU will reserve an
> > > > address range of size 'RSVD' after the ending address of pc-dimm
> > > > device.
> > > > 
> > > > For the following example,
> > > >     -object memory-backend-file,id=mem0,size=4G,...
> > > >     -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
> > > 
> > > "reserved-size" is not a clear name.  I suggest calling the property
> > > "num-flush-hints" (default 0).  QEMU can calculate the actual size in
> > > bytes.
> > > 
> > >   -device nvdimm,num-flush-hints=1
> > > 
> > > QEMU will use one flush hint and reserve enough space (e.g. 1 page) for
> > > the MMIO region.
> > >
> > 
> > Each flush hint can be as small as one cache line size which is also
> > the size used in this patch series.
> > 
> > We need to calculate the size of all flush hints in pc_dimm_memory_plug(),
> > because when building the flush hint address structure we need to know
> > the address of flush hints.
> > 
> > IIUC, pc_dimm_memory_plug() is not specific to x86, so it's better
> > take a general way to get the vcpu cache line size in pc_dimm_memory_plug(),
> > which seemingly lacks in QEMU (though I believe it should be doable).
> > 
> > To make things simple, I leave the size decision to users, and check
> > whether it's large enough when building the flush hint address
> > structures in patch 4.
> 
> Do you see my concern that "reserved-size" is not a good property?
> 
>  * How does the user choose a sensible value?

A user has to calculate the size by himself by multiplying the number
of flush hint address (currently it's hardcoded to 1) with the cache
line size (e.g. 64 bytes on x86) and properly align it, or simply uses
4KB or even a larger value.

In case the reserved-size is too small, QEMU will complain and abort,
so the user still has chance to know what goes wrong.

> 
>  * Why is it called "reserved-size" instead of "flush-hints-size"?
>

This property is made as a property of pc-dimm rather than nvdimm,
which could be used for other purposes, e.g. adjust the alignment the
base address of hotplugged pc-dimm (I remember old linux kernels on
x86-64 requires hotplugged memory chunk be 128MB aligned). Of course,
it can be limited to be a nvdimm-only property, and renamed to
'flush-hints-size' in that case.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
  2017-04-11  8:29       ` Haozhong Zhang
@ 2017-04-11 11:55         ` Xiao Guangrong
  0 siblings, 0 replies; 38+ messages in thread
From: Xiao Guangrong @ 2017-04-11 11:55 UTC (permalink / raw)
  To: qemu-devel, dan.j.williams, Xiao Guangrong, Igor Mammedov,
	Michael S. Tsirkin



On 04/11/2017 04:29 PM, Haozhong Zhang wrote:
> On 04/11/17 16:22 +0800, Haozhong Zhang wrote:
>> On 04/06/17 19:52 +0800, Xiao Guangrong wrote:
>>>
>>>
>>> On 03/31/2017 04:41 PM, Haozhong Zhang wrote:
>>>> fsync() is used to persist modifications to the back store. If the
>>>> host NVDIMM is used as the back store, fsync() on Linux will trigger
>>>> the write to the host flush hint address.
>>>>
>>>> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
>>>> ---
>>>>  hw/mem/nvdimm.c         | 22 ++++++++++++++++++++++
>>>>  include/hw/mem/nvdimm.h | 13 +++++++++++++
>>>>  2 files changed, 35 insertions(+)
>>>>
>>>> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
>>>> index db896b0..484ab8b 100644
>>>> --- a/hw/mem/nvdimm.c
>>>> +++ b/hw/mem/nvdimm.c
>>>> @@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
>>>>      return &nvdimm->nvdimm_mr;
>>>>  }
>>>>
>>>> +static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
>>>> +{
>>>> +    if (nvdimm->flush_hint_enabled) {
>>>> +        nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);
>>>
>>> Hmm, IIRC host-mem-file does not initalize backend_fd at all.
>>
>> Oops, forgot to add this part. Thanks for remind.
>
> Sorry, I replied too quick.
>
> For hostmem-file, hostmem_mr->ram_block->fd is set in file_ram_alloc(),
> which can be got by memory_region_get_fd() here.

Okay, my wrong memory and did not check the code during patch
review. :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-11  8:41     ` Haozhong Zhang
@ 2017-04-11 14:56       ` Dan Williams
  2017-04-20 19:49         ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2017-04-11 14:56 UTC (permalink / raw)
  To: Xiao Guangrong, Stefan Hajnoczi, Eduardo Habkost,
	Michael S. Tsirkin, qemu-devel, Igor Mammedov, Paolo Bonzini,
	Dan J Williams, Richard Henderson, Xiao Guangrong,
	Christoph Hellwig

[ adding Christoph ]

On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
>>
>>
>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> > > This patch series constructs the flush hint address structures for
>> > > nvdimm devices in QEMU.
>> > >
>> > > It's of course not for 2.9. I send it out early in order to get
>> > > comments on one point I'm uncertain (see the detailed explanation
>> > > below). Thanks for any comments in advance!
>> > > Background
>> > > ---------------
>> >
>> > Extra background:
>> >
>> > Flush Hint Addresses are necessary because:
>> >
>> > 1. Some hardware configurations may require them.  In other words, a
>> >    cache flush instruction is not enough to persist data.
>> >
>> > 2. The host file system may need fsync(2) calls (e.g. to persist
>> >    metadata changes).
>> >
>> > Without Flush Hint Addresses only some NVDIMM configurations actually
>> > guarantee data persistence.
>> >
>> > > Flush hint address structure is a substructure of NFIT and specifies
>> > > one or more addresses, namely Flush Hint Addresses. Software can write
>> > > to any one of these flush hint addresses to cause any preceding writes
>> > > to the NVDIMM region to be flushed out of the intervening platform
>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>> >
>> > Do you have performance data?  I'm concerned that Flush Hint Address
>> > hardware interface is not virtualization-friendly.
>> >
>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>> >
>> >   wmb();
>> >   for (i = 0; i < nd_region->ndr_mappings; i++)
>> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
>> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>> >   wmb();
>> >
>> > That looks pretty lightweight - it's an MMIO write between write
>> > barriers.
>> >
>> > This patch implements the MMIO write like this:
>> >
>> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
>> >   {
>> >       if (nvdimm->backend_fd != -1) {
>> >           /*
>> >            * If the backend store is a physical NVDIMM device, fsync()
>> >            * will trigger the flush via the flush hint on the host device.
>> >            */
>> >           fsync(nvdimm->backend_fd);
>> >       }
>> >   }
>> >
>> > The MMIO store instruction turned into a synchronous fsync(2) system
>> > call plus vmexit/vmenter and QEMU userspace context switch:
>> >
>> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>> >    instruction has an unexpected and huge latency.
>> >
>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
>> >    (including the monitor) are blocked during fsync(2).  Other vcpu
>> >    threads may block if they vmexit.
>> >
>> > It is hard to implement this efficiently in QEMU.  This is why I said
>> > the hardware interface is not virtualization-friendly.  It's cheap on
>> > real hardware but expensive under virtualization.
>> >
>> > We should think about the optimal way of implementing Flush Hint
>> > Addresses in QEMU.  But if there is no reasonable way to implement them
>> > then I think it's better *not* to implement them, just like the Block
>> > Window feature which is also not virtualization-friendly.  Users who
>> > want a block device can use virtio-blk.  I don't think NVDIMM Block
>> > Window can achieve better performance than virtio-blk under
>> > virtualization (although I'm happy to be proven wrong).
>> >
>> > Some ideas for a faster implementation:
>> >
>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>> >    global mutex.  Little synchronization is necessary as long as the
>> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
>> >
>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
>> >    the physical NVDIMM in cases where the configuration does not require
>> >    host kernel interception?  That way QEMU can map the physical
>> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>> >    is bypassed and performance would be good.
>> >
>> > I'm not sure there is anything we can do to make the case where the host
>> > kernel wants an fsync(2) fast :(.
>>
>> Good point.
>>
>> We can assume flush-CPU-cache-to-make-persistence is always
>> available on Intel's hardware so that flush-hint-table is not
>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>>
>
> We can let users of qemu (e.g. libvirt) detect whether the backend
> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
> is not supported.
>

There currently is no ACPI mechanism to detect the presence of ADR.
Also, you still need the flush for fs metadata management.

>> If the vNVDIMM device is based on the regular file, i think
>> fsync is the bottleneck rather than this mmio-virtualization. :(
>>
>
> Yes, fsync() on the regular file is the bottleneck. We may either
>
> 1/ perform the host-side flush in an asynchronous way which will not
>    block vcpu too long,
>
> or
>
> 2/ not provide strong durability guarantee for non-NVDIMM backend and
>    not emulate flush-hint for guest at all. (I know 1/ does not
>    provide strong durability guarantee either).

or

3/ Use device-dax as a stop-gap until we can get an efficient fsync()
overhead reduction (or bypass) mechanism built and accepted for
filesystem-dax.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-11  6:34   ` Haozhong Zhang
@ 2017-04-18 10:15     ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-18 10:15 UTC (permalink / raw)
  To: qemu-devel, Xiao Guangrong, Michael S. Tsirkin, Eduardo Habkost,
	Paolo Bonzini, Igor Mammedov, dan.j.williams, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 7907 bytes --]

On Tue, Apr 11, 2017 at 02:34:26PM +0800, Haozhong Zhang wrote:
> On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > > This patch series constructs the flush hint address structures for
> > > nvdimm devices in QEMU.
> > > 
> > > It's of course not for 2.9. I send it out early in order to get
> > > comments on one point I'm uncertain (see the detailed explanation
> > > below). Thanks for any comments in advance!
> > > Background
> > > ---------------
> > 
> > Extra background:
> > 
> > Flush Hint Addresses are necessary because:
> > 
> > 1. Some hardware configurations may require them.  In other words, a
> >    cache flush instruction is not enough to persist data.
> > 
> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >    metadata changes).
> > 
> > Without Flush Hint Addresses only some NVDIMM configurations actually
> > guarantee data persistence.
> > 
> > > Flush hint address structure is a substructure of NFIT and specifies
> > > one or more addresses, namely Flush Hint Addresses. Software can write
> > > to any one of these flush hint addresses to cause any preceding writes
> > > to the NVDIMM region to be flushed out of the intervening platform
> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > Do you have performance data?  I'm concerned that Flush Hint Address
> > hardware interface is not virtualization-friendly.
> 
> Some performance data below.
> 
> Host HW config:
>   CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled
>   MEM: 64 GB
> 
>   As I don't have NVDIMM hardware, so I use files in ext4 fs on a
>   normal SATA SSD as the back storage of vNVDIMM.
> 
> 
> Host SW config:
>   Kernel: 4.10.1
>   QEMU: commit ea2afcf with this patch series applied.
> 
> 
> Guest config:
>   For flush hint enabled case, the following QEMU options are used
>     -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
>     -m 4G,slots=4,maxmem=128G \
>     -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
>     -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
>     -hda GUEST_DISK_IMG -serial pty
> 
>   For flush hint disabled case, the following QEMU options are used
>     -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
>     -m 4G,slots=4,maxmem=128G \
>     -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
>     -device nvdimm,id=nv1,memdev=mem1 \
>     -hda GUEST_DISK_IMG -serial pty
> 
>   nvm-img used above is created in ext4 fs on the host SSD by
>     dd if=/dev/zero of=nvm-img bs=1G count=8
> 
>   Guest kernel: 4.11.0-rc4
> 
> 
> Benchmark in guest:
>   mkfs.ext4 /dev/pmem0
>   mount -o dax /dev/pmem0 /mnt
>   dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping
>   rm /mnt/data                               #
>   dd if=/dev/zero of=/mnt/data bs=1G count=7
> 
>   and record the write speed reported by the last 'dd' command.
> 
> 
> Result:
>   - Flush hint disabled
>     Vary from 161 MB/s to 708 MB/s, depending on how many fs/device
>     flush operations are performed on the host side during the guest
>     'dd'.
> 
>   - Flush hint enabled
>   
>     Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in
>     QEMU takes. Usually, there is at least one fsync() during one 'dd'
>     command that takes several seconds (the worst one takes 39 s).
> 
>     To be worse, during those long host-side fsync() operations, guest
>     kernel complained stalls.

I'm surprised that maximum throughput was 708 MB/s.  The guest is
DAX-aware and the write(2) syscall is a memcpy.  I expected higher
numbers without flush hints.

Also strange that throughput varied so greatly.  A benchmark that varies
4x is not useful since it's hard to tell if anything <4x indicates a
significant performance difference.  In other words, the noise is huge!

What results do you get on the host?

Dan: Any comments on this benchmark and is there a recommended way to
benchmark NVDIMM?

> Some thoughts:
> 
> - If the non-NVDIMM hardware is used as the back store of vNVDIMM,
>   QEMU may perform the host-side flush operations asynchronously with
>   VM, which will not block VM too long but sacrifice the durability
>   guarantee.
> 
> - If physical NVDIMM is used as the back store and ADR is supported on
>   the host, QEMU can rely on ADR to guarantee the data durability and
>   will not need to emulate flush hint for guest.
> 
> - If physical NVDIMM is used as the back store and ADR is not
>   supported on the host, QEMU will still need to emulate flush hint
>   for guest and need to use a fast approach other than fsync() to
>   trigger writes to host flush hint.
> 
>   Could kernel expose an interface to allow the userland (i.e. QEMU in
>   this case) to directly write to the flush hint of a NVDIMM region?
> 
> 
> Haozhong
> 
> > 
> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> > 
> >   wmb();
> >   for (i = 0; i < nd_region->ndr_mappings; i++)
> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >   wmb();
> > 
> > That looks pretty lightweight - it's an MMIO write between write
> > barriers.
> > 
> > This patch implements the MMIO write like this:
> > 
> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
> >   {
> >       if (nvdimm->backend_fd != -1) {
> >           /*
> >            * If the backend store is a physical NVDIMM device, fsync()
> >            * will trigger the flush via the flush hint on the host device.
> >            */
> >           fsync(nvdimm->backend_fd);
> >       }
> >   }
> > 
> > The MMIO store instruction turned into a synchronous fsync(2) system
> > call plus vmexit/vmenter and QEMU userspace context switch:
> > 
> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
> >    instruction has an unexpected and huge latency.
> > 
> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >    (including the monitor) are blocked during fsync(2).  Other vcpu
> >    threads may block if they vmexit.
> > 
> > It is hard to implement this efficiently in QEMU.  This is why I said
> > the hardware interface is not virtualization-friendly.  It's cheap on
> > real hardware but expensive under virtualization.
> > 
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU.  But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly.  Users who
> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> > 
> > Some ideas for a faster implementation:
> > 
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >    global mutex.  Little synchronization is necessary as long as the
> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> > 
> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >    the physical NVDIMM in cases where the configuration does not require
> >    host kernel interception?  That way QEMU can map the physical
> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
> >    is bypassed and performance would be good.
> > 
> > I'm not sure there is anything we can do to make the case where the host
> > kernel wants an fsync(2) fast :(.
> > 
> > Benchmark results would be important for deciding how big the problem
> > is.
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
  2017-04-11  8:57         ` Haozhong Zhang
@ 2017-04-20 10:54           ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2017-04-20 10:54 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Stefan Hajnoczi, qemu-devel, dan.j.williams, Xiao Guangrong,
	Michael S. Tsirkin

On Tue, 11 Apr 2017 16:57:00 +0800
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

> On 04/07/17 14:46 +0100, Stefan Hajnoczi wrote:
> > On Thu, Apr 06, 2017 at 06:46:49PM +0800, Haozhong Zhang wrote:  
> > > On 04/06/17 11:24 +0100, Stefan Hajnoczi wrote:  
> > > > On Fri, Mar 31, 2017 at 04:41:44PM +0800, Haozhong Zhang wrote:  
> > > > > If option 'reserved-size=RSVD' is present, QEMU will reserve an
> > > > > address range of size 'RSVD' after the ending address of pc-dimm
> > > > > device.
> > > > > 
> > > > > For the following example,
> > > > >     -object memory-backend-file,id=mem0,size=4G,...
> > > > >     -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...  
> > > > 
> > > > "reserved-size" is not a clear name.  I suggest calling the property
> > > > "num-flush-hints" (default 0).  QEMU can calculate the actual size in
> > > > bytes.
> > > > 
> > > >   -device nvdimm,num-flush-hints=1
I'm for this approach as well

I see that in patchset you hardcode cache line size to 64 for pc/q35
machines. What you could do is to introduce machine or cpu callback to
fetch cache line size and then you can calculate needed MMIO region size
using it and num-flush-hints in target independent way

> > > > 
> > > > QEMU will use one flush hint and reserve enough space (e.g. 1 page) for
> > > > the MMIO region.
> > > >  
> > > 
> > > Each flush hint can be as small as one cache line size which is also
> > > the size used in this patch series.
> > > 
> > > We need to calculate the size of all flush hints in pc_dimm_memory_plug(),
> > > because when building the flush hint address structure we need to know
> > > the address of flush hints.
> > > 
> > > IIUC, pc_dimm_memory_plug() is not specific to x86, so it's better
> > > take a general way to get the vcpu cache line size in pc_dimm_memory_plug(),
> > > which seemingly lacks in QEMU (though I believe it should be doable).
> > > 
> > > To make things simple, I leave the size decision to users, and check
> > > whether it's large enough when building the flush hint address
> > > structures in patch 4.  
> > 
> > Do you see my concern that "reserved-size" is not a good property?
> > 
> >  * How does the user choose a sensible value?  
> 
> A user has to calculate the size by himself by multiplying the number
> of flush hint address (currently it's hardcoded to 1) with the cache
> line size (e.g. 64 bytes on x86) and properly align it, or simply uses
> 4KB or even a larger value.
> 
> In case the reserved-size is too small, QEMU will complain and abort,
> so the user still has chance to know what goes wrong.
> 
> > 
> >  * Why is it called "reserved-size" instead of "flush-hints-size"?
> >  
> 
> This property is made as a property of pc-dimm rather than nvdimm,
> which could be used for other purposes, e.g. adjust the alignment the
> base address of hotplugged pc-dimm (I remember old linux kernels on
> x86-64 requires hotplugged memory chunk be 128MB aligned).
it would be rather awkward to use with pc-dimm:
 1: it would affect not the dimm for which it's specified but the following one
 2: to set it to correct value that would align base address of the next dimm
    to need value, the user first would have to know (set manually) a base address
    of dimm for which reserved-size is provided.

Users would be better off if he/she just provided properly aligned base address
instead of reserved-size.

> Of course,
> it can be limited to be a nvdimm-only property, and renamed to
> 'flush-hints-size' in that case.
pls move this property to nvdimm

> 
> Thanks,
> Haozhong

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
  2017-04-06 10:13   ` Stefan Hajnoczi
  2017-04-06 10:25   ` Stefan Hajnoczi
@ 2017-04-20 11:22   ` Igor Mammedov
  2 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2017-04-20 11:22 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, dan.j.williams, Michael S. Tsirkin, Xiao Guangrong,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On Fri, 31 Mar 2017 16:41:47 +0800
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

> Add an boolean option 'flush-hint' to device 'nvdimm'. If it's on, a
> flush hint address structure will be constructed for each nvdimm
> device.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/acpi/nvdimm.c        | 106 +++++++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/pc.c            |   5 ++-
>  hw/mem/nvdimm.c         |  26 ++++++++++++
>  include/hw/mem/nvdimm.h |   2 +-
>  4 files changed, 132 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index ea2ac3e..a7ff0b2 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -32,6 +32,8 @@
>  #include "hw/acpi/bios-linker-loader.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "hw/mem/nvdimm.h"
> +#include "exec/address-spaces.h"
> +#include "qapi/error.h"
>  
>  static int nvdimm_device_list(Object *obj, void *opaque)
>  {
> @@ -168,6 +170,22 @@ struct NvdimmNfitControlRegion {
>  typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
>  
>  /*
> + * NVDIMM Flush Hint Address Structure
> + *
> + * It enables the data durability mechanism via ACPI.
> + */
> +struct NvdimmNfitFlushHintAddress {
> +    uint16_t type;
> +    uint16_t length;
> +    uint32_t nfit_handle;
> +    uint16_t nr_flush_hint_addr;
> +    uint8_t  reserved[6];
> +#define NR_FLUSH_HINT_ADDR 1
> +    uint64_t flush_hint_addr[NR_FLUSH_HINT_ADDR];
> +} QEMU_PACKED;
> +typedef struct NvdimmNfitFlushHintAddress NvdimmNfitFlushHintAddress;
> +
> +/*
>   * Module serial number is a unique number for each device. We use the
>   * slot id of NVDIMM device to generate this number so that each device
>   * associates with a different number.
> @@ -343,10 +361,79 @@ static void nvdimm_build_structure_dcr(GArray *structures, DeviceState *dev)
>                                           (DSM) in DSM Spec Rev1.*/);
>  }
>  
> -static GArray *nvdimm_build_device_structure(void)
> +static uint64_t nvdimm_flush_hint_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    return 0;
> +}
> +
> +static void nvdimm_flush_hint_write(void *opaque, hwaddr addr,
> +                                    uint64_t data, unsigned size)
> +{
> +    nvdimm_debug("Write Flush Hint: offset 0x%"HWADDR_PRIx", data 0x%"PRIx64"\n",
> +                 addr, data);
> +    nvdimm_flush((NVDIMMDevice *)opaque);
> +}
> +
> +static const MemoryRegionOps nvdimm_flush_hint_ops = {
> +    .read = nvdimm_flush_hint_read,
> +    .write = nvdimm_flush_hint_write,
> +};
> +
> +/*
> + * ACPI 6.0: 5.2.25.7 Flush Hint Address Structure
> + */
> +static void nvdimm_build_structure_flush_hint(GArray *structures,
> +                                              DeviceState *dev,
> +                                              unsigned int cache_line_size,
> +                                              Error **errp)
> +{
> +    NvdimmNfitFlushHintAddress *flush_hint;
> +    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP, NULL);
> +    PCDIMMDevice *dimm = PC_DIMM(dev);
> +    NVDIMMDevice *nvdimm = NVDIMM(dev);
> +    uint64_t addr;
> +    unsigned int i;
> +    MemoryRegion *mr;
> +    Error *local_err = NULL;
> +
> +    if (!nvdimm->flush_hint_enabled) {
> +        return;
> +    }
> +
> +    if (cache_line_size * NR_FLUSH_HINT_ADDR > dimm->reserved_size) {
> +        error_setg(&local_err,
> +                   "insufficient reserved space for flush hint buffers");
> +        goto out;
> +    }
> +
> +    addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP, NULL);
> +    addr += object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP, NULL);
> +
> +    flush_hint = acpi_data_push(structures, sizeof(*flush_hint));
> +    flush_hint->type = cpu_to_le16(6 /* Flush Hint Address Structure */);
> +    flush_hint->length = cpu_to_le16(sizeof(*flush_hint));
> +    flush_hint->nfit_handle = cpu_to_le32(nvdimm_slot_to_handle(slot));
> +    flush_hint->nr_flush_hint_addr = cpu_to_le16(NR_FLUSH_HINT_ADDR);



> +    for (i = 0; i < NR_FLUSH_HINT_ADDR; i++, addr += cache_line_size) {
> +        flush_hint->flush_hint_addr[i] = cpu_to_le64(addr);
> +
> +        mr = g_new0(MemoryRegion, 1);
> +        memory_region_init_io(mr, OBJECT(dev), &nvdimm_flush_hint_ops, nvdimm,
> +                              "nvdimm-flush-hint", cache_line_size);
> +        memory_region_add_subregion(get_system_memory(), addr, mr);
this hunk should be in nvdimm_plug() and use hotplug_memory MR
also it this region might consume upto 1G of address space due to
1Gb alignment per slot.

Alternatively instead of mapping flush hint after nvdimm, you can
use the approach used for label_data and cut out a piece for
flush hint from the end of nvdimm's address range and possibly
use backend's MR as parent for it, doing mapping in nvdimm_realize().

Then flush hint would affect only nvdimm code and won't touch generic
pc-dimm code.

> +    }
> +
> + out:
> +    error_propagate(errp, local_err);
> +}
> +
> +static GArray *nvdimm_build_device_structure(AcpiNVDIMMState *state,
> +                                             Error **errp)
>  {
>      GSList *device_list = nvdimm_get_device_list();
>      GArray *structures = g_array_new(false, true /* clear */, 1);
> +    Error *local_err = NULL;
>  
>      for (; device_list; device_list = device_list->next) {
>          DeviceState *dev = device_list->data;
> @@ -362,9 +449,17 @@ static GArray *nvdimm_build_device_structure(void)
>  
>          /* build NVDIMM Control Region Structure. */
>          nvdimm_build_structure_dcr(structures, dev);
> +
> +        /* build Flush Hint Address Structure */
> +        nvdimm_build_structure_flush_hint(structures, dev,
> +                                          state->cache_line_size, &local_err);
> +        if (local_err) {
> +            break;
> +        }
>      }
>      g_slist_free(device_list);
>  
> +    error_propagate(errp, local_err);
>      return structures;
>  }
>  
> @@ -373,16 +468,17 @@ static void nvdimm_init_fit_buffer(NvdimmFitBuffer *fit_buf)
>      fit_buf->fit = g_array_new(false, true /* clear */, 1);
>  }
>  
> -static void nvdimm_build_fit_buffer(NvdimmFitBuffer *fit_buf)
> +static void nvdimm_build_fit_buffer(AcpiNVDIMMState *state, Error **errp)
>  {
> +    NvdimmFitBuffer *fit_buf = &state->fit_buf;
>      g_array_free(fit_buf->fit, true);
> -    fit_buf->fit = nvdimm_build_device_structure();
> +    fit_buf->fit = nvdimm_build_device_structure(state, errp);
>      fit_buf->dirty = true;
>  }
>  
> -void nvdimm_plug(AcpiNVDIMMState *state)
> +void nvdimm_plug(AcpiNVDIMMState *state, Error **errp)
>  {
> -    nvdimm_build_fit_buffer(&state->fit_buf);
> +    nvdimm_build_fit_buffer(state, errp);
>  }
>  
>  static void nvdimm_build_nfit(AcpiNVDIMMState *state, GArray *table_offsets,
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index d24388e..da4a5d7 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1718,7 +1718,10 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
>                         "nvdimm is not enabled: missing 'nvdimm' in '-M'");
>              goto out;
>          }
> -        nvdimm_plug(&pcms->acpi_nvdimm_state);
> +        nvdimm_plug(&pcms->acpi_nvdimm_state, &local_err);
> +        if (local_err) {
> +            goto out;
> +        }
>      }
>  
>      hhc = HOTPLUG_HANDLER_GET_CLASS(pcms->acpi_dev);
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 484ab8b..a26908b 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -64,11 +64,37 @@ out:
>      error_propagate(errp, local_err);
>  }
>  
> +static bool nvdimm_get_flush_hint(Object *obj, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +
> +    return nvdimm->flush_hint_enabled;
> +}
> +
> +static void nvdimm_set_flush_hint(Object *obj, bool val, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +    Error *local_err = NULL;
> +
> +    if (nvdimm->flush_hint_enabled) {
> +        error_setg(&local_err, "cannot change property value");
> +        goto out;
> +    }
> +
> +    nvdimm->flush_hint_enabled = val;
> +
> + out:
> +    error_propagate(errp, local_err);
> +}
> +
>  static void nvdimm_init(Object *obj)
>  {
>      object_property_add(obj, "label-size", "int",
>                          nvdimm_get_label_size, nvdimm_set_label_size, NULL,
>                          NULL, NULL);
> +    object_property_add_bool(obj, "flush-hint",
> +                             nvdimm_get_flush_hint, nvdimm_set_flush_hint,
> +                             NULL);
>  }
>  
>  static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index 888def6..726f4d9 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -145,7 +145,7 @@ void nvdimm_init_acpi_state(AcpiNVDIMMState *state, MemoryRegion *io,
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>                         BIOSLinker *linker, AcpiNVDIMMState *state,
>                         uint32_t ram_slots);
> -void nvdimm_plug(AcpiNVDIMMState *state);
> +void nvdimm_plug(AcpiNVDIMMState *state, Error **errp);
>  void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
>  void nvdimm_flush(NVDIMMDevice *nvdimm);
>  #endif

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
  2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
  2017-04-06 11:52   ` Xiao Guangrong
@ 2017-04-20 13:12   ` Igor Mammedov
  1 sibling, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2017-04-20 13:12 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, dan.j.williams, Michael S. Tsirkin, Xiao Guangrong

On Fri, 31 Mar 2017 16:41:45 +0800
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

> fsync() is used to persist modifications to the back store. If the
s/back/backing/

> host NVDIMM is used as the back store, fsync() on Linux will trigger
> the write to the host flush hint address.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/mem/nvdimm.c         | 22 ++++++++++++++++++++++
>  include/hw/mem/nvdimm.h | 13 +++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index db896b0..484ab8b 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
>      return &nvdimm->nvdimm_mr;
>  }
>  
> +static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
> +{
> +    if (nvdimm->flush_hint_enabled) {
> +        nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);
> +    } else {
> +        nvdimm->backend_fd = -1;
> +    }
> +}
> +
> +void nvdimm_flush(NVDIMMDevice *nvdimm)
> +{
> +    if (nvdimm->backend_fd != -1) {
if backend isn't file and user asked for flush_hint, he/she silently won't get it.
maybe we should fail nvdimm_realize() if backend is not file.


> +        /*
> +         * If the backend store is a physical NVDIMM device, fsync()
> +         * will trigger the flush via the flush hint on the host device.
> +         */
> +        fsync(nvdimm->backend_fd);
> +    }
> +}
> +
>  static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>  {
>      MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem, errp);
> @@ -105,6 +125,8 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>      memory_region_init_alias(&nvdimm->nvdimm_mr, OBJECT(dimm),
>                               "nvdimm-memory", mr, 0, pmem_size);
>      nvdimm->nvdimm_mr.align = align;
> +
> +    nvdimm_flush_init(nvdimm, mr);
>  }
>  
>  /*
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index 03e1ff9..eb71f41 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -71,6 +71,18 @@ struct NVDIMMDevice {
>       * guest via ACPI NFIT and _FIT method if NVDIMM hotplug is supported.
>       */
>      MemoryRegion nvdimm_mr;
> +
> +    /*
> +     * If true, a flush hint address structure will be built for this
> +     * NVDIMM device.
> +     */
> +    bool flush_hint_enabled;
> +    /*
> +     * File descriptor of the backend store, which is used in nvdimm
> +     * flush.  It's cached in NVDIMMDevice rather than being fetched
> +     * at each request in order to accelerate the flush a little bit.
> +     */
> +    int backend_fd;
>  };
>  typedef struct NVDIMMDevice NVDIMMDevice;
>  
> @@ -132,4 +144,5 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>                         uint32_t ram_slots);
>  void nvdimm_plug(AcpiNVDIMMState *state);
>  void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
> +void nvdimm_flush(NVDIMMDevice *nvdimm);
>  #endif

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-11 14:56       ` Dan Williams
@ 2017-04-20 19:49         ` Dan Williams
  2017-04-21 13:56           ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2017-04-20 19:49 UTC (permalink / raw)
  To: Xiao Guangrong, Stefan Hajnoczi, Eduardo Habkost,
	Michael S. Tsirkin, qemu-devel, Igor Mammedov, Paolo Bonzini,
	Dan J Williams, Richard Henderson, Xiao Guangrong,
	Christoph Hellwig

On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> [ adding Christoph ]
>
> On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
>> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
>>>
>>>
>>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
>>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>>> > > This patch series constructs the flush hint address structures for
>>> > > nvdimm devices in QEMU.
>>> > >
>>> > > It's of course not for 2.9. I send it out early in order to get
>>> > > comments on one point I'm uncertain (see the detailed explanation
>>> > > below). Thanks for any comments in advance!
>>> > > Background
>>> > > ---------------
>>> >
>>> > Extra background:
>>> >
>>> > Flush Hint Addresses are necessary because:
>>> >
>>> > 1. Some hardware configurations may require them.  In other words, a
>>> >    cache flush instruction is not enough to persist data.
>>> >
>>> > 2. The host file system may need fsync(2) calls (e.g. to persist
>>> >    metadata changes).
>>> >
>>> > Without Flush Hint Addresses only some NVDIMM configurations actually
>>> > guarantee data persistence.
>>> >
>>> > > Flush hint address structure is a substructure of NFIT and specifies
>>> > > one or more addresses, namely Flush Hint Addresses. Software can write
>>> > > to any one of these flush hint addresses to cause any preceding writes
>>> > > to the NVDIMM region to be flushed out of the intervening platform
>>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>>> >
>>> > Do you have performance data?  I'm concerned that Flush Hint Address
>>> > hardware interface is not virtualization-friendly.
>>> >
>>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>>> >
>>> >   wmb();
>>> >   for (i = 0; i < nd_region->ndr_mappings; i++)
>>> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
>>> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>>> >   wmb();
>>> >
>>> > That looks pretty lightweight - it's an MMIO write between write
>>> > barriers.
>>> >
>>> > This patch implements the MMIO write like this:
>>> >
>>> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
>>> >   {
>>> >       if (nvdimm->backend_fd != -1) {
>>> >           /*
>>> >            * If the backend store is a physical NVDIMM device, fsync()
>>> >            * will trigger the flush via the flush hint on the host device.
>>> >            */
>>> >           fsync(nvdimm->backend_fd);
>>> >       }
>>> >   }
>>> >
>>> > The MMIO store instruction turned into a synchronous fsync(2) system
>>> > call plus vmexit/vmenter and QEMU userspace context switch:
>>> >
>>> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>>> >    instruction has an unexpected and huge latency.
>>> >
>>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
>>> >    (including the monitor) are blocked during fsync(2).  Other vcpu
>>> >    threads may block if they vmexit.
>>> >
>>> > It is hard to implement this efficiently in QEMU.  This is why I said
>>> > the hardware interface is not virtualization-friendly.  It's cheap on
>>> > real hardware but expensive under virtualization.
>>> >
>>> > We should think about the optimal way of implementing Flush Hint
>>> > Addresses in QEMU.  But if there is no reasonable way to implement them
>>> > then I think it's better *not* to implement them, just like the Block
>>> > Window feature which is also not virtualization-friendly.  Users who
>>> > want a block device can use virtio-blk.  I don't think NVDIMM Block
>>> > Window can achieve better performance than virtio-blk under
>>> > virtualization (although I'm happy to be proven wrong).
>>> >
>>> > Some ideas for a faster implementation:
>>> >
>>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>>> >    global mutex.  Little synchronization is necessary as long as the
>>> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
>>> >
>>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
>>> >    the physical NVDIMM in cases where the configuration does not require
>>> >    host kernel interception?  That way QEMU can map the physical
>>> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>>> >    is bypassed and performance would be good.
>>> >
>>> > I'm not sure there is anything we can do to make the case where the host
>>> > kernel wants an fsync(2) fast :(.
>>>
>>> Good point.
>>>
>>> We can assume flush-CPU-cache-to-make-persistence is always
>>> available on Intel's hardware so that flush-hint-table is not
>>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>>>
>>
>> We can let users of qemu (e.g. libvirt) detect whether the backend
>> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
>> is not supported.
>>
>
> There currently is no ACPI mechanism to detect the presence of ADR.
> Also, you still need the flush for fs metadata management.
>
>>> If the vNVDIMM device is based on the regular file, i think
>>> fsync is the bottleneck rather than this mmio-virtualization. :(
>>>
>>
>> Yes, fsync() on the regular file is the bottleneck. We may either
>>
>> 1/ perform the host-side flush in an asynchronous way which will not
>>    block vcpu too long,
>>
>> or
>>
>> 2/ not provide strong durability guarantee for non-NVDIMM backend and
>>    not emulate flush-hint for guest at all. (I know 1/ does not
>>    provide strong durability guarantee either).
>
> or
>
> 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
> overhead reduction (or bypass) mechanism built and accepted for
> filesystem-dax.

I didn't realize we have a bigger problem with host filesystem-fsync
and that WPQ exits will not save us. Applications that use device-dax
in the guest may never trigger a WPQ flush, because userspace flushing
with device-dax is expected to be safe. WPQ flush was never meant to
be a persistency mechanism the way it is proposed here, it's only
meant to minimize the fallout from potential ADR failure. My apologies
for insinuating that it was viable.

So, until we solve this userspace flushing problem virtualization must
not pass through any file except a device-dax instance for any
production workload.

Also these performance overheads seem prohibitive. We really want to
take whatever fsync minimization / bypass mechanism we come up with on
the host into a fast para-virtualized interface for the guest. Guests
need to be able to avoid hypervisor and host syscall overhead in the
fast path.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-20 19:49         ` Dan Williams
@ 2017-04-21 13:56           ` Stefan Hajnoczi
  2017-04-21 19:14             ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-04-21 13:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Xiao Guangrong, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Igor Mammedov, Paolo Bonzini, Richard Henderson, Xiao Guangrong,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 7599 bytes --]

On Thu, Apr 20, 2017 at 12:49:21PM -0700, Dan Williams wrote:
> On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> > [ adding Christoph ]
> >
> > On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:
> >> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
> >>>
> >>>
> >>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> >>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> >>> > > This patch series constructs the flush hint address structures for
> >>> > > nvdimm devices in QEMU.
> >>> > >
> >>> > > It's of course not for 2.9. I send it out early in order to get
> >>> > > comments on one point I'm uncertain (see the detailed explanation
> >>> > > below). Thanks for any comments in advance!
> >>> > > Background
> >>> > > ---------------
> >>> >
> >>> > Extra background:
> >>> >
> >>> > Flush Hint Addresses are necessary because:
> >>> >
> >>> > 1. Some hardware configurations may require them.  In other words, a
> >>> >    cache flush instruction is not enough to persist data.
> >>> >
> >>> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >>> >    metadata changes).
> >>> >
> >>> > Without Flush Hint Addresses only some NVDIMM configurations actually
> >>> > guarantee data persistence.
> >>> >
> >>> > > Flush hint address structure is a substructure of NFIT and specifies
> >>> > > one or more addresses, namely Flush Hint Addresses. Software can write
> >>> > > to any one of these flush hint addresses to cause any preceding writes
> >>> > > to the NVDIMM region to be flushed out of the intervening platform
> >>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> >>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> >>> >
> >>> > Do you have performance data?  I'm concerned that Flush Hint Address
> >>> > hardware interface is not virtualization-friendly.
> >>> >
> >>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> >>> >
> >>> >   wmb();
> >>> >   for (i = 0; i < nd_region->ndr_mappings; i++)
> >>> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
> >>> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >>> >   wmb();
> >>> >
> >>> > That looks pretty lightweight - it's an MMIO write between write
> >>> > barriers.
> >>> >
> >>> > This patch implements the MMIO write like this:
> >>> >
> >>> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
> >>> >   {
> >>> >       if (nvdimm->backend_fd != -1) {
> >>> >           /*
> >>> >            * If the backend store is a physical NVDIMM device, fsync()
> >>> >            * will trigger the flush via the flush hint on the host device.
> >>> >            */
> >>> >           fsync(nvdimm->backend_fd);
> >>> >       }
> >>> >   }
> >>> >
> >>> > The MMIO store instruction turned into a synchronous fsync(2) system
> >>> > call plus vmexit/vmenter and QEMU userspace context switch:
> >>> >
> >>> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
> >>> >    instruction has an unexpected and huge latency.
> >>> >
> >>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >>> >    (including the monitor) are blocked during fsync(2).  Other vcpu
> >>> >    threads may block if they vmexit.
> >>> >
> >>> > It is hard to implement this efficiently in QEMU.  This is why I said
> >>> > the hardware interface is not virtualization-friendly.  It's cheap on
> >>> > real hardware but expensive under virtualization.
> >>> >
> >>> > We should think about the optimal way of implementing Flush Hint
> >>> > Addresses in QEMU.  But if there is no reasonable way to implement them
> >>> > then I think it's better *not* to implement them, just like the Block
> >>> > Window feature which is also not virtualization-friendly.  Users who
> >>> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> >>> > Window can achieve better performance than virtio-blk under
> >>> > virtualization (although I'm happy to be proven wrong).
> >>> >
> >>> > Some ideas for a faster implementation:
> >>> >
> >>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >>> >    global mutex.  Little synchronization is necessary as long as the
> >>> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> >>> >
> >>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >>> >    the physical NVDIMM in cases where the configuration does not require
> >>> >    host kernel interception?  That way QEMU can map the physical
> >>> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
> >>> >    is bypassed and performance would be good.
> >>> >
> >>> > I'm not sure there is anything we can do to make the case where the host
> >>> > kernel wants an fsync(2) fast :(.
> >>>
> >>> Good point.
> >>>
> >>> We can assume flush-CPU-cache-to-make-persistence is always
> >>> available on Intel's hardware so that flush-hint-table is not
> >>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
> >>>
> >>
> >> We can let users of qemu (e.g. libvirt) detect whether the backend
> >> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
> >> is not supported.
> >>
> >
> > There currently is no ACPI mechanism to detect the presence of ADR.
> > Also, you still need the flush for fs metadata management.
> >
> >>> If the vNVDIMM device is based on the regular file, i think
> >>> fsync is the bottleneck rather than this mmio-virtualization. :(
> >>>
> >>
> >> Yes, fsync() on the regular file is the bottleneck. We may either
> >>
> >> 1/ perform the host-side flush in an asynchronous way which will not
> >>    block vcpu too long,
> >>
> >> or
> >>
> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and
> >>    not emulate flush-hint for guest at all. (I know 1/ does not
> >>    provide strong durability guarantee either).
> >
> > or
> >
> > 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
> > overhead reduction (or bypass) mechanism built and accepted for
> > filesystem-dax.
> 
> I didn't realize we have a bigger problem with host filesystem-fsync
> and that WPQ exits will not save us. Applications that use device-dax
> in the guest may never trigger a WPQ flush, because userspace flushing
> with device-dax is expected to be safe. WPQ flush was never meant to
> be a persistency mechanism the way it is proposed here, it's only
> meant to minimize the fallout from potential ADR failure. My apologies
> for insinuating that it was viable.
> 
> So, until we solve this userspace flushing problem virtualization must
> not pass through any file except a device-dax instance for any
> production workload.

Okay.  That's what I've assumed up until now and I think distros will
document this limitation.

> Also these performance overheads seem prohibitive. We really want to
> take whatever fsync minimization / bypass mechanism we come up with on
> the host into a fast para-virtualized interface for the guest. Guests
> need to be able to avoid hypervisor and host syscall overhead in the
> fast path.

It's hard to avoid the hypervisor if the host kernel file system needs
an fsync() to persist everything.  There should be a fast path where the
host file is preallocated and no fancy file system features are in use
(e.g. deduplication, copy-on-write snapshots) where host file systems
don't need fsync().

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
  2017-04-21 13:56           ` Stefan Hajnoczi
@ 2017-04-21 19:14             ` Dan Williams
  0 siblings, 0 replies; 38+ messages in thread
From: Dan Williams @ 2017-04-21 19:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Xiao Guangrong, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Igor Mammedov, Paolo Bonzini, Richard Henderson, Xiao Guangrong,
	Christoph Hellwig, linux-xfs, linux-fsdevel

[ adding xfs and fsdevel ]

On Fri, Apr 21, 2017 at 6:56 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
[..]
>> >>> If the vNVDIMM device is based on the regular file, i think
>> >>> fsync is the bottleneck rather than this mmio-virtualization. :(
>> >>>
>> >>
>> >> Yes, fsync() on the regular file is the bottleneck. We may either
>> >>
>> >> 1/ perform the host-side flush in an asynchronous way which will not
>> >>    block vcpu too long,
>> >>
>> >> or
>> >>
>> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and
>> >>    not emulate flush-hint for guest at all. (I know 1/ does not
>> >>    provide strong durability guarantee either).
>> >
>> > or
>> >
>> > 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
>> > overhead reduction (or bypass) mechanism built and accepted for
>> > filesystem-dax.
>>
>> I didn't realize we have a bigger problem with host filesystem-fsync
>> and that WPQ exits will not save us. Applications that use device-dax
>> in the guest may never trigger a WPQ flush, because userspace flushing
>> with device-dax is expected to be safe. WPQ flush was never meant to
>> be a persistency mechanism the way it is proposed here, it's only
>> meant to minimize the fallout from potential ADR failure. My apologies
>> for insinuating that it was viable.
>>
>> So, until we solve this userspace flushing problem virtualization must
>> not pass through any file except a device-dax instance for any
>> production workload.
>
> Okay.  That's what I've assumed up until now and I think distros will
> document this limitation.
>
>> Also these performance overheads seem prohibitive. We really want to
>> take whatever fsync minimization / bypass mechanism we come up with on
>> the host into a fast para-virtualized interface for the guest. Guests
>> need to be able to avoid hypervisor and host syscall overhead in the
>> fast path.
>
> It's hard to avoid the hypervisor if the host kernel file system needs
> an fsync() to persist everything.  There should be a fast path where the
> host file is preallocated and no fancy file system features are in use
> (e.g. deduplication, copy-on-write snapshots) where host file systems
> don't need fsync().
>

So we've gone around and around on this with XFS folks with various
levels of disagreement about how to achieve synchronous faulting or
disabling metadata updates for a file. I think at some point someone
is going to want some fancy filesystem feature *and* DAX *and* still
want it to be fast.

The current problem is that if you want to checkpoint persistence at a
high rate, think committing updates to a tree data structure, an
fsync() call is going to burn a lot of cycles just to find out that
there is nothing to do in most cases. Especially when you've only
touched a couple pointers in a cache line, calling into the kernel to
sync those writes is a ridiculous proposition.

One of the current ideas to resolve this is instead of trying to
implement synchronous faulting and wrestle with the constraints that
puts on fs-fault paths, is to instead have synchronous notification of
metadata dirtying events to userspace. That notification mechanism
would be associated with something like an fsync2() library call that
knows how to bypass sys_fsync in the common case. Of course, this is
still in the idea phase, so until we can get a proof-of-concept on its
feet this is all subject to further debate.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-04-21 19:14 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31  8:41 [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Haozhong Zhang
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address Haozhong Zhang
2017-04-06 10:24   ` Stefan Hajnoczi
2017-04-06 10:46     ` Haozhong Zhang
2017-04-07 13:46       ` Stefan Hajnoczi
2017-04-11  8:57         ` Haozhong Zhang
2017-04-20 10:54           ` Igor Mammedov
2017-04-06 11:50   ` Xiao Guangrong
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store Haozhong Zhang
2017-04-06 11:52   ` Xiao Guangrong
2017-04-11  8:22     ` Haozhong Zhang
2017-04-11  8:29       ` Haozhong Zhang
2017-04-11 11:55         ` Xiao Guangrong
2017-04-20 13:12   ` Igor Mammedov
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 3/4] nvdimm acpi: record the cache line size in AcpiNVDIMMState Haozhong Zhang
2017-03-31  8:41 ` [Qemu-devel] [RFC PATCH 4/4] nvdimm acpi: build flush hint address structure if required Haozhong Zhang
2017-04-06 10:13   ` Stefan Hajnoczi
2017-04-06 10:53     ` Haozhong Zhang
2017-04-07 14:41       ` Stefan Hajnoczi
2017-04-07 15:51     ` Dan Williams
2017-04-06 10:25   ` Stefan Hajnoczi
2017-04-20 11:22   ` Igor Mammedov
2017-04-06  9:39 ` [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure Xiao Guangrong
2017-04-06  9:58   ` Haozhong Zhang
2017-04-06 11:46     ` Xiao Guangrong
2017-04-06  9:43 ` Stefan Hajnoczi
2017-04-06 10:31   ` Haozhong Zhang
2017-04-07 14:38     ` Stefan Hajnoczi
2017-04-06 12:02   ` Xiao Guangrong
2017-04-11  8:41     ` Haozhong Zhang
2017-04-11 14:56       ` Dan Williams
2017-04-20 19:49         ` Dan Williams
2017-04-21 13:56           ` Stefan Hajnoczi
2017-04-21 19:14             ` Dan Williams
2017-04-06 14:32   ` Dan Williams
2017-04-07 14:31     ` Stefan Hajnoczi
2017-04-11  6:34   ` Haozhong Zhang
2017-04-18 10:15     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.