All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/4] nvdimm: fixes for (non-)dax backends
@ 2017-06-06  7:22 Haozhong Zhang
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size" Haozhong Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-06  7:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi, Dan Williams, Haozhong Zhang

This is v2 of patch series
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05919.html.

Changes in v2:
* Patch 2: use major/minor numbers of device and sysfs to check
  whether a backend file refers to DAX device. (Dan & Stefan)

* Patch 3: set ACPI_NFIT_MEM_NOT_ARMED bit of state flags in guest
  NVDIMM region mapping structure. This behavior is controlled by a
  new QEMU nvdimm option "restrict". See details in patch 3. (Dan)

* Patch 4: drop the "align" property and get the alignment of DAX
  device by QEMU itself.

* Wrap Linux specific code by __linux__. (Stefan)


Haozhong Zhang (4):
  nvdimm: add a macro for property "label-size"
  nvdimm: warn if the backend is not a DAX device
  nvdimm: add a boolean option "restrict"
  util/mmap-alloc: account for DAX device in qemu_fd_getpagesize

 hw/acpi/nvdimm.c        | 16 ++++++++++
 hw/mem/nvdimm.c         | 44 +++++++++++++++++++++++++-
 include/hw/mem/nvdimm.h |  8 +++++
 include/qemu/osdep.h    | 19 +++++++++++
 util/mmap-alloc.c       |  6 ++++
 util/osdep.c            | 84 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 176 insertions(+), 1 deletion(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size"
  2017-06-06  7:22 [Qemu-devel] [PATCH v2 0/4] nvdimm: fixes for (non-)dax backends Haozhong Zhang
@ 2017-06-06  7:22 ` Haozhong Zhang
  2017-06-07 15:29   ` Stefan Hajnoczi
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-06  7:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi, Dan Williams, Haozhong Zhang

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/mem/nvdimm.c         | 2 +-
 include/hw/mem/nvdimm.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index db896b0bb6..a9b0863f20 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -66,7 +66,7 @@ out:
 
 static void nvdimm_init(Object *obj)
 {
-    object_property_add(obj, "label-size", "int",
+    object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
                         nvdimm_get_label_size, nvdimm_set_label_size, NULL,
                         NULL, NULL);
 }
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index 03e1ff9558..f1f3987055 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -47,6 +47,9 @@
 #define NVDIMM_CLASS(oc) OBJECT_CLASS_CHECK(NVDIMMClass, (oc), TYPE_NVDIMM)
 #define NVDIMM_GET_CLASS(obj) OBJECT_GET_CLASS(NVDIMMClass, (obj), \
                                                TYPE_NVDIMM)
+
+#define NVDIMM_LABEL_SIZE_PROP "label-size"
+
 struct NVDIMMDevice {
     /* private */
     PCDIMMDevice parent_obj;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-06  7:22 [Qemu-devel] [PATCH v2 0/4] nvdimm: fixes for (non-)dax backends Haozhong Zhang
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size" Haozhong Zhang
@ 2017-06-06  7:22 ` Haozhong Zhang
  2017-06-06 17:59   ` Dan Williams
                     ` (2 more replies)
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict" Haozhong Zhang
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 4/4] util/mmap-alloc: account for DAX device in qemu_fd_getpagesize Haozhong Zhang
  3 siblings, 3 replies; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-06  7:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi, Dan Williams, Haozhong Zhang

Applications in Linux guest that use device-dax never trigger flush
that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
device-dax, QEMU cannot guarantee the persistence of guest writes.
Before solving this flushing problem, QEMU should warn users if the
host backend is not device-dax.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
---
 hw/mem/nvdimm.c      |  6 ++++++
 include/qemu/osdep.h |  9 ++++++++
 util/osdep.c         | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index a9b0863f20..b23542fbdf 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -26,6 +26,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/error-report.h"
 
 static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
                                   void *opaque, Error **errp)
@@ -84,6 +85,11 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
     uint64_t align, pmem_size, size = memory_region_size(mr);
 
+    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
+        error_report("warning: nvdimm backend does not look like a DAX device, "
+                     "unable to guarantee persistence of guest writes");
+    }
+
     align = memory_region_get_alignment(mr);
 
     pmem_size = size - nvdimm->label_size;
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 1c9f5e260c..7f26af371e 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -470,4 +470,13 @@ char *qemu_get_pid_name(pid_t pid);
  */
 pid_t qemu_fork(Error **errp);
 
+/**
+ * qemu_fd_is_dev_dax:
+ *
+ * Check whether @fd describes a DAX device.
+ *
+ * Returns true if it is; otherwise, return false.
+ */
+bool qemu_fd_is_dev_dax(int fd);
+
 #endif
diff --git a/util/osdep.c b/util/osdep.c
index a2863c8e53..02881f96bc 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
     return readv_writev(fd, iov, iov_cnt, true);
 }
 #endif
+
+#ifdef __linux__
+static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
+                                       char *buf, size_t len)
+{
+    ssize_t read_bytes;
+    struct stat st;
+    unsigned int major, minor;
+    char *path, *pos;
+    int sysfs_fd;
+
+    if (fstat(fd, &st)) {
+        return 0;
+    }
+
+    major = major(st.st_rdev);
+    minor = minor(st.st_rdev);
+    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
+
+    sysfs_fd = open(path, O_RDONLY);
+    g_free(path);
+    if (sysfs_fd == -1) {
+        return 0;
+    }
+
+    read_bytes = read(sysfs_fd, buf, len - 1);
+    close(sysfs_fd);
+    if (read_bytes > 0) {
+        buf[read_bytes] = '\0';
+        pos = g_strstr_len(buf, read_bytes, "\n");
+        if (pos) {
+            *pos = '\0';
+        }
+    }
+
+    return read_bytes;
+}
+#endif /* __linux__ */
+
+bool qemu_fd_is_dev_dax(int fd)
+{
+    bool is_dax = false;
+
+#ifdef __linux__
+    char devtype[7];
+    ssize_t len;
+
+    if (fd == -1) {
+        return false;
+    }
+
+    len = qemu_dev_dax_sysfs_read(fd, "device/devtype",
+                                  devtype, sizeof(devtype));
+    if (len <= 0) {
+        return false;
+    }
+    is_dax = !strncmp(devtype, "nd_dax", len);
+#endif /* __linux__ */
+
+    return is_dax;
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-06  7:22 [Qemu-devel] [PATCH v2 0/4] nvdimm: fixes for (non-)dax backends Haozhong Zhang
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size" Haozhong Zhang
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
@ 2017-06-06  7:22 ` Haozhong Zhang
  2017-06-07 15:27   ` Stefan Hajnoczi
  2017-06-08 12:56   ` Michael S. Tsirkin
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 4/4] util/mmap-alloc: account for DAX device in qemu_fd_getpagesize Haozhong Zhang
  3 siblings, 2 replies; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-06  7:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi, Dan Williams, Haozhong Zhang

If a vNVDIMM device is not backed by a DAX device and its "restrict"
option is enabled, bit 3 of state flags in its region mapping
structure will be set, in order to notify the guest of the lack of
write persistence guarantee. Once this bit is set, the guest OS may
mark the vNVDIMM device as read-only.

This option is disabled by default for backwards compatibility. It's
recommended to enable for the formal usage.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 hw/acpi/nvdimm.c        | 16 ++++++++++++++++
 hw/mem/nvdimm.c         | 38 +++++++++++++++++++++++++++++++++++++-
 include/hw/mem/nvdimm.h |  5 +++++
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 8e7d6ec034..fd1ef6dc65 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -138,6 +138,8 @@ struct NvdimmNfitMemDev {
 } QEMU_PACKED;
 typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
 
+#define ACPI_NFIT_MEM_NOT_ARMED    (1 << 3)
+
 /*
  * NVDIMM Control Region Structure
  *
@@ -289,6 +291,10 @@ nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
     int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
                                             NULL);
     uint32_t handle = nvdimm_slot_to_handle(slot);
+    bool dev_dax = object_property_get_bool(OBJECT(dev), NVDIMM_DEV_DAX_PROP,
+                                            NULL);
+    bool restrict_mode = object_property_get_bool(OBJECT(dev),
+                                                  NVDIMM_RESTRICT_PROP, NULL);
 
     nfit_memdev = acpi_data_push(structures, sizeof(*nfit_memdev));
 
@@ -312,6 +318,16 @@ nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
 
     /* Only one interleave for PMEM. */
     nfit_memdev->interleave_ways = cpu_to_le16(1);
+
+    /*
+     * If a vNVDIMM device in the restrict mode and is not backed by a
+     * DAX device, QEMU will set ACPI_NFIT_MEM_NOT_ARMED bit of state
+     * flags in its region mapping structure, in order to notify the
+     * guest of the lack of write persistence guarantee.
+     */
+    if (!dev_dax && restrict_mode) {
+        nfit_memdev->flags = cpu_to_le16(ACPI_NFIT_MEM_NOT_ARMED);
+    }
 }
 
 /*
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index b23542fbdf..cda416e5c8 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -65,11 +65,46 @@ out:
     error_propagate(errp, local_err);
 }
 
+static bool nvdimm_get_backend_dev_dax(Object *obj, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+    return nvdimm->backend_dev_dax;
+}
+
+static bool nvdimm_get_restrict(Object *obj, Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+    return nvdimm->restrict_mode;
+}
+
+static void nvdimm_set_restrict(Object *obj, bool val, Error **errp)
+{
+    DeviceState *dev = DEVICE(obj);
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+    Error *local_err = NULL;
+
+    if (dev->realized) {
+        error_setg(&local_err, "cannot change property value");
+        goto out;
+    }
+
+    nvdimm->restrict_mode = val;
+
+ out:
+    error_propagate(errp, local_err);
+}
+
 static void nvdimm_init(Object *obj)
 {
     object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
                         nvdimm_get_label_size, nvdimm_set_label_size, NULL,
                         NULL, NULL);
+    object_property_add_bool(obj, NVDIMM_DEV_DAX_PROP,
+                             nvdimm_get_backend_dev_dax, NULL, NULL);
+    object_property_add_bool(obj, NVDIMM_RESTRICT_PROP,
+                             nvdimm_get_restrict, nvdimm_set_restrict, NULL);
 }
 
 static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
@@ -85,7 +120,8 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
     uint64_t align, pmem_size, size = memory_region_size(mr);
 
-    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
+    nvdimm->backend_dev_dax = qemu_fd_is_dev_dax(memory_region_get_fd(mr));
+    if (!nvdimm->backend_dev_dax) {
         error_report("warning: nvdimm backend does not look like a DAX device, "
                      "unable to guarantee persistence of guest writes");
     }
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index f1f3987055..2fbe0d7858 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -49,6 +49,8 @@
                                                TYPE_NVDIMM)
 
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
+#define NVDIMM_DEV_DAX_PROP    "dev-dax"
+#define NVDIMM_RESTRICT_PROP   "restrict"
 
 struct NVDIMMDevice {
     /* private */
@@ -74,6 +76,9 @@ struct NVDIMMDevice {
      * guest via ACPI NFIT and _FIT method if NVDIMM hotplug is supported.
      */
     MemoryRegion nvdimm_mr;
+
+    bool backend_dev_dax;
+    bool restrict_mode;
 };
 typedef struct NVDIMMDevice NVDIMMDevice;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [PATCH v2 4/4] util/mmap-alloc: account for DAX device in qemu_fd_getpagesize
  2017-06-06  7:22 [Qemu-devel] [PATCH v2 0/4] nvdimm: fixes for (non-)dax backends Haozhong Zhang
                   ` (2 preceding siblings ...)
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict" Haozhong Zhang
@ 2017-06-06  7:22 ` Haozhong Zhang
  2017-06-07 15:29   ` Stefan Hajnoczi
  3 siblings, 1 reply; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-06  7:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi, Dan Williams, Haozhong Zhang

A DAX device may require a page size different than getpagesize().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 include/qemu/osdep.h | 10 ++++++++++
 util/mmap-alloc.c    |  6 ++++++
 util/osdep.c         | 23 +++++++++++++++++++++++
 3 files changed, 39 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 7f26af371e..51aca88f0d 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -479,4 +479,14 @@ pid_t qemu_fork(Error **errp);
  */
 bool qemu_fd_is_dev_dax(int fd);
 
+/**
+ * qemu_get_dev_dax_align:
+ *
+ * Get the suggested alignment of a DAX device descried by @fd.
+ *
+ * Return a non-zero suggested alignment on success; return 0 if @fd
+ * does not describe a DAX device, or if it fails to get alignment
+ * of the DAX device.
+ */
+size_t qemu_get_dev_dax_align(int fd);
 #endif
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 3ec029a9ea..2d8af64b44 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -25,8 +25,14 @@ size_t qemu_fd_getpagesize(int fd)
 #ifdef CONFIG_LINUX
     struct statfs fs;
     int ret;
+    size_t align;
 
     if (fd != -1) {
+        align = qemu_get_dev_dax_align(fd);
+        if (align) {
+            return align;
+        }
+
         do {
             ret = fstatfs(fd, &fs);
         } while (ret != 0 && errno == EINTR);
diff --git a/util/osdep.c b/util/osdep.c
index 02881f96bc..e7837d30b4 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -532,3 +532,26 @@ bool qemu_fd_is_dev_dax(int fd)
 
     return is_dax;
 }
+
+size_t qemu_get_dev_dax_align(int fd)
+{
+    size_t align = 0;
+
+#ifdef __linux__
+    char buf[12];
+    ssize_t len;
+
+    if (!qemu_fd_is_dev_dax(fd)) {
+        return 0;
+    }
+    len = qemu_dev_dax_sysfs_read(fd, "device/align", buf, sizeof(buf));
+    if (len <= 0) {
+        return 0;
+    }
+    if (qemu_strtosz(buf, NULL, &align) < 0) {
+        align = 0;
+    }
+#endif /* __linux__ */
+
+    return align;
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
@ 2017-06-06 17:59   ` Dan Williams
  2017-06-08  1:07     ` Haozhong Zhang
  2017-06-07 15:14   ` Stefan Hajnoczi
  2017-06-08 12:51   ` Michael S. Tsirkin
  2 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2017-06-06 17:59 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi

On Tue, Jun 6, 2017 at 12:22 AM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> Applications in Linux guest that use device-dax never trigger flush
> that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> device-dax, QEMU cannot guarantee the persistence of guest writes.
> Before solving this flushing problem, QEMU should warn users if the
> host backend is not device-dax.
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> ---
>  hw/mem/nvdimm.c      |  6 ++++++
>  include/qemu/osdep.h |  9 ++++++++
>  util/osdep.c         | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+)
>
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index a9b0863f20..b23542fbdf 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -26,6 +26,7 @@
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/error-report.h"
>
>  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
>                                    void *opaque, Error **errp)
> @@ -84,6 +85,11 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>      NVDIMMDevice *nvdimm = NVDIMM(dimm);
>      uint64_t align, pmem_size, size = memory_region_size(mr);
>
> +    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> +        error_report("warning: nvdimm backend does not look like a DAX device, "
> +                     "unable to guarantee persistence of guest writes");
> +    }
> +
>      align = memory_region_get_alignment(mr);
>
>      pmem_size = size - nvdimm->label_size;
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 1c9f5e260c..7f26af371e 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -470,4 +470,13 @@ char *qemu_get_pid_name(pid_t pid);
>   */
>  pid_t qemu_fork(Error **errp);
>
> +/**
> + * qemu_fd_is_dev_dax:
> + *
> + * Check whether @fd describes a DAX device.
> + *
> + * Returns true if it is; otherwise, return false.
> + */
> +bool qemu_fd_is_dev_dax(int fd);
> +
>  #endif
> diff --git a/util/osdep.c b/util/osdep.c
> index a2863c8e53..02881f96bc 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
>      return readv_writev(fd, iov, iov_cnt, true);
>  }
>  #endif
> +
> +#ifdef __linux__
> +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> +                                       char *buf, size_t len)
> +{
> +    ssize_t read_bytes;
> +    struct stat st;
> +    unsigned int major, minor;
> +    char *path, *pos;
> +    int sysfs_fd;
> +
> +    if (fstat(fd, &st)) {
> +        return 0;
> +    }
> +
> +    major = major(st.st_rdev);
> +    minor = minor(st.st_rdev);
> +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> +
> +    sysfs_fd = open(path, O_RDONLY);
> +    g_free(path);
> +    if (sysfs_fd == -1) {
> +        return 0;
> +    }
> +
> +    read_bytes = read(sysfs_fd, buf, len - 1);
> +    close(sysfs_fd);
> +    if (read_bytes > 0) {
> +        buf[read_bytes] = '\0';
> +        pos = g_strstr_len(buf, read_bytes, "\n");
> +        if (pos) {
> +            *pos = '\0';
> +        }
> +    }
> +
> +    return read_bytes;
> +}
> +#endif /* __linux__ */
> +
> +bool qemu_fd_is_dev_dax(int fd)
> +{
> +    bool is_dax = false;
> +
> +#ifdef __linux__
> +    char devtype[7];
> +    ssize_t len;
> +
> +    if (fd == -1) {
> +        return false;
> +    }
> +
> +    len = qemu_dev_dax_sysfs_read(fd, "device/devtype",
> +                                  devtype, sizeof(devtype));
> +    if (len <= 0) {
> +        return false;
> +    }
> +    is_dax = !strncmp(devtype, "nd_dax", len);

There's no guarantee that device-dax instances are always parented by
an "nd_dax" device-type. A more reliable check is to see if
"/sys/dev/char/%u:%u/subsystem" points to "/sys/class/dax".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
  2017-06-06 17:59   ` Dan Williams
@ 2017-06-07 15:14   ` Stefan Hajnoczi
  2017-06-08  1:07     ` Haozhong Zhang
  2017-06-08 12:51   ` Michael S. Tsirkin
  2 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-07 15:14 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]

On Tue, Jun 06, 2017 at 03:22:27PM +0800, Haozhong Zhang wrote:
> diff --git a/util/osdep.c b/util/osdep.c
> index a2863c8e53..02881f96bc 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
>      return readv_writev(fd, iov, iov_cnt, true);
>  }
>  #endif
> +
> +#ifdef __linux__
> +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> +                                       char *buf, size_t len)
> +{
> +    ssize_t read_bytes;
> +    struct stat st;
> +    unsigned int major, minor;
> +    char *path, *pos;
> +    int sysfs_fd;
> +
> +    if (fstat(fd, &st)) {
> +        return 0;
> +    }
> +
> +    major = major(st.st_rdev);
> +    minor = minor(st.st_rdev);
> +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> +
> +    sysfs_fd = open(path, O_RDONLY);
> +    g_free(path);
> +    if (sysfs_fd == -1) {
> +        return 0;
> +    }
> +
> +    read_bytes = read(sysfs_fd, buf, len - 1);
> +    close(sysfs_fd);
> +    if (read_bytes > 0) {
> +        buf[read_bytes] = '\0';
> +        pos = g_strstr_len(buf, read_bytes, "\n");
> +        if (pos) {
> +            *pos = '\0';
> +        }

Should read_bytes be adjusted since we made the string shorter?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict" Haozhong Zhang
@ 2017-06-07 15:27   ` Stefan Hajnoczi
  2017-06-08  1:45     ` Haozhong Zhang
  2017-06-08  6:39     ` Haozhong Zhang
  2017-06-08 12:56   ` Michael S. Tsirkin
  1 sibling, 2 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-07 15:27 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 932 bytes --]

On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> If a vNVDIMM device is not backed by a DAX device and its "restrict"
> option is enabled, bit 3 of state flags in its region mapping
> structure will be set, in order to notify the guest of the lack of
> write persistence guarantee. Once this bit is set, the guest OS may
> mark the vNVDIMM device as read-only.
> 
> This option is disabled by default for backwards compatibility. It's
> recommended to enable for the formal usage.

Good idea.  I think the following is cleaner:

DEFINE_PROP_ON_OFF_AUTO("readonly") on the 'nvdimm' device.  The
following states are available:

 * 'on' - ACPI_NFIT_MEM_NOT_ARMED is set
 * 'off' - ACPI_NFIT_MEM_NOT_ARMED is clear
 * 'auto' - ACPI_NFIT_MEM_NOT_ARMED set if backend is not persistent

This new property defaults to 'auto'.  Machine types older than
pc-i440fx-2.10 and pc-q35-2.10 default to 'on'.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/4] util/mmap-alloc: account for DAX device in qemu_fd_getpagesize
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 4/4] util/mmap-alloc: account for DAX device in qemu_fd_getpagesize Haozhong Zhang
@ 2017-06-07 15:29   ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-07 15:29 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

On Tue, Jun 06, 2017 at 03:22:29PM +0800, Haozhong Zhang wrote:
> A DAX device may require a page size different than getpagesize().
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  include/qemu/osdep.h | 10 ++++++++++
>  util/mmap-alloc.c    |  6 ++++++
>  util/osdep.c         | 23 +++++++++++++++++++++++
>  3 files changed, 39 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size"
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size" Haozhong Zhang
@ 2017-06-07 15:29   ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-07 15:29 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 301 bytes --]

On Tue, Jun 06, 2017 at 03:22:26PM +0800, Haozhong Zhang wrote:
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
>  hw/mem/nvdimm.c         | 2 +-
>  include/hw/mem/nvdimm.h | 3 +++
>  2 files changed, 4 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-06 17:59   ` Dan Williams
@ 2017-06-08  1:07     ` Haozhong Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-08  1:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi

On 06/06/17 10:59 -0700, Dan Williams wrote:
> On Tue, Jun 6, 2017 at 12:22 AM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Applications in Linux guest that use device-dax never trigger flush
> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > Before solving this flushing problem, QEMU should warn users if the
> > host backend is not device-dax.
> >
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> > ---
> >  hw/mem/nvdimm.c      |  6 ++++++
> >  include/qemu/osdep.h |  9 ++++++++
> >  util/osdep.c         | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 76 insertions(+)
> >
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index a9b0863f20..b23542fbdf 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -26,6 +26,7 @@
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> >  #include "hw/mem/nvdimm.h"
> > +#include "qemu/error-report.h"
> >
> >  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
> >                                    void *opaque, Error **errp)
> > @@ -84,6 +85,11 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
> >      NVDIMMDevice *nvdimm = NVDIMM(dimm);
> >      uint64_t align, pmem_size, size = memory_region_size(mr);
> >
> > +    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> > +        error_report("warning: nvdimm backend does not look like a DAX device, "
> > +                     "unable to guarantee persistence of guest writes");
> > +    }
> > +
> >      align = memory_region_get_alignment(mr);
> >
> >      pmem_size = size - nvdimm->label_size;
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 1c9f5e260c..7f26af371e 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -470,4 +470,13 @@ char *qemu_get_pid_name(pid_t pid);
> >   */
> >  pid_t qemu_fork(Error **errp);
> >
> > +/**
> > + * qemu_fd_is_dev_dax:
> > + *
> > + * Check whether @fd describes a DAX device.
> > + *
> > + * Returns true if it is; otherwise, return false.
> > + */
> > +bool qemu_fd_is_dev_dax(int fd);
> > +
> >  #endif
> > diff --git a/util/osdep.c b/util/osdep.c
> > index a2863c8e53..02881f96bc 100644
> > --- a/util/osdep.c
> > +++ b/util/osdep.c
> > @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
> >      return readv_writev(fd, iov, iov_cnt, true);
> >  }
> >  #endif
> > +
> > +#ifdef __linux__
> > +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> > +                                       char *buf, size_t len)
> > +{
> > +    ssize_t read_bytes;
> > +    struct stat st;
> > +    unsigned int major, minor;
> > +    char *path, *pos;
> > +    int sysfs_fd;
> > +
> > +    if (fstat(fd, &st)) {
> > +        return 0;
> > +    }
> > +
> > +    major = major(st.st_rdev);
> > +    minor = minor(st.st_rdev);
> > +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> > +
> > +    sysfs_fd = open(path, O_RDONLY);
> > +    g_free(path);
> > +    if (sysfs_fd == -1) {
> > +        return 0;
> > +    }
> > +
> > +    read_bytes = read(sysfs_fd, buf, len - 1);
> > +    close(sysfs_fd);
> > +    if (read_bytes > 0) {
> > +        buf[read_bytes] = '\0';
> > +        pos = g_strstr_len(buf, read_bytes, "\n");
> > +        if (pos) {
> > +            *pos = '\0';
> > +        }
> > +    }
> > +
> > +    return read_bytes;
> > +}
> > +#endif /* __linux__ */
> > +
> > +bool qemu_fd_is_dev_dax(int fd)
> > +{
> > +    bool is_dax = false;
> > +
> > +#ifdef __linux__
> > +    char devtype[7];
> > +    ssize_t len;
> > +
> > +    if (fd == -1) {
> > +        return false;
> > +    }
> > +
> > +    len = qemu_dev_dax_sysfs_read(fd, "device/devtype",
> > +                                  devtype, sizeof(devtype));
> > +    if (len <= 0) {
> > +        return false;
> > +    }
> > +    is_dax = !strncmp(devtype, "nd_dax", len);
> 
> There's no guarantee that device-dax instances are always parented by
> an "nd_dax" device-type. A more reliable check is to see if
> "/sys/dev/char/%u:%u/subsystem" points to "/sys/class/dax".

Thanks for pointing out this, will change in the next version.

Haozhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-07 15:14   ` Stefan Hajnoczi
@ 2017-06-08  1:07     ` Haozhong Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-08  1:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

On 06/07/17 16:14 +0100, Stefan Hajnoczi wrote:
> On Tue, Jun 06, 2017 at 03:22:27PM +0800, Haozhong Zhang wrote:
> > diff --git a/util/osdep.c b/util/osdep.c
> > index a2863c8e53..02881f96bc 100644
> > --- a/util/osdep.c
> > +++ b/util/osdep.c
> > @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
> >      return readv_writev(fd, iov, iov_cnt, true);
> >  }
> >  #endif
> > +
> > +#ifdef __linux__
> > +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> > +                                       char *buf, size_t len)
> > +{
> > +    ssize_t read_bytes;
> > +    struct stat st;
> > +    unsigned int major, minor;
> > +    char *path, *pos;
> > +    int sysfs_fd;
> > +
> > +    if (fstat(fd, &st)) {
> > +        return 0;
> > +    }
> > +
> > +    major = major(st.st_rdev);
> > +    minor = minor(st.st_rdev);
> > +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> > +
> > +    sysfs_fd = open(path, O_RDONLY);
> > +    g_free(path);
> > +    if (sysfs_fd == -1) {
> > +        return 0;
> > +    }
> > +
> > +    read_bytes = read(sysfs_fd, buf, len - 1);
> > +    close(sysfs_fd);
> > +    if (read_bytes > 0) {
> > +        buf[read_bytes] = '\0';
> > +        pos = g_strstr_len(buf, read_bytes, "\n");
> > +        if (pos) {
> > +            *pos = '\0';
> > +        }
> 
> Should read_bytes be adjusted since we made the string shorter?

Yes, I'll change in the next version.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-07 15:27   ` Stefan Hajnoczi
@ 2017-06-08  1:45     ` Haozhong Zhang
  2017-06-08 10:19       ` Stefan Hajnoczi
  2017-06-08  6:39     ` Haozhong Zhang
  1 sibling, 1 reply; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-08  1:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

On 06/07/17 16:27 +0100, Stefan Hajnoczi wrote:
> On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> > If a vNVDIMM device is not backed by a DAX device and its "restrict"
> > option is enabled, bit 3 of state flags in its region mapping
> > structure will be set, in order to notify the guest of the lack of
> > write persistence guarantee. Once this bit is set, the guest OS may
> > mark the vNVDIMM device as read-only.
> > 
> > This option is disabled by default for backwards compatibility. It's
> > recommended to enable for the formal usage.
> 
> Good idea.  I think the following is cleaner:
> 
> DEFINE_PROP_ON_OFF_AUTO("readonly") on the 'nvdimm' device.  The
> following states are available:
> 
>  * 'on' - ACPI_NFIT_MEM_NOT_ARMED is set
>  * 'off' - ACPI_NFIT_MEM_NOT_ARMED is clear
>  * 'auto' - ACPI_NFIT_MEM_NOT_ARMED set if backend is not persistent
> 
> This new property defaults to 'auto'.  Machine types older than
> pc-i440fx-2.10 and pc-q35-2.10 default to 'on'.

Shouldn't it be 'off' on older machine types? The older machine types
and older QEMU never check the backend and never set ACPI_NFIT_MEM_NOT_ARMED.

Haozhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-07 15:27   ` Stefan Hajnoczi
  2017-06-08  1:45     ` Haozhong Zhang
@ 2017-06-08  6:39     ` Haozhong Zhang
  2017-06-08 10:18       ` Stefan Hajnoczi
  1 sibling, 1 reply; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-08  6:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

On 06/07/17 16:27 +0100, Stefan Hajnoczi wrote:
> On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> > If a vNVDIMM device is not backed by a DAX device and its "restrict"
> > option is enabled, bit 3 of state flags in its region mapping
> > structure will be set, in order to notify the guest of the lack of
> > write persistence guarantee. Once this bit is set, the guest OS may
> > mark the vNVDIMM device as read-only.
> > 
> > This option is disabled by default for backwards compatibility. It's
> > recommended to enable for the formal usage.
> 
> Good idea.  I think the following is cleaner:
> 
> DEFINE_PROP_ON_OFF_AUTO("readonly") on the 'nvdimm' device.  The
> following states are available:
> 
>  * 'on' - ACPI_NFIT_MEM_NOT_ARMED is set
>  * 'off' - ACPI_NFIT_MEM_NOT_ARMED is clear
>  * 'auto' - ACPI_NFIT_MEM_NOT_ARMED set if backend is not persistent
> 
> This new property defaults to 'auto'.  Machine types older than
> pc-i440fx-2.10 and pc-q35-2.10 default to 'on'.

I think the the name "readonly" is not precise, because QEMU only sets
one bit and does not prevent guest writes. It's guest decision to
treat the vNVDIMM devices as read-only (e.g. Linux kernel).

We may use "unsafe-write" instead.

Haozhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-08  6:39     ` Haozhong Zhang
@ 2017-06-08 10:18       ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-08 10:18 UTC (permalink / raw)
  To: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1479 bytes --]

On Thu, Jun 08, 2017 at 02:39:48PM +0800, Haozhong Zhang wrote:
> On 06/07/17 16:27 +0100, Stefan Hajnoczi wrote:
> > On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> > > If a vNVDIMM device is not backed by a DAX device and its "restrict"
> > > option is enabled, bit 3 of state flags in its region mapping
> > > structure will be set, in order to notify the guest of the lack of
> > > write persistence guarantee. Once this bit is set, the guest OS may
> > > mark the vNVDIMM device as read-only.
> > > 
> > > This option is disabled by default for backwards compatibility. It's
> > > recommended to enable for the formal usage.
> > 
> > Good idea.  I think the following is cleaner:
> > 
> > DEFINE_PROP_ON_OFF_AUTO("readonly") on the 'nvdimm' device.  The
> > following states are available:
> > 
> >  * 'on' - ACPI_NFIT_MEM_NOT_ARMED is set
> >  * 'off' - ACPI_NFIT_MEM_NOT_ARMED is clear
> >  * 'auto' - ACPI_NFIT_MEM_NOT_ARMED set if backend is not persistent
> > 
> > This new property defaults to 'auto'.  Machine types older than
> > pc-i440fx-2.10 and pc-q35-2.10 default to 'on'.
> 
> I think the the name "readonly" is not precise, because QEMU only sets
> one bit and does not prevent guest writes. It's guest decision to
> treat the vNVDIMM devices as read-only (e.g. Linux kernel).
> 
> We may use "unsafe-write" instead.

I agree, "readonly" isn't accurate.  I would use the NFIT terminology
and call it "armed".

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-08  1:45     ` Haozhong Zhang
@ 2017-06-08 10:19       ` Stefan Hajnoczi
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-08 10:19 UTC (permalink / raw)
  To: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1361 bytes --]

On Thu, Jun 08, 2017 at 09:45:29AM +0800, Haozhong Zhang wrote:
> On 06/07/17 16:27 +0100, Stefan Hajnoczi wrote:
> > On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> > > If a vNVDIMM device is not backed by a DAX device and its "restrict"
> > > option is enabled, bit 3 of state flags in its region mapping
> > > structure will be set, in order to notify the guest of the lack of
> > > write persistence guarantee. Once this bit is set, the guest OS may
> > > mark the vNVDIMM device as read-only.
> > > 
> > > This option is disabled by default for backwards compatibility. It's
> > > recommended to enable for the formal usage.
> > 
> > Good idea.  I think the following is cleaner:
> > 
> > DEFINE_PROP_ON_OFF_AUTO("readonly") on the 'nvdimm' device.  The
> > following states are available:
> > 
> >  * 'on' - ACPI_NFIT_MEM_NOT_ARMED is set
> >  * 'off' - ACPI_NFIT_MEM_NOT_ARMED is clear
> >  * 'auto' - ACPI_NFIT_MEM_NOT_ARMED set if backend is not persistent
> > 
> > This new property defaults to 'auto'.  Machine types older than
> > pc-i440fx-2.10 and pc-q35-2.10 default to 'on'.
> 
> Shouldn't it be 'off' on older machine types? The older machine types
> and older QEMU never check the backend and never set ACPI_NFIT_MEM_NOT_ARMED.

You are right.  'readonly' should be 'off' for older machine types.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
  2017-06-06 17:59   ` Dan Williams
  2017-06-07 15:14   ` Stefan Hajnoczi
@ 2017-06-08 12:51   ` Michael S. Tsirkin
  2017-06-12  3:18     ` Haozhong Zhang
  2 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2017-06-08 12:51 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi, Dan Williams

On Tue, Jun 06, 2017 at 03:22:27PM +0800, Haozhong Zhang wrote:
> Applications in Linux guest that use device-dax never trigger flush
> that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> device-dax, QEMU cannot guarantee the persistence of guest writes.
> Before solving this flushing problem, QEMU should warn users if the
> host backend is not device-dax.

No one reads warnings unless things fail but they can be a debugging aid
if they do. But they have to be simple and robust to be helpful for
that. This one seems to me too complex and fragile.


> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com

Don't include this line in commit log please. Put it after --- if you
must.

> ---
>  hw/mem/nvdimm.c      |  6 ++++++
>  include/qemu/osdep.h |  9 ++++++++
>  util/osdep.c         | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+)
> 
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index a9b0863f20..b23542fbdf 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -26,6 +26,7 @@
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/error-report.h"
>  
>  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
>                                    void *opaque, Error **errp)
> @@ -84,6 +85,11 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>      NVDIMMDevice *nvdimm = NVDIMM(dimm);
>      uint64_t align, pmem_size, size = memory_region_size(mr);
>  
> +    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> +        error_report("warning: nvdimm backend does not look like a DAX device, "
> +                     "unable to guarantee persistence of guest writes");
> +    }
> +
>      align = memory_region_get_alignment(mr);
>  
>      pmem_size = size - nvdimm->label_size;
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 1c9f5e260c..7f26af371e 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -470,4 +470,13 @@ char *qemu_get_pid_name(pid_t pid);
>   */
>  pid_t qemu_fork(Error **errp);
>  
> +/**
> + * qemu_fd_is_dev_dax:
> + *
> + * Check whether @fd describes a DAX device.
> + *
> + * Returns true if it is; otherwise, return false.
> + */
> +bool qemu_fd_is_dev_dax(int fd);
> +
>  #endif
> diff --git a/util/osdep.c b/util/osdep.c
> index a2863c8e53..02881f96bc 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
>      return readv_writev(fd, iov, iov_cnt, true);
>  }
>  #endif
> +
> +#ifdef __linux__
> +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> +                                       char *buf, size_t len)
> +{
> +    ssize_t read_bytes;
> +    struct stat st;
> +    unsigned int major, minor;
> +    char *path, *pos;
> +    int sysfs_fd;
> +
> +    if (fstat(fd, &st)) {
> +        return 0;
> +    }
> +
> +    major = major(st.st_rdev);
> +    minor = minor(st.st_rdev);
> +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> +
> +    sysfs_fd = open(path, O_RDONLY);
> +    g_free(path);
> +    if (sysfs_fd == -1) {
> +        return 0;
> +    }
> +
> +    read_bytes = read(sysfs_fd, buf, len - 1);
> +    close(sysfs_fd);
> +    if (read_bytes > 0) {
> +        buf[read_bytes] = '\0';
> +        pos = g_strstr_len(buf, read_bytes, "\n");
> +        if (pos) {
> +            *pos = '\0';
> +        }
> +    }
> +
> +    return read_bytes;
> +}
> +#endif /* __linux__ */
> +
> +bool qemu_fd_is_dev_dax(int fd)
> +{
> +    bool is_dax = false;
> +
> +#ifdef __linux__
> +    char devtype[7];
> +    ssize_t len;
> +
> +    if (fd == -1) {
> +        return false;
> +    }
> +
> +    len = qemu_dev_dax_sysfs_read(fd, "device/devtype",
> +                                  devtype, sizeof(devtype));
> +    if (len <= 0) {
> +        return false;
> +    }

This can easily trigger false positives e.g. if sysfs access
is blocked by selinux.


> +    is_dax = !strncmp(devtype, "nd_dax", len);

E.g. will return true on any string starting with nd_dax.
And it's not clear non-dax devices can't guarantee safety.

> +#endif /* __linux__ */
> +
> +    return is_dax;
> +}
> -- 
> 2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict" Haozhong Zhang
  2017-06-07 15:27   ` Stefan Hajnoczi
@ 2017-06-08 12:56   ` Michael S. Tsirkin
  2017-06-09  9:34     ` Stefan Hajnoczi
  2017-06-12  1:18     ` Haozhong Zhang
  1 sibling, 2 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2017-06-08 12:56 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi, Dan Williams

On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> If a vNVDIMM device is not backed by a DAX device and its "restrict"
> option is enabled, bit 3 of state flags in its region mapping
> structure will be set, in order to notify the guest of the lack of
> write persistence guarantee. Once this bit is set, the guest OS may
> mark the vNVDIMM device as read-only.
> 
> This option is disabled by default for backwards compatibility. It's
> recommended to enable for the formal usage.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>

Seems wrong to me. E.g. it won't work in a nested
virt setup. What if backend is dax but is not armed?
Can't the armed bit of the backing device be tested?
Name "restrict" is also confusing. Can we reuse cache=
options? E.g. cache=unsafe etc.

> ---
>  hw/acpi/nvdimm.c        | 16 ++++++++++++++++
>  hw/mem/nvdimm.c         | 38 +++++++++++++++++++++++++++++++++++++-
>  include/hw/mem/nvdimm.h |  5 +++++
>  3 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 8e7d6ec034..fd1ef6dc65 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -138,6 +138,8 @@ struct NvdimmNfitMemDev {
>  } QEMU_PACKED;
>  typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
>  
> +#define ACPI_NFIT_MEM_NOT_ARMED    (1 << 3)
> +
>  /*
>   * NVDIMM Control Region Structure
>   *
> @@ -289,6 +291,10 @@ nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
>      int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
>                                              NULL);
>      uint32_t handle = nvdimm_slot_to_handle(slot);
> +    bool dev_dax = object_property_get_bool(OBJECT(dev), NVDIMM_DEV_DAX_PROP,
> +                                            NULL);
> +    bool restrict_mode = object_property_get_bool(OBJECT(dev),
> +                                                  NVDIMM_RESTRICT_PROP, NULL);
>  
>      nfit_memdev = acpi_data_push(structures, sizeof(*nfit_memdev));
>  
> @@ -312,6 +318,16 @@ nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
>  
>      /* Only one interleave for PMEM. */
>      nfit_memdev->interleave_ways = cpu_to_le16(1);
> +
> +    /*
> +     * If a vNVDIMM device in the restrict mode and is not backed by a
> +     * DAX device, QEMU will set ACPI_NFIT_MEM_NOT_ARMED bit of state
> +     * flags in its region mapping structure, in order to notify the
> +     * guest of the lack of write persistence guarantee.
> +     */
> +    if (!dev_dax && restrict_mode) {
> +        nfit_memdev->flags = cpu_to_le16(ACPI_NFIT_MEM_NOT_ARMED);
> +    }
>  }
>  
>  /*
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index b23542fbdf..cda416e5c8 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -65,11 +65,46 @@ out:
>      error_propagate(errp, local_err);
>  }
>  
> +static bool nvdimm_get_backend_dev_dax(Object *obj, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +
> +    return nvdimm->backend_dev_dax;
> +}
> +
> +static bool nvdimm_get_restrict(Object *obj, Error **errp)
> +{
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +
> +    return nvdimm->restrict_mode;
> +}
> +
> +static void nvdimm_set_restrict(Object *obj, bool val, Error **errp)
> +{
> +    DeviceState *dev = DEVICE(obj);
> +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> +    Error *local_err = NULL;
> +
> +    if (dev->realized) {
> +        error_setg(&local_err, "cannot change property value");
> +        goto out;
> +    }
> +
> +    nvdimm->restrict_mode = val;
> +
> + out:
> +    error_propagate(errp, local_err);
> +}
> +
>  static void nvdimm_init(Object *obj)
>  {
>      object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
>                          nvdimm_get_label_size, nvdimm_set_label_size, NULL,
>                          NULL, NULL);
> +    object_property_add_bool(obj, NVDIMM_DEV_DAX_PROP,
> +                             nvdimm_get_backend_dev_dax, NULL, NULL);
> +    object_property_add_bool(obj, NVDIMM_RESTRICT_PROP,
> +                             nvdimm_get_restrict, nvdimm_set_restrict, NULL);
>  }
>  
>  static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> @@ -85,7 +120,8 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>      NVDIMMDevice *nvdimm = NVDIMM(dimm);
>      uint64_t align, pmem_size, size = memory_region_size(mr);
>  
> -    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> +    nvdimm->backend_dev_dax = qemu_fd_is_dev_dax(memory_region_get_fd(mr));
> +    if (!nvdimm->backend_dev_dax) {
>          error_report("warning: nvdimm backend does not look like a DAX device, "
>                       "unable to guarantee persistence of guest writes");
>      }
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index f1f3987055..2fbe0d7858 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -49,6 +49,8 @@
>                                                 TYPE_NVDIMM)
>  
>  #define NVDIMM_LABEL_SIZE_PROP "label-size"
> +#define NVDIMM_DEV_DAX_PROP    "dev-dax"
> +#define NVDIMM_RESTRICT_PROP   "restrict"
>  
>  struct NVDIMMDevice {
>      /* private */
> @@ -74,6 +76,9 @@ struct NVDIMMDevice {
>       * guest via ACPI NFIT and _FIT method if NVDIMM hotplug is supported.
>       */
>      MemoryRegion nvdimm_mr;
> +
> +    bool backend_dev_dax;
> +    bool restrict_mode;
>  };
>  typedef struct NVDIMMDevice NVDIMMDevice;
>  
> -- 
> 2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-08 12:56   ` Michael S. Tsirkin
@ 2017-06-09  9:34     ` Stefan Hajnoczi
  2017-06-12  1:18     ` Haozhong Zhang
  1 sibling, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2017-06-09  9:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Haozhong Zhang, qemu-devel, Igor Mammedov, Xiao Guangrong, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

On Thu, Jun 08, 2017 at 03:56:42PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> > If a vNVDIMM device is not backed by a DAX device and its "restrict"
> > option is enabled, bit 3 of state flags in its region mapping
> > structure will be set, in order to notify the guest of the lack of
> > write persistence guarantee. Once this bit is set, the guest OS may
> > mark the vNVDIMM device as read-only.
> > 
> > This option is disabled by default for backwards compatibility. It's
> > recommended to enable for the formal usage.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> 
> Seems wrong to me. E.g. it won't work in a nested
> virt setup. What if backend is dax but is not armed?
> Can't the armed bit of the backing device be tested?
> Name "restrict" is also confusing. Can we reuse cache=
> options? E.g. cache=unsafe etc.

The -drive cache= options (writeback, writethrough, none, directsync,
unsafe) are confusing and considered legacy options.  The new options
are -drive
cache.writeback=on|off,cache.direct=on|off,cache.no-flush=on|off.

I suggested to call the option -device nvdimm,armed=auto|on|off in
another email.  "Armed" is the term used by the NVDIMM/NFIT
specification and it has an NVDIMM-specific meaning.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict"
  2017-06-08 12:56   ` Michael S. Tsirkin
  2017-06-09  9:34     ` Stefan Hajnoczi
@ 2017-06-12  1:18     ` Haozhong Zhang
  1 sibling, 0 replies; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-12  1:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi, Dan Williams

On 06/08/17 15:56 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 06, 2017 at 03:22:28PM +0800, Haozhong Zhang wrote:
> > If a vNVDIMM device is not backed by a DAX device and its "restrict"
> > option is enabled, bit 3 of state flags in its region mapping
> > structure will be set, in order to notify the guest of the lack of
> > write persistence guarantee. Once this bit is set, the guest OS may
> > mark the vNVDIMM device as read-only.
> > 
> > This option is disabled by default for backwards compatibility. It's
> > recommended to enable for the formal usage.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> 
> Seems wrong to me. E.g. it won't work in a nested
> virt setup. What if backend is dax but is not armed?
> Can't the armed bit of the backing device be tested?

If the not-arm bit of a host NVDIMM region is set, Linux NVDIMM driver
will make it read-only, and QEMU will fail when it tries to mmap it
with flags (PROT_READ | PROT_WRITE). Thus, we don't need to check
whether the host region is not armed.

> Name "restrict" is also confusing. Can we reuse cache=
> options? E.g. cache=unsafe etc.

I agree the name is confusing, and would like to use the name 'armed'
suggested by Stefan.

Thanks,
Haozhong

> 
> > ---
> >  hw/acpi/nvdimm.c        | 16 ++++++++++++++++
> >  hw/mem/nvdimm.c         | 38 +++++++++++++++++++++++++++++++++++++-
> >  include/hw/mem/nvdimm.h |  5 +++++
> >  3 files changed, 58 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> > index 8e7d6ec034..fd1ef6dc65 100644
> > --- a/hw/acpi/nvdimm.c
> > +++ b/hw/acpi/nvdimm.c
> > @@ -138,6 +138,8 @@ struct NvdimmNfitMemDev {
> >  } QEMU_PACKED;
> >  typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
> >  
> > +#define ACPI_NFIT_MEM_NOT_ARMED    (1 << 3)
> > +
> >  /*
> >   * NVDIMM Control Region Structure
> >   *
> > @@ -289,6 +291,10 @@ nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
> >      int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
> >                                              NULL);
> >      uint32_t handle = nvdimm_slot_to_handle(slot);
> > +    bool dev_dax = object_property_get_bool(OBJECT(dev), NVDIMM_DEV_DAX_PROP,
> > +                                            NULL);
> > +    bool restrict_mode = object_property_get_bool(OBJECT(dev),
> > +                                                  NVDIMM_RESTRICT_PROP, NULL);
> >  
> >      nfit_memdev = acpi_data_push(structures, sizeof(*nfit_memdev));
> >  
> > @@ -312,6 +318,16 @@ nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
> >  
> >      /* Only one interleave for PMEM. */
> >      nfit_memdev->interleave_ways = cpu_to_le16(1);
> > +
> > +    /*
> > +     * If a vNVDIMM device in the restrict mode and is not backed by a
> > +     * DAX device, QEMU will set ACPI_NFIT_MEM_NOT_ARMED bit of state
> > +     * flags in its region mapping structure, in order to notify the
> > +     * guest of the lack of write persistence guarantee.
> > +     */
> > +    if (!dev_dax && restrict_mode) {
> > +        nfit_memdev->flags = cpu_to_le16(ACPI_NFIT_MEM_NOT_ARMED);
> > +    }
> >  }
> >  
> >  /*
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index b23542fbdf..cda416e5c8 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -65,11 +65,46 @@ out:
> >      error_propagate(errp, local_err);
> >  }
> >  
> > +static bool nvdimm_get_backend_dev_dax(Object *obj, Error **errp)
> > +{
> > +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> > +
> > +    return nvdimm->backend_dev_dax;
> > +}
> > +
> > +static bool nvdimm_get_restrict(Object *obj, Error **errp)
> > +{
> > +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> > +
> > +    return nvdimm->restrict_mode;
> > +}
> > +
> > +static void nvdimm_set_restrict(Object *obj, bool val, Error **errp)
> > +{
> > +    DeviceState *dev = DEVICE(obj);
> > +    NVDIMMDevice *nvdimm = NVDIMM(obj);
> > +    Error *local_err = NULL;
> > +
> > +    if (dev->realized) {
> > +        error_setg(&local_err, "cannot change property value");
> > +        goto out;
> > +    }
> > +
> > +    nvdimm->restrict_mode = val;
> > +
> > + out:
> > +    error_propagate(errp, local_err);
> > +}
> > +
> >  static void nvdimm_init(Object *obj)
> >  {
> >      object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
> >                          nvdimm_get_label_size, nvdimm_set_label_size, NULL,
> >                          NULL, NULL);
> > +    object_property_add_bool(obj, NVDIMM_DEV_DAX_PROP,
> > +                             nvdimm_get_backend_dev_dax, NULL, NULL);
> > +    object_property_add_bool(obj, NVDIMM_RESTRICT_PROP,
> > +                             nvdimm_get_restrict, nvdimm_set_restrict, NULL);
> >  }
> >  
> >  static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> > @@ -85,7 +120,8 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
> >      NVDIMMDevice *nvdimm = NVDIMM(dimm);
> >      uint64_t align, pmem_size, size = memory_region_size(mr);
> >  
> > -    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> > +    nvdimm->backend_dev_dax = qemu_fd_is_dev_dax(memory_region_get_fd(mr));
> > +    if (!nvdimm->backend_dev_dax) {
> >          error_report("warning: nvdimm backend does not look like a DAX device, "
> >                       "unable to guarantee persistence of guest writes");
> >      }
> > diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> > index f1f3987055..2fbe0d7858 100644
> > --- a/include/hw/mem/nvdimm.h
> > +++ b/include/hw/mem/nvdimm.h
> > @@ -49,6 +49,8 @@
> >                                                 TYPE_NVDIMM)
> >  
> >  #define NVDIMM_LABEL_SIZE_PROP "label-size"
> > +#define NVDIMM_DEV_DAX_PROP    "dev-dax"
> > +#define NVDIMM_RESTRICT_PROP   "restrict"
> >  
> >  struct NVDIMMDevice {
> >      /* private */
> > @@ -74,6 +76,9 @@ struct NVDIMMDevice {
> >       * guest via ACPI NFIT and _FIT method if NVDIMM hotplug is supported.
> >       */
> >      MemoryRegion nvdimm_mr;
> > +
> > +    bool backend_dev_dax;
> > +    bool restrict_mode;
> >  };
> >  typedef struct NVDIMMDevice NVDIMMDevice;
> >  
> > -- 
> > 2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-08 12:51   ` Michael S. Tsirkin
@ 2017-06-12  3:18     ` Haozhong Zhang
  2017-06-12  3:25       ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Haozhong Zhang @ 2017-06-12  3:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi, Dan Williams

On 06/08/17 15:51 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 06, 2017 at 03:22:27PM +0800, Haozhong Zhang wrote:
> > Applications in Linux guest that use device-dax never trigger flush
> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > Before solving this flushing problem, QEMU should warn users if the
> > host backend is not device-dax.
> 
> No one reads warnings unless things fail but they can be a debugging aid
> if they do. But they have to be simple and robust to be helpful for
> that. This one seems to me too complex and fragile.
>
> 
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> 
> Don't include this line in commit log please. Put it after --- if you
> must.
> 
> > ---
> >  hw/mem/nvdimm.c      |  6 ++++++
> >  include/qemu/osdep.h |  9 ++++++++
> >  util/osdep.c         | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 76 insertions(+)
> > 
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index a9b0863f20..b23542fbdf 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -26,6 +26,7 @@
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> >  #include "hw/mem/nvdimm.h"
> > +#include "qemu/error-report.h"
> >  
> >  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
> >                                    void *opaque, Error **errp)
> > @@ -84,6 +85,11 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
> >      NVDIMMDevice *nvdimm = NVDIMM(dimm);
> >      uint64_t align, pmem_size, size = memory_region_size(mr);
> >  
> > +    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> > +        error_report("warning: nvdimm backend does not look like a DAX device, "
> > +                     "unable to guarantee persistence of guest writes");
> > +    }
> > +
> >      align = memory_region_get_alignment(mr);
> >  
> >      pmem_size = size - nvdimm->label_size;
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 1c9f5e260c..7f26af371e 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -470,4 +470,13 @@ char *qemu_get_pid_name(pid_t pid);
> >   */
> >  pid_t qemu_fork(Error **errp);
> >  
> > +/**
> > + * qemu_fd_is_dev_dax:
> > + *
> > + * Check whether @fd describes a DAX device.
> > + *
> > + * Returns true if it is; otherwise, return false.
> > + */
> > +bool qemu_fd_is_dev_dax(int fd);
> > +
> >  #endif
> > diff --git a/util/osdep.c b/util/osdep.c
> > index a2863c8e53..02881f96bc 100644
> > --- a/util/osdep.c
> > +++ b/util/osdep.c
> > @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
> >      return readv_writev(fd, iov, iov_cnt, true);
> >  }
> >  #endif
> > +
> > +#ifdef __linux__
> > +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> > +                                       char *buf, size_t len)
> > +{
> > +    ssize_t read_bytes;
> > +    struct stat st;
> > +    unsigned int major, minor;
> > +    char *path, *pos;
> > +    int sysfs_fd;
> > +
> > +    if (fstat(fd, &st)) {
> > +        return 0;
> > +    }
> > +
> > +    major = major(st.st_rdev);
> > +    minor = minor(st.st_rdev);
> > +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> > +
> > +    sysfs_fd = open(path, O_RDONLY);
> > +    g_free(path);
> > +    if (sysfs_fd == -1) {
> > +        return 0;
> > +    }
> > +
> > +    read_bytes = read(sysfs_fd, buf, len - 1);
> > +    close(sysfs_fd);
> > +    if (read_bytes > 0) {
> > +        buf[read_bytes] = '\0';
> > +        pos = g_strstr_len(buf, read_bytes, "\n");
> > +        if (pos) {
> > +            *pos = '\0';
> > +        }
> > +    }
> > +
> > +    return read_bytes;
> > +}
> > +#endif /* __linux__ */
> > +
> > +bool qemu_fd_is_dev_dax(int fd)
> > +{
> > +    bool is_dax = false;
> > +
> > +#ifdef __linux__
> > +    char devtype[7];
> > +    ssize_t len;
> > +
> > +    if (fd == -1) {
> > +        return false;
> > +    }
> > +
> > +    len = qemu_dev_dax_sysfs_read(fd, "device/devtype",
> > +                                  devtype, sizeof(devtype));
> > +    if (len <= 0) {
> > +        return false;
> > +    }
> 
> This can easily trigger false positives e.g. if sysfs access
> is blocked by selinux.
>

A part of userland interface of nvdimm is exposed via sysfs, so QEMU
has to access to certain sysfs entries in order to, e.g. perform the
DAX check in this patch and get alignment requirement in patch 4.

I agree it's not robust if the permission is not properly granted. We
may either
1) require users to grant the permissions to QEMU and document the
   requirements
 or
2) get the information from sysfs from the outside of QEMU
   (e.g. libvirt) which has the permission and then pass to QEMU.

Which one do you think is better?

> 
> > +    is_dax = !strncmp(devtype, "nd_dax", len);
> 
> E.g. will return true on any string starting with nd_dax.
> And it's not clear non-dax devices can't guarantee safety.

This check can be done by checking whether /sys/dev/char/MAJ:MIN/subsystem
points to /sys/class/dax, as Dan suggested, if we decide to access
sysfs in QEMU.

Thanks,
Haozhong

> 
> > +#endif /* __linux__ */
> > +
> > +    return is_dax;
> > +}
> > -- 
> > 2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device
  2017-06-12  3:18     ` Haozhong Zhang
@ 2017-06-12  3:25       ` Michael S. Tsirkin
  0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12  3:25 UTC (permalink / raw)
  To: qemu-devel, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi, Dan Williams

On Mon, Jun 12, 2017 at 11:18:21AM +0800, Haozhong Zhang wrote:
> On 06/08/17 15:51 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 06, 2017 at 03:22:27PM +0800, Haozhong Zhang wrote:
> > > Applications in Linux guest that use device-dax never trigger flush
> > > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > > Before solving this flushing problem, QEMU should warn users if the
> > > host backend is not device-dax.
> > 
> > No one reads warnings unless things fail but they can be a debugging aid
> > if they do. But they have to be simple and robust to be helpful for
> > that. This one seems to me too complex and fragile.
> >
> > 
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> > 
> > Don't include this line in commit log please. Put it after --- if you
> > must.
> > 
> > > ---
> > >  hw/mem/nvdimm.c      |  6 ++++++
> > >  include/qemu/osdep.h |  9 ++++++++
> > >  util/osdep.c         | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 76 insertions(+)
> > > 
> > > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > > index a9b0863f20..b23542fbdf 100644
> > > --- a/hw/mem/nvdimm.c
> > > +++ b/hw/mem/nvdimm.c
> > > @@ -26,6 +26,7 @@
> > >  #include "qapi/error.h"
> > >  #include "qapi/visitor.h"
> > >  #include "hw/mem/nvdimm.h"
> > > +#include "qemu/error-report.h"
> > >  
> > >  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
> > >                                    void *opaque, Error **errp)
> > > @@ -84,6 +85,11 @@ static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
> > >      NVDIMMDevice *nvdimm = NVDIMM(dimm);
> > >      uint64_t align, pmem_size, size = memory_region_size(mr);
> > >  
> > > +    if (!qemu_fd_is_dev_dax(memory_region_get_fd(mr))) {
> > > +        error_report("warning: nvdimm backend does not look like a DAX device, "
> > > +                     "unable to guarantee persistence of guest writes");
> > > +    }
> > > +
> > >      align = memory_region_get_alignment(mr);
> > >  
> > >      pmem_size = size - nvdimm->label_size;
> > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > > index 1c9f5e260c..7f26af371e 100644
> > > --- a/include/qemu/osdep.h
> > > +++ b/include/qemu/osdep.h
> > > @@ -470,4 +470,13 @@ char *qemu_get_pid_name(pid_t pid);
> > >   */
> > >  pid_t qemu_fork(Error **errp);
> > >  
> > > +/**
> > > + * qemu_fd_is_dev_dax:
> > > + *
> > > + * Check whether @fd describes a DAX device.
> > > + *
> > > + * Returns true if it is; otherwise, return false.
> > > + */
> > > +bool qemu_fd_is_dev_dax(int fd);
> > > +
> > >  #endif
> > > diff --git a/util/osdep.c b/util/osdep.c
> > > index a2863c8e53..02881f96bc 100644
> > > --- a/util/osdep.c
> > > +++ b/util/osdep.c
> > > @@ -471,3 +471,64 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
> > >      return readv_writev(fd, iov, iov_cnt, true);
> > >  }
> > >  #endif
> > > +
> > > +#ifdef __linux__
> > > +static ssize_t qemu_dev_dax_sysfs_read(int fd, const char *entry,
> > > +                                       char *buf, size_t len)
> > > +{
> > > +    ssize_t read_bytes;
> > > +    struct stat st;
> > > +    unsigned int major, minor;
> > > +    char *path, *pos;
> > > +    int sysfs_fd;
> > > +
> > > +    if (fstat(fd, &st)) {
> > > +        return 0;
> > > +    }
> > > +
> > > +    major = major(st.st_rdev);
> > > +    minor = minor(st.st_rdev);
> > > +    path = g_strdup_printf("/sys/dev/char/%u:%u/%s", major, minor, entry);
> > > +
> > > +    sysfs_fd = open(path, O_RDONLY);
> > > +    g_free(path);
> > > +    if (sysfs_fd == -1) {
> > > +        return 0;
> > > +    }
> > > +
> > > +    read_bytes = read(sysfs_fd, buf, len - 1);
> > > +    close(sysfs_fd);
> > > +    if (read_bytes > 0) {
> > > +        buf[read_bytes] = '\0';
> > > +        pos = g_strstr_len(buf, read_bytes, "\n");
> > > +        if (pos) {
> > > +            *pos = '\0';
> > > +        }
> > > +    }
> > > +
> > > +    return read_bytes;
> > > +}
> > > +#endif /* __linux__ */
> > > +
> > > +bool qemu_fd_is_dev_dax(int fd)
> > > +{
> > > +    bool is_dax = false;
> > > +
> > > +#ifdef __linux__
> > > +    char devtype[7];
> > > +    ssize_t len;
> > > +
> > > +    if (fd == -1) {
> > > +        return false;
> > > +    }
> > > +
> > > +    len = qemu_dev_dax_sysfs_read(fd, "device/devtype",
> > > +                                  devtype, sizeof(devtype));
> > > +    if (len <= 0) {
> > > +        return false;
> > > +    }
> > 
> > This can easily trigger false positives e.g. if sysfs access
> > is blocked by selinux.
> >
> 
> A part of userland interface of nvdimm is exposed via sysfs, so QEMU
> has to access to certain sysfs entries in order to, e.g. perform the
> DAX check in this patch and get alignment requirement in patch 4.
> 
> I agree it's not robust if the permission is not properly granted. We
> may either
> 1) require users to grant the permissions to QEMU and document the
>    requirements
>  or
> 2) get the information from sysfs from the outside of QEMU
>    (e.g. libvirt) which has the permission and then pass to QEMU.
> 
> Which one do you think is better?

2) since that allows emulating things without hardware dax.

> > 
> > > +    is_dax = !strncmp(devtype, "nd_dax", len);
> > 
> > E.g. will return true on any string starting with nd_dax.
> > And it's not clear non-dax devices can't guarantee safety.
> 
> This check can be done by checking whether /sys/dev/char/MAJ:MIN/subsystem
> points to /sys/class/dax, as Dan suggested, if we decide to access
> sysfs in QEMU.
> 
> Thanks,
> Haozhong
> 
> > 
> > > +#endif /* __linux__ */
> > > +
> > > +    return is_dax;
> > > +}
> > > -- 
> > > 2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-06-12  3:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-06  7:22 [Qemu-devel] [PATCH v2 0/4] nvdimm: fixes for (non-)dax backends Haozhong Zhang
2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 1/4] nvdimm: add a macro for property "label-size" Haozhong Zhang
2017-06-07 15:29   ` Stefan Hajnoczi
2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 2/4] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
2017-06-06 17:59   ` Dan Williams
2017-06-08  1:07     ` Haozhong Zhang
2017-06-07 15:14   ` Stefan Hajnoczi
2017-06-08  1:07     ` Haozhong Zhang
2017-06-08 12:51   ` Michael S. Tsirkin
2017-06-12  3:18     ` Haozhong Zhang
2017-06-12  3:25       ` Michael S. Tsirkin
2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 3/4] nvdimm: add a boolean option "restrict" Haozhong Zhang
2017-06-07 15:27   ` Stefan Hajnoczi
2017-06-08  1:45     ` Haozhong Zhang
2017-06-08 10:19       ` Stefan Hajnoczi
2017-06-08  6:39     ` Haozhong Zhang
2017-06-08 10:18       ` Stefan Hajnoczi
2017-06-08 12:56   ` Michael S. Tsirkin
2017-06-09  9:34     ` Stefan Hajnoczi
2017-06-12  1:18     ` Haozhong Zhang
2017-06-06  7:22 ` [Qemu-devel] [PATCH v2 4/4] util/mmap-alloc: account for DAX device in qemu_fd_getpagesize Haozhong Zhang
2017-06-07 15:29   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.