All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
@ 2017-05-26  2:32 Haozhong Zhang
  2017-05-26  2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Haozhong Zhang @ 2017-05-26  2:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Haozhong Zhang, Michael S. Tsirkin, Igor Mammedov,
	Xiao Guangrong, Stefan Hajnoczi, Dan Williams

Applications in Linux guest that use device-dax never trigger flush
that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
device-dax, QEMU cannot guarantee the persistence of guest writes.
Before solving this flushing problem, QEMU should warn users if the
host backend is not device-dax.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
---
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
Resend because the wrong maintainer email address was used.
---
 hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index db896b0bb6..c7bb407f33 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -26,6 +26,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/error-report.h"
 
 static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
                                   void *opaque, Error **errp)
@@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
     return &nvdimm->nvdimm_mr;
 }
 
+static void nvdimm_check_dax(HostMemoryBackend *hostmem)
+{
+    char *mem_path =
+        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
+    char *dev_name = NULL, *sysfs_path = NULL;
+    bool is_dax = false;
+
+    if (!mem_path) {
+        goto out;
+    }
+
+    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
+        goto out;
+    }
+
+    dev_name = mem_path + strlen("/dev/");
+    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
+    if (access(sysfs_path, F_OK)) {
+        goto out;
+    }
+
+    is_dax = true;
+
+ out:
+    if (!is_dax) {
+        error_report("warning: nvdimm backend %s is not DAX device, "
+                     "unable to guarantee persistence of guest writes",
+                     mem_path ?: "RAM");
+    }
+
+    g_free(sysfs_path);
+    g_free(mem_path);
+}
+
 static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
 {
     MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem, errp);
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
     uint64_t align, pmem_size, size = memory_region_size(mr);
 
+    nvdimm_check_dax(dimm->hostmem);
+
     align = memory_region_get_alignment(mr);
 
     pmem_size = size - nvdimm->label_size;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-26  2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
@ 2017-05-26  2:32 ` Haozhong Zhang
  2017-05-26  6:39   ` Marc-André Lureau
  2017-05-26  3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams
  2017-06-01 12:00 ` Xiao Guangrong
  2 siblings, 1 reply; 21+ messages in thread
From: Haozhong Zhang @ 2017-05-26  2:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Haozhong Zhang, Eduardo Habkost, Igor Mammedov, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson, Xiao Guangrong,
	Stefan Hajnoczi, Dan Williams

file_ram_alloc() currently maps the backend file via mmap to a virtual
address aligned to the value returned by qemu_fd_getpagesize(). When a
DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
mmap implementation may require an alignment larger than what
qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.

This commit adds an attribute 'align' to hostmem-file, so that users
can specify a proper alignment that satisfies the kernel requirement.

If 'align' is not specified or is 0, the value returned by
qemu_fd_get_pagesize() will be used as before.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
Resend because the wrong maintainer email address was used.
---
 backends/hostmem-file.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 exec.c                  |  8 +++++++-
 include/exec/memory.h   |  2 ++
 memory.c                |  2 ++
 numa.c                  |  2 +-
 5 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index fc4ef46d11..d44fb41b55 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -33,6 +33,7 @@ struct HostMemoryBackendFile {
 
     bool share;
     char *mem_path;
+    uint64_t align;
 };
 
 static void
@@ -57,7 +58,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         path = object_get_canonical_path(OBJECT(backend));
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
-                                 backend->size, fb->share,
+                                 backend->size, fb->align, fb->share,
                                  fb->mem_path, errp);
         g_free(path);
     }
@@ -104,6 +105,40 @@ static void file_memory_backend_set_share(Object *o, bool value, Error **errp)
 }
 
 static void
+file_memory_backend_get_align(Object *o, Visitor *v, const char *name,
+                              void *opaque, Error **errp)
+{
+    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
+    uint64_t val = fb->align;
+
+    visit_type_size(v, name, &val, errp);
+}
+
+static void
+file_memory_backend_set_align(Object *o, Visitor *v, const char *name,
+                              void *opaque, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
+    Error *local_err = NULL;
+    uint64_t val;
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(&local_err, "cannot change property value");
+        goto out;
+    }
+
+    visit_type_size(v, name, &val, &local_err);
+    if (local_err) {
+        goto out;
+    }
+    fb->align = val;
+
+ out:
+    error_propagate(errp, local_err);
+}
+
+static void
 file_backend_class_init(ObjectClass *oc, void *data)
 {
     HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc);
@@ -116,6 +151,10 @@ file_backend_class_init(ObjectClass *oc, void *data)
     object_class_property_add_str(oc, "mem-path",
         get_mem_path, set_mem_path,
         &error_abort);
+    object_class_property_add(oc, "align", "int",
+        file_memory_backend_get_align,
+        file_memory_backend_set_align,
+        NULL, NULL, &error_abort);
 }
 
 static void file_backend_instance_finalize(Object *o)
diff --git a/exec.c b/exec.c
index ff16f04f2b..5bb62e2e98 100644
--- a/exec.c
+++ b/exec.c
@@ -1549,7 +1549,13 @@ static void *file_ram_alloc(RAMBlock *block,
     }
 
     block->page_size = qemu_fd_getpagesize(fd);
-    block->mr->align = block->page_size;
+    if (block->mr->align % block->page_size) {
+        error_setg(errp, "alignment 0x%" PRIx64 " must be "
+                   "multiple of page size 0x%" PRIx64,
+                   block->mr->align, block->page_size);
+        goto error;
+    }
+    block->mr->align = MAX(block->page_size, block->mr->align);
 #if defined(__s390x__)
     if (kvm_enabled()) {
         block->mr->align = MAX(block->mr->align, QEMU_VMALLOC_ALIGN);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 99e0f54d86..05d3d0da3b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -441,6 +441,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *        must be unique within any device
  * @size: size of the region.
+ * @align: alignment of the region.
  * @share: %true if memory must be mmaped with the MAP_SHARED flag
  * @path: the path in which to allocate the RAM.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -449,6 +450,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       struct Object *owner,
                                       const char *name,
                                       uint64_t size,
+                                      uint64_t align,
                                       bool share,
                                       const char *path,
                                       Error **errp);
diff --git a/memory.c b/memory.c
index b727f5ec0e..5165b9aa08 100644
--- a/memory.c
+++ b/memory.c
@@ -1386,6 +1386,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
                                       struct Object *owner,
                                       const char *name,
                                       uint64_t size,
+                                      uint64_t align,
                                       bool share,
                                       const char *path,
                                       Error **errp)
@@ -1394,6 +1395,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
+    mr->align = align;
     mr->ram_block = qemu_ram_alloc_from_file(size, mr, share, path, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
diff --git a/numa.c b/numa.c
index ca731455e9..39a25aa1d2 100644
--- a/numa.c
+++ b/numa.c
@@ -541,7 +541,7 @@ static void allocate_system_memory_nonnuma(MemoryRegion *mr, Object *owner,
     if (mem_path) {
 #ifdef __linux__
         Error *err = NULL;
-        memory_region_init_ram_from_file(mr, owner, name, ram_size, false,
+        memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, false,
                                          mem_path, &err);
         if (err) {
             error_report_err(err);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26  2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
  2017-05-26  2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang
@ 2017-05-26  3:34 ` Dan Williams
  2017-05-26  4:30   ` Haozhong Zhang
  2017-05-26 14:38   ` Daniel P. Berrange
  2017-06-01 12:00 ` Xiao Guangrong
  2 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-26  3:34 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong,
	Stefan Hajnoczi

On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> Applications in Linux guest that use device-dax never trigger flush
> that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> device-dax, QEMU cannot guarantee the persistence of guest writes.
> Before solving this flushing problem, QEMU should warn users if the
> host backend is not device-dax.

I think this needs to be stronger than a "warn" it needs to be
explicitly forbidden when it is known to be unsafe.

>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> ---
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> Cc: Stefan Hajnoczi <stefanha@gmail.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
> Resend because the wrong maintainer email address was used.
> ---
>  hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
>
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index db896b0bb6..c7bb407f33 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -26,6 +26,7 @@
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/error-report.h"
>
>  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
>                                    void *opaque, Error **errp)
> @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
>      return &nvdimm->nvdimm_mr;
>  }
>
> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
> +{
> +    char *mem_path =
> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
> +    char *dev_name = NULL, *sysfs_path = NULL;
> +    bool is_dax = false;
> +
> +    if (!mem_path) {
> +        goto out;
> +    }
> +
> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
> +        goto out;
> +    }

What if we're using a symlink to a device-dax instance? The should
explicitly be looking up the major / minor number of the device (after
following any symlinks) and verifying that it is device-dax by
checking /sys/dev/char/$major:$minor refers to a device-dax instance.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26  3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams
@ 2017-05-26  4:30   ` Haozhong Zhang
  2017-05-26 14:28     ` Dan Williams
  2017-05-26 14:38   ` Daniel P. Berrange
  1 sibling, 1 reply; 21+ messages in thread
From: Haozhong Zhang @ 2017-05-26  4:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel,
	Michael S. Tsirkin

On 05/25/17 20:34 -0700, Dan Williams wrote:
> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Applications in Linux guest that use device-dax never trigger flush
> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > Before solving this flushing problem, QEMU should warn users if the
> > host backend is not device-dax.
> 
> I think this needs to be stronger than a "warn" it needs to be
> explicitly forbidden when it is known to be unsafe.
>

I understand your worry and am not object to your suggestion, but
forbidden will change the existing behavior that allows such unsafe
usage. Let's wait for other maintainers' comments.

> >
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com
> > ---
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Igor Mammedov <imammedo@redhat.com>
> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > Cc: Stefan Hajnoczi <stefanha@gmail.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> > Resend because the wrong maintainer email address was used.
> > ---
> >  hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 37 insertions(+)
> >
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index db896b0bb6..c7bb407f33 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -26,6 +26,7 @@
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> >  #include "hw/mem/nvdimm.h"
> > +#include "qemu/error-report.h"
> >
> >  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
> >                                    void *opaque, Error **errp)
> > @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm)
> >      return &nvdimm->nvdimm_mr;
> >  }
> >
> > +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
> > +{
> > +    char *mem_path =
> > +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
> > +    char *dev_name = NULL, *sysfs_path = NULL;
> > +    bool is_dax = false;
> > +
> > +    if (!mem_path) {
> > +        goto out;
> > +    }
> > +
> > +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
> > +        goto out;
> > +    }
> 
> What if we're using a symlink to a device-dax instance? The should
> explicitly be looking up the major / minor number of the device (after
> following any symlinks) and verifying that it is device-dax by
> checking /sys/dev/char/$major:$minor refers to a device-dax instance.
> 

I'll follow to your suggestion in the next version.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-26  2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang
@ 2017-05-26  6:39   ` Marc-André Lureau
       [not found]     ` <20170526065132.ayapfebbft5lyigd@hz-desktop>
  0 siblings, 1 reply; 21+ messages in thread
From: Marc-André Lureau @ 2017-05-26  6:39 UTC (permalink / raw)
  To: Haozhong Zhang, qemu-devel
  Cc: Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi,
	Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Dan Williams,
	Richard Henderson

Hi

On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com>
wrote:

> file_ram_alloc() currently maps the backend file via mmap to a virtual
> address aligned to the value returned by qemu_fd_getpagesize(). When a
> DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
> mmap implementation may require an alignment larger than what
> qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.
>

How is the accepted value queried? Any chance to configure it
automatically?



> This commit adds an attribute 'align' to hostmem-file, so that users
> can specify a proper alignment that satisfies the kernel requirement.
>
> If 'align' is not specified or is 0, the value returned by
> qemu_fd_get_pagesize() will be used as before.
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> Cc: Stefan Hajnoczi <stefanha@gmail.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
> Resend because the wrong maintainer email address was used.
> ---
>  backends/hostmem-file.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>  exec.c                  |  8 +++++++-
>  include/exec/memory.h   |  2 ++
>  memory.c                |  2 ++
>  numa.c                  |  2 +-
>  5 files changed, 52 insertions(+), 3 deletions(-)
>
> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
> index fc4ef46d11..d44fb41b55 100644
> --- a/backends/hostmem-file.c
> +++ b/backends/hostmem-file.c
> @@ -33,6 +33,7 @@ struct HostMemoryBackendFile {
>
>      bool share;
>      char *mem_path;
> +    uint64_t align;
>  };
>
>  static void
> @@ -57,7 +58,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend,
> Error **errp)
>          path = object_get_canonical_path(OBJECT(backend));
>          memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
>                                   path,
> -                                 backend->size, fb->share,
> +                                 backend->size, fb->align, fb->share,
>                                   fb->mem_path, errp);
>          g_free(path);
>      }
> @@ -104,6 +105,40 @@ static void file_memory_backend_set_share(Object *o,
> bool value, Error **errp)
>  }
>
>  static void
> +file_memory_backend_get_align(Object *o, Visitor *v, const char *name,
> +                              void *opaque, Error **errp)
> +{
> +    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
> +    uint64_t val = fb->align;
> +
> +    visit_type_size(v, name, &val, errp);
> +}
> +
> +static void
> +file_memory_backend_set_align(Object *o, Visitor *v, const char *name,
> +                              void *opaque, Error **errp)
> +{
> +    HostMemoryBackend *backend = MEMORY_BACKEND(o);
> +    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
> +    Error *local_err = NULL;
> +    uint64_t val;
> +
> +    if (host_memory_backend_mr_inited(backend)) {
> +        error_setg(&local_err, "cannot change property value");
> +        goto out;
> +    }
> +
> +    visit_type_size(v, name, &val, &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +    fb->align = val;
> +
> + out:
> +    error_propagate(errp, local_err);
> +}
> +
> +static void
>  file_backend_class_init(ObjectClass *oc, void *data)
>  {
>      HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc);
> @@ -116,6 +151,10 @@ file_backend_class_init(ObjectClass *oc, void *data)
>      object_class_property_add_str(oc, "mem-path",
>          get_mem_path, set_mem_path,
>          &error_abort);
> +    object_class_property_add(oc, "align", "int",
>

It would make sense to make it "size" instead of "int"


> +        file_memory_backend_get_align,
> +        file_memory_backend_set_align,
> +        NULL, NULL, &error_abort);
>  }
>
>  static void file_backend_instance_finalize(Object *o)
> diff --git a/exec.c b/exec.c
> index ff16f04f2b..5bb62e2e98 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1549,7 +1549,13 @@ static void *file_ram_alloc(RAMBlock *block,
>      }
>
>      block->page_size = qemu_fd_getpagesize(fd);
> -    block->mr->align = block->page_size;
> +    if (block->mr->align % block->page_size) {
> +        error_setg(errp, "alignment 0x%" PRIx64 " must be "
> +                   "multiple of page size 0x%" PRIx64,
> +                   block->mr->align, block->page_size);
> +        goto error;
> +    }
> +    block->mr->align = MAX(block->page_size, block->mr->align);
>  #if defined(__s390x__)
>      if (kvm_enabled()) {
>          block->mr->align = MAX(block->mr->align, QEMU_VMALLOC_ALIGN);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 99e0f54d86..05d3d0da3b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -441,6 +441,7 @@ void memory_region_init_resizeable_ram(MemoryRegion
> *mr,
>   * @name: Region name, becomes part of RAMBlock name used in migration
> stream
>   *        must be unique within any device
>   * @size: size of the region.
> + * @align: alignment of the region.
>   * @share: %true if memory must be mmaped with the MAP_SHARED flag
>   * @path: the path in which to allocate the RAM.
>   * @errp: pointer to Error*, to store an error if it happens.
> @@ -449,6 +450,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
>                                        struct Object *owner,
>                                        const char *name,
>                                        uint64_t size,
> +                                      uint64_t align,
>                                        bool share,
>                                        const char *path,
>                                        Error **errp);
> diff --git a/memory.c b/memory.c
> index b727f5ec0e..5165b9aa08 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1386,6 +1386,7 @@ void memory_region_init_ram_from_file(MemoryRegion
> *mr,
>                                        struct Object *owner,
>                                        const char *name,
>                                        uint64_t size,
> +                                      uint64_t align,
>                                        bool share,
>                                        const char *path,
>                                        Error **errp)
> @@ -1394,6 +1395,7 @@ void memory_region_init_ram_from_file(MemoryRegion
> *mr,
>      mr->ram = true;
>      mr->terminates = true;
>      mr->destructor = memory_region_destructor_ram;
> +    mr->align = align;
>      mr->ram_block = qemu_ram_alloc_from_file(size, mr, share, path, errp);
>      mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
>  }
> diff --git a/numa.c b/numa.c
> index ca731455e9..39a25aa1d2 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -541,7 +541,7 @@ static void
> allocate_system_memory_nonnuma(MemoryRegion *mr, Object *owner,
>      if (mem_path) {
>  #ifdef __linux__
>          Error *err = NULL;
> -        memory_region_init_ram_from_file(mr, owner, name, ram_size, false,
> +        memory_region_init_ram_from_file(mr, owner, name, ram_size, 0,
> false,
>                                           mem_path, &err);
>          if (err) {
>              error_report_err(err);
> --
> 2.11.0
>
>
otherwise, looks good to me
-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
       [not found]     ` <20170526065132.ayapfebbft5lyigd@hz-desktop>
@ 2017-05-26  7:05       ` Marc-André Lureau
       [not found]         ` <20170526071654.kxcqflvpo2tvcrvu@hz-desktop>
  0 siblings, 1 reply; 21+ messages in thread
From: Marc-André Lureau @ 2017-05-26  7:05 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi,
	Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Dan Williams,
	Richard Henderson

Hi

On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com>
wrote:

> On 05/26/17 06:39 +0000, Marc-André Lureau wrote:
> > Hi
> >
> > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com
> >
> > wrote:
> >
> > > file_ram_alloc() currently maps the backend file via mmap to a virtual
> > > address aligned to the value returned by qemu_fd_getpagesize(). When a
> > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
> > > mmap implementation may require an alignment larger than what
> > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.
> > >
> >
> > How is the accepted value queried? Any chance to configure it
> > automatically?
>
> Take /dev/dax0.0 for example. The value can be read from
> /sys/class/dax/dax0.0/device/dax_region/align.
>

Should this work be left to management layer, or could qemu figure it out
by itself (using udev?)


> [..]
> > > +static void
> > >  file_backend_class_init(ObjectClass *oc, void *data)
> > >  {
> > >      HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc);
> > > @@ -116,6 +151,10 @@ file_backend_class_init(ObjectClass *oc, void
> *data)
> > >      object_class_property_add_str(oc, "mem-path",
> > >          get_mem_path, set_mem_path,
> > >          &error_abort);
> > > +    object_class_property_add(oc, "align", "int",
> > >
> >
> > It would make sense to make it "size" instead of "int"
> >
>
> will change in the next version
>

thanks
-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
       [not found]         ` <20170526071654.kxcqflvpo2tvcrvu@hz-desktop>
@ 2017-05-26 14:24           ` Dan Williams
  2017-05-26 18:55             ` Eduardo Habkost
  2017-05-30 12:17             ` Paolo Bonzini
  0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-26 14:24 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi,
	Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Dan Williams,
	Richard Henderson

On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 05/26/17 07:05 +0000, Marc-André Lureau wrote:
>> Hi
>>
>> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com>
>> wrote:
>>
>> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote:
>> > > Hi
>> > >
>> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com
>> > >
>> > > wrote:
>> > >
>> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual
>> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a
>> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
>> > > > mmap implementation may require an alignment larger than what
>> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.
>> > > >
>> > >
>> > > How is the accepted value queried? Any chance to configure it
>> > > automatically?
>> >
>> > Take /dev/dax0.0 for example. The value can be read from
>> > /sys/class/dax/dax0.0/device/dax_region/align.
>> >
>>
>> Should this work be left to management layer, or could qemu figure it out
>> by itself (using udev?)
>>
>
> For DAX device only, QEMU can figure out the proper alignment by
> itself. However, I'm not sure whether there are other non-DAX cases
> requiring non-default alignment, so I think it's better to just add an
> interface (i.e. align attribute) in QEMU and let other management
> tools (e.g. libvirt?) fill a proper value.

I can't imagine any cases where you would want to specify an
alignment. If it's regular file mmap any alignment is fine, and if
it's device-dax only the configured alignment of the device instance
is allowed. So, I don't think this should be a configurable option,
just read it from the device instance and you're done.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26  4:30   ` Haozhong Zhang
@ 2017-05-26 14:28     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-26 14:28 UTC (permalink / raw)
  To: Dan Williams, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong,
	qemu-devel, Michael S. Tsirkin

On Thu, May 25, 2017 at 9:30 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 05/25/17 20:34 -0700, Dan Williams wrote:
>> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > Applications in Linux guest that use device-dax never trigger flush
>> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
>> > device-dax, QEMU cannot guarantee the persistence of guest writes.
>> > Before solving this flushing problem, QEMU should warn users if the
>> > host backend is not device-dax.
>>
>> I think this needs to be stronger than a "warn" it needs to be
>> explicitly forbidden when it is known to be unsafe.
>>
>
> I understand your worry and am not object to your suggestion, but
> forbidden will change the existing behavior that allows such unsafe
> usage. Let's wait for other maintainers' comments.

Changing existing behavior is the point. Even if a qemu maintainer
thought it was acceptable to allow the dangerous behavior with a
warning I would continue to assert that we need to block it by
default. I'm not opposed to adding a new configuration option like
"allow dangerous pmem" to override the new default behavior, but I
think this patch is incorrect to just emit a message.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26  3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams
  2017-05-26  4:30   ` Haozhong Zhang
@ 2017-05-26 14:38   ` Daniel P. Berrange
  2017-05-26 15:25     ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel P. Berrange @ 2017-05-26 14:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong,
	qemu-devel, Michael S. Tsirkin

On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Applications in Linux guest that use device-dax never trigger flush
> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> > Before solving this flushing problem, QEMU should warn users if the
> > host backend is not device-dax.
> 
> I think this needs to be stronger than a "warn" it needs to be
> explicitly forbidden when it is known to be unsafe.

I think users should have the choice in what they want to do -
QEMU should not artifically block it.  There are plenty of things
in QEMU that are potentially unsafe in some usage scenarios, but
we just document how to use them in a safe manner & any caveats
that apply. Higher level applications above QEMU can then consider
how they want to apply a usage policy to meet the needs of their
usage scenario.

Having an emulated DAX device that doesn't guarantee persistence
is no different to having an emulated disk device that never flushes
to underlying host storage.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26 14:38   ` Daniel P. Berrange
@ 2017-05-26 15:25     ` Dan Williams
  2017-05-27  1:13       ` Haozhong Zhang
  2017-05-30  9:20       ` Daniel P. Berrange
  0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-26 15:25 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong,
	qemu-devel, Michael S. Tsirkin

On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
>> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > Applications in Linux guest that use device-dax never trigger flush
>> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
>> > device-dax, QEMU cannot guarantee the persistence of guest writes.
>> > Before solving this flushing problem, QEMU should warn users if the
>> > host backend is not device-dax.
>>
>> I think this needs to be stronger than a "warn" it needs to be
>> explicitly forbidden when it is known to be unsafe.
>
> I think users should have the choice in what they want to do -
> QEMU should not artifically block it.  There are plenty of things
> in QEMU that are potentially unsafe in some usage scenarios, but
> we just document how to use them in a safe manner & any caveats
> that apply. Higher level applications above QEMU can then consider
> how they want to apply a usage policy to meet the needs of their
> usage scenario.
>
> Having an emulated DAX device that doesn't guarantee persistence
> is no different to having an emulated disk device that never flushes
> to underlying host storage.
>

It is different in the sense that the contract of when the guest
should assume persistence is specified by when the write completes to
the virtual disk. In the case of the virtual NFIT we are currently
lying to the guest about that platform persistence guarantee even if
the hypervisor is emulating pmem with volatile memory.

In other words, I agree that it should be possible to tell the guest
to assume it is pmem when it is not, but we need more granularity in
the configuration to communicate the capabilities correctly to the
guest. It seems the NFIT memory device state flag
ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to
the guest. Can we add a new knob to control the polarity of that flag
and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using
regular file mmap and clear the flag by default in the device-dax
case?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-26 14:24           ` Dan Williams
@ 2017-05-26 18:55             ` Eduardo Habkost
  2017-05-26 20:50               ` Dan Williams
  2017-05-27  1:04               ` Haozhong Zhang
  2017-05-30 12:17             ` Paolo Bonzini
  1 sibling, 2 replies; 21+ messages in thread
From: Eduardo Habkost @ 2017-05-26 18:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: qemu-devel, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong,
	Igor Mammedov, Paolo Bonzini, Richard Henderson

On Fri, May 26, 2017 at 07:24:26AM -0700, Dan Williams wrote:
> On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > On 05/26/17 07:05 +0000, Marc-André Lureau wrote:
> >> Hi
> >>
> >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com>
> >> wrote:
> >>
> >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote:
> >> > > Hi
> >> > >
> >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual
> >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a
> >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
> >> > > > mmap implementation may require an alignment larger than what
> >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.
> >> > > >
> >> > >
> >> > > How is the accepted value queried? Any chance to configure it
> >> > > automatically?
> >> >
> >> > Take /dev/dax0.0 for example. The value can be read from
> >> > /sys/class/dax/dax0.0/device/dax_region/align.
> >> >
> >>
> >> Should this work be left to management layer, or could qemu figure it out
> >> by itself (using udev?)
> >>
> >
> > For DAX device only, QEMU can figure out the proper alignment by
> > itself. However, I'm not sure whether there are other non-DAX cases
> > requiring non-default alignment, so I think it's better to just add an
> > interface (i.e. align attribute) in QEMU and let other management
> > tools (e.g. libvirt?) fill a proper value.
> 
> I can't imagine any cases where you would want to specify an
> alignment. If it's regular file mmap any alignment is fine, and if
> it's device-dax only the configured alignment of the device instance
> is allowed. So, I don't think this should be a configurable option,
> just read it from the device instance and you're done.

Agreed.

BTW, there's no generic interface to ask the kernel what's the
required mmap() alignment for a file?

-- 
Eduardo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-26 18:55             ` Eduardo Habkost
@ 2017-05-26 20:50               ` Dan Williams
  2017-05-27  1:04               ` Haozhong Zhang
  1 sibling, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-26 20:50 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: qemu-devel, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong,
	Igor Mammedov, Paolo Bonzini, Richard Henderson

On Fri, May 26, 2017 at 11:55 AM, Eduardo Habkost <ehabkost@redhat.com> wrote:
> On Fri, May 26, 2017 at 07:24:26AM -0700, Dan Williams wrote:
>> On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > On 05/26/17 07:05 +0000, Marc-André Lureau wrote:
>> >> Hi
>> >>
>> >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com>
>> >> wrote:
>> >>
>> >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote:
>> >> > > Hi
>> >> > >
>> >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com
>> >> > >
>> >> > > wrote:
>> >> > >
>> >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual
>> >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a
>> >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
>> >> > > > mmap implementation may require an alignment larger than what
>> >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.
>> >> > > >
>> >> > >
>> >> > > How is the accepted value queried? Any chance to configure it
>> >> > > automatically?
>> >> >
>> >> > Take /dev/dax0.0 for example. The value can be read from
>> >> > /sys/class/dax/dax0.0/device/dax_region/align.
>> >> >
>> >>
>> >> Should this work be left to management layer, or could qemu figure it out
>> >> by itself (using udev?)
>> >>
>> >
>> > For DAX device only, QEMU can figure out the proper alignment by
>> > itself. However, I'm not sure whether there are other non-DAX cases
>> > requiring non-default alignment, so I think it's better to just add an
>> > interface (i.e. align attribute) in QEMU and let other management
>> > tools (e.g. libvirt?) fill a proper value.
>>
>> I can't imagine any cases where you would want to specify an
>> alignment. If it's regular file mmap any alignment is fine, and if
>> it's device-dax only the configured alignment of the device instance
>> is allowed. So, I don't think this should be a configurable option,
>> just read it from the device instance and you're done.
>
> Agreed.
>
> BTW, there's no generic interface to ask the kernel what's the
> required mmap() alignment for a file?

It's only this /dev/dax device-file case where the alignment is
strictly enforced. The required alignment is advertised in sysfs, for
example:

# ls -l /dev/dax0.0
crw------- 1 root root 253, 0 May 26 06:41 /dev/dax0.0
# cat /sys/dev/char/253\:0/device/align
2097152

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-26 18:55             ` Eduardo Habkost
  2017-05-26 20:50               ` Dan Williams
@ 2017-05-27  1:04               ` Haozhong Zhang
  1 sibling, 0 replies; 21+ messages in thread
From: Haozhong Zhang @ 2017-05-27  1:04 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Dan Williams, Peter Crosthwaite, Stefan Hajnoczi, qemu-devel,
	Xiao Guangrong, Paolo Bonzini, Igor Mammedov, Richard Henderson

On 05/26/17 15:55 -0300, Eduardo Habkost wrote:
> On Fri, May 26, 2017 at 07:24:26AM -0700, Dan Williams wrote:
> > On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:
> > > On 05/26/17 07:05 +0000, Marc-André Lureau wrote:
> > >> Hi
> > >>
> > >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com>
> > >> wrote:
> > >>
> > >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote:
> > >> > > Hi
> > >> > >
> > >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual
> > >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a
> > >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel
> > >> > > > mmap implementation may require an alignment larger than what
> > >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail.
> > >> > > >
> > >> > >
> > >> > > How is the accepted value queried? Any chance to configure it
> > >> > > automatically?
> > >> >
> > >> > Take /dev/dax0.0 for example. The value can be read from
> > >> > /sys/class/dax/dax0.0/device/dax_region/align.
> > >> >
> > >>
> > >> Should this work be left to management layer, or could qemu figure it out
> > >> by itself (using udev?)
> > >>
> > >
> > > For DAX device only, QEMU can figure out the proper alignment by
> > > itself. However, I'm not sure whether there are other non-DAX cases
> > > requiring non-default alignment, so I think it's better to just add an
> > > interface (i.e. align attribute) in QEMU and let other management
> > > tools (e.g. libvirt?) fill a proper value.
> > 
> > I can't imagine any cases where you would want to specify an
> > alignment. If it's regular file mmap any alignment is fine, and if
> > it's device-dax only the configured alignment of the device instance
> > is allowed. So, I don't think this should be a configurable option,
> > just read it from the device instance and you're done.
> 
> Agreed.
>

Ok, I'll drop this attribute and let QEMU figure out the alignment.

Thanks,
Haozhong

> BTW, there's no generic interface to ask the kernel what's the
> required mmap() alignment for a file?
> 
> -- 
> Eduardo
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26 15:25     ` Dan Williams
@ 2017-05-27  1:13       ` Haozhong Zhang
  2017-05-30  9:20       ` Daniel P. Berrange
  1 sibling, 0 replies; 21+ messages in thread
From: Haozhong Zhang @ 2017-05-27  1:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Daniel P. Berrange, Stefan Hajnoczi, Igor Mammedov,
	Xiao Guangrong, qemu-devel, Michael S. Tsirkin

On 05/26/17 08:25 -0700, Dan Williams wrote:
> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > Applications in Linux guest that use device-dax never trigger flush
> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> >> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> >> > Before solving this flushing problem, QEMU should warn users if the
> >> > host backend is not device-dax.
> >>
> >> I think this needs to be stronger than a "warn" it needs to be
> >> explicitly forbidden when it is known to be unsafe.
> >
> > I think users should have the choice in what they want to do -
> > QEMU should not artifically block it.  There are plenty of things
> > in QEMU that are potentially unsafe in some usage scenarios, but
> > we just document how to use them in a safe manner & any caveats
> > that apply. Higher level applications above QEMU can then consider
> > how they want to apply a usage policy to meet the needs of their
> > usage scenario.
> >
> > Having an emulated DAX device that doesn't guarantee persistence
> > is no different to having an emulated disk device that never flushes
> > to underlying host storage.
> >
> 
> It is different in the sense that the contract of when the guest
> should assume persistence is specified by when the write completes to
> the virtual disk. In the case of the virtual NFIT we are currently
> lying to the guest about that platform persistence guarantee even if
> the hypervisor is emulating pmem with volatile memory.
> 
> In other words, I agree that it should be possible to tell the guest
> to assume it is pmem when it is not, but we need more granularity in
> the configuration to communicate the capabilities correctly to the
> guest. It seems the NFIT memory device state flag
> ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to
> the guest. Can we add a new knob to control the polarity of that flag
> and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using
> regular file mmap and clear the flag by default in the device-dax
> case?

Yes, we can set ACPI_NFIT_MEM_NOT_ARMED if the host backend is not
device-dax.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26 15:25     ` Dan Williams
  2017-05-27  1:13       ` Haozhong Zhang
@ 2017-05-30  9:20       ` Daniel P. Berrange
  2017-05-30 11:41         ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel P. Berrange @ 2017-05-30  9:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong,
	qemu-devel, Michael S. Tsirkin

On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote:
> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > Applications in Linux guest that use device-dax never trigger flush
> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
> >> > device-dax, QEMU cannot guarantee the persistence of guest writes.
> >> > Before solving this flushing problem, QEMU should warn users if the
> >> > host backend is not device-dax.
> >>
> >> I think this needs to be stronger than a "warn" it needs to be
> >> explicitly forbidden when it is known to be unsafe.
> >
> > I think users should have the choice in what they want to do -
> > QEMU should not artifically block it.  There are plenty of things
> > in QEMU that are potentially unsafe in some usage scenarios, but
> > we just document how to use them in a safe manner & any caveats
> > that apply. Higher level applications above QEMU can then consider
> > how they want to apply a usage policy to meet the needs of their
> > usage scenario.
> >
> > Having an emulated DAX device that doesn't guarantee persistence
> > is no different to having an emulated disk device that never flushes
> > to underlying host storage.
> >
> 
> It is different in the sense that the contract of when the guest
> should assume persistence is specified by when the write completes to
> the virtual disk. In the case of the virtual NFIT we are currently
> lying to the guest about that platform persistence guarantee even if
> the hypervisor is emulating pmem with volatile memory.

We equally lie to the guest about persistence of disks, when the
disk is run with certain cache= settings, or when the disk backing file
is on tmpfs. It is simply a choice the mgmt application makes about
whether to provide backing storage that is capable of satsifying the
guarantees implied by the guest device model. So I'm still not seeing
a compelling reason to artifically block this with DAX.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-30  9:20       ` Daniel P. Berrange
@ 2017-05-30 11:41         ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-30 11:41 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong,
	qemu-devel, Michael S. Tsirkin

On Tue, May 30, 2017 at 2:20 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote:
>> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
>> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote:
>> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang
>> >> <haozhong.zhang@intel.com> wrote:
>> >> > Applications in Linux guest that use device-dax never trigger flush
>> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not
>> >> > device-dax, QEMU cannot guarantee the persistence of guest writes.
>> >> > Before solving this flushing problem, QEMU should warn users if the
>> >> > host backend is not device-dax.
>> >>
>> >> I think this needs to be stronger than a "warn" it needs to be
>> >> explicitly forbidden when it is known to be unsafe.
>> >
>> > I think users should have the choice in what they want to do -
>> > QEMU should not artifically block it.  There are plenty of things
>> > in QEMU that are potentially unsafe in some usage scenarios, but
>> > we just document how to use them in a safe manner & any caveats
>> > that apply. Higher level applications above QEMU can then consider
>> > how they want to apply a usage policy to meet the needs of their
>> > usage scenario.
>> >
>> > Having an emulated DAX device that doesn't guarantee persistence
>> > is no different to having an emulated disk device that never flushes
>> > to underlying host storage.
>> >
>>
>> It is different in the sense that the contract of when the guest
>> should assume persistence is specified by when the write completes to
>> the virtual disk. In the case of the virtual NFIT we are currently
>> lying to the guest about that platform persistence guarantee even if
>> the hypervisor is emulating pmem with volatile memory.
>
> We equally lie to the guest about persistence of disks, when the
> disk is run with certain cache= settings, or when the disk backing file
> is on tmpfs. It is simply a choice the mgmt application makes about
> whether to provide backing storage that is capable of satsifying the
> guarantees implied by the guest device model. So I'm still not seeing
> a compelling reason to artifically block this with DAX.

Agreed, artificially blocking is not a good path, but for completeness
we still need a way to control the ACPI "not armed" health state flag
and a sane default for that parameter.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-26 14:24           ` Dan Williams
  2017-05-26 18:55             ` Eduardo Habkost
@ 2017-05-30 12:17             ` Paolo Bonzini
  2017-05-30 19:16               ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2017-05-30 12:17 UTC (permalink / raw)
  To: Dan Williams, qemu-devel, Eduardo Habkost, Peter Crosthwaite,
	Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov,
	Richard Henderson



On 26/05/2017 16:24, Dan Williams wrote:
>> For DAX device only, QEMU can figure out the proper alignment by
>> itself. However, I'm not sure whether there are other non-DAX cases
>> requiring non-default alignment, so I think it's better to just add an
>> interface (i.e. align attribute) in QEMU and let other management
>> tools (e.g. libvirt?) fill a proper value.
> I can't imagine any cases where you would want to specify an
> alignment. If it's regular file mmap any alignment is fine, and if
> it's device-dax only the configured alignment of the device instance
> is allowed. So, I don't think this should be a configurable option,
> just read it from the device instance and you're done.

A 2M or 1G alignment lets KVM use EPT hugepages if the host physical
addresses are contiguous and 2M- or 1G-aligned.

QEMU only does this for hugetlbfs currently, where the requirement on
the host physical addresses is always satisfied.  Would the same apply
to NVDIMM device DAX?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment
  2017-05-30 12:17             ` Paolo Bonzini
@ 2017-05-30 19:16               ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-05-30 19:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi,
	Xiao Guangrong, Igor Mammedov, Richard Henderson

On Tue, May 30, 2017 at 5:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 26/05/2017 16:24, Dan Williams wrote:
>>> For DAX device only, QEMU can figure out the proper alignment by
>>> itself. However, I'm not sure whether there are other non-DAX cases
>>> requiring non-default alignment, so I think it's better to just add an
>>> interface (i.e. align attribute) in QEMU and let other management
>>> tools (e.g. libvirt?) fill a proper value.
>> I can't imagine any cases where you would want to specify an
>> alignment. If it's regular file mmap any alignment is fine, and if
>> it's device-dax only the configured alignment of the device instance
>> is allowed. So, I don't think this should be a configurable option,
>> just read it from the device instance and you're done.
>
> A 2M or 1G alignment lets KVM use EPT hugepages if the host physical
> addresses are contiguous and 2M- or 1G-aligned.
>
> QEMU only does this for hugetlbfs currently, where the requirement on
> the host physical addresses is always satisfied.  Would the same apply
> to NVDIMM device DAX?
>

Yes, I believe it is similar. Device-dax guarantees and enforces (mmap
area must be aligned) a given fault granularity.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-05-26  2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
  2017-05-26  2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang
  2017-05-26  3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams
@ 2017-06-01 12:00 ` Xiao Guangrong
  2017-06-01 13:53     ` [Qemu-devel] " Dan Williams
  2 siblings, 1 reply; 21+ messages in thread
From: Xiao Guangrong @ 2017-06-01 12:00 UTC (permalink / raw)
  To: Haozhong Zhang, qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Stefan Hajnoczi, Dan Williams



On 05/26/2017 10:32 AM, Haozhong Zhang wrote:

> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
> +{
> +    char *mem_path =
> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
> +    char *dev_name = NULL, *sysfs_path = NULL;
> +    bool is_dax = false;
> +
> +    if (!mem_path) {
> +        goto out;
> +    }
> +
> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
> +        goto out;
> +    }
> +
> +    dev_name = mem_path + strlen("/dev/");
> +    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
> +    if (access(sysfs_path, F_OK)) {
> +        goto out;
> +    }
> +
> +    is_dax = true;
> +

So only dax raw disk has write-persistence guaranty, a pre-allocated
file which locates on a DAX-enabled filesystem can not?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
  2017-06-01 12:00 ` Xiao Guangrong
@ 2017-06-01 13:53     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-06-01 13:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, qemu-devel, Michael S. Tsirkin, Igor Mammedov,
	Stefan Hajnoczi, linux-fsdevel

[ adding linux-fsdevel ]

On Thu, Jun 1, 2017 at 5:00 AM, Xiao Guangrong <guangrong.xiao@gmail.com> wrote:
>
>
> On 05/26/2017 10:32 AM, Haozhong Zhang wrote:
>
>> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
>> +{
>> +    char *mem_path =
>> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
>> +    char *dev_name = NULL, *sysfs_path = NULL;
>> +    bool is_dax = false;
>> +
>> +    if (!mem_path) {
>> +        goto out;
>> +    }
>> +
>> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
>> +        goto out;
>> +    }
>> +
>> +    dev_name = mem_path + strlen("/dev/");
>> +    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
>> +    if (access(sysfs_path, F_OK)) {
>> +        goto out;
>> +    }
>> +
>> +    is_dax = true;
>> +
>
>
> So only dax raw disk has write-persistence guaranty, a pre-allocated
> file which locates on a DAX-enabled filesystem can not?

Correct, it is not guaranteed by any existing filesystem. It may work
by accident today on ext4 with a pre-allocated file, but a filesystem
is otherwise permitted to drop any writes that have not been synced.
Until we get an explicit MMAP_SYNC or METADATA_IMMUTABLE feature into
a filesystem the only interface that we can use to reliably pass
persistent memory through to a guest is device-dax.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device
@ 2017-06-01 13:53     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2017-06-01 13:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, qemu-devel, Michael S. Tsirkin, Igor Mammedov,
	Stefan Hajnoczi, linux-fsdevel

[ adding linux-fsdevel ]

On Thu, Jun 1, 2017 at 5:00 AM, Xiao Guangrong <guangrong.xiao@gmail.com> wrote:
>
>
> On 05/26/2017 10:32 AM, Haozhong Zhang wrote:
>
>> +static void nvdimm_check_dax(HostMemoryBackend *hostmem)
>> +{
>> +    char *mem_path =
>> +        object_property_get_str(OBJECT(hostmem), "mem-path", NULL);
>> +    char *dev_name = NULL, *sysfs_path = NULL;
>> +    bool is_dax = false;
>> +
>> +    if (!mem_path) {
>> +        goto out;
>> +    }
>> +
>> +    if (!g_str_has_prefix(mem_path, "/dev/dax")) {
>> +        goto out;
>> +    }
>> +
>> +    dev_name = mem_path + strlen("/dev/");
>> +    sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name);
>> +    if (access(sysfs_path, F_OK)) {
>> +        goto out;
>> +    }
>> +
>> +    is_dax = true;
>> +
>
>
> So only dax raw disk has write-persistence guaranty, a pre-allocated
> file which locates on a DAX-enabled filesystem can not?

Correct, it is not guaranteed by any existing filesystem. It may work
by accident today on ext4 with a pre-allocated file, but a filesystem
is otherwise permitted to drop any writes that have not been synced.
Until we get an explicit MMAP_SYNC or METADATA_IMMUTABLE feature into
a filesystem the only interface that we can use to reliably pass
persistent memory through to a guest is device-dax.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-06-01 13:53 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-26  2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang
2017-05-26  2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang
2017-05-26  6:39   ` Marc-André Lureau
     [not found]     ` <20170526065132.ayapfebbft5lyigd@hz-desktop>
2017-05-26  7:05       ` Marc-André Lureau
     [not found]         ` <20170526071654.kxcqflvpo2tvcrvu@hz-desktop>
2017-05-26 14:24           ` Dan Williams
2017-05-26 18:55             ` Eduardo Habkost
2017-05-26 20:50               ` Dan Williams
2017-05-27  1:04               ` Haozhong Zhang
2017-05-30 12:17             ` Paolo Bonzini
2017-05-30 19:16               ` Dan Williams
2017-05-26  3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams
2017-05-26  4:30   ` Haozhong Zhang
2017-05-26 14:28     ` Dan Williams
2017-05-26 14:38   ` Daniel P. Berrange
2017-05-26 15:25     ` Dan Williams
2017-05-27  1:13       ` Haozhong Zhang
2017-05-30  9:20       ` Daniel P. Berrange
2017-05-30 11:41         ` Dan Williams
2017-06-01 12:00 ` Xiao Guangrong
2017-06-01 13:53   ` Dan Williams
2017-06-01 13:53     ` [Qemu-devel] " Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.