* [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device @ 2017-05-26 2:32 Haozhong Zhang 2017-05-26 2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Haozhong Zhang @ 2017-05-26 2:32 UTC (permalink / raw) To: qemu-devel Cc: Haozhong Zhang, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi, Dan Williams Applications in Linux guest that use device-dax never trigger flush that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not device-dax, QEMU cannot guarantee the persistence of guest writes. Before solving this flushing problem, QEMU should warn users if the host backend is not device-dax. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com --- Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Cc: Stefan Hajnoczi <stefanha@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> --- Resend because the wrong maintainer email address was used. --- hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c index db896b0bb6..c7bb407f33 100644 --- a/hw/mem/nvdimm.c +++ b/hw/mem/nvdimm.c @@ -26,6 +26,7 @@ #include "qapi/error.h" #include "qapi/visitor.h" #include "hw/mem/nvdimm.h" +#include "qemu/error-report.h" static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name, void *opaque, Error **errp) @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) return &nvdimm->nvdimm_mr; } +static void nvdimm_check_dax(HostMemoryBackend *hostmem) +{ + char *mem_path = + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); + char *dev_name = NULL, *sysfs_path = NULL; + bool is_dax = false; + + if (!mem_path) { + goto out; + } + + if (!g_str_has_prefix(mem_path, "/dev/dax")) { + goto out; + } + + dev_name = mem_path + strlen("/dev/"); + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); + if (access(sysfs_path, F_OK)) { + goto out; + } + + is_dax = true; + + out: + if (!is_dax) { + error_report("warning: nvdimm backend %s is not DAX device, " + "unable to guarantee persistence of guest writes", + mem_path ?: "RAM"); + } + + g_free(sysfs_path); + g_free(mem_path); +} + static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp) { MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem, errp); NVDIMMDevice *nvdimm = NVDIMM(dimm); uint64_t align, pmem_size, size = memory_region_size(mr); + nvdimm_check_dax(dimm->hostmem); + align = memory_region_get_alignment(mr); pmem_size = size - nvdimm->label_size; -- 2.11.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-26 2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang @ 2017-05-26 2:32 ` Haozhong Zhang 2017-05-26 6:39 ` Marc-André Lureau 2017-05-26 3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams 2017-06-01 12:00 ` Xiao Guangrong 2 siblings, 1 reply; 21+ messages in thread From: Haozhong Zhang @ 2017-05-26 2:32 UTC (permalink / raw) To: qemu-devel Cc: Haozhong Zhang, Eduardo Habkost, Igor Mammedov, Paolo Bonzini, Peter Crosthwaite, Richard Henderson, Xiao Guangrong, Stefan Hajnoczi, Dan Williams file_ram_alloc() currently maps the backend file via mmap to a virtual address aligned to the value returned by qemu_fd_getpagesize(). When a DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel mmap implementation may require an alignment larger than what qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. This commit adds an attribute 'align' to hostmem-file, so that users can specify a proper alignment that satisfies the kernel requirement. If 'align' is not specified or is 0, the value returned by qemu_fd_get_pagesize() will be used as before. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> --- Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Cc: Stefan Hajnoczi <stefanha@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> --- Resend because the wrong maintainer email address was used. --- backends/hostmem-file.c | 41 ++++++++++++++++++++++++++++++++++++++++- exec.c | 8 +++++++- include/exec/memory.h | 2 ++ memory.c | 2 ++ numa.c | 2 +- 5 files changed, 52 insertions(+), 3 deletions(-) diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c index fc4ef46d11..d44fb41b55 100644 --- a/backends/hostmem-file.c +++ b/backends/hostmem-file.c @@ -33,6 +33,7 @@ struct HostMemoryBackendFile { bool share; char *mem_path; + uint64_t align; }; static void @@ -57,7 +58,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) path = object_get_canonical_path(OBJECT(backend)); memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), path, - backend->size, fb->share, + backend->size, fb->align, fb->share, fb->mem_path, errp); g_free(path); } @@ -104,6 +105,40 @@ static void file_memory_backend_set_share(Object *o, bool value, Error **errp) } static void +file_memory_backend_get_align(Object *o, Visitor *v, const char *name, + void *opaque, Error **errp) +{ + HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o); + uint64_t val = fb->align; + + visit_type_size(v, name, &val, errp); +} + +static void +file_memory_backend_set_align(Object *o, Visitor *v, const char *name, + void *opaque, Error **errp) +{ + HostMemoryBackend *backend = MEMORY_BACKEND(o); + HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o); + Error *local_err = NULL; + uint64_t val; + + if (host_memory_backend_mr_inited(backend)) { + error_setg(&local_err, "cannot change property value"); + goto out; + } + + visit_type_size(v, name, &val, &local_err); + if (local_err) { + goto out; + } + fb->align = val; + + out: + error_propagate(errp, local_err); +} + +static void file_backend_class_init(ObjectClass *oc, void *data) { HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc); @@ -116,6 +151,10 @@ file_backend_class_init(ObjectClass *oc, void *data) object_class_property_add_str(oc, "mem-path", get_mem_path, set_mem_path, &error_abort); + object_class_property_add(oc, "align", "int", + file_memory_backend_get_align, + file_memory_backend_set_align, + NULL, NULL, &error_abort); } static void file_backend_instance_finalize(Object *o) diff --git a/exec.c b/exec.c index ff16f04f2b..5bb62e2e98 100644 --- a/exec.c +++ b/exec.c @@ -1549,7 +1549,13 @@ static void *file_ram_alloc(RAMBlock *block, } block->page_size = qemu_fd_getpagesize(fd); - block->mr->align = block->page_size; + if (block->mr->align % block->page_size) { + error_setg(errp, "alignment 0x%" PRIx64 " must be " + "multiple of page size 0x%" PRIx64, + block->mr->align, block->page_size); + goto error; + } + block->mr->align = MAX(block->page_size, block->mr->align); #if defined(__s390x__) if (kvm_enabled()) { block->mr->align = MAX(block->mr->align, QEMU_VMALLOC_ALIGN); diff --git a/include/exec/memory.h b/include/exec/memory.h index 99e0f54d86..05d3d0da3b 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -441,6 +441,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr, * @name: Region name, becomes part of RAMBlock name used in migration stream * must be unique within any device * @size: size of the region. + * @align: alignment of the region. * @share: %true if memory must be mmaped with the MAP_SHARED flag * @path: the path in which to allocate the RAM. * @errp: pointer to Error*, to store an error if it happens. @@ -449,6 +450,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr, struct Object *owner, const char *name, uint64_t size, + uint64_t align, bool share, const char *path, Error **errp); diff --git a/memory.c b/memory.c index b727f5ec0e..5165b9aa08 100644 --- a/memory.c +++ b/memory.c @@ -1386,6 +1386,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr, struct Object *owner, const char *name, uint64_t size, + uint64_t align, bool share, const char *path, Error **errp) @@ -1394,6 +1395,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr, mr->ram = true; mr->terminates = true; mr->destructor = memory_region_destructor_ram; + mr->align = align; mr->ram_block = qemu_ram_alloc_from_file(size, mr, share, path, errp); mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0; } diff --git a/numa.c b/numa.c index ca731455e9..39a25aa1d2 100644 --- a/numa.c +++ b/numa.c @@ -541,7 +541,7 @@ static void allocate_system_memory_nonnuma(MemoryRegion *mr, Object *owner, if (mem_path) { #ifdef __linux__ Error *err = NULL; - memory_region_init_ram_from_file(mr, owner, name, ram_size, false, + memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, false, mem_path, &err); if (err) { error_report_err(err); -- 2.11.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-26 2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang @ 2017-05-26 6:39 ` Marc-André Lureau [not found] ` <20170526065132.ayapfebbft5lyigd@hz-desktop> 0 siblings, 1 reply; 21+ messages in thread From: Marc-André Lureau @ 2017-05-26 6:39 UTC (permalink / raw) To: Haozhong Zhang, qemu-devel Cc: Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Dan Williams, Richard Henderson Hi On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com> wrote: > file_ram_alloc() currently maps the backend file via mmap to a virtual > address aligned to the value returned by qemu_fd_getpagesize(). When a > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel > mmap implementation may require an alignment larger than what > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. > How is the accepted value queried? Any chance to configure it automatically? > This commit adds an attribute 'align' to hostmem-file, so that users > can specify a proper alignment that satisfies the kernel requirement. > > If 'align' is not specified or is 0, the value returned by > qemu_fd_get_pagesize() will be used as before. > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> > --- > Cc: Eduardo Habkost <ehabkost@redhat.com> > Cc: Igor Mammedov <imammedo@redhat.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com> > Cc: Richard Henderson <rth@twiddle.net> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > Cc: Stefan Hajnoczi <stefanha@gmail.com> > Cc: Dan Williams <dan.j.williams@intel.com> > --- > Resend because the wrong maintainer email address was used. > --- > backends/hostmem-file.c | 41 ++++++++++++++++++++++++++++++++++++++++- > exec.c | 8 +++++++- > include/exec/memory.h | 2 ++ > memory.c | 2 ++ > numa.c | 2 +- > 5 files changed, 52 insertions(+), 3 deletions(-) > > diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c > index fc4ef46d11..d44fb41b55 100644 > --- a/backends/hostmem-file.c > +++ b/backends/hostmem-file.c > @@ -33,6 +33,7 @@ struct HostMemoryBackendFile { > > bool share; > char *mem_path; > + uint64_t align; > }; > > static void > @@ -57,7 +58,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, > Error **errp) > path = object_get_canonical_path(OBJECT(backend)); > memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), > path, > - backend->size, fb->share, > + backend->size, fb->align, fb->share, > fb->mem_path, errp); > g_free(path); > } > @@ -104,6 +105,40 @@ static void file_memory_backend_set_share(Object *o, > bool value, Error **errp) > } > > static void > +file_memory_backend_get_align(Object *o, Visitor *v, const char *name, > + void *opaque, Error **errp) > +{ > + HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o); > + uint64_t val = fb->align; > + > + visit_type_size(v, name, &val, errp); > +} > + > +static void > +file_memory_backend_set_align(Object *o, Visitor *v, const char *name, > + void *opaque, Error **errp) > +{ > + HostMemoryBackend *backend = MEMORY_BACKEND(o); > + HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o); > + Error *local_err = NULL; > + uint64_t val; > + > + if (host_memory_backend_mr_inited(backend)) { > + error_setg(&local_err, "cannot change property value"); > + goto out; > + } > + > + visit_type_size(v, name, &val, &local_err); > + if (local_err) { > + goto out; > + } > + fb->align = val; > + > + out: > + error_propagate(errp, local_err); > +} > + > +static void > file_backend_class_init(ObjectClass *oc, void *data) > { > HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc); > @@ -116,6 +151,10 @@ file_backend_class_init(ObjectClass *oc, void *data) > object_class_property_add_str(oc, "mem-path", > get_mem_path, set_mem_path, > &error_abort); > + object_class_property_add(oc, "align", "int", > It would make sense to make it "size" instead of "int" > + file_memory_backend_get_align, > + file_memory_backend_set_align, > + NULL, NULL, &error_abort); > } > > static void file_backend_instance_finalize(Object *o) > diff --git a/exec.c b/exec.c > index ff16f04f2b..5bb62e2e98 100644 > --- a/exec.c > +++ b/exec.c > @@ -1549,7 +1549,13 @@ static void *file_ram_alloc(RAMBlock *block, > } > > block->page_size = qemu_fd_getpagesize(fd); > - block->mr->align = block->page_size; > + if (block->mr->align % block->page_size) { > + error_setg(errp, "alignment 0x%" PRIx64 " must be " > + "multiple of page size 0x%" PRIx64, > + block->mr->align, block->page_size); > + goto error; > + } > + block->mr->align = MAX(block->page_size, block->mr->align); > #if defined(__s390x__) > if (kvm_enabled()) { > block->mr->align = MAX(block->mr->align, QEMU_VMALLOC_ALIGN); > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 99e0f54d86..05d3d0da3b 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -441,6 +441,7 @@ void memory_region_init_resizeable_ram(MemoryRegion > *mr, > * @name: Region name, becomes part of RAMBlock name used in migration > stream > * must be unique within any device > * @size: size of the region. > + * @align: alignment of the region. > * @share: %true if memory must be mmaped with the MAP_SHARED flag > * @path: the path in which to allocate the RAM. > * @errp: pointer to Error*, to store an error if it happens. > @@ -449,6 +450,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr, > struct Object *owner, > const char *name, > uint64_t size, > + uint64_t align, > bool share, > const char *path, > Error **errp); > diff --git a/memory.c b/memory.c > index b727f5ec0e..5165b9aa08 100644 > --- a/memory.c > +++ b/memory.c > @@ -1386,6 +1386,7 @@ void memory_region_init_ram_from_file(MemoryRegion > *mr, > struct Object *owner, > const char *name, > uint64_t size, > + uint64_t align, > bool share, > const char *path, > Error **errp) > @@ -1394,6 +1395,7 @@ void memory_region_init_ram_from_file(MemoryRegion > *mr, > mr->ram = true; > mr->terminates = true; > mr->destructor = memory_region_destructor_ram; > + mr->align = align; > mr->ram_block = qemu_ram_alloc_from_file(size, mr, share, path, errp); > mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0; > } > diff --git a/numa.c b/numa.c > index ca731455e9..39a25aa1d2 100644 > --- a/numa.c > +++ b/numa.c > @@ -541,7 +541,7 @@ static void > allocate_system_memory_nonnuma(MemoryRegion *mr, Object *owner, > if (mem_path) { > #ifdef __linux__ > Error *err = NULL; > - memory_region_init_ram_from_file(mr, owner, name, ram_size, false, > + memory_region_init_ram_from_file(mr, owner, name, ram_size, 0, > false, > mem_path, &err); > if (err) { > error_report_err(err); > -- > 2.11.0 > > otherwise, looks good to me -- Marc-André Lureau ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20170526065132.ayapfebbft5lyigd@hz-desktop>]
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment [not found] ` <20170526065132.ayapfebbft5lyigd@hz-desktop> @ 2017-05-26 7:05 ` Marc-André Lureau [not found] ` <20170526071654.kxcqflvpo2tvcrvu@hz-desktop> 0 siblings, 1 reply; 21+ messages in thread From: Marc-André Lureau @ 2017-05-26 7:05 UTC (permalink / raw) To: qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Dan Williams, Richard Henderson Hi On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 05/26/17 06:39 +0000, Marc-André Lureau wrote: > > Hi > > > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com > > > > wrote: > > > > > file_ram_alloc() currently maps the backend file via mmap to a virtual > > > address aligned to the value returned by qemu_fd_getpagesize(). When a > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel > > > mmap implementation may require an alignment larger than what > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. > > > > > > > How is the accepted value queried? Any chance to configure it > > automatically? > > Take /dev/dax0.0 for example. The value can be read from > /sys/class/dax/dax0.0/device/dax_region/align. > Should this work be left to management layer, or could qemu figure it out by itself (using udev?) > [..] > > > +static void > > > file_backend_class_init(ObjectClass *oc, void *data) > > > { > > > HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc); > > > @@ -116,6 +151,10 @@ file_backend_class_init(ObjectClass *oc, void > *data) > > > object_class_property_add_str(oc, "mem-path", > > > get_mem_path, set_mem_path, > > > &error_abort); > > > + object_class_property_add(oc, "align", "int", > > > > > > > It would make sense to make it "size" instead of "int" > > > > will change in the next version > thanks -- Marc-André Lureau ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20170526071654.kxcqflvpo2tvcrvu@hz-desktop>]
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment [not found] ` <20170526071654.kxcqflvpo2tvcrvu@hz-desktop> @ 2017-05-26 14:24 ` Dan Williams 2017-05-26 18:55 ` Eduardo Habkost 2017-05-30 12:17 ` Paolo Bonzini 0 siblings, 2 replies; 21+ messages in thread From: Dan Williams @ 2017-05-26 14:24 UTC (permalink / raw) To: qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Dan Williams, Richard Henderson On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 05/26/17 07:05 +0000, Marc-André Lureau wrote: >> Hi >> >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com> >> wrote: >> >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote: >> > > Hi >> > > >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com >> > > >> > > wrote: >> > > >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel >> > > > mmap implementation may require an alignment larger than what >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. >> > > > >> > > >> > > How is the accepted value queried? Any chance to configure it >> > > automatically? >> > >> > Take /dev/dax0.0 for example. The value can be read from >> > /sys/class/dax/dax0.0/device/dax_region/align. >> > >> >> Should this work be left to management layer, or could qemu figure it out >> by itself (using udev?) >> > > For DAX device only, QEMU can figure out the proper alignment by > itself. However, I'm not sure whether there are other non-DAX cases > requiring non-default alignment, so I think it's better to just add an > interface (i.e. align attribute) in QEMU and let other management > tools (e.g. libvirt?) fill a proper value. I can't imagine any cases where you would want to specify an alignment. If it's regular file mmap any alignment is fine, and if it's device-dax only the configured alignment of the device instance is allowed. So, I don't think this should be a configurable option, just read it from the device instance and you're done. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-26 14:24 ` Dan Williams @ 2017-05-26 18:55 ` Eduardo Habkost 2017-05-26 20:50 ` Dan Williams 2017-05-27 1:04 ` Haozhong Zhang 2017-05-30 12:17 ` Paolo Bonzini 1 sibling, 2 replies; 21+ messages in thread From: Eduardo Habkost @ 2017-05-26 18:55 UTC (permalink / raw) To: Dan Williams Cc: qemu-devel, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Richard Henderson On Fri, May 26, 2017 at 07:24:26AM -0700, Dan Williams wrote: > On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > On 05/26/17 07:05 +0000, Marc-André Lureau wrote: > >> Hi > >> > >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com> > >> wrote: > >> > >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote: > >> > > Hi > >> > > > >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com > >> > > > >> > > wrote: > >> > > > >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual > >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a > >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel > >> > > > mmap implementation may require an alignment larger than what > >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. > >> > > > > >> > > > >> > > How is the accepted value queried? Any chance to configure it > >> > > automatically? > >> > > >> > Take /dev/dax0.0 for example. The value can be read from > >> > /sys/class/dax/dax0.0/device/dax_region/align. > >> > > >> > >> Should this work be left to management layer, or could qemu figure it out > >> by itself (using udev?) > >> > > > > For DAX device only, QEMU can figure out the proper alignment by > > itself. However, I'm not sure whether there are other non-DAX cases > > requiring non-default alignment, so I think it's better to just add an > > interface (i.e. align attribute) in QEMU and let other management > > tools (e.g. libvirt?) fill a proper value. > > I can't imagine any cases where you would want to specify an > alignment. If it's regular file mmap any alignment is fine, and if > it's device-dax only the configured alignment of the device instance > is allowed. So, I don't think this should be a configurable option, > just read it from the device instance and you're done. Agreed. BTW, there's no generic interface to ask the kernel what's the required mmap() alignment for a file? -- Eduardo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-26 18:55 ` Eduardo Habkost @ 2017-05-26 20:50 ` Dan Williams 2017-05-27 1:04 ` Haozhong Zhang 1 sibling, 0 replies; 21+ messages in thread From: Dan Williams @ 2017-05-26 20:50 UTC (permalink / raw) To: Eduardo Habkost Cc: qemu-devel, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Paolo Bonzini, Richard Henderson On Fri, May 26, 2017 at 11:55 AM, Eduardo Habkost <ehabkost@redhat.com> wrote: > On Fri, May 26, 2017 at 07:24:26AM -0700, Dan Williams wrote: >> On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > On 05/26/17 07:05 +0000, Marc-André Lureau wrote: >> >> Hi >> >> >> >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com> >> >> wrote: >> >> >> >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote: >> >> > > Hi >> >> > > >> >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com >> >> > > >> >> > > wrote: >> >> > > >> >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual >> >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a >> >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel >> >> > > > mmap implementation may require an alignment larger than what >> >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. >> >> > > > >> >> > > >> >> > > How is the accepted value queried? Any chance to configure it >> >> > > automatically? >> >> > >> >> > Take /dev/dax0.0 for example. The value can be read from >> >> > /sys/class/dax/dax0.0/device/dax_region/align. >> >> > >> >> >> >> Should this work be left to management layer, or could qemu figure it out >> >> by itself (using udev?) >> >> >> > >> > For DAX device only, QEMU can figure out the proper alignment by >> > itself. However, I'm not sure whether there are other non-DAX cases >> > requiring non-default alignment, so I think it's better to just add an >> > interface (i.e. align attribute) in QEMU and let other management >> > tools (e.g. libvirt?) fill a proper value. >> >> I can't imagine any cases where you would want to specify an >> alignment. If it's regular file mmap any alignment is fine, and if >> it's device-dax only the configured alignment of the device instance >> is allowed. So, I don't think this should be a configurable option, >> just read it from the device instance and you're done. > > Agreed. > > BTW, there's no generic interface to ask the kernel what's the > required mmap() alignment for a file? It's only this /dev/dax device-file case where the alignment is strictly enforced. The required alignment is advertised in sysfs, for example: # ls -l /dev/dax0.0 crw------- 1 root root 253, 0 May 26 06:41 /dev/dax0.0 # cat /sys/dev/char/253\:0/device/align 2097152 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-26 18:55 ` Eduardo Habkost 2017-05-26 20:50 ` Dan Williams @ 2017-05-27 1:04 ` Haozhong Zhang 1 sibling, 0 replies; 21+ messages in thread From: Haozhong Zhang @ 2017-05-27 1:04 UTC (permalink / raw) To: Eduardo Habkost Cc: Dan Williams, Peter Crosthwaite, Stefan Hajnoczi, qemu-devel, Xiao Guangrong, Paolo Bonzini, Igor Mammedov, Richard Henderson On 05/26/17 15:55 -0300, Eduardo Habkost wrote: > On Fri, May 26, 2017 at 07:24:26AM -0700, Dan Williams wrote: > > On Fri, May 26, 2017 at 12:16 AM, Haozhong Zhang > > <haozhong.zhang@intel.com> wrote: > > > On 05/26/17 07:05 +0000, Marc-André Lureau wrote: > > >> Hi > > >> > > >> On Fri, May 26, 2017 at 10:51 AM Haozhong Zhang <haozhong.zhang@intel.com> > > >> wrote: > > >> > > >> > On 05/26/17 06:39 +0000, Marc-André Lureau wrote: > > >> > > Hi > > >> > > > > >> > > On Fri, May 26, 2017 at 6:34 AM Haozhong Zhang <haozhong.zhang@intel.com > > >> > > > > >> > > wrote: > > >> > > > > >> > > > file_ram_alloc() currently maps the backend file via mmap to a virtual > > >> > > > address aligned to the value returned by qemu_fd_getpagesize(). When a > > >> > > > DAX device (e.g. /dev/dax0.0) is used as the backend file, its kernel > > >> > > > mmap implementation may require an alignment larger than what > > >> > > > qemu_fd_get_pagesize() returns (e.g. 2MB vs. 4KB), and mmap may fail. > > >> > > > > > >> > > > > >> > > How is the accepted value queried? Any chance to configure it > > >> > > automatically? > > >> > > > >> > Take /dev/dax0.0 for example. The value can be read from > > >> > /sys/class/dax/dax0.0/device/dax_region/align. > > >> > > > >> > > >> Should this work be left to management layer, or could qemu figure it out > > >> by itself (using udev?) > > >> > > > > > > For DAX device only, QEMU can figure out the proper alignment by > > > itself. However, I'm not sure whether there are other non-DAX cases > > > requiring non-default alignment, so I think it's better to just add an > > > interface (i.e. align attribute) in QEMU and let other management > > > tools (e.g. libvirt?) fill a proper value. > > > > I can't imagine any cases where you would want to specify an > > alignment. If it's regular file mmap any alignment is fine, and if > > it's device-dax only the configured alignment of the device instance > > is allowed. So, I don't think this should be a configurable option, > > just read it from the device instance and you're done. > > Agreed. > Ok, I'll drop this attribute and let QEMU figure out the alignment. Thanks, Haozhong > BTW, there's no generic interface to ask the kernel what's the > required mmap() alignment for a file? > > -- > Eduardo > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-26 14:24 ` Dan Williams 2017-05-26 18:55 ` Eduardo Habkost @ 2017-05-30 12:17 ` Paolo Bonzini 2017-05-30 19:16 ` Dan Williams 1 sibling, 1 reply; 21+ messages in thread From: Paolo Bonzini @ 2017-05-30 12:17 UTC (permalink / raw) To: Dan Williams, qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Richard Henderson On 26/05/2017 16:24, Dan Williams wrote: >> For DAX device only, QEMU can figure out the proper alignment by >> itself. However, I'm not sure whether there are other non-DAX cases >> requiring non-default alignment, so I think it's better to just add an >> interface (i.e. align attribute) in QEMU and let other management >> tools (e.g. libvirt?) fill a proper value. > I can't imagine any cases where you would want to specify an > alignment. If it's regular file mmap any alignment is fine, and if > it's device-dax only the configured alignment of the device instance > is allowed. So, I don't think this should be a configurable option, > just read it from the device instance and you're done. A 2M or 1G alignment lets KVM use EPT hugepages if the host physical addresses are contiguous and 2M- or 1G-aligned. QEMU only does this for hugetlbfs currently, where the requirement on the host physical addresses is always satisfied. Would the same apply to NVDIMM device DAX? Thanks, Paolo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment 2017-05-30 12:17 ` Paolo Bonzini @ 2017-05-30 19:16 ` Dan Williams 0 siblings, 0 replies; 21+ messages in thread From: Dan Williams @ 2017-05-30 19:16 UTC (permalink / raw) To: Paolo Bonzini Cc: qemu-devel, Eduardo Habkost, Peter Crosthwaite, Stefan Hajnoczi, Xiao Guangrong, Igor Mammedov, Richard Henderson On Tue, May 30, 2017 at 5:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 26/05/2017 16:24, Dan Williams wrote: >>> For DAX device only, QEMU can figure out the proper alignment by >>> itself. However, I'm not sure whether there are other non-DAX cases >>> requiring non-default alignment, so I think it's better to just add an >>> interface (i.e. align attribute) in QEMU and let other management >>> tools (e.g. libvirt?) fill a proper value. >> I can't imagine any cases where you would want to specify an >> alignment. If it's regular file mmap any alignment is fine, and if >> it's device-dax only the configured alignment of the device instance >> is allowed. So, I don't think this should be a configurable option, >> just read it from the device instance and you're done. > > A 2M or 1G alignment lets KVM use EPT hugepages if the host physical > addresses are contiguous and 2M- or 1G-aligned. > > QEMU only does this for hugetlbfs currently, where the requirement on > the host physical addresses is always satisfied. Would the same apply > to NVDIMM device DAX? > Yes, I believe it is similar. Device-dax guarantees and enforces (mmap area must be aligned) a given fault granularity. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang 2017-05-26 2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang @ 2017-05-26 3:34 ` Dan Williams 2017-05-26 4:30 ` Haozhong Zhang 2017-05-26 14:38 ` Daniel P. Berrange 2017-06-01 12:00 ` Xiao Guangrong 2 siblings, 2 replies; 21+ messages in thread From: Dan Williams @ 2017-05-26 3:34 UTC (permalink / raw) To: Haozhong Zhang Cc: qemu-devel, Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong, Stefan Hajnoczi On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > Applications in Linux guest that use device-dax never trigger flush > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > device-dax, QEMU cannot guarantee the persistence of guest writes. > Before solving this flushing problem, QEMU should warn users if the > host backend is not device-dax. I think this needs to be stronger than a "warn" it needs to be explicitly forbidden when it is known to be unsafe. > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com > --- > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Igor Mammedov <imammedo@redhat.com> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > Cc: Stefan Hajnoczi <stefanha@gmail.com> > Cc: Dan Williams <dan.j.williams@intel.com> > --- > Resend because the wrong maintainer email address was used. > --- > hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) > > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c > index db896b0bb6..c7bb407f33 100644 > --- a/hw/mem/nvdimm.c > +++ b/hw/mem/nvdimm.c > @@ -26,6 +26,7 @@ > #include "qapi/error.h" > #include "qapi/visitor.h" > #include "hw/mem/nvdimm.h" > +#include "qemu/error-report.h" > > static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name, > void *opaque, Error **errp) > @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) > return &nvdimm->nvdimm_mr; > } > > +static void nvdimm_check_dax(HostMemoryBackend *hostmem) > +{ > + char *mem_path = > + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); > + char *dev_name = NULL, *sysfs_path = NULL; > + bool is_dax = false; > + > + if (!mem_path) { > + goto out; > + } > + > + if (!g_str_has_prefix(mem_path, "/dev/dax")) { > + goto out; > + } What if we're using a symlink to a device-dax instance? The should explicitly be looking up the major / minor number of the device (after following any symlinks) and verifying that it is device-dax by checking /sys/dev/char/$major:$minor refers to a device-dax instance. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams @ 2017-05-26 4:30 ` Haozhong Zhang 2017-05-26 14:28 ` Dan Williams 2017-05-26 14:38 ` Daniel P. Berrange 1 sibling, 1 reply; 21+ messages in thread From: Haozhong Zhang @ 2017-05-26 4:30 UTC (permalink / raw) To: Dan Williams Cc: Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On 05/25/17 20:34 -0700, Dan Williams wrote: > On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > Applications in Linux guest that use device-dax never trigger flush > > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > > device-dax, QEMU cannot guarantee the persistence of guest writes. > > Before solving this flushing problem, QEMU should warn users if the > > host backend is not device-dax. > > I think this needs to be stronger than a "warn" it needs to be > explicitly forbidden when it is known to be unsafe. > I understand your worry and am not object to your suggestion, but forbidden will change the existing behavior that allows such unsafe usage. Let's wait for other maintainers' comments. > > > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> > > Message-id: CAPcyv4hV2-ZW8SMCRtD0P_86KgR3DHOvNe+6T5SY2u7wXg3gEg@mail.gmail.com > > --- > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > > Cc: Igor Mammedov <imammedo@redhat.com> > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > Cc: Stefan Hajnoczi <stefanha@gmail.com> > > Cc: Dan Williams <dan.j.williams@intel.com> > > --- > > Resend because the wrong maintainer email address was used. > > --- > > hw/mem/nvdimm.c | 37 +++++++++++++++++++++++++++++++++++++ > > 1 file changed, 37 insertions(+) > > > > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c > > index db896b0bb6..c7bb407f33 100644 > > --- a/hw/mem/nvdimm.c > > +++ b/hw/mem/nvdimm.c > > @@ -26,6 +26,7 @@ > > #include "qapi/error.h" > > #include "qapi/visitor.h" > > #include "hw/mem/nvdimm.h" > > +#include "qemu/error-report.h" > > > > static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name, > > void *opaque, Error **errp) > > @@ -78,12 +79,48 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) > > return &nvdimm->nvdimm_mr; > > } > > > > +static void nvdimm_check_dax(HostMemoryBackend *hostmem) > > +{ > > + char *mem_path = > > + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); > > + char *dev_name = NULL, *sysfs_path = NULL; > > + bool is_dax = false; > > + > > + if (!mem_path) { > > + goto out; > > + } > > + > > + if (!g_str_has_prefix(mem_path, "/dev/dax")) { > > + goto out; > > + } > > What if we're using a symlink to a device-dax instance? The should > explicitly be looking up the major / minor number of the device (after > following any symlinks) and verifying that it is device-dax by > checking /sys/dev/char/$major:$minor refers to a device-dax instance. > I'll follow to your suggestion in the next version. Thanks, Haozhong ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 4:30 ` Haozhong Zhang @ 2017-05-26 14:28 ` Dan Williams 0 siblings, 0 replies; 21+ messages in thread From: Dan Williams @ 2017-05-26 14:28 UTC (permalink / raw) To: Dan Williams, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On Thu, May 25, 2017 at 9:30 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 05/25/17 20:34 -0700, Dan Williams wrote: >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > Applications in Linux guest that use device-dax never trigger flush >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not >> > device-dax, QEMU cannot guarantee the persistence of guest writes. >> > Before solving this flushing problem, QEMU should warn users if the >> > host backend is not device-dax. >> >> I think this needs to be stronger than a "warn" it needs to be >> explicitly forbidden when it is known to be unsafe. >> > > I understand your worry and am not object to your suggestion, but > forbidden will change the existing behavior that allows such unsafe > usage. Let's wait for other maintainers' comments. Changing existing behavior is the point. Even if a qemu maintainer thought it was acceptable to allow the dangerous behavior with a warning I would continue to assert that we need to block it by default. I'm not opposed to adding a new configuration option like "allow dangerous pmem" to override the new default behavior, but I think this patch is incorrect to just emit a message. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams 2017-05-26 4:30 ` Haozhong Zhang @ 2017-05-26 14:38 ` Daniel P. Berrange 2017-05-26 15:25 ` Dan Williams 1 sibling, 1 reply; 21+ messages in thread From: Daniel P. Berrange @ 2017-05-26 14:38 UTC (permalink / raw) To: Dan Williams Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: > On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > Applications in Linux guest that use device-dax never trigger flush > > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > > device-dax, QEMU cannot guarantee the persistence of guest writes. > > Before solving this flushing problem, QEMU should warn users if the > > host backend is not device-dax. > > I think this needs to be stronger than a "warn" it needs to be > explicitly forbidden when it is known to be unsafe. I think users should have the choice in what they want to do - QEMU should not artifically block it. There are plenty of things in QEMU that are potentially unsafe in some usage scenarios, but we just document how to use them in a safe manner & any caveats that apply. Higher level applications above QEMU can then consider how they want to apply a usage policy to meet the needs of their usage scenario. Having an emulated DAX device that doesn't guarantee persistence is no different to having an emulated disk device that never flushes to underlying host storage. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 14:38 ` Daniel P. Berrange @ 2017-05-26 15:25 ` Dan Williams 2017-05-27 1:13 ` Haozhong Zhang 2017-05-30 9:20 ` Daniel P. Berrange 0 siblings, 2 replies; 21+ messages in thread From: Dan Williams @ 2017-05-26 15:25 UTC (permalink / raw) To: Daniel P. Berrange Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > Applications in Linux guest that use device-dax never trigger flush >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not >> > device-dax, QEMU cannot guarantee the persistence of guest writes. >> > Before solving this flushing problem, QEMU should warn users if the >> > host backend is not device-dax. >> >> I think this needs to be stronger than a "warn" it needs to be >> explicitly forbidden when it is known to be unsafe. > > I think users should have the choice in what they want to do - > QEMU should not artifically block it. There are plenty of things > in QEMU that are potentially unsafe in some usage scenarios, but > we just document how to use them in a safe manner & any caveats > that apply. Higher level applications above QEMU can then consider > how they want to apply a usage policy to meet the needs of their > usage scenario. > > Having an emulated DAX device that doesn't guarantee persistence > is no different to having an emulated disk device that never flushes > to underlying host storage. > It is different in the sense that the contract of when the guest should assume persistence is specified by when the write completes to the virtual disk. In the case of the virtual NFIT we are currently lying to the guest about that platform persistence guarantee even if the hypervisor is emulating pmem with volatile memory. In other words, I agree that it should be possible to tell the guest to assume it is pmem when it is not, but we need more granularity in the configuration to communicate the capabilities correctly to the guest. It seems the NFIT memory device state flag ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to the guest. Can we add a new knob to control the polarity of that flag and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using regular file mmap and clear the flag by default in the device-dax case? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 15:25 ` Dan Williams @ 2017-05-27 1:13 ` Haozhong Zhang 2017-05-30 9:20 ` Daniel P. Berrange 1 sibling, 0 replies; 21+ messages in thread From: Haozhong Zhang @ 2017-05-27 1:13 UTC (permalink / raw) To: Dan Williams Cc: Daniel P. Berrange, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On 05/26/17 08:25 -0700, Dan Williams wrote: > On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: > >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > Applications in Linux guest that use device-dax never trigger flush > >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > >> > device-dax, QEMU cannot guarantee the persistence of guest writes. > >> > Before solving this flushing problem, QEMU should warn users if the > >> > host backend is not device-dax. > >> > >> I think this needs to be stronger than a "warn" it needs to be > >> explicitly forbidden when it is known to be unsafe. > > > > I think users should have the choice in what they want to do - > > QEMU should not artifically block it. There are plenty of things > > in QEMU that are potentially unsafe in some usage scenarios, but > > we just document how to use them in a safe manner & any caveats > > that apply. Higher level applications above QEMU can then consider > > how they want to apply a usage policy to meet the needs of their > > usage scenario. > > > > Having an emulated DAX device that doesn't guarantee persistence > > is no different to having an emulated disk device that never flushes > > to underlying host storage. > > > > It is different in the sense that the contract of when the guest > should assume persistence is specified by when the write completes to > the virtual disk. In the case of the virtual NFIT we are currently > lying to the guest about that platform persistence guarantee even if > the hypervisor is emulating pmem with volatile memory. > > In other words, I agree that it should be possible to tell the guest > to assume it is pmem when it is not, but we need more granularity in > the configuration to communicate the capabilities correctly to the > guest. It seems the NFIT memory device state flag > ACPI_NFIT_MEM_NOT_ARMED is a good way to communicate pmem safety to > the guest. Can we add a new knob to control the polarity of that flag > and make ACPI_NFIT_MEM_NOT_ARMED being set the default case when using > regular file mmap and clear the flag by default in the device-dax > case? Yes, we can set ACPI_NFIT_MEM_NOT_ARMED if the host backend is not device-dax. Thanks, Haozhong ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 15:25 ` Dan Williams 2017-05-27 1:13 ` Haozhong Zhang @ 2017-05-30 9:20 ` Daniel P. Berrange 2017-05-30 11:41 ` Dan Williams 1 sibling, 1 reply; 21+ messages in thread From: Daniel P. Berrange @ 2017-05-30 9:20 UTC (permalink / raw) To: Dan Williams Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote: > On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: > >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > Applications in Linux guest that use device-dax never trigger flush > >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not > >> > device-dax, QEMU cannot guarantee the persistence of guest writes. > >> > Before solving this flushing problem, QEMU should warn users if the > >> > host backend is not device-dax. > >> > >> I think this needs to be stronger than a "warn" it needs to be > >> explicitly forbidden when it is known to be unsafe. > > > > I think users should have the choice in what they want to do - > > QEMU should not artifically block it. There are plenty of things > > in QEMU that are potentially unsafe in some usage scenarios, but > > we just document how to use them in a safe manner & any caveats > > that apply. Higher level applications above QEMU can then consider > > how they want to apply a usage policy to meet the needs of their > > usage scenario. > > > > Having an emulated DAX device that doesn't guarantee persistence > > is no different to having an emulated disk device that never flushes > > to underlying host storage. > > > > It is different in the sense that the contract of when the guest > should assume persistence is specified by when the write completes to > the virtual disk. In the case of the virtual NFIT we are currently > lying to the guest about that platform persistence guarantee even if > the hypervisor is emulating pmem with volatile memory. We equally lie to the guest about persistence of disks, when the disk is run with certain cache= settings, or when the disk backing file is on tmpfs. It is simply a choice the mgmt application makes about whether to provide backing storage that is capable of satsifying the guarantees implied by the guest device model. So I'm still not seeing a compelling reason to artifically block this with DAX. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-30 9:20 ` Daniel P. Berrange @ 2017-05-30 11:41 ` Dan Williams 0 siblings, 0 replies; 21+ messages in thread From: Dan Williams @ 2017-05-30 11:41 UTC (permalink / raw) To: Daniel P. Berrange Cc: Haozhong Zhang, Stefan Hajnoczi, Igor Mammedov, Xiao Guangrong, qemu-devel, Michael S. Tsirkin On Tue, May 30, 2017 at 2:20 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Fri, May 26, 2017 at 08:25:20AM -0700, Dan Williams wrote: >> On Fri, May 26, 2017 at 7:38 AM, Daniel P. Berrange <berrange@redhat.com> wrote: >> > On Thu, May 25, 2017 at 08:34:23PM -0700, Dan Williams wrote: >> >> On Thu, May 25, 2017 at 7:32 PM, Haozhong Zhang >> >> <haozhong.zhang@intel.com> wrote: >> >> > Applications in Linux guest that use device-dax never trigger flush >> >> > that can be trapped by KVM/QEMU. Meanwhile, if the host backend is not >> >> > device-dax, QEMU cannot guarantee the persistence of guest writes. >> >> > Before solving this flushing problem, QEMU should warn users if the >> >> > host backend is not device-dax. >> >> >> >> I think this needs to be stronger than a "warn" it needs to be >> >> explicitly forbidden when it is known to be unsafe. >> > >> > I think users should have the choice in what they want to do - >> > QEMU should not artifically block it. There are plenty of things >> > in QEMU that are potentially unsafe in some usage scenarios, but >> > we just document how to use them in a safe manner & any caveats >> > that apply. Higher level applications above QEMU can then consider >> > how they want to apply a usage policy to meet the needs of their >> > usage scenario. >> > >> > Having an emulated DAX device that doesn't guarantee persistence >> > is no different to having an emulated disk device that never flushes >> > to underlying host storage. >> > >> >> It is different in the sense that the contract of when the guest >> should assume persistence is specified by when the write completes to >> the virtual disk. In the case of the virtual NFIT we are currently >> lying to the guest about that platform persistence guarantee even if >> the hypervisor is emulating pmem with volatile memory. > > We equally lie to the guest about persistence of disks, when the > disk is run with certain cache= settings, or when the disk backing file > is on tmpfs. It is simply a choice the mgmt application makes about > whether to provide backing storage that is capable of satsifying the > guarantees implied by the guest device model. So I'm still not seeing > a compelling reason to artifically block this with DAX. Agreed, artificially blocking is not a good path, but for completeness we still need a way to control the ACPI "not armed" health state flag and a sane default for that parameter. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-05-26 2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang 2017-05-26 2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang 2017-05-26 3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams @ 2017-06-01 12:00 ` Xiao Guangrong 2017-06-01 13:53 ` [Qemu-devel] " Dan Williams 2 siblings, 1 reply; 21+ messages in thread From: Xiao Guangrong @ 2017-06-01 12:00 UTC (permalink / raw) To: Haozhong Zhang, qemu-devel Cc: Michael S. Tsirkin, Igor Mammedov, Stefan Hajnoczi, Dan Williams On 05/26/2017 10:32 AM, Haozhong Zhang wrote: > +static void nvdimm_check_dax(HostMemoryBackend *hostmem) > +{ > + char *mem_path = > + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); > + char *dev_name = NULL, *sysfs_path = NULL; > + bool is_dax = false; > + > + if (!mem_path) { > + goto out; > + } > + > + if (!g_str_has_prefix(mem_path, "/dev/dax")) { > + goto out; > + } > + > + dev_name = mem_path + strlen("/dev/"); > + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); > + if (access(sysfs_path, F_OK)) { > + goto out; > + } > + > + is_dax = true; > + So only dax raw disk has write-persistence guaranty, a pre-allocated file which locates on a DAX-enabled filesystem can not? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device 2017-06-01 12:00 ` Xiao Guangrong @ 2017-06-01 13:53 ` Dan Williams 0 siblings, 0 replies; 21+ messages in thread From: Dan Williams @ 2017-06-01 13:53 UTC (permalink / raw) To: Xiao Guangrong Cc: Haozhong Zhang, qemu-devel, Michael S. Tsirkin, Igor Mammedov, Stefan Hajnoczi, linux-fsdevel [ adding linux-fsdevel ] On Thu, Jun 1, 2017 at 5:00 AM, Xiao Guangrong <guangrong.xiao@gmail.com> wrote: > > > On 05/26/2017 10:32 AM, Haozhong Zhang wrote: > >> +static void nvdimm_check_dax(HostMemoryBackend *hostmem) >> +{ >> + char *mem_path = >> + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); >> + char *dev_name = NULL, *sysfs_path = NULL; >> + bool is_dax = false; >> + >> + if (!mem_path) { >> + goto out; >> + } >> + >> + if (!g_str_has_prefix(mem_path, "/dev/dax")) { >> + goto out; >> + } >> + >> + dev_name = mem_path + strlen("/dev/"); >> + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); >> + if (access(sysfs_path, F_OK)) { >> + goto out; >> + } >> + >> + is_dax = true; >> + > > > So only dax raw disk has write-persistence guaranty, a pre-allocated > file which locates on a DAX-enabled filesystem can not? Correct, it is not guaranteed by any existing filesystem. It may work by accident today on ext4 with a pre-allocated file, but a filesystem is otherwise permitted to drop any writes that have not been synced. Until we get an explicit MMAP_SYNC or METADATA_IMMUTABLE feature into a filesystem the only interface that we can use to reliably pass persistent memory through to a guest is device-dax. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device @ 2017-06-01 13:53 ` Dan Williams 0 siblings, 0 replies; 21+ messages in thread From: Dan Williams @ 2017-06-01 13:53 UTC (permalink / raw) To: Xiao Guangrong Cc: Haozhong Zhang, qemu-devel, Michael S. Tsirkin, Igor Mammedov, Stefan Hajnoczi, linux-fsdevel [ adding linux-fsdevel ] On Thu, Jun 1, 2017 at 5:00 AM, Xiao Guangrong <guangrong.xiao@gmail.com> wrote: > > > On 05/26/2017 10:32 AM, Haozhong Zhang wrote: > >> +static void nvdimm_check_dax(HostMemoryBackend *hostmem) >> +{ >> + char *mem_path = >> + object_property_get_str(OBJECT(hostmem), "mem-path", NULL); >> + char *dev_name = NULL, *sysfs_path = NULL; >> + bool is_dax = false; >> + >> + if (!mem_path) { >> + goto out; >> + } >> + >> + if (!g_str_has_prefix(mem_path, "/dev/dax")) { >> + goto out; >> + } >> + >> + dev_name = mem_path + strlen("/dev/"); >> + sysfs_path = g_strdup_printf("/sys/class/dax/%s", dev_name); >> + if (access(sysfs_path, F_OK)) { >> + goto out; >> + } >> + >> + is_dax = true; >> + > > > So only dax raw disk has write-persistence guaranty, a pre-allocated > file which locates on a DAX-enabled filesystem can not? Correct, it is not guaranteed by any existing filesystem. It may work by accident today on ext4 with a pre-allocated file, but a filesystem is otherwise permitted to drop any writes that have not been synced. Until we get an explicit MMAP_SYNC or METADATA_IMMUTABLE feature into a filesystem the only interface that we can use to reliably pass persistent memory through to a guest is device-dax. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2017-06-01 13:53 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-26 2:32 [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Haozhong Zhang 2017-05-26 2:32 ` [Qemu-devel] [RESEND PATCH 2/2] hostmem-file: add an attribute 'align' to set its alignment Haozhong Zhang 2017-05-26 6:39 ` Marc-André Lureau [not found] ` <20170526065132.ayapfebbft5lyigd@hz-desktop> 2017-05-26 7:05 ` Marc-André Lureau [not found] ` <20170526071654.kxcqflvpo2tvcrvu@hz-desktop> 2017-05-26 14:24 ` Dan Williams 2017-05-26 18:55 ` Eduardo Habkost 2017-05-26 20:50 ` Dan Williams 2017-05-27 1:04 ` Haozhong Zhang 2017-05-30 12:17 ` Paolo Bonzini 2017-05-30 19:16 ` Dan Williams 2017-05-26 3:34 ` [Qemu-devel] [RESEND PATCH 1/2] nvdimm: warn if the backend is not a DAX device Dan Williams 2017-05-26 4:30 ` Haozhong Zhang 2017-05-26 14:28 ` Dan Williams 2017-05-26 14:38 ` Daniel P. Berrange 2017-05-26 15:25 ` Dan Williams 2017-05-27 1:13 ` Haozhong Zhang 2017-05-30 9:20 ` Daniel P. Berrange 2017-05-30 11:41 ` Dan Williams 2017-06-01 12:00 ` Xiao Guangrong 2017-06-01 13:53 ` Dan Williams 2017-06-01 13:53 ` [Qemu-devel] " Dan Williams
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.