On Thu, Apr 20, 2017 at 12:49:21PM -0700, Dan Williams wrote: > On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams wrote: > > [ adding Christoph ] > > > > On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang > > wrote: > >> On 04/06/17 20:02 +0800, Xiao Guangrong wrote: > >>> > >>> > >>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: > >>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > >>> > > This patch series constructs the flush hint address structures for > >>> > > nvdimm devices in QEMU. > >>> > > > >>> > > It's of course not for 2.9. I send it out early in order to get > >>> > > comments on one point I'm uncertain (see the detailed explanation > >>> > > below). Thanks for any comments in advance! > >>> > > Background > >>> > > --------------- > >>> > > >>> > Extra background: > >>> > > >>> > Flush Hint Addresses are necessary because: > >>> > > >>> > 1. Some hardware configurations may require them. In other words, a > >>> > cache flush instruction is not enough to persist data. > >>> > > >>> > 2. The host file system may need fsync(2) calls (e.g. to persist > >>> > metadata changes). > >>> > > >>> > Without Flush Hint Addresses only some NVDIMM configurations actually > >>> > guarantee data persistence. > >>> > > >>> > > Flush hint address structure is a substructure of NFIT and specifies > >>> > > one or more addresses, namely Flush Hint Addresses. Software can write > >>> > > to any one of these flush hint addresses to cause any preceding writes > >>> > > to the NVDIMM region to be flushed out of the intervening platform > >>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > >>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > >>> > > >>> > Do you have performance data? I'm concerned that Flush Hint Address > >>> > hardware interface is not virtualization-friendly. > >>> > > >>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > >>> > > >>> > wmb(); > >>> > for (i = 0; i < nd_region->ndr_mappings; i++) > >>> > if (ndrd_get_flush_wpq(ndrd, i, 0)) > >>> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > >>> > wmb(); > >>> > > >>> > That looks pretty lightweight - it's an MMIO write between write > >>> > barriers. > >>> > > >>> > This patch implements the MMIO write like this: > >>> > > >>> > void nvdimm_flush(NVDIMMDevice *nvdimm) > >>> > { > >>> > if (nvdimm->backend_fd != -1) { > >>> > /* > >>> > * If the backend store is a physical NVDIMM device, fsync() > >>> > * will trigger the flush via the flush hint on the host device. > >>> > */ > >>> > fsync(nvdimm->backend_fd); > >>> > } > >>> > } > >>> > > >>> > The MMIO store instruction turned into a synchronous fsync(2) system > >>> > call plus vmexit/vmenter and QEMU userspace context switch: > >>> > > >>> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > >>> > instruction has an unexpected and huge latency. > >>> > > >>> > 2. The vcpu thread holds the QEMU global mutex so all other threads > >>> > (including the monitor) are blocked during fsync(2). Other vcpu > >>> > threads may block if they vmexit. > >>> > > >>> > It is hard to implement this efficiently in QEMU. This is why I said > >>> > the hardware interface is not virtualization-friendly. It's cheap on > >>> > real hardware but expensive under virtualization. > >>> > > >>> > We should think about the optimal way of implementing Flush Hint > >>> > Addresses in QEMU. But if there is no reasonable way to implement them > >>> > then I think it's better *not* to implement them, just like the Block > >>> > Window feature which is also not virtualization-friendly. Users who > >>> > want a block device can use virtio-blk. I don't think NVDIMM Block > >>> > Window can achieve better performance than virtio-blk under > >>> > virtualization (although I'm happy to be proven wrong). > >>> > > >>> > Some ideas for a faster implementation: > >>> > > >>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > >>> > global mutex. Little synchronization is necessary as long as the > >>> > NVDIMM device isn't hot unplugged (not yet supported anyway). > >>> > > >>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from > >>> > the physical NVDIMM in cases where the configuration does not require > >>> > host kernel interception? That way QEMU can map the physical > >>> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > >>> > is bypassed and performance would be good. > >>> > > >>> > I'm not sure there is anything we can do to make the case where the host > >>> > kernel wants an fsync(2) fast :(. > >>> > >>> Good point. > >>> > >>> We can assume flush-CPU-cache-to-make-persistence is always > >>> available on Intel's hardware so that flush-hint-table is not > >>> needed if the vNVDIMM is based on a real Intel's NVDIMM device. > >>> > >> > >> We can let users of qemu (e.g. libvirt) detect whether the backend > >> device supports ADR, and pass 'flush-hint' option to qemu only if ADR > >> is not supported. > >> > > > > There currently is no ACPI mechanism to detect the presence of ADR. > > Also, you still need the flush for fs metadata management. > > > >>> If the vNVDIMM device is based on the regular file, i think > >>> fsync is the bottleneck rather than this mmio-virtualization. :( > >>> > >> > >> Yes, fsync() on the regular file is the bottleneck. We may either > >> > >> 1/ perform the host-side flush in an asynchronous way which will not > >> block vcpu too long, > >> > >> or > >> > >> 2/ not provide strong durability guarantee for non-NVDIMM backend and > >> not emulate flush-hint for guest at all. (I know 1/ does not > >> provide strong durability guarantee either). > > > > or > > > > 3/ Use device-dax as a stop-gap until we can get an efficient fsync() > > overhead reduction (or bypass) mechanism built and accepted for > > filesystem-dax. > > I didn't realize we have a bigger problem with host filesystem-fsync > and that WPQ exits will not save us. Applications that use device-dax > in the guest may never trigger a WPQ flush, because userspace flushing > with device-dax is expected to be safe. WPQ flush was never meant to > be a persistency mechanism the way it is proposed here, it's only > meant to minimize the fallout from potential ADR failure. My apologies > for insinuating that it was viable. > > So, until we solve this userspace flushing problem virtualization must > not pass through any file except a device-dax instance for any > production workload. Okay. That's what I've assumed up until now and I think distros will document this limitation. > Also these performance overheads seem prohibitive. We really want to > take whatever fsync minimization / bypass mechanism we come up with on > the host into a fast para-virtualized interface for the guest. Guests > need to be able to avoid hypervisor and host syscall overhead in the > fast path. It's hard to avoid the hypervisor if the host kernel file system needs an fsync() to persist everything. There should be a fast path where the host file is preallocated and no fancy file system features are in use (e.g. deduplication, copy-on-write snapshots) where host file systems don't need fsync(). Stefan