From mboxrd@z Thu Jan  1 00:00:00 1970
From: Pankaj Gupta <pagupta@redhat.com>
Subject: Re: [Qemu-devel] KVM "fake DAX" device flushing
Date: Fri, 12 May 2017 02:56:14 -0400 (EDT)
Message-ID: <459420445.7146183.1494572174539.JavaMail.zimbra@redhat.com>
References: <1494431760-6455-1-git-send-email-pagupta@redhat.com> <20170511181703.GC8701@stefanha-x1.localdomain>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8BIT
Cc: kwolf@redhat.com, Haozhong Zhang <haozhong.zhang@intel.com>,
        Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
        kvm@vger.kernel.org, qemu-devel@nongnu.org, pbonzini@redhat.com,
        Dan Williams <dan.j.williams@intel.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:40178 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1755908AbdELG4W (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 12 May 2017 02:56:22 -0400
In-Reply-To: <20170511181703.GC8701@stefanha-x1.localdomain>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


> 
> On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > We are sharing initial project proposal for
> > 'KVM "fake DAX" device flushing' project for feedback.
> > Got the idea during discussion with 'Rik van Riel'.
> 
> CCing NVDIMM folks.
> 
> > 
> > Also, request answers to 'Questions' section.
> > 
> > Abstract :
> > ----------
> > Project idea is to use fake persistent memory with direct
> > access(DAX) in virtual machines. Overall goal of project
> > is to increase the number of virtual machines that can be
> > run on a physical machine, in order to increase the density
> > of customer virtual machines.
> > 
> > The idea is to avoid the guest page cache, and minimize the
> > memory footprint of virtual machines. By presenting a disk
> > image as a nvdimm direct access (DAX) memory region in a
> > virtual machine, the guest OS can avoid using page cache
> > memory for most file accesses.
> > 
> > Problem Statement :
> > ------------------
> > * Guest uses page cache in memory to process fast requests
> >   for disk read/write. This results in big memory footprint
> >   of guests without host knowing much details of the guest
> >   memory.
> > 
> > * If guests use direct access(DAX) with fake persistent
> >   storage, the host manages the page cache for guests,
> >   allowing the host to easily reclaim/evict less frequently
> >   used page cache pages without requiring guest cooperation,
> >   like ballooning would.
> > 
> > * Host manages guest cache as ‘mmaped’ disk image area in
> >   qemu address space. This region is passed to guest as fake
> >   persistent memory range. We need a new flushing interface
> >   to flush this cache to secondary storage to persist guest
> >   writes.
> > 
> > * New asynchronous flushing interface will allow guests to
> >   cause the host flush the dirty data to backup storage file.
> >   Systems with pmem storage make use of CLFLUSH instruction
> >   to flush single cache line to persistent storage and it
> >   takes care of flushing. With fake persistent storage in
> >   guest we cannot depend on CLFLUSH instruction to flush entire
> >   dirty cache to backing storage. Even If we trap and emulate
> >   CLFLUSH instruction guest vCPU has to wait till we flush all
> >   the dirty memory. Instead of this we need to implement a new
> >   asynchronous guest flushing interface, which allows the guest
> >   to specify a larger range to be flushed at once, and allows
> >   the vCPU to run something else while the data is being synced
> >   to disk.
> > 
> > * New flushing interface will consists of a para virt driver to
> >   new fake nvdimm like device which will process guest flushing
> >   requests like fsync/msync etc instead of pmem library calls
> >   like clflush. The corresponding device at host side will be
> >   responsible for flushing requests for guest dirty pages.
> >   Guest can put current task in sleep and vCPU can run any other
> >   task while host side flushing of guests pages is in progress.
> > 
> > Host controlled fake nvdimm DAX to avoid guest page cache :
> > -------------------------------------------------------------
> > * Bypass guest page cache by using a fake persistent storage
> >   like nvdimm & DAX. Guest Read/Write is directly done on
> >   fake persistent storage without involving guest kernel for
> >   caching data.
> > 
> > * Fake nvdimm device passed to guest is backed by a regular
> >   file in host stored in secondary storage.
> > 
> > * Qemu has implementation of fake NVDIMM/DAX device. Use this
> >   capability of passing regular host file(disk) as nvdimm device
> >   to guest.
> > 
> > * Nvdimm with DAX works for ext4/xfs filesystem. Supported
> >   filesystem should be DAX compatible.
> > 
> > * As we are using guest disk as fake DAX/NVDIMM device, we
> >   need a mechanism for persistence of data backed on regular
> >   host storage file.
> > 
> > * For live migration use case, if host side backing file is
> >   shared storage, we need to flush the page cache for the disk
> >   image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?)
> >   before starting execution of the guest on the destination host.
> 
> Good point.  QEMU currently only supports live migration with O_DIRECT.
> I think the problem was that userspace cannot guarantee consistency in
> the general case.  If you find a solution to this problem for fake
> NVDIMM then maybe the QEMU block layer can also begin supporting live
> migration with buffered I/O.
> 
> > 
> > Design :
> > ---------
> > * In order to not have page cache inside the guest, qemu would:
> > 
> >  1) mmap the guest's disk image and present that disk image to
> >     the guest as a persistent memory range.
> > 
> >  2) Present information to the guest telling it that the persistent
> >     memory range is not physical persistent memory.
> 
> Steps 1 & 2 are already supported by QEMU NVDIMM emulation today.

Yes. I have also tested guest 'fake DAX' device using QEMU NVDIMM emulation.
> 
> >  3) Present an additional paravirt device alongside the persistent
> >     memory range, that can be used to sync (ranges of) data to disk.
> > 
> > * Guest would use the disk image mostly like a persistent memory
> >   device, with two exceptions:
> > 
> >   1) It would not tell userspace that the files on that device are
> >      persistent memory. This is  done so userspace knows to call
> >      fsync/msync, instead of the pmem clflush library call.
> 
> Not sure I agree with hiding the nvdimm nature of the device.  Instead I
> think you need to build this capability into the Linux nvdimm code.
> libpmem will detect these types of devices and issue fsync/msync when
> the application wants to flush.
> 
> >   2) When userspace calls fsync/msync on files on the fake persistent
> >      memory device, issue a request through the paravirt device that
> >      causes the host to flush the device back end.
> > 
> > * Guest uses fake persistent storage data updates can be still in
> >   qemu memory. We need a way to flush cached data in host to backed
> 
> s/qemu memory/host memory/
> 
> I guess you mean that host userspace needs a way to reliably flush an
> address range to the underlying storage.

right.
> 
> >   secondary storage.
> > 
> > * Once the guest receives a completion event from the host, it will
> >   allow userspace programs that were waiting on the fsync/msync to
> >   continue running.
> > 
> > * Host is responsible for paging in pages in host backing area for
> >   guest persistent memory as they are accessed by the guest, and
> >   for evicting pages as host memory fills up.
> > 
> > Questions :
> > -----------
> > * What should the flushing interface between guest and host look
> >   like?
> 
> A simple hack for prototyping is to instantiate an virtio-blk-pci for
> the mmapped host file.  The guest can send flush commands on the
> virtio-blk-pci device but will otherwise use the mapped memory directly.

okay. I will check this.
> 
> > * Any suggestions to hook the IO caching code with KVM/Qemu or
> >   thoughts on how we should do it?
> > 
> > * Thinking of implementing a guest para virt driver which will send
> >   guest requests to Qemu to flush data to disk. Not sure at this
> >   point how to tell userspace to work on this device as any regular
> >   device without considering it as persistent device. Any suggestions
> >   on this?
> > 
> > * Not thought yet about ballooning impact. But feel this solution
> >   could be better than ballooning in long term? As we will be
> >   managing all guests cache from host side.
> > 
> > * Not sure this solution works for ARM and other architectures and
> >   Windows?
>