All of lore.kernel.org
 help / color / mirror / Atom feed
* KVM "fake DAX" device flushing
@ 2017-05-10 15:56 ` Pankaj Gupta
  0 siblings, 0 replies; 18+ messages in thread
From: Pankaj Gupta @ 2017-05-10 15:56 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: riel, pbonzini, kwolf, stefanha

We are sharing initial project proposal for 
'KVM "fake DAX" device flushing' project for feedback. 
Got the idea during discussion with 'Rik van Riel'. 

Also, request answers to 'Questions' section.

Abstract : 
----------
Project idea is to use fake persistent memory with direct 
access(DAX) in virtual machines. Overall goal of project 
is to increase the number of virtual machines that can be 
run on a physical machine, in order to increase the density 
of customer virtual machines.

The idea is to avoid the guest page cache, and minimize the 
memory footprint of virtual machines. By presenting a disk 
image as a nvdimm direct access (DAX) memory region in a 
virtual machine, the guest OS can avoid using page cache 
memory for most file accesses.

Problem Statement :
------------------
* Guest uses page cache in memory to process fast requests 
  for disk read/write. This results in big memory footprint 
  of guests without host knowing much details of the guest 
  memory. 

* If guests use direct access(DAX) with fake persistent 
  storage, the host manages the page cache for guests, 
  allowing the host to easily reclaim/evict less frequently 
  used page cache pages without requiring guest cooperation, 
  like ballooning would.

* Host manages guest cache as ‘mmaped’ disk image area in 
  qemu address space. This region is passed to guest as fake 
  persistent memory range. We need a new flushing interface 
  to flush this cache to secondary storage to persist guest 
  writes.

* New asynchronous flushing interface will allow guests to 
  cause the host flush the dirty data to backup storage file. 
  Systems with pmem storage make use of CLFLUSH instruction 
  to flush single cache line to persistent storage and it 
  takes care of flushing. With fake persistent storage in 
  guest we cannot depend on CLFLUSH instruction to flush entire 
  dirty cache to backing storage. Even If we trap and emulate 
  CLFLUSH instruction guest vCPU has to wait till we flush all 
  the dirty memory. Instead of this we need to implement a new 
  asynchronous guest flushing interface, which allows the guest 
  to specify a larger range to be flushed at once, and allows 
  the vCPU to run something else while the data is being synced 
  to disk. 

* New flushing interface will consists of a para virt driver to 
  new fake nvdimm like device which will process guest flushing
  requests like fsync/msync etc instead of pmem library calls 
  like clflush. The corresponding device at host side will be 
  responsible for flushing requests for guest dirty pages. 
  Guest can put current task in sleep and vCPU can run any other 
  task while host side flushing of guests pages is in progress.

Host controlled fake nvdimm DAX to avoid guest page cache :
-------------------------------------------------------------
* Bypass guest page cache by using a fake persistent storage 
  like nvdimm & DAX. Guest Read/Write is directly done on 
  fake persistent storage without involving guest kernel for 
  caching data.

* Fake nvdimm device passed to guest is backed by a regular 
  file in host stored in secondary storage.

* Qemu has implementation of fake NVDIMM/DAX device. Use this 
  capability of passing regular host file(disk) as nvdimm device 
  to guest.

* Nvdimm with DAX works for ext4/xfs filesystem. Supported 
  filesystem should be DAX compatible. 

* As we are using guest disk as fake DAX/NVDIMM device, we 
  need a mechanism for persistence of data backed on regular 
  host storage file.

* For live migration use case, if host side backing file is 
  shared storage, we need to flush the page cache for the disk 
  image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?) 
  before starting execution of the guest on the destination host.

Design :
---------
* In order to not have page cache inside the guest, qemu would:

 1) mmap the guest's disk image and present that disk image to 
    the guest as a persistent memory range.

 2) Present information to the guest telling it that the persistent 
    memory range is not physical persistent memory.

 3) Present an additional paravirt device alongside the persistent 
    memory range, that can be used to sync (ranges of) data to disk.

* Guest would use the disk image mostly like a persistent memory 
  device, with two exceptions:

  1) It would not tell userspace that the files on that device are 
     persistent memory. This is  done so userspace knows to call 
     fsync/msync, instead of the pmem clflush library call.

  2) When userspace calls fsync/msync on files on the fake persistent 
     memory device, issue a request through the paravirt device that 
     causes the host to flush the device back end.

* Guest uses fake persistent storage data updates can be still in 
  qemu memory. We need a way to flush cached data in host to backed 
  secondary storage.

* Once the guest receives a completion event from the host, it will 
  allow userspace programs that were waiting on the fsync/msync to 
  continue running.

* Host is responsible for paging in pages in host backing area for 
  guest persistent memory as they are accessed by the guest, and 
  for evicting pages as host memory fills up.

Questions :
-----------
* What should the flushing interface between guest and host look 
  like?

* Any suggestions to hook the IO caching code with KVM/Qemu or 
  thoughts on how we should do it? 

* Thinking of implementing a guest para virt driver which will send 
  guest requests to Qemu to flush data to disk. Not sure at this 
  point how to tell userspace to work on this device as any regular
  device without considering it as persistent device. Any suggestions
  on this?

* Not thought yet about ballooning impact. But feel this solution 
  could be better than ballooning in long term? As we will be 
  managing all guests cache from host side.

* Not sure this solution works for ARM and other architectures and 
  Windows? 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-10 15:56 ` Pankaj Gupta
  0 siblings, 0 replies; 18+ messages in thread
From: Pankaj Gupta @ 2017-05-10 15:56 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: riel, pbonzini, kwolf, stefanha

We are sharing initial project proposal for 
'KVM "fake DAX" device flushing' project for feedback. 
Got the idea during discussion with 'Rik van Riel'. 

Also, request answers to 'Questions' section.

Abstract : 
----------
Project idea is to use fake persistent memory with direct 
access(DAX) in virtual machines. Overall goal of project 
is to increase the number of virtual machines that can be 
run on a physical machine, in order to increase the density 
of customer virtual machines.

The idea is to avoid the guest page cache, and minimize the 
memory footprint of virtual machines. By presenting a disk 
image as a nvdimm direct access (DAX) memory region in a 
virtual machine, the guest OS can avoid using page cache 
memory for most file accesses.

Problem Statement :
------------------
* Guest uses page cache in memory to process fast requests 
  for disk read/write. This results in big memory footprint 
  of guests without host knowing much details of the guest 
  memory. 

* If guests use direct access(DAX) with fake persistent 
  storage, the host manages the page cache for guests, 
  allowing the host to easily reclaim/evict less frequently 
  used page cache pages without requiring guest cooperation, 
  like ballooning would.

* Host manages guest cache as ‘mmaped’ disk image area in 
  qemu address space. This region is passed to guest as fake 
  persistent memory range. We need a new flushing interface 
  to flush this cache to secondary storage to persist guest 
  writes.

* New asynchronous flushing interface will allow guests to 
  cause the host flush the dirty data to backup storage file. 
  Systems with pmem storage make use of CLFLUSH instruction 
  to flush single cache line to persistent storage and it 
  takes care of flushing. With fake persistent storage in 
  guest we cannot depend on CLFLUSH instruction to flush entire 
  dirty cache to backing storage. Even If we trap and emulate 
  CLFLUSH instruction guest vCPU has to wait till we flush all 
  the dirty memory. Instead of this we need to implement a new 
  asynchronous guest flushing interface, which allows the guest 
  to specify a larger range to be flushed at once, and allows 
  the vCPU to run something else while the data is being synced 
  to disk. 

* New flushing interface will consists of a para virt driver to 
  new fake nvdimm like device which will process guest flushing
  requests like fsync/msync etc instead of pmem library calls 
  like clflush. The corresponding device at host side will be 
  responsible for flushing requests for guest dirty pages. 
  Guest can put current task in sleep and vCPU can run any other 
  task while host side flushing of guests pages is in progress.

Host controlled fake nvdimm DAX to avoid guest page cache :
-------------------------------------------------------------
* Bypass guest page cache by using a fake persistent storage 
  like nvdimm & DAX. Guest Read/Write is directly done on 
  fake persistent storage without involving guest kernel for 
  caching data.

* Fake nvdimm device passed to guest is backed by a regular 
  file in host stored in secondary storage.

* Qemu has implementation of fake NVDIMM/DAX device. Use this 
  capability of passing regular host file(disk) as nvdimm device 
  to guest.

* Nvdimm with DAX works for ext4/xfs filesystem. Supported 
  filesystem should be DAX compatible. 

* As we are using guest disk as fake DAX/NVDIMM device, we 
  need a mechanism for persistence of data backed on regular 
  host storage file.

* For live migration use case, if host side backing file is 
  shared storage, we need to flush the page cache for the disk 
  image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?) 
  before starting execution of the guest on the destination host.

Design :
---------
* In order to not have page cache inside the guest, qemu would:

 1) mmap the guest's disk image and present that disk image to 
    the guest as a persistent memory range.

 2) Present information to the guest telling it that the persistent 
    memory range is not physical persistent memory.

 3) Present an additional paravirt device alongside the persistent 
    memory range, that can be used to sync (ranges of) data to disk.

* Guest would use the disk image mostly like a persistent memory 
  device, with two exceptions:

  1) It would not tell userspace that the files on that device are 
     persistent memory. This is  done so userspace knows to call 
     fsync/msync, instead of the pmem clflush library call.

  2) When userspace calls fsync/msync on files on the fake persistent 
     memory device, issue a request through the paravirt device that 
     causes the host to flush the device back end.

* Guest uses fake persistent storage data updates can be still in 
  qemu memory. We need a way to flush cached data in host to backed 
  secondary storage.

* Once the guest receives a completion event from the host, it will 
  allow userspace programs that were waiting on the fsync/msync to 
  continue running.

* Host is responsible for paging in pages in host backing area for 
  guest persistent memory as they are accessed by the guest, and 
  for evicting pages as host memory fills up.

Questions :
-----------
* What should the flushing interface between guest and host look 
  like?

* Any suggestions to hook the IO caching code with KVM/Qemu or 
  thoughts on how we should do it? 

* Thinking of implementing a guest para virt driver which will send 
  guest requests to Qemu to flush data to disk. Not sure at this 
  point how to tell userspace to work on this device as any regular
  device without considering it as persistent device. Any suggestions
  on this?

* Not thought yet about ballooning impact. But feel this solution 
  could be better than ballooning in long term? As we will be 
  managing all guests cache from host side.

* Not sure this solution works for ARM and other architectures and 
  Windows? 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM "fake DAX" device flushing
  2017-05-10 15:56 ` [Qemu-devel] " Pankaj Gupta
@ 2017-05-11 18:17   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2017-05-11 18:17 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm, qemu-devel, riel, pbonzini, kwolf, Haozhong Zhang,
	Dan Williams, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 7490 bytes --]

On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> We are sharing initial project proposal for 
> 'KVM "fake DAX" device flushing' project for feedback. 
> Got the idea during discussion with 'Rik van Riel'. 

CCing NVDIMM folks.

> 
> Also, request answers to 'Questions' section.
> 
> Abstract : 
> ----------
> Project idea is to use fake persistent memory with direct 
> access(DAX) in virtual machines. Overall goal of project 
> is to increase the number of virtual machines that can be 
> run on a physical machine, in order to increase the density 
> of customer virtual machines.
> 
> The idea is to avoid the guest page cache, and minimize the 
> memory footprint of virtual machines. By presenting a disk 
> image as a nvdimm direct access (DAX) memory region in a 
> virtual machine, the guest OS can avoid using page cache 
> memory for most file accesses.
> 
> Problem Statement :
> ------------------
> * Guest uses page cache in memory to process fast requests 
>   for disk read/write. This results in big memory footprint 
>   of guests without host knowing much details of the guest 
>   memory. 
> 
> * If guests use direct access(DAX) with fake persistent 
>   storage, the host manages the page cache for guests, 
>   allowing the host to easily reclaim/evict less frequently 
>   used page cache pages without requiring guest cooperation, 
>   like ballooning would.
> 
> * Host manages guest cache as ‘mmaped’ disk image area in 
>   qemu address space. This region is passed to guest as fake 
>   persistent memory range. We need a new flushing interface 
>   to flush this cache to secondary storage to persist guest 
>   writes.
> 
> * New asynchronous flushing interface will allow guests to 
>   cause the host flush the dirty data to backup storage file. 
>   Systems with pmem storage make use of CLFLUSH instruction 
>   to flush single cache line to persistent storage and it 
>   takes care of flushing. With fake persistent storage in 
>   guest we cannot depend on CLFLUSH instruction to flush entire 
>   dirty cache to backing storage. Even If we trap and emulate 
>   CLFLUSH instruction guest vCPU has to wait till we flush all 
>   the dirty memory. Instead of this we need to implement a new 
>   asynchronous guest flushing interface, which allows the guest 
>   to specify a larger range to be flushed at once, and allows 
>   the vCPU to run something else while the data is being synced 
>   to disk. 
> 
> * New flushing interface will consists of a para virt driver to 
>   new fake nvdimm like device which will process guest flushing
>   requests like fsync/msync etc instead of pmem library calls 
>   like clflush. The corresponding device at host side will be 
>   responsible for flushing requests for guest dirty pages. 
>   Guest can put current task in sleep and vCPU can run any other 
>   task while host side flushing of guests pages is in progress.
> 
> Host controlled fake nvdimm DAX to avoid guest page cache :
> -------------------------------------------------------------
> * Bypass guest page cache by using a fake persistent storage 
>   like nvdimm & DAX. Guest Read/Write is directly done on 
>   fake persistent storage without involving guest kernel for 
>   caching data.
> 
> * Fake nvdimm device passed to guest is backed by a regular 
>   file in host stored in secondary storage.
> 
> * Qemu has implementation of fake NVDIMM/DAX device. Use this 
>   capability of passing regular host file(disk) as nvdimm device 
>   to guest.
> 
> * Nvdimm with DAX works for ext4/xfs filesystem. Supported 
>   filesystem should be DAX compatible. 
> 
> * As we are using guest disk as fake DAX/NVDIMM device, we 
>   need a mechanism for persistence of data backed on regular 
>   host storage file.
> 
> * For live migration use case, if host side backing file is 
>   shared storage, we need to flush the page cache for the disk 
>   image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?) 
>   before starting execution of the guest on the destination host.

Good point.  QEMU currently only supports live migration with O_DIRECT.
I think the problem was that userspace cannot guarantee consistency in
the general case.  If you find a solution to this problem for fake
NVDIMM then maybe the QEMU block layer can also begin supporting live
migration with buffered I/O.

> 
> Design :
> ---------
> * In order to not have page cache inside the guest, qemu would:
> 
>  1) mmap the guest's disk image and present that disk image to 
>     the guest as a persistent memory range.
> 
>  2) Present information to the guest telling it that the persistent 
>     memory range is not physical persistent memory.

Steps 1 & 2 are already supported by QEMU NVDIMM emulation today.

>  3) Present an additional paravirt device alongside the persistent 
>     memory range, that can be used to sync (ranges of) data to disk.
> 
> * Guest would use the disk image mostly like a persistent memory 
>   device, with two exceptions:
> 
>   1) It would not tell userspace that the files on that device are 
>      persistent memory. This is  done so userspace knows to call 
>      fsync/msync, instead of the pmem clflush library call.

Not sure I agree with hiding the nvdimm nature of the device.  Instead I
think you need to build this capability into the Linux nvdimm code.
libpmem will detect these types of devices and issue fsync/msync when
the application wants to flush.

>   2) When userspace calls fsync/msync on files on the fake persistent 
>      memory device, issue a request through the paravirt device that 
>      causes the host to flush the device back end.
> 
> * Guest uses fake persistent storage data updates can be still in 
>   qemu memory. We need a way to flush cached data in host to backed 

s/qemu memory/host memory/

I guess you mean that host userspace needs a way to reliably flush an
address range to the underlying storage.

>   secondary storage.
> 
> * Once the guest receives a completion event from the host, it will 
>   allow userspace programs that were waiting on the fsync/msync to 
>   continue running.
> 
> * Host is responsible for paging in pages in host backing area for 
>   guest persistent memory as they are accessed by the guest, and 
>   for evicting pages as host memory fills up.
> 
> Questions :
> -----------
> * What should the flushing interface between guest and host look 
>   like?

A simple hack for prototyping is to instantiate an virtio-blk-pci for
the mmapped host file.  The guest can send flush commands on the
virtio-blk-pci device but will otherwise use the mapped memory directly.

> * Any suggestions to hook the IO caching code with KVM/Qemu or 
>   thoughts on how we should do it? 
> 
> * Thinking of implementing a guest para virt driver which will send 
>   guest requests to Qemu to flush data to disk. Not sure at this 
>   point how to tell userspace to work on this device as any regular
>   device without considering it as persistent device. Any suggestions
>   on this?
> 
> * Not thought yet about ballooning impact. But feel this solution 
>   could be better than ballooning in long term? As we will be 
>   managing all guests cache from host side.
> 
> * Not sure this solution works for ARM and other architectures and 
>   Windows? 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-11 18:17   ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2017-05-11 18:17 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm, qemu-devel, riel, pbonzini, kwolf, Haozhong Zhang,
	Dan Williams, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 7490 bytes --]

On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> We are sharing initial project proposal for 
> 'KVM "fake DAX" device flushing' project for feedback. 
> Got the idea during discussion with 'Rik van Riel'. 

CCing NVDIMM folks.

> 
> Also, request answers to 'Questions' section.
> 
> Abstract : 
> ----------
> Project idea is to use fake persistent memory with direct 
> access(DAX) in virtual machines. Overall goal of project 
> is to increase the number of virtual machines that can be 
> run on a physical machine, in order to increase the density 
> of customer virtual machines.
> 
> The idea is to avoid the guest page cache, and minimize the 
> memory footprint of virtual machines. By presenting a disk 
> image as a nvdimm direct access (DAX) memory region in a 
> virtual machine, the guest OS can avoid using page cache 
> memory for most file accesses.
> 
> Problem Statement :
> ------------------
> * Guest uses page cache in memory to process fast requests 
>   for disk read/write. This results in big memory footprint 
>   of guests without host knowing much details of the guest 
>   memory. 
> 
> * If guests use direct access(DAX) with fake persistent 
>   storage, the host manages the page cache for guests, 
>   allowing the host to easily reclaim/evict less frequently 
>   used page cache pages without requiring guest cooperation, 
>   like ballooning would.
> 
> * Host manages guest cache as ‘mmaped’ disk image area in 
>   qemu address space. This region is passed to guest as fake 
>   persistent memory range. We need a new flushing interface 
>   to flush this cache to secondary storage to persist guest 
>   writes.
> 
> * New asynchronous flushing interface will allow guests to 
>   cause the host flush the dirty data to backup storage file. 
>   Systems with pmem storage make use of CLFLUSH instruction 
>   to flush single cache line to persistent storage and it 
>   takes care of flushing. With fake persistent storage in 
>   guest we cannot depend on CLFLUSH instruction to flush entire 
>   dirty cache to backing storage. Even If we trap and emulate 
>   CLFLUSH instruction guest vCPU has to wait till we flush all 
>   the dirty memory. Instead of this we need to implement a new 
>   asynchronous guest flushing interface, which allows the guest 
>   to specify a larger range to be flushed at once, and allows 
>   the vCPU to run something else while the data is being synced 
>   to disk. 
> 
> * New flushing interface will consists of a para virt driver to 
>   new fake nvdimm like device which will process guest flushing
>   requests like fsync/msync etc instead of pmem library calls 
>   like clflush. The corresponding device at host side will be 
>   responsible for flushing requests for guest dirty pages. 
>   Guest can put current task in sleep and vCPU can run any other 
>   task while host side flushing of guests pages is in progress.
> 
> Host controlled fake nvdimm DAX to avoid guest page cache :
> -------------------------------------------------------------
> * Bypass guest page cache by using a fake persistent storage 
>   like nvdimm & DAX. Guest Read/Write is directly done on 
>   fake persistent storage without involving guest kernel for 
>   caching data.
> 
> * Fake nvdimm device passed to guest is backed by a regular 
>   file in host stored in secondary storage.
> 
> * Qemu has implementation of fake NVDIMM/DAX device. Use this 
>   capability of passing regular host file(disk) as nvdimm device 
>   to guest.
> 
> * Nvdimm with DAX works for ext4/xfs filesystem. Supported 
>   filesystem should be DAX compatible. 
> 
> * As we are using guest disk as fake DAX/NVDIMM device, we 
>   need a mechanism for persistence of data backed on regular 
>   host storage file.
> 
> * For live migration use case, if host side backing file is 
>   shared storage, we need to flush the page cache for the disk 
>   image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?) 
>   before starting execution of the guest on the destination host.

Good point.  QEMU currently only supports live migration with O_DIRECT.
I think the problem was that userspace cannot guarantee consistency in
the general case.  If you find a solution to this problem for fake
NVDIMM then maybe the QEMU block layer can also begin supporting live
migration with buffered I/O.

> 
> Design :
> ---------
> * In order to not have page cache inside the guest, qemu would:
> 
>  1) mmap the guest's disk image and present that disk image to 
>     the guest as a persistent memory range.
> 
>  2) Present information to the guest telling it that the persistent 
>     memory range is not physical persistent memory.

Steps 1 & 2 are already supported by QEMU NVDIMM emulation today.

>  3) Present an additional paravirt device alongside the persistent 
>     memory range, that can be used to sync (ranges of) data to disk.
> 
> * Guest would use the disk image mostly like a persistent memory 
>   device, with two exceptions:
> 
>   1) It would not tell userspace that the files on that device are 
>      persistent memory. This is  done so userspace knows to call 
>      fsync/msync, instead of the pmem clflush library call.

Not sure I agree with hiding the nvdimm nature of the device.  Instead I
think you need to build this capability into the Linux nvdimm code.
libpmem will detect these types of devices and issue fsync/msync when
the application wants to flush.

>   2) When userspace calls fsync/msync on files on the fake persistent 
>      memory device, issue a request through the paravirt device that 
>      causes the host to flush the device back end.
> 
> * Guest uses fake persistent storage data updates can be still in 
>   qemu memory. We need a way to flush cached data in host to backed 

s/qemu memory/host memory/

I guess you mean that host userspace needs a way to reliably flush an
address range to the underlying storage.

>   secondary storage.
> 
> * Once the guest receives a completion event from the host, it will 
>   allow userspace programs that were waiting on the fsync/msync to 
>   continue running.
> 
> * Host is responsible for paging in pages in host backing area for 
>   guest persistent memory as they are accessed by the guest, and 
>   for evicting pages as host memory fills up.
> 
> Questions :
> -----------
> * What should the flushing interface between guest and host look 
>   like?

A simple hack for prototyping is to instantiate an virtio-blk-pci for
the mmapped host file.  The guest can send flush commands on the
virtio-blk-pci device but will otherwise use the mapped memory directly.

> * Any suggestions to hook the IO caching code with KVM/Qemu or 
>   thoughts on how we should do it? 
> 
> * Thinking of implementing a guest para virt driver which will send 
>   guest requests to Qemu to flush data to disk. Not sure at this 
>   point how to tell userspace to work on this device as any regular
>   device without considering it as persistent device. Any suggestions
>   on this?
> 
> * Not thought yet about ballooning impact. But feel this solution 
>   could be better than ballooning in long term? As we will be 
>   managing all guests cache from host side.
> 
> * Not sure this solution works for ARM and other architectures and 
>   Windows? 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM "fake DAX" device flushing
  2017-05-11 18:17   ` [Qemu-devel] " Stefan Hajnoczi
@ 2017-05-11 19:15     ` Dan Williams
  -1 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-05-11 19:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Pankaj Gupta, KVM list, qemu-devel, Rik van Riel, Paolo Bonzini,
	kwolf, Haozhong Zhang, Xiao Guangrong

On Thu, May 11, 2017 at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
>> We are sharing initial project proposal for
>> 'KVM "fake DAX" device flushing' project for feedback.
>> Got the idea during discussion with 'Rik van Riel'.
>
> CCing NVDIMM folks.
>
>>
>> Also, request answers to 'Questions' section.
>>
>> Abstract :
>> ----------
>> Project idea is to use fake persistent memory with direct
>> access(DAX) in virtual machines. Overall goal of project
>> is to increase the number of virtual machines that can be
>> run on a physical machine, in order to increase the density
>> of customer virtual machines.
>>
>> The idea is to avoid the guest page cache, and minimize the
>> memory footprint of virtual machines. By presenting a disk
>> image as a nvdimm direct access (DAX) memory region in a
>> virtual machine, the guest OS can avoid using page cache
>> memory for most file accesses.

How is this different than the solution that Clear Containers came up with?

https://lwn.net/Articles/644675/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-11 19:15     ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-05-11 19:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Pankaj Gupta, KVM list, qemu-devel, Rik van Riel, Paolo Bonzini,
	kwolf, Haozhong Zhang, Xiao Guangrong

On Thu, May 11, 2017 at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
>> We are sharing initial project proposal for
>> 'KVM "fake DAX" device flushing' project for feedback.
>> Got the idea during discussion with 'Rik van Riel'.
>
> CCing NVDIMM folks.
>
>>
>> Also, request answers to 'Questions' section.
>>
>> Abstract :
>> ----------
>> Project idea is to use fake persistent memory with direct
>> access(DAX) in virtual machines. Overall goal of project
>> is to increase the number of virtual machines that can be
>> run on a physical machine, in order to increase the density
>> of customer virtual machines.
>>
>> The idea is to avoid the guest page cache, and minimize the
>> memory footprint of virtual machines. By presenting a disk
>> image as a nvdimm direct access (DAX) memory region in a
>> virtual machine, the guest OS can avoid using page cache
>> memory for most file accesses.

How is this different than the solution that Clear Containers came up with?

https://lwn.net/Articles/644675/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM "fake DAX" device flushing
  2017-05-11 19:15     ` [Qemu-devel] " Dan Williams
@ 2017-05-11 21:35       ` Rik van Riel
  -1 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2017-05-11 21:35 UTC (permalink / raw)
  To: Dan Williams, Stefan Hajnoczi
  Cc: Pankaj Gupta, KVM list, qemu-devel, Paolo Bonzini, kwolf,
	Haozhong Zhang, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

On Thu, 2017-05-11 at 12:15 -0700, Dan Williams wrote:
> On Thu, May 11, 2017 at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.co
> m> wrote:
> > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > We are sharing initial project proposal for
> > > 'KVM "fake DAX" device flushing' project for feedback.
> > > Got the idea during discussion with 'Rik van Riel'.
> > 
> > CCing NVDIMM folks.
> > 
> > > 
> > > Also, request answers to 'Questions' section.
> > > 
> > > Abstract :
> > > ----------
> > > Project idea is to use fake persistent memory with direct
> > > access(DAX) in virtual machines. Overall goal of project
> > > is to increase the number of virtual machines that can be
> > > run on a physical machine, in order to increase the density
> > > of customer virtual machines.
> > > 
> > > The idea is to avoid the guest page cache, and minimize the
> > > memory footprint of virtual machines. By presenting a disk
> > > image as a nvdimm direct access (DAX) memory region in a
> > > virtual machine, the guest OS can avoid using page cache
> > > memory for most file accesses.
> 
> How is this different than the solution that Clear Containers came up
> with?
> 
> https://lwn.net/Articles/644675/

Clear Containers uses MAP_PRIVATE with read-only
images.

This solution is about making read-write images
work.  When a program in the guest calls fsync,
we need to ensure the data has actually hit the
disk on the host side before fsync returns.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-11 21:35       ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2017-05-11 21:35 UTC (permalink / raw)
  To: Dan Williams, Stefan Hajnoczi
  Cc: Pankaj Gupta, KVM list, qemu-devel, Paolo Bonzini, kwolf,
	Haozhong Zhang, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

On Thu, 2017-05-11 at 12:15 -0700, Dan Williams wrote:
> On Thu, May 11, 2017 at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.co
> m> wrote:
> > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > We are sharing initial project proposal for
> > > 'KVM "fake DAX" device flushing' project for feedback.
> > > Got the idea during discussion with 'Rik van Riel'.
> > 
> > CCing NVDIMM folks.
> > 
> > > 
> > > Also, request answers to 'Questions' section.
> > > 
> > > Abstract :
> > > ----------
> > > Project idea is to use fake persistent memory with direct
> > > access(DAX) in virtual machines. Overall goal of project
> > > is to increase the number of virtual machines that can be
> > > run on a physical machine, in order to increase the density
> > > of customer virtual machines.
> > > 
> > > The idea is to avoid the guest page cache, and minimize the
> > > memory footprint of virtual machines. By presenting a disk
> > > image as a nvdimm direct access (DAX) memory region in a
> > > virtual machine, the guest OS can avoid using page cache
> > > memory for most file accesses.
> 
> How is this different than the solution that Clear Containers came up
> with?
> 
> https://lwn.net/Articles/644675/

Clear Containers uses MAP_PRIVATE with read-only
images.

This solution is about making read-write images
work.  When a program in the guest calls fsync,
we need to ensure the data has actually hit the
disk on the host side before fsync returns.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM "fake DAX" device flushing
  2017-05-11 18:17   ` [Qemu-devel] " Stefan Hajnoczi
@ 2017-05-11 21:38     ` Rik van Riel
  -1 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2017-05-11 21:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Pankaj Gupta
  Cc: kvm, qemu-devel, pbonzini, kwolf, Haozhong Zhang, Dan Williams,
	Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1229 bytes --]

On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > * For live migration use case, if host side backing file is 
> >   shared storage, we need to flush the page cache for the disk 
> >   image at the destination (new fadvise interface,
> > FADV_INVALIDATE_CACHE?) 
> >   before starting execution of the guest on the destination host.
> 
> Good point.  QEMU currently only supports live migration with
> O_DIRECT.
> I think the problem was that userspace cannot guarantee consistency
> in
> the general case.  If you find a solution to this problem for fake
> NVDIMM then maybe the QEMU block layer can also begin supporting live
> migration with buffered I/O.

I'll be happy to work with you on that, independently
of Pankaj's project.

It looks like the fadvise system call could be extended
pretty easily with an FADV_INVALIDATE_CACHE command, the
other side of which can simply hook into the existing
page cache invalidation code in the kernel.

Qemu will need to know whether the invalidation succeeded,
but that is something we can test for pretty easily before
returning to userspace.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-11 21:38     ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2017-05-11 21:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Pankaj Gupta
  Cc: kvm, qemu-devel, pbonzini, kwolf, Haozhong Zhang, Dan Williams,
	Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1229 bytes --]

On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > * For live migration use case, if host side backing file is 
> >   shared storage, we need to flush the page cache for the disk 
> >   image at the destination (new fadvise interface,
> > FADV_INVALIDATE_CACHE?) 
> >   before starting execution of the guest on the destination host.
> 
> Good point.  QEMU currently only supports live migration with
> O_DIRECT.
> I think the problem was that userspace cannot guarantee consistency
> in
> the general case.  If you find a solution to this problem for fake
> NVDIMM then maybe the QEMU block layer can also begin supporting live
> migration with buffered I/O.

I'll be happy to work with you on that, independently
of Pankaj's project.

It looks like the fadvise system call could be extended
pretty easily with an FADV_INVALIDATE_CACHE command, the
other side of which can simply hook into the existing
page cache invalidation code in the kernel.

Qemu will need to know whether the invalidation succeeded,
but that is something we can test for pretty easily before
returning to userspace.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
  2017-05-10 15:56 ` [Qemu-devel] " Pankaj Gupta
@ 2017-05-11 22:06   ` Dan Williams
  -1 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-05-11 22:06 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kwolf, KVM list, linux-nvdimm@lists.01.org, qemu-devel, stefanha,
	Paolo Bonzini

[ adding nvdimm mailing list ]

On Wed, May 10, 2017 at 8:56 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
> We are sharing initial project proposal for
> 'KVM "fake DAX" device flushing' project for feedback.
> Got the idea during discussion with 'Rik van Riel'.
>
> Also, request answers to 'Questions' section.
>
> Abstract :
> ----------
> Project idea is to use fake persistent memory with direct
> access(DAX) in virtual machines. Overall goal of project
> is to increase the number of virtual machines that can be
> run on a physical machine, in order to increase the density
> of customer virtual machines.
>
> The idea is to avoid the guest page cache, and minimize the
> memory footprint of virtual machines. By presenting a disk
> image as a nvdimm direct access (DAX) memory region in a
> virtual machine, the guest OS can avoid using page cache
> memory for most file accesses.
>
> Problem Statement :
> ------------------
> * Guest uses page cache in memory to process fast requests
>   for disk read/write. This results in big memory footprint
>   of guests without host knowing much details of the guest
>   memory.
>
> * If guests use direct access(DAX) with fake persistent
>   storage, the host manages the page cache for guests,
>   allowing the host to easily reclaim/evict less frequently
>   used page cache pages without requiring guest cooperation,
>   like ballooning would.
>
> * Host manages guest cache as ‘mmaped’ disk image area in
>   qemu address space. This region is passed to guest as fake
>   persistent memory range. We need a new flushing interface
>   to flush this cache to secondary storage to persist guest
>   writes.
>
> * New asynchronous flushing interface will allow guests to
>   cause the host flush the dirty data to backup storage file.
>   Systems with pmem storage make use of CLFLUSH instruction
>   to flush single cache line to persistent storage and it
>   takes care of flushing. With fake persistent storage in
>   guest we cannot depend on CLFLUSH instruction to flush entire
>   dirty cache to backing storage. Even If we trap and emulate
>   CLFLUSH instruction guest vCPU has to wait till we flush all
>   the dirty memory. Instead of this we need to implement a new
>   asynchronous guest flushing interface, which allows the guest
>   to specify a larger range to be flushed at once, and allows
>   the vCPU to run something else while the data is being synced
>   to disk.
>
> * New flushing interface will consists of a para virt driver to
>   new fake nvdimm like device which will process guest flushing
>   requests like fsync/msync etc instead of pmem library calls
>   like clflush. The corresponding device at host side will be
>   responsible for flushing requests for guest dirty pages.
>   Guest can put current task in sleep and vCPU can run any other
>   task while host side flushing of guests pages is in progress.
>
> Host controlled fake nvdimm DAX to avoid guest page cache :
> -------------------------------------------------------------
> * Bypass guest page cache by using a fake persistent storage
>   like nvdimm & DAX. Guest Read/Write is directly done on
>   fake persistent storage without involving guest kernel for
>   caching data.
>
> * Fake nvdimm device passed to guest is backed by a regular
>   file in host stored in secondary storage.
>
> * Qemu has implementation of fake NVDIMM/DAX device. Use this
>   capability of passing regular host file(disk) as nvdimm device
>   to guest.
>
> * Nvdimm with DAX works for ext4/xfs filesystem. Supported
>   filesystem should be DAX compatible.
>
> * As we are using guest disk as fake DAX/NVDIMM device, we
>   need a mechanism for persistence of data backed on regular
>   host storage file.
>
> * For live migration use case, if host side backing file is
>   shared storage, we need to flush the page cache for the disk
>   image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?)
>   before starting execution of the guest on the destination host.
>
> Design :
> ---------
> * In order to not have page cache inside the guest, qemu would:
>
>  1) mmap the guest's disk image and present that disk image to
>     the guest as a persistent memory range.
>
>  2) Present information to the guest telling it that the persistent
>     memory range is not physical persistent memory.
>
>  3) Present an additional paravirt device alongside the persistent
>     memory range, that can be used to sync (ranges of) data to disk.
>
> * Guest would use the disk image mostly like a persistent memory
>   device, with two exceptions:
>
>   1) It would not tell userspace that the files on that device are
>      persistent memory. This is  done so userspace knows to call
>      fsync/msync, instead of the pmem clflush library call.

There are no (safe) pmem applications today that can get by without
calling fsync/msync after an mmap write to a file on ext4 or xfs.
We're trying to fix that, more details below.

>   2) When userspace calls fsync/msync on files on the fake persistent
>      memory device, issue a request through the paravirt device that
>      causes the host to flush the device back end.

We need this in general for the persistent memory use case. There has
been proposals about using the "flush hint" addresses as defined by
ACPI 6, but those are awkward because it's not a queued interface and
they are not defined to be part of the persistence path, so
applications may be written to not trigger a "deep flush". Device-DAX
is an interface that bypasses the flush hint mechanism by default.

>
> * Guest uses fake persistent storage data updates can be still in
>   qemu memory. We need a way to flush cached data in host to backed
>   secondary storage.
>
> * Once the guest receives a completion event from the host, it will
>   allow userspace programs that were waiting on the fsync/msync to
>   continue running.
>
> * Host is responsible for paging in pages in host backing area for
>   guest persistent memory as they are accessed by the guest, and
>   for evicting pages as host memory fills up.
>
> Questions :
> -----------
> * What should the flushing interface between guest and host look
>   like?

I'm very interested in this because we have another need for a new
flushing interface to support passing DAX capable memory ranges
through to the guest in the general case.  Some of the background is
here:

    https://www.mail-archive.com/qemu-devel@nongnu.org/msg444473.html

...but we need an interface to implement lightweight flushing of data
updates to persistence for applications that want to use persist
updates to data structures on a frequent basis, think a persistent
btree. This new flush interface needs a host side implementation, but
also a way for guests to know when a guest-fsync needs to trigger
host-fsync.


>
> * Any suggestions to hook the IO caching code with KVM/Qemu or
>   thoughts on how we should do it?
>
> * Thinking of implementing a guest para virt driver which will send
>   guest requests to Qemu to flush data to disk. Not sure at this
>   point how to tell userspace to work on this device as any regular
>   device without considering it as persistent device. Any suggestions
>   on this?
>
> * Not thought yet about ballooning impact. But feel this solution
>   could be better than ballooning in long term? As we will be
>   managing all guests cache from host side.
>
> * Not sure this solution works for ARM and other architectures and
>   Windows?
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-11 22:06   ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2017-05-11 22:06 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: KVM list, qemu-devel, kwolf, Paolo Bonzini, stefanha,
	linux-nvdimm@lists.01.org

[ adding nvdimm mailing list ]

On Wed, May 10, 2017 at 8:56 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
> We are sharing initial project proposal for
> 'KVM "fake DAX" device flushing' project for feedback.
> Got the idea during discussion with 'Rik van Riel'.
>
> Also, request answers to 'Questions' section.
>
> Abstract :
> ----------
> Project idea is to use fake persistent memory with direct
> access(DAX) in virtual machines. Overall goal of project
> is to increase the number of virtual machines that can be
> run on a physical machine, in order to increase the density
> of customer virtual machines.
>
> The idea is to avoid the guest page cache, and minimize the
> memory footprint of virtual machines. By presenting a disk
> image as a nvdimm direct access (DAX) memory region in a
> virtual machine, the guest OS can avoid using page cache
> memory for most file accesses.
>
> Problem Statement :
> ------------------
> * Guest uses page cache in memory to process fast requests
>   for disk read/write. This results in big memory footprint
>   of guests without host knowing much details of the guest
>   memory.
>
> * If guests use direct access(DAX) with fake persistent
>   storage, the host manages the page cache for guests,
>   allowing the host to easily reclaim/evict less frequently
>   used page cache pages without requiring guest cooperation,
>   like ballooning would.
>
> * Host manages guest cache as ‘mmaped’ disk image area in
>   qemu address space. This region is passed to guest as fake
>   persistent memory range. We need a new flushing interface
>   to flush this cache to secondary storage to persist guest
>   writes.
>
> * New asynchronous flushing interface will allow guests to
>   cause the host flush the dirty data to backup storage file.
>   Systems with pmem storage make use of CLFLUSH instruction
>   to flush single cache line to persistent storage and it
>   takes care of flushing. With fake persistent storage in
>   guest we cannot depend on CLFLUSH instruction to flush entire
>   dirty cache to backing storage. Even If we trap and emulate
>   CLFLUSH instruction guest vCPU has to wait till we flush all
>   the dirty memory. Instead of this we need to implement a new
>   asynchronous guest flushing interface, which allows the guest
>   to specify a larger range to be flushed at once, and allows
>   the vCPU to run something else while the data is being synced
>   to disk.
>
> * New flushing interface will consists of a para virt driver to
>   new fake nvdimm like device which will process guest flushing
>   requests like fsync/msync etc instead of pmem library calls
>   like clflush. The corresponding device at host side will be
>   responsible for flushing requests for guest dirty pages.
>   Guest can put current task in sleep and vCPU can run any other
>   task while host side flushing of guests pages is in progress.
>
> Host controlled fake nvdimm DAX to avoid guest page cache :
> -------------------------------------------------------------
> * Bypass guest page cache by using a fake persistent storage
>   like nvdimm & DAX. Guest Read/Write is directly done on
>   fake persistent storage without involving guest kernel for
>   caching data.
>
> * Fake nvdimm device passed to guest is backed by a regular
>   file in host stored in secondary storage.
>
> * Qemu has implementation of fake NVDIMM/DAX device. Use this
>   capability of passing regular host file(disk) as nvdimm device
>   to guest.
>
> * Nvdimm with DAX works for ext4/xfs filesystem. Supported
>   filesystem should be DAX compatible.
>
> * As we are using guest disk as fake DAX/NVDIMM device, we
>   need a mechanism for persistence of data backed on regular
>   host storage file.
>
> * For live migration use case, if host side backing file is
>   shared storage, we need to flush the page cache for the disk
>   image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?)
>   before starting execution of the guest on the destination host.
>
> Design :
> ---------
> * In order to not have page cache inside the guest, qemu would:
>
>  1) mmap the guest's disk image and present that disk image to
>     the guest as a persistent memory range.
>
>  2) Present information to the guest telling it that the persistent
>     memory range is not physical persistent memory.
>
>  3) Present an additional paravirt device alongside the persistent
>     memory range, that can be used to sync (ranges of) data to disk.
>
> * Guest would use the disk image mostly like a persistent memory
>   device, with two exceptions:
>
>   1) It would not tell userspace that the files on that device are
>      persistent memory. This is  done so userspace knows to call
>      fsync/msync, instead of the pmem clflush library call.

There are no (safe) pmem applications today that can get by without
calling fsync/msync after an mmap write to a file on ext4 or xfs.
We're trying to fix that, more details below.

>   2) When userspace calls fsync/msync on files on the fake persistent
>      memory device, issue a request through the paravirt device that
>      causes the host to flush the device back end.

We need this in general for the persistent memory use case. There has
been proposals about using the "flush hint" addresses as defined by
ACPI 6, but those are awkward because it's not a queued interface and
they are not defined to be part of the persistence path, so
applications may be written to not trigger a "deep flush". Device-DAX
is an interface that bypasses the flush hint mechanism by default.

>
> * Guest uses fake persistent storage data updates can be still in
>   qemu memory. We need a way to flush cached data in host to backed
>   secondary storage.
>
> * Once the guest receives a completion event from the host, it will
>   allow userspace programs that were waiting on the fsync/msync to
>   continue running.
>
> * Host is responsible for paging in pages in host backing area for
>   guest persistent memory as they are accessed by the guest, and
>   for evicting pages as host memory fills up.
>
> Questions :
> -----------
> * What should the flushing interface between guest and host look
>   like?

I'm very interested in this because we have another need for a new
flushing interface to support passing DAX capable memory ranges
through to the guest in the general case.  Some of the background is
here:

    https://www.mail-archive.com/qemu-devel@nongnu.org/msg444473.html

...but we need an interface to implement lightweight flushing of data
updates to persistence for applications that want to use persist
updates to data structures on a frequent basis, think a persistent
btree. This new flush interface needs a host side implementation, but
also a way for guests to know when a guest-fsync needs to trigger
host-fsync.


>
> * Any suggestions to hook the IO caching code with KVM/Qemu or
>   thoughts on how we should do it?
>
> * Thinking of implementing a guest para virt driver which will send
>   guest requests to Qemu to flush data to disk. Not sure at this
>   point how to tell userspace to work on this device as any regular
>   device without considering it as persistent device. Any suggestions
>   on this?
>
> * Not thought yet about ballooning impact. But feel this solution
>   could be better than ballooning in long term? As we will be
>   managing all guests cache from host side.
>
> * Not sure this solution works for ARM and other architectures and
>   Windows?
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
  2017-05-11 18:17   ` [Qemu-devel] " Stefan Hajnoczi
                     ` (2 preceding siblings ...)
  (?)
@ 2017-05-12  6:56   ` Pankaj Gupta
  -1 siblings, 0 replies; 18+ messages in thread
From: Pankaj Gupta @ 2017-05-12  6:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, Haozhong Zhang, Xiao Guangrong, kvm, qemu-devel, pbonzini,
	Dan Williams


> 
> On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > We are sharing initial project proposal for
> > 'KVM "fake DAX" device flushing' project for feedback.
> > Got the idea during discussion with 'Rik van Riel'.
> 
> CCing NVDIMM folks.
> 
> > 
> > Also, request answers to 'Questions' section.
> > 
> > Abstract :
> > ----------
> > Project idea is to use fake persistent memory with direct
> > access(DAX) in virtual machines. Overall goal of project
> > is to increase the number of virtual machines that can be
> > run on a physical machine, in order to increase the density
> > of customer virtual machines.
> > 
> > The idea is to avoid the guest page cache, and minimize the
> > memory footprint of virtual machines. By presenting a disk
> > image as a nvdimm direct access (DAX) memory region in a
> > virtual machine, the guest OS can avoid using page cache
> > memory for most file accesses.
> > 
> > Problem Statement :
> > ------------------
> > * Guest uses page cache in memory to process fast requests
> >   for disk read/write. This results in big memory footprint
> >   of guests without host knowing much details of the guest
> >   memory.
> > 
> > * If guests use direct access(DAX) with fake persistent
> >   storage, the host manages the page cache for guests,
> >   allowing the host to easily reclaim/evict less frequently
> >   used page cache pages without requiring guest cooperation,
> >   like ballooning would.
> > 
> > * Host manages guest cache as ‘mmaped’ disk image area in
> >   qemu address space. This region is passed to guest as fake
> >   persistent memory range. We need a new flushing interface
> >   to flush this cache to secondary storage to persist guest
> >   writes.
> > 
> > * New asynchronous flushing interface will allow guests to
> >   cause the host flush the dirty data to backup storage file.
> >   Systems with pmem storage make use of CLFLUSH instruction
> >   to flush single cache line to persistent storage and it
> >   takes care of flushing. With fake persistent storage in
> >   guest we cannot depend on CLFLUSH instruction to flush entire
> >   dirty cache to backing storage. Even If we trap and emulate
> >   CLFLUSH instruction guest vCPU has to wait till we flush all
> >   the dirty memory. Instead of this we need to implement a new
> >   asynchronous guest flushing interface, which allows the guest
> >   to specify a larger range to be flushed at once, and allows
> >   the vCPU to run something else while the data is being synced
> >   to disk.
> > 
> > * New flushing interface will consists of a para virt driver to
> >   new fake nvdimm like device which will process guest flushing
> >   requests like fsync/msync etc instead of pmem library calls
> >   like clflush. The corresponding device at host side will be
> >   responsible for flushing requests for guest dirty pages.
> >   Guest can put current task in sleep and vCPU can run any other
> >   task while host side flushing of guests pages is in progress.
> > 
> > Host controlled fake nvdimm DAX to avoid guest page cache :
> > -------------------------------------------------------------
> > * Bypass guest page cache by using a fake persistent storage
> >   like nvdimm & DAX. Guest Read/Write is directly done on
> >   fake persistent storage without involving guest kernel for
> >   caching data.
> > 
> > * Fake nvdimm device passed to guest is backed by a regular
> >   file in host stored in secondary storage.
> > 
> > * Qemu has implementation of fake NVDIMM/DAX device. Use this
> >   capability of passing regular host file(disk) as nvdimm device
> >   to guest.
> > 
> > * Nvdimm with DAX works for ext4/xfs filesystem. Supported
> >   filesystem should be DAX compatible.
> > 
> > * As we are using guest disk as fake DAX/NVDIMM device, we
> >   need a mechanism for persistence of data backed on regular
> >   host storage file.
> > 
> > * For live migration use case, if host side backing file is
> >   shared storage, we need to flush the page cache for the disk
> >   image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?)
> >   before starting execution of the guest on the destination host.
> 
> Good point.  QEMU currently only supports live migration with O_DIRECT.
> I think the problem was that userspace cannot guarantee consistency in
> the general case.  If you find a solution to this problem for fake
> NVDIMM then maybe the QEMU block layer can also begin supporting live
> migration with buffered I/O.
> 
> > 
> > Design :
> > ---------
> > * In order to not have page cache inside the guest, qemu would:
> > 
> >  1) mmap the guest's disk image and present that disk image to
> >     the guest as a persistent memory range.
> > 
> >  2) Present information to the guest telling it that the persistent
> >     memory range is not physical persistent memory.
> 
> Steps 1 & 2 are already supported by QEMU NVDIMM emulation today.

Yes. I have also tested guest 'fake DAX' device using QEMU NVDIMM emulation.
> 
> >  3) Present an additional paravirt device alongside the persistent
> >     memory range, that can be used to sync (ranges of) data to disk.
> > 
> > * Guest would use the disk image mostly like a persistent memory
> >   device, with two exceptions:
> > 
> >   1) It would not tell userspace that the files on that device are
> >      persistent memory. This is  done so userspace knows to call
> >      fsync/msync, instead of the pmem clflush library call.
> 
> Not sure I agree with hiding the nvdimm nature of the device.  Instead I
> think you need to build this capability into the Linux nvdimm code.
> libpmem will detect these types of devices and issue fsync/msync when
> the application wants to flush.
> 
> >   2) When userspace calls fsync/msync on files on the fake persistent
> >      memory device, issue a request through the paravirt device that
> >      causes the host to flush the device back end.
> > 
> > * Guest uses fake persistent storage data updates can be still in
> >   qemu memory. We need a way to flush cached data in host to backed
> 
> s/qemu memory/host memory/
> 
> I guess you mean that host userspace needs a way to reliably flush an
> address range to the underlying storage.

right.
> 
> >   secondary storage.
> > 
> > * Once the guest receives a completion event from the host, it will
> >   allow userspace programs that were waiting on the fsync/msync to
> >   continue running.
> > 
> > * Host is responsible for paging in pages in host backing area for
> >   guest persistent memory as they are accessed by the guest, and
> >   for evicting pages as host memory fills up.
> > 
> > Questions :
> > -----------
> > * What should the flushing interface between guest and host look
> >   like?
> 
> A simple hack for prototyping is to instantiate an virtio-blk-pci for
> the mmapped host file.  The guest can send flush commands on the
> virtio-blk-pci device but will otherwise use the mapped memory directly.

okay. I will check this.
> 
> > * Any suggestions to hook the IO caching code with KVM/Qemu or
> >   thoughts on how we should do it?
> > 
> > * Thinking of implementing a guest para virt driver which will send
> >   guest requests to Qemu to flush data to disk. Not sure at this
> >   point how to tell userspace to work on this device as any regular
> >   device without considering it as persistent device. Any suggestions
> >   on this?
> > 
> > * Not thought yet about ballooning impact. But feel this solution
> >   could be better than ballooning in long term? As we will be
> >   managing all guests cache from host side.
> > 
> > * Not sure this solution works for ARM and other architectures and
> >   Windows?
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM "fake DAX" device flushing
  2017-05-11 21:38     ` [Qemu-devel] " Rik van Riel
@ 2017-05-12 13:42       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2017-05-12 13:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pankaj Gupta, kvm, qemu-devel, pbonzini, kwolf, Haozhong Zhang,
	Dan Williams, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Thu, May 11, 2017 at 05:38:40PM -0400, Rik van Riel wrote:
> On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > * For live migration use case, if host side backing file is 
> > >   shared storage, we need to flush the page cache for the disk 
> > >   image at the destination (new fadvise interface,
> > > FADV_INVALIDATE_CACHE?) 
> > >   before starting execution of the guest on the destination host.
> > 
> > Good point.  QEMU currently only supports live migration with
> > O_DIRECT.
> > I think the problem was that userspace cannot guarantee consistency
> > in
> > the general case.  If you find a solution to this problem for fake
> > NVDIMM then maybe the QEMU block layer can also begin supporting live
> > migration with buffered I/O.
> 
> I'll be happy to work with you on that, independently
> of Pankaj's project.
> 
> It looks like the fadvise system call could be extended
> pretty easily with an FADV_INVALIDATE_CACHE command, the
> other side of which can simply hook into the existing
> page cache invalidation code in the kernel.
> 
> Qemu will need to know whether the invalidation succeeded,
> but that is something we can test for pretty easily before
> returning to userspace.

Sounds great.  I will review the long discussions that took place on
qemu-devel about cache invalidation for live migration - just want to
make sure there were no other reasons why only O_DIRECT is supported :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-12 13:42       ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2017-05-12 13:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pankaj Gupta, kvm, qemu-devel, pbonzini, kwolf, Haozhong Zhang,
	Dan Williams, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Thu, May 11, 2017 at 05:38:40PM -0400, Rik van Riel wrote:
> On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > * For live migration use case, if host side backing file is 
> > >   shared storage, we need to flush the page cache for the disk 
> > >   image at the destination (new fadvise interface,
> > > FADV_INVALIDATE_CACHE?) 
> > >   before starting execution of the guest on the destination host.
> > 
> > Good point.  QEMU currently only supports live migration with
> > O_DIRECT.
> > I think the problem was that userspace cannot guarantee consistency
> > in
> > the general case.  If you find a solution to this problem for fake
> > NVDIMM then maybe the QEMU block layer can also begin supporting live
> > migration with buffered I/O.
> 
> I'll be happy to work with you on that, independently
> of Pankaj's project.
> 
> It looks like the fadvise system call could be extended
> pretty easily with an FADV_INVALIDATE_CACHE command, the
> other side of which can simply hook into the existing
> page cache invalidation code in the kernel.
> 
> Qemu will need to know whether the invalidation succeeded,
> but that is something we can test for pretty easily before
> returning to userspace.

Sounds great.  I will review the long discussions that took place on
qemu-devel about cache invalidation for live migration - just want to
make sure there were no other reasons why only O_DIRECT is supported :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM "fake DAX" device flushing
  2017-05-12 13:42       ` [Qemu-devel] " Stefan Hajnoczi
@ 2017-05-12 16:53         ` Kevin Wolf
  -1 siblings, 0 replies; 18+ messages in thread
From: Kevin Wolf @ 2017-05-12 16:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Rik van Riel, Pankaj Gupta, kvm, qemu-devel, pbonzini,
	Haozhong Zhang, Dan Williams, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]

Am 12.05.2017 um 15:42 hat Stefan Hajnoczi geschrieben:
> On Thu, May 11, 2017 at 05:38:40PM -0400, Rik van Riel wrote:
> > On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> > > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > > * For live migration use case, if host side backing file is 
> > > >   shared storage, we need to flush the page cache for the disk 
> > > >   image at the destination (new fadvise interface,
> > > > FADV_INVALIDATE_CACHE?) 
> > > >   before starting execution of the guest on the destination host.
> > > 
> > > Good point.  QEMU currently only supports live migration with
> > > O_DIRECT.
> > > I think the problem was that userspace cannot guarantee consistency
> > > in
> > > the general case.  If you find a solution to this problem for fake
> > > NVDIMM then maybe the QEMU block layer can also begin supporting live
> > > migration with buffered I/O.
> > 
> > I'll be happy to work with you on that, independently
> > of Pankaj's project.
> > 
> > It looks like the fadvise system call could be extended
> > pretty easily with an FADV_INVALIDATE_CACHE command, the
> > other side of which can simply hook into the existing
> > page cache invalidation code in the kernel.
> > 
> > Qemu will need to know whether the invalidation succeeded,
> > but that is something we can test for pretty easily before
> > returning to userspace.
> 
> Sounds great.  I will review the long discussions that took place on
> qemu-devel about cache invalidation for live migration - just want to
> make sure there were no other reasons why only O_DIRECT is supported
> :).

There are other reasons why we recommend against using non-O_DIRECT
modes in production (including the error handling), but with respect to
live migration, this is the only one I'm aware of.

As I already said in the private email thread, an FADV_INVALIDATE_CACHE
should do the trick and I'd be happy to work with you guys on that.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
@ 2017-05-12 16:53         ` Kevin Wolf
  0 siblings, 0 replies; 18+ messages in thread
From: Kevin Wolf @ 2017-05-12 16:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Rik van Riel, Pankaj Gupta, kvm, qemu-devel, pbonzini,
	Haozhong Zhang, Dan Williams, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]

Am 12.05.2017 um 15:42 hat Stefan Hajnoczi geschrieben:
> On Thu, May 11, 2017 at 05:38:40PM -0400, Rik van Riel wrote:
> > On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> > > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > > * For live migration use case, if host side backing file is 
> > > >   shared storage, we need to flush the page cache for the disk 
> > > >   image at the destination (new fadvise interface,
> > > > FADV_INVALIDATE_CACHE?) 
> > > >   before starting execution of the guest on the destination host.
> > > 
> > > Good point.  QEMU currently only supports live migration with
> > > O_DIRECT.
> > > I think the problem was that userspace cannot guarantee consistency
> > > in
> > > the general case.  If you find a solution to this problem for fake
> > > NVDIMM then maybe the QEMU block layer can also begin supporting live
> > > migration with buffered I/O.
> > 
> > I'll be happy to work with you on that, independently
> > of Pankaj's project.
> > 
> > It looks like the fadvise system call could be extended
> > pretty easily with an FADV_INVALIDATE_CACHE command, the
> > other side of which can simply hook into the existing
> > page cache invalidation code in the kernel.
> > 
> > Qemu will need to know whether the invalidation succeeded,
> > but that is something we can test for pretty easily before
> > returning to userspace.
> 
> Sounds great.  I will review the long discussions that took place on
> qemu-devel about cache invalidation for live migration - just want to
> make sure there were no other reasons why only O_DIRECT is supported
> :).

There are other reasons why we recommend against using non-O_DIRECT
modes in production (including the error handling), but with respect to
live migration, this is the only one I'm aware of.

As I already said in the private email thread, an FADV_INVALIDATE_CACHE
should do the trick and I'd be happy to work with you guys on that.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" device flushing
  2017-05-12 16:53         ` [Qemu-devel] " Kevin Wolf
  (?)
@ 2017-05-15  9:12         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2017-05-15  9:12 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Pankaj Gupta, Xiao Guangrong, kvm,
	Haozhong Zhang, qemu-devel, pbonzini, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]

On Fri, May 12, 2017 at 06:53:44PM +0200, Kevin Wolf wrote:
> Am 12.05.2017 um 15:42 hat Stefan Hajnoczi geschrieben:
> > On Thu, May 11, 2017 at 05:38:40PM -0400, Rik van Riel wrote:
> > > On Thu, 2017-05-11 at 14:17 -0400, Stefan Hajnoczi wrote:
> > > > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote:
> > > > > * For live migration use case, if host side backing file is 
> > > > >   shared storage, we need to flush the page cache for the disk 
> > > > >   image at the destination (new fadvise interface,
> > > > > FADV_INVALIDATE_CACHE?) 
> > > > >   before starting execution of the guest on the destination host.
> > > > 
> > > > Good point.  QEMU currently only supports live migration with
> > > > O_DIRECT.
> > > > I think the problem was that userspace cannot guarantee consistency
> > > > in
> > > > the general case.  If you find a solution to this problem for fake
> > > > NVDIMM then maybe the QEMU block layer can also begin supporting live
> > > > migration with buffered I/O.
> > > 
> > > I'll be happy to work with you on that, independently
> > > of Pankaj's project.
> > > 
> > > It looks like the fadvise system call could be extended
> > > pretty easily with an FADV_INVALIDATE_CACHE command, the
> > > other side of which can simply hook into the existing
> > > page cache invalidation code in the kernel.
> > > 
> > > Qemu will need to know whether the invalidation succeeded,
> > > but that is something we can test for pretty easily before
> > > returning to userspace.
> > 
> > Sounds great.  I will review the long discussions that took place on
> > qemu-devel about cache invalidation for live migration - just want to
> > make sure there were no other reasons why only O_DIRECT is supported
> > :).
> 
> There are other reasons why we recommend against using non-O_DIRECT
> modes in production (including the error handling), but with respect to
> live migration, this is the only one I'm aware of.
> 
> As I already said in the private email thread, an FADV_INVALIDATE_CACHE
> should do the trick and I'd be happy to work with you guys on that.

Okay, I didn't know you and Rik had already discussed this in private.
The QEMU change is probably not difficult.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-05-15  9:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-10 15:56 KVM "fake DAX" device flushing Pankaj Gupta
2017-05-10 15:56 ` [Qemu-devel] " Pankaj Gupta
2017-05-11 18:17 ` Stefan Hajnoczi
2017-05-11 18:17   ` [Qemu-devel] " Stefan Hajnoczi
2017-05-11 19:15   ` Dan Williams
2017-05-11 19:15     ` [Qemu-devel] " Dan Williams
2017-05-11 21:35     ` Rik van Riel
2017-05-11 21:35       ` [Qemu-devel] " Rik van Riel
2017-05-11 21:38   ` Rik van Riel
2017-05-11 21:38     ` [Qemu-devel] " Rik van Riel
2017-05-12 13:42     ` Stefan Hajnoczi
2017-05-12 13:42       ` [Qemu-devel] " Stefan Hajnoczi
2017-05-12 16:53       ` Kevin Wolf
2017-05-12 16:53         ` [Qemu-devel] " Kevin Wolf
2017-05-15  9:12         ` Stefan Hajnoczi
2017-05-12  6:56   ` Pankaj Gupta
2017-05-11 22:06 ` Dan Williams
2017-05-11 22:06   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.