All of lore.kernel.org
 help / color / mirror / Atom feed
* KVM "fake DAX" flushing interface - discussion
@ 2017-07-21  6:56   ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21  6:56 UTC (permalink / raw)
  To: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong.eric, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal


Hello,

We shared a proposal for 'KVM fake DAX flushing interface'.

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html

We did initial POC in which we used 'virtio-blk' device to perform 
a device flush on pmem fsync on ext4 filesystem. They are few hacks 
to make things work. We need suggestions on below points before we 
start actual implementation.

A] Problems to solve:
------------------

1] We are considering two approaches for 'fake DAX flushing interface'.
    
 1.1] fake dax with NVDIMM flush hints & KVM async page fault

     - Existing interface.

     - The approach to use flush hint address is already nacked upstream.

     - Flush hint not queued interface for flushing. Applications might 
       avoid to use it.

     - Flush hint address traps from guest to host and do an entire fsync 
       on backing file which itself is costly.

     - Can be used to flush specific pages on host backing disk. We can 
       send data(pages information) equal to cache-line size(limitation) 
       and tell host to sync corresponding pages instead of entire disk sync.

     - This will be an asynchronous operation and vCPU control is returned 
       quickly.


 1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)

     - New interface

     - Guest maintains information of DAX dirty pages as exceptional entries in 
       radix tree.

     - If we want to flush specific pages from guest to host, we need to send 
       list of the dirty pages corresponding to file on which we are doing fsync.

     - This will require implementation of new interface, a new paravirt device 
       for sending flush requests.

     - Host side will perform fsync/fdatasync on list of dirty pages or entire 
       block device backed file.

2] Questions:
-----------

 2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
      to call this? device DAX neither calls fsync/msync?

 2.2] Depending upon interface we decide, we need optimal solution to sync 
      range of pages?

     - Send range of pages from guest to host to sync asynchronously instead 
       of syncing entire block device?

     - Other option is to sync entire disk backing file to make sure all the 
       writes are persistent. In our case, backing file is a regular file on 
       non NVDIMM device so host page cache has list of dirty pages which
       can be used either with fsync or similar interface.

 2.3] If we do host fsync on entire disk we will be flushing all the dirty data
      to backend file. Just thinking what would be better approach, flushing 
      pages on corresponding guest file fsync or entire block device?

 2.4] If we decide to choose one of the above approaches, we need to consider 
      all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
      fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
      Or how flush hint addresses would be invoked with fsync or similar api?

 2.5] Also with filesystem journalling and other mount options like barriers, 
      ordered etc, how we decide to use page flush hint or regular fsync on file?
 
 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
      these to to host? At host side would we able to find corresponding page and flush 
      them all?

Suggestions & ideas are welcome.

Thanks,
Pankaj

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* KVM "fake DAX" flushing interface - discussion
@ 2017-07-21  6:56   ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21  6:56 UTC (permalink / raw)
  To: kvm-devel, Qemu Developers,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
  Cc: Kevin Wolf, Rik van Riel,
	xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal


Hello,

We shared a proposal for 'KVM fake DAX flushing interface'.

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html

We did initial POC in which we used 'virtio-blk' device to perform 
a device flush on pmem fsync on ext4 filesystem. They are few hacks 
to make things work. We need suggestions on below points before we 
start actual implementation.

A] Problems to solve:
------------------

1] We are considering two approaches for 'fake DAX flushing interface'.
    
 1.1] fake dax with NVDIMM flush hints & KVM async page fault

     - Existing interface.

     - The approach to use flush hint address is already nacked upstream.

     - Flush hint not queued interface for flushing. Applications might 
       avoid to use it.

     - Flush hint address traps from guest to host and do an entire fsync 
       on backing file which itself is costly.

     - Can be used to flush specific pages on host backing disk. We can 
       send data(pages information) equal to cache-line size(limitation) 
       and tell host to sync corresponding pages instead of entire disk sync.

     - This will be an asynchronous operation and vCPU control is returned 
       quickly.


 1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)

     - New interface

     - Guest maintains information of DAX dirty pages as exceptional entries in 
       radix tree.

     - If we want to flush specific pages from guest to host, we need to send 
       list of the dirty pages corresponding to file on which we are doing fsync.

     - This will require implementation of new interface, a new paravirt device 
       for sending flush requests.

     - Host side will perform fsync/fdatasync on list of dirty pages or entire 
       block device backed file.

2] Questions:
-----------

 2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
      to call this? device DAX neither calls fsync/msync?

 2.2] Depending upon interface we decide, we need optimal solution to sync 
      range of pages?

     - Send range of pages from guest to host to sync asynchronously instead 
       of syncing entire block device?

     - Other option is to sync entire disk backing file to make sure all the 
       writes are persistent. In our case, backing file is a regular file on 
       non NVDIMM device so host page cache has list of dirty pages which
       can be used either with fsync or similar interface.

 2.3] If we do host fsync on entire disk we will be flushing all the dirty data
      to backend file. Just thinking what would be better approach, flushing 
      pages on corresponding guest file fsync or entire block device?

 2.4] If we decide to choose one of the above approaches, we need to consider 
      all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
      fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
      Or how flush hint addresses would be invoked with fsync or similar api?

 2.5] Also with filesystem journalling and other mount options like barriers, 
      ordered etc, how we decide to use page flush hint or regular fsync on file?
 
 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
      these to to host? At host side would we able to find corresponding page and flush 
      them all?

Suggestions & ideas are welcome.

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 176+ messages in thread

* [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21  6:56   ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21  6:56 UTC (permalink / raw)
  To: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org
  Cc: Rik van Riel, Dan Williams, Stefan Hajnoczi, ross.zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong.eric, Haozhong Zhang


Hello,

We shared a proposal for 'KVM fake DAX flushing interface'.

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html

We did initial POC in which we used 'virtio-blk' device to perform 
a device flush on pmem fsync on ext4 filesystem. They are few hacks 
to make things work. We need suggestions on below points before we 
start actual implementation.

A] Problems to solve:
------------------

1] We are considering two approaches for 'fake DAX flushing interface'.
    
 1.1] fake dax with NVDIMM flush hints & KVM async page fault

     - Existing interface.

     - The approach to use flush hint address is already nacked upstream.

     - Flush hint not queued interface for flushing. Applications might 
       avoid to use it.

     - Flush hint address traps from guest to host and do an entire fsync 
       on backing file which itself is costly.

     - Can be used to flush specific pages on host backing disk. We can 
       send data(pages information) equal to cache-line size(limitation) 
       and tell host to sync corresponding pages instead of entire disk sync.

     - This will be an asynchronous operation and vCPU control is returned 
       quickly.


 1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)

     - New interface

     - Guest maintains information of DAX dirty pages as exceptional entries in 
       radix tree.

     - If we want to flush specific pages from guest to host, we need to send 
       list of the dirty pages corresponding to file on which we are doing fsync.

     - This will require implementation of new interface, a new paravirt device 
       for sending flush requests.

     - Host side will perform fsync/fdatasync on list of dirty pages or entire 
       block device backed file.

2] Questions:
-----------

 2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
      to call this? device DAX neither calls fsync/msync?

 2.2] Depending upon interface we decide, we need optimal solution to sync 
      range of pages?

     - Send range of pages from guest to host to sync asynchronously instead 
       of syncing entire block device?

     - Other option is to sync entire disk backing file to make sure all the 
       writes are persistent. In our case, backing file is a regular file on 
       non NVDIMM device so host page cache has list of dirty pages which
       can be used either with fsync or similar interface.

 2.3] If we do host fsync on entire disk we will be flushing all the dirty data
      to backend file. Just thinking what would be better approach, flushing 
      pages on corresponding guest file fsync or entire block device?

 2.4] If we decide to choose one of the above approaches, we need to consider 
      all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
      fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
      Or how flush hint addresses would be invoked with fsync or similar api?

 2.5] Also with filesystem journalling and other mount options like barriers, 
      ordered etc, how we decide to use page flush hint or regular fsync on file?
 
 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
      these to to host? At host side would we able to find corresponding page and flush 
      them all?

Suggestions & ideas are welcome.

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-21  6:56   ` Pankaj Gupta
  (?)
@ 2017-07-21  9:51     ` Haozhong Zhang
  -1 siblings, 0 replies; 176+ messages in thread
From: Haozhong Zhang @ 2017-07-21  9:51 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong.eric, kvm-devel,
	linux-nvdimm@lists.01.org, Qemu Developers, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal

On 07/21/17 02:56 -0400, Pankaj Gupta wrote:
> 
> Hello,
> 
> We shared a proposal for 'KVM fake DAX flushing interface'.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
>

In above link,
  "Overall goal of project 
   is to increase the number of virtual machines that can be 
   run on a physical machine, in order to *increase the density*
   of customer virtual machines"

Is the fake persistent memory used as normal RAM in guest? If no, how
is it expected to be used in guest?

> We did initial POC in which we used 'virtio-blk' device to perform 
> a device flush on pmem fsync on ext4 filesystem. They are few hacks 
> to make things work. We need suggestions on below points before we 
> start actual implementation.
>
> A] Problems to solve:
> ------------------
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>     
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>      - Existing interface.
> 
>      - The approach to use flush hint address is already nacked upstream.
> 
>      - Flush hint not queued interface for flushing. Applications might 
>        avoid to use it.
> 
>      - Flush hint address traps from guest to host and do an entire fsync 
>        on backing file which itself is costly.
> 
>      - Can be used to flush specific pages on host backing disk. We can 
>        send data(pages information) equal to cache-line size(limitation) 
>        and tell host to sync corresponding pages instead of entire disk sync.
> 
>      - This will be an asynchronous operation and vCPU control is returned 
>        quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)
> 
>      - New interface
> 
>      - Guest maintains information of DAX dirty pages as exceptional entries in 
>        radix tree.
> 
>      - If we want to flush specific pages from guest to host, we need to send 
>        list of the dirty pages corresponding to file on which we are doing fsync.
> 
>      - This will require implementation of new interface, a new paravirt device 
>        for sending flush requests.
> 
>      - Host side will perform fsync/fdatasync on list of dirty pages or entire 
>        block device backed file.
> 
> 2] Questions:
> -----------
> 
>  2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
>       to call this? device DAX neither calls fsync/msync?
> 
>  2.2] Depending upon interface we decide, we need optimal solution to sync 
>       range of pages?
> 
>      - Send range of pages from guest to host to sync asynchronously instead 
>        of syncing entire block device?

e.g. a new virtio device to deliver sync requests to host?

> 
>      - Other option is to sync entire disk backing file to make sure all the 
>        writes are persistent. In our case, backing file is a regular file on 
>        non NVDIMM device so host page cache has list of dirty pages which
>        can be used either with fsync or similar interface.

As the amount of dirty pages can be variant, the latency of each host
fsync is likely to vary in a large range.

> 
>  2.3] If we do host fsync on entire disk we will be flushing all the dirty data
>       to backend file. Just thinking what would be better approach, flushing 
>       pages on corresponding guest file fsync or entire block device?
> 
>  2.4] If we decide to choose one of the above approaches, we need to consider 
>       all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
>       fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
>       Or how flush hint addresses would be invoked with fsync or similar api?
> 
>  2.5] Also with filesystem journalling and other mount options like barriers, 
>       ordered etc, how we decide to use page flush hint or regular fsync on file?
>  
>  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
>       these to to host? At host side would we able to find corresponding page and flush 
>       them all?

That may require the host file system provides API to flush specified
blocks/extents and their meta data in the file system. I'm not
familiar with this part and don't know whether such API exists.

Haozhong
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-21  9:51     ` Haozhong Zhang
  0 siblings, 0 replies; 176+ messages in thread
From: Haozhong Zhang @ 2017-07-21  9:51 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross.zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong.eric

On 07/21/17 02:56 -0400, Pankaj Gupta wrote:
> 
> Hello,
> 
> We shared a proposal for 'KVM fake DAX flushing interface'.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
>

In above link,
  "Overall goal of project 
   is to increase the number of virtual machines that can be 
   run on a physical machine, in order to *increase the density*
   of customer virtual machines"

Is the fake persistent memory used as normal RAM in guest? If no, how
is it expected to be used in guest?

> We did initial POC in which we used 'virtio-blk' device to perform 
> a device flush on pmem fsync on ext4 filesystem. They are few hacks 
> to make things work. We need suggestions on below points before we 
> start actual implementation.
>
> A] Problems to solve:
> ------------------
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>     
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>      - Existing interface.
> 
>      - The approach to use flush hint address is already nacked upstream.
> 
>      - Flush hint not queued interface for flushing. Applications might 
>        avoid to use it.
> 
>      - Flush hint address traps from guest to host and do an entire fsync 
>        on backing file which itself is costly.
> 
>      - Can be used to flush specific pages on host backing disk. We can 
>        send data(pages information) equal to cache-line size(limitation) 
>        and tell host to sync corresponding pages instead of entire disk sync.
> 
>      - This will be an asynchronous operation and vCPU control is returned 
>        quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)
> 
>      - New interface
> 
>      - Guest maintains information of DAX dirty pages as exceptional entries in 
>        radix tree.
> 
>      - If we want to flush specific pages from guest to host, we need to send 
>        list of the dirty pages corresponding to file on which we are doing fsync.
> 
>      - This will require implementation of new interface, a new paravirt device 
>        for sending flush requests.
> 
>      - Host side will perform fsync/fdatasync on list of dirty pages or entire 
>        block device backed file.
> 
> 2] Questions:
> -----------
> 
>  2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
>       to call this? device DAX neither calls fsync/msync?
> 
>  2.2] Depending upon interface we decide, we need optimal solution to sync 
>       range of pages?
> 
>      - Send range of pages from guest to host to sync asynchronously instead 
>        of syncing entire block device?

e.g. a new virtio device to deliver sync requests to host?

> 
>      - Other option is to sync entire disk backing file to make sure all the 
>        writes are persistent. In our case, backing file is a regular file on 
>        non NVDIMM device so host page cache has list of dirty pages which
>        can be used either with fsync or similar interface.

As the amount of dirty pages can be variant, the latency of each host
fsync is likely to vary in a large range.

> 
>  2.3] If we do host fsync on entire disk we will be flushing all the dirty data
>       to backend file. Just thinking what would be better approach, flushing 
>       pages on corresponding guest file fsync or entire block device?
> 
>  2.4] If we decide to choose one of the above approaches, we need to consider 
>       all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
>       fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
>       Or how flush hint addresses would be invoked with fsync or similar api?
> 
>  2.5] Also with filesystem journalling and other mount options like barriers, 
>       ordered etc, how we decide to use page flush hint or regular fsync on file?
>  
>  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
>       these to to host? At host side would we able to find corresponding page and flush 
>       them all?

That may require the host file system provides API to flush specified
blocks/extents and their meta data in the file system. I'm not
familiar with this part and don't know whether such API exists.

Haozhong

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21  9:51     ` Haozhong Zhang
  0 siblings, 0 replies; 176+ messages in thread
From: Haozhong Zhang @ 2017-07-21  9:51 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross.zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong.eric

On 07/21/17 02:56 -0400, Pankaj Gupta wrote:
> 
> Hello,
> 
> We shared a proposal for 'KVM fake DAX flushing interface'.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
>

In above link,
  "Overall goal of project 
   is to increase the number of virtual machines that can be 
   run on a physical machine, in order to *increase the density*
   of customer virtual machines"

Is the fake persistent memory used as normal RAM in guest? If no, how
is it expected to be used in guest?

> We did initial POC in which we used 'virtio-blk' device to perform 
> a device flush on pmem fsync on ext4 filesystem. They are few hacks 
> to make things work. We need suggestions on below points before we 
> start actual implementation.
>
> A] Problems to solve:
> ------------------
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>     
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>      - Existing interface.
> 
>      - The approach to use flush hint address is already nacked upstream.
> 
>      - Flush hint not queued interface for flushing. Applications might 
>        avoid to use it.
> 
>      - Flush hint address traps from guest to host and do an entire fsync 
>        on backing file which itself is costly.
> 
>      - Can be used to flush specific pages on host backing disk. We can 
>        send data(pages information) equal to cache-line size(limitation) 
>        and tell host to sync corresponding pages instead of entire disk sync.
> 
>      - This will be an asynchronous operation and vCPU control is returned 
>        quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)
> 
>      - New interface
> 
>      - Guest maintains information of DAX dirty pages as exceptional entries in 
>        radix tree.
> 
>      - If we want to flush specific pages from guest to host, we need to send 
>        list of the dirty pages corresponding to file on which we are doing fsync.
> 
>      - This will require implementation of new interface, a new paravirt device 
>        for sending flush requests.
> 
>      - Host side will perform fsync/fdatasync on list of dirty pages or entire 
>        block device backed file.
> 
> 2] Questions:
> -----------
> 
>  2.1] Not sure why WPQ flush is not a queued interface? We can force applications 
>       to call this? device DAX neither calls fsync/msync?
> 
>  2.2] Depending upon interface we decide, we need optimal solution to sync 
>       range of pages?
> 
>      - Send range of pages from guest to host to sync asynchronously instead 
>        of syncing entire block device?

e.g. a new virtio device to deliver sync requests to host?

> 
>      - Other option is to sync entire disk backing file to make sure all the 
>        writes are persistent. In our case, backing file is a regular file on 
>        non NVDIMM device so host page cache has list of dirty pages which
>        can be used either with fsync or similar interface.

As the amount of dirty pages can be variant, the latency of each host
fsync is likely to vary in a large range.

> 
>  2.3] If we do host fsync on entire disk we will be flushing all the dirty data
>       to backend file. Just thinking what would be better approach, flushing 
>       pages on corresponding guest file fsync or entire block device?
> 
>  2.4] If we decide to choose one of the above approaches, we need to consider 
>       all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding
>       fsync code of fs seems reasonable? Just thinking for flush hint address use-case?
>       Or how flush hint addresses would be invoked with fsync or similar api?
> 
>  2.5] Also with filesystem journalling and other mount options like barriers, 
>       ordered etc, how we decide to use page flush hint or regular fsync on file?
>  
>  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send 
>       these to to host? At host side would we able to find corresponding page and flush 
>       them all?

That may require the host file system provides API to flush specified
blocks/extents and their meta data in the file system. I'm not
familiar with this part and don't know whether such API exists.

Haozhong

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-21  9:51     ` Haozhong Zhang
  (?)
@ 2017-07-21 10:21       ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21 10:21 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong eric, kvm-devel,
	linux-nvdimm@lists.01.org, Qemu Developers, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal


> > 
> > Hello,
> > 
> > We shared a proposal for 'KVM fake DAX flushing interface'.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
> >
> 
> In above link,
>   "Overall goal of project
>    is to increase the number of virtual machines that can be
>    run on a physical machine, in order to *increase the density*
>    of customer virtual machines"
> 
> Is the fake persistent memory used as normal RAM in guest? If no, how
> is it expected to be used in guest?

Yes, guest will have a nvdimm DAX device and not use page cache for most 
of the operations. Host will manage memory requirement of all the guests.
  
> 
> > We did initial POC in which we used 'virtio-blk' device to perform
> > a device flush on pmem fsync on ext4 filesystem. They are few hacks
> > to make things work. We need suggestions on below points before we
> > start actual implementation.
> >
> > A] Problems to solve:
> > ------------------
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >     
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >      - Existing interface.
> > 
> >      - The approach to use flush hint address is already nacked upstream.
> > 
> >      - Flush hint not queued interface for flushing. Applications might
> >        avoid to use it.
> > 
> >      - Flush hint address traps from guest to host and do an entire fsync
> >        on backing file which itself is costly.
> > 
> >      - Can be used to flush specific pages on host backing disk. We can
> >        send data(pages information) equal to cache-line size(limitation)
> >        and tell host to sync corresponding pages instead of entire disk
> >        sync.
> > 
> >      - This will be an asynchronous operation and vCPU control is returned
> >        quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> > 
> >      - New interface
> > 
> >      - Guest maintains information of DAX dirty pages as exceptional
> >      entries in
> >        radix tree.
> > 
> >      - If we want to flush specific pages from guest to host, we need to
> >      send
> >        list of the dirty pages corresponding to file on which we are doing
> >        fsync.
> > 
> >      - This will require implementation of new interface, a new paravirt
> >      device
> >        for sending flush requests.
> > 
> >      - Host side will perform fsync/fdatasync on list of dirty pages or
> >      entire
> >        block device backed file.
> > 
> > 2] Questions:
> > -----------
> > 
> >  2.1] Not sure why WPQ flush is not a queued interface? We can force
> >  applications
> >       to call this? device DAX neither calls fsync/msync?
> > 
> >  2.2] Depending upon interface we decide, we need optimal solution to sync
> >       range of pages?
> > 
> >      - Send range of pages from guest to host to sync asynchronously
> >      instead
> >        of syncing entire block device?
> 
> e.g. a new virtio device to deliver sync requests to host?
> 
> > 
> >      - Other option is to sync entire disk backing file to make sure all
> >      the
> >        writes are persistent. In our case, backing file is a regular file
> >        on
> >        non NVDIMM device so host page cache has list of dirty pages which
> >        can be used either with fsync or similar interface.
> 
> As the amount of dirty pages can be variant, the latency of each host
> fsync is likely to vary in a large range.
> 
> > 
> >  2.3] If we do host fsync on entire disk we will be flushing all the dirty
> >  data
> >       to backend file. Just thinking what would be better approach,
> >       flushing
> >       pages on corresponding guest file fsync or entire block device?
> > 
> >  2.4] If we decide to choose one of the above approaches, we need to
> >  consider
> >       all DAX supporting filesystems(ext4/xfs). Would hooking code to
> >       corresponding
> >       fsync code of fs seems reasonable? Just thinking for flush hint
> >       address use-case?
> >       Or how flush hint addresses would be invoked with fsync or similar
> >       api?
> > 
> >  2.5] Also with filesystem journalling and other mount options like
> >  barriers,
> >       ordered etc, how we decide to use page flush hint or regular fsync on
> >       file?
> >  
> >  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and
> >  we send
> >       these to to host? At host side would we able to find corresponding
> >       page and flush
> >       them all?
> 
> That may require the host file system provides API to flush specified
> blocks/extents and their meta data in the file system. I'm not
> familiar with this part and don't know whether such API exists.
> 
> Haozhong
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 10:21       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21 10:21 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong eric, kvm-devel,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Qemu Developers, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal


> > 
> > Hello,
> > 
> > We shared a proposal for 'KVM fake DAX flushing interface'.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
> >
> 
> In above link,
>   "Overall goal of project
>    is to increase the number of virtual machines that can be
>    run on a physical machine, in order to *increase the density*
>    of customer virtual machines"
> 
> Is the fake persistent memory used as normal RAM in guest? If no, how
> is it expected to be used in guest?

Yes, guest will have a nvdimm DAX device and not use page cache for most 
of the operations. Host will manage memory requirement of all the guests.
  
> 
> > We did initial POC in which we used 'virtio-blk' device to perform
> > a device flush on pmem fsync on ext4 filesystem. They are few hacks
> > to make things work. We need suggestions on below points before we
> > start actual implementation.
> >
> > A] Problems to solve:
> > ------------------
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >     
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >      - Existing interface.
> > 
> >      - The approach to use flush hint address is already nacked upstream.
> > 
> >      - Flush hint not queued interface for flushing. Applications might
> >        avoid to use it.
> > 
> >      - Flush hint address traps from guest to host and do an entire fsync
> >        on backing file which itself is costly.
> > 
> >      - Can be used to flush specific pages on host backing disk. We can
> >        send data(pages information) equal to cache-line size(limitation)
> >        and tell host to sync corresponding pages instead of entire disk
> >        sync.
> > 
> >      - This will be an asynchronous operation and vCPU control is returned
> >        quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> > 
> >      - New interface
> > 
> >      - Guest maintains information of DAX dirty pages as exceptional
> >      entries in
> >        radix tree.
> > 
> >      - If we want to flush specific pages from guest to host, we need to
> >      send
> >        list of the dirty pages corresponding to file on which we are doing
> >        fsync.
> > 
> >      - This will require implementation of new interface, a new paravirt
> >      device
> >        for sending flush requests.
> > 
> >      - Host side will perform fsync/fdatasync on list of dirty pages or
> >      entire
> >        block device backed file.
> > 
> > 2] Questions:
> > -----------
> > 
> >  2.1] Not sure why WPQ flush is not a queued interface? We can force
> >  applications
> >       to call this? device DAX neither calls fsync/msync?
> > 
> >  2.2] Depending upon interface we decide, we need optimal solution to sync
> >       range of pages?
> > 
> >      - Send range of pages from guest to host to sync asynchronously
> >      instead
> >        of syncing entire block device?
> 
> e.g. a new virtio device to deliver sync requests to host?
> 
> > 
> >      - Other option is to sync entire disk backing file to make sure all
> >      the
> >        writes are persistent. In our case, backing file is a regular file
> >        on
> >        non NVDIMM device so host page cache has list of dirty pages which
> >        can be used either with fsync or similar interface.
> 
> As the amount of dirty pages can be variant, the latency of each host
> fsync is likely to vary in a large range.
> 
> > 
> >  2.3] If we do host fsync on entire disk we will be flushing all the dirty
> >  data
> >       to backend file. Just thinking what would be better approach,
> >       flushing
> >       pages on corresponding guest file fsync or entire block device?
> > 
> >  2.4] If we decide to choose one of the above approaches, we need to
> >  consider
> >       all DAX supporting filesystems(ext4/xfs). Would hooking code to
> >       corresponding
> >       fsync code of fs seems reasonable? Just thinking for flush hint
> >       address use-case?
> >       Or how flush hint addresses would be invoked with fsync or similar
> >       api?
> > 
> >  2.5] Also with filesystem journalling and other mount options like
> >  barriers,
> >       ordered etc, how we decide to use page flush hint or regular fsync on
> >       file?
> >  
> >  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and
> >  we send
> >       these to to host? At host side would we able to find corresponding
> >       page and flush
> >       them all?
> 
> That may require the host file system provides API to flush specified
> blocks/extents and their meta data in the file system. I'm not
> familiar with this part and don't know whether such API exists.
> 
> Haozhong
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 10:21       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21 10:21 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric


> > 
> > Hello,
> > 
> > We shared a proposal for 'KVM fake DAX flushing interface'.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
> >
> 
> In above link,
>   "Overall goal of project
>    is to increase the number of virtual machines that can be
>    run on a physical machine, in order to *increase the density*
>    of customer virtual machines"
> 
> Is the fake persistent memory used as normal RAM in guest? If no, how
> is it expected to be used in guest?

Yes, guest will have a nvdimm DAX device and not use page cache for most 
of the operations. Host will manage memory requirement of all the guests.
  
> 
> > We did initial POC in which we used 'virtio-blk' device to perform
> > a device flush on pmem fsync on ext4 filesystem. They are few hacks
> > to make things work. We need suggestions on below points before we
> > start actual implementation.
> >
> > A] Problems to solve:
> > ------------------
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >     
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >      - Existing interface.
> > 
> >      - The approach to use flush hint address is already nacked upstream.
> > 
> >      - Flush hint not queued interface for flushing. Applications might
> >        avoid to use it.
> > 
> >      - Flush hint address traps from guest to host and do an entire fsync
> >        on backing file which itself is costly.
> > 
> >      - Can be used to flush specific pages on host backing disk. We can
> >        send data(pages information) equal to cache-line size(limitation)
> >        and tell host to sync corresponding pages instead of entire disk
> >        sync.
> > 
> >      - This will be an asynchronous operation and vCPU control is returned
> >        quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> > 
> >      - New interface
> > 
> >      - Guest maintains information of DAX dirty pages as exceptional
> >      entries in
> >        radix tree.
> > 
> >      - If we want to flush specific pages from guest to host, we need to
> >      send
> >        list of the dirty pages corresponding to file on which we are doing
> >        fsync.
> > 
> >      - This will require implementation of new interface, a new paravirt
> >      device
> >        for sending flush requests.
> > 
> >      - Host side will perform fsync/fdatasync on list of dirty pages or
> >      entire
> >        block device backed file.
> > 
> > 2] Questions:
> > -----------
> > 
> >  2.1] Not sure why WPQ flush is not a queued interface? We can force
> >  applications
> >       to call this? device DAX neither calls fsync/msync?
> > 
> >  2.2] Depending upon interface we decide, we need optimal solution to sync
> >       range of pages?
> > 
> >      - Send range of pages from guest to host to sync asynchronously
> >      instead
> >        of syncing entire block device?
> 
> e.g. a new virtio device to deliver sync requests to host?
> 
> > 
> >      - Other option is to sync entire disk backing file to make sure all
> >      the
> >        writes are persistent. In our case, backing file is a regular file
> >        on
> >        non NVDIMM device so host page cache has list of dirty pages which
> >        can be used either with fsync or similar interface.
> 
> As the amount of dirty pages can be variant, the latency of each host
> fsync is likely to vary in a large range.
> 
> > 
> >  2.3] If we do host fsync on entire disk we will be flushing all the dirty
> >  data
> >       to backend file. Just thinking what would be better approach,
> >       flushing
> >       pages on corresponding guest file fsync or entire block device?
> > 
> >  2.4] If we decide to choose one of the above approaches, we need to
> >  consider
> >       all DAX supporting filesystems(ext4/xfs). Would hooking code to
> >       corresponding
> >       fsync code of fs seems reasonable? Just thinking for flush hint
> >       address use-case?
> >       Or how flush hint addresses would be invoked with fsync or similar
> >       api?
> > 
> >  2.5] Also with filesystem journalling and other mount options like
> >  barriers,
> >       ordered etc, how we decide to use page flush hint or regular fsync on
> >       file?
> >  
> >  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and
> >  we send
> >       these to to host? At host side would we able to find corresponding
> >       page and flush
> >       them all?
> 
> That may require the host file system provides API to flush specified
> blocks/extents and their meta data in the file system. I'm not
> familiar with this part and don't know whether such API exists.
> 
> Haozhong
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-21  6:56   ` Pankaj Gupta
@ 2017-07-21 12:12     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 176+ messages in thread
From: Stefan Hajnoczi @ 2017-07-21 12:12 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross.zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong.eric, Haozhong Zhang

[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]

On Fri, Jul 21, 2017 at 02:56:34AM -0400, Pankaj Gupta wrote:
> A] Problems to solve:
> ------------------
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>     
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>      - Existing interface.
> 
>      - The approach to use flush hint address is already nacked upstream.
>
>      - Flush hint not queued interface for flushing. Applications might 
>        avoid to use it.

This doesn't contradicts the last point about async operation and vcpu
control.  KVM async page faults turn the Address Flush Hints write into
an async operation so the guest can get other work done while waiting
for completion.

> 
>      - Flush hint address traps from guest to host and do an entire fsync 
>        on backing file which itself is costly.
> 
>      - Can be used to flush specific pages on host backing disk. We can 
>        send data(pages information) equal to cache-line size(limitation) 
>        and tell host to sync corresponding pages instead of entire disk sync.

Are you sure?  Your previous point says only the entire device can be
synced.  The NVDIMM Adress Flush Hints interface does not involve
address range information.

> 
>      - This will be an asynchronous operation and vCPU control is returned 
>        quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)

Perhaps this can be exposed via ACPI as part of the NVDIMM standards
instead of a separate KVM-only paravirt device.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 12:12     ` Stefan Hajnoczi
  0 siblings, 0 replies; 176+ messages in thread
From: Stefan Hajnoczi @ 2017-07-21 12:12 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross.zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong.eric, Haozhong Zhang

[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]

On Fri, Jul 21, 2017 at 02:56:34AM -0400, Pankaj Gupta wrote:
> A] Problems to solve:
> ------------------
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
>     
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>      - Existing interface.
> 
>      - The approach to use flush hint address is already nacked upstream.
>
>      - Flush hint not queued interface for flushing. Applications might 
>        avoid to use it.

This doesn't contradicts the last point about async operation and vcpu
control.  KVM async page faults turn the Address Flush Hints write into
an async operation so the guest can get other work done while waiting
for completion.

> 
>      - Flush hint address traps from guest to host and do an entire fsync 
>        on backing file which itself is costly.
> 
>      - Can be used to flush specific pages on host backing disk. We can 
>        send data(pages information) equal to cache-line size(limitation) 
>        and tell host to sync corresponding pages instead of entire disk sync.

Are you sure?  Your previous point says only the entire device can be
synced.  The NVDIMM Adress Flush Hints interface does not involve
address range information.

> 
>      - This will be an asynchronous operation and vCPU control is returned 
>        quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax with device flush)

Perhaps this can be exposed via ACPI as part of the NVDIMM standards
instead of a separate KVM-only paravirt device.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-21 12:12     ` [Qemu-devel] " Stefan Hajnoczi
  (?)
@ 2017-07-21 13:29       ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21 13:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong eric, kvm-devel,
	linux-nvdimm@lists.01.org, Qemu Developers, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal


> > A] Problems to solve:
> > ------------------
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >     
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >      - Existing interface.
> > 
> >      - The approach to use flush hint address is already nacked upstream.
> >
> >      - Flush hint not queued interface for flushing. Applications might
> >        avoid to use it.
> 
> This doesn't contradicts the last point about async operation and vcpu
> control.  KVM async page faults turn the Address Flush Hints write into
> an async operation so the guest can get other work done while waiting
> for completion.
> 
> > 
> >      - Flush hint address traps from guest to host and do an entire fsync
> >        on backing file which itself is costly.
> > 
> >      - Can be used to flush specific pages on host backing disk. We can
> >        send data(pages information) equal to cache-line size(limitation)
> >        and tell host to sync corresponding pages instead of entire disk
> >        sync.
> 
> Are you sure?  Your previous point says only the entire device can be
> synced.  The NVDIMM Adress Flush Hints interface does not involve
> address range information.

Just syncing entire block device should be simple but costly. Using flush 
hint address to write data which contains list/info of dirty pages to 
flush requires more thought. This calls mmio write callback at Qemu side.
As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
of data guest can write and is equal to cache line size.
 
> 
> > 
> >      - This will be an asynchronous operation and vCPU control is returned
> >        quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> 
> Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> instead of a separate KVM-only paravirt device.

Same reason as above. If we decide on sending list of dirty pages there is
limit to send max size of data to host using flush hint address.  
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 13:29       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21 13:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang


> > A] Problems to solve:
> > ------------------
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >     
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >      - Existing interface.
> > 
> >      - The approach to use flush hint address is already nacked upstream.
> >
> >      - Flush hint not queued interface for flushing. Applications might
> >        avoid to use it.
> 
> This doesn't contradicts the last point about async operation and vcpu
> control.  KVM async page faults turn the Address Flush Hints write into
> an async operation so the guest can get other work done while waiting
> for completion.
> 
> > 
> >      - Flush hint address traps from guest to host and do an entire fsync
> >        on backing file which itself is costly.
> > 
> >      - Can be used to flush specific pages on host backing disk. We can
> >        send data(pages information) equal to cache-line size(limitation)
> >        and tell host to sync corresponding pages instead of entire disk
> >        sync.
> 
> Are you sure?  Your previous point says only the entire device can be
> synced.  The NVDIMM Adress Flush Hints interface does not involve
> address range information.

Just syncing entire block device should be simple but costly. Using flush 
hint address to write data which contains list/info of dirty pages to 
flush requires more thought. This calls mmio write callback at Qemu side.
As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
of data guest can write and is equal to cache line size.
 
> 
> > 
> >      - This will be an asynchronous operation and vCPU control is returned
> >        quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> 
> Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> instead of a separate KVM-only paravirt device.

Same reason as above. If we decide on sending list of dirty pages there is
limit to send max size of data to host using flush hint address.  
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 13:29       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-21 13:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Rik van Riel, Dan Williams, Stefan Hajnoczi, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang


> > A] Problems to solve:
> > ------------------
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> >     
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >      - Existing interface.
> > 
> >      - The approach to use flush hint address is already nacked upstream.
> >
> >      - Flush hint not queued interface for flushing. Applications might
> >        avoid to use it.
> 
> This doesn't contradicts the last point about async operation and vcpu
> control.  KVM async page faults turn the Address Flush Hints write into
> an async operation so the guest can get other work done while waiting
> for completion.
> 
> > 
> >      - Flush hint address traps from guest to host and do an entire fsync
> >        on backing file which itself is costly.
> > 
> >      - Can be used to flush specific pages on host backing disk. We can
> >        send data(pages information) equal to cache-line size(limitation)
> >        and tell host to sync corresponding pages instead of entire disk
> >        sync.
> 
> Are you sure?  Your previous point says only the entire device can be
> synced.  The NVDIMM Adress Flush Hints interface does not involve
> address range information.

Just syncing entire block device should be simple but costly. Using flush 
hint address to write data which contains list/info of dirty pages to 
flush requires more thought. This calls mmio write callback at Qemu side.
As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
of data guest can write and is equal to cache line size.
 
> 
> > 
> >      - This will be an asynchronous operation and vCPU control is returned
> >        quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> 
> Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> instead of a separate KVM-only paravirt device.

Same reason as above. If we decide on sending list of dirty pages there is
limit to send max size of data to host using flush hint address.  
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 14:00         ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-21 14:00 UTC (permalink / raw)
  To: Pankaj Gupta, Stefan Hajnoczi
  Cc: Kevin Wolf, xiaoguangrong eric, kvm-devel,
	linux-nvdimm@lists.01.org, Qemu Developers, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal

On Fri, 2017-07-21 at 09:29 -0400, Pankaj Gupta wrote:
> > > 
> > >      - Flush hint address traps from guest to host and do an
> > > entire fsync
> > >        on backing file which itself is costly.
> > > 
> > >      - Can be used to flush specific pages on host backing disk.
> > > We can
> > >        send data(pages information) equal to cache-line
> > > size(limitation)
> > >        and tell host to sync corresponding pages instead of
> > > entire disk
> > >        sync.
> > 
> > Are you sure?  Your previous point says only the entire device can
> > be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly.

Costly depends on just how fast the backing IO device is.

If the backing IO is a spinning disk, doing targeted range
syncs will certainly be faster.

On the other hand, if the backing IO is one of the latest
generation SSD devices, it may be faster to have just one
hypercall and flush everything, than it would be to have
separate sync calls for each range that we want flushed.

Should we design our interfaces for yesterday's storage
devices, or for tomorrow's storage devices?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 14:00         ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-21 14:00 UTC (permalink / raw)
  To: Pankaj Gupta, Stefan Hajnoczi
  Cc: Kevin Wolf, xiaoguangrong eric, kvm-devel,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Qemu Developers, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Fri, 2017-07-21 at 09:29 -0400, Pankaj Gupta wrote:
> > > 
> > >      - Flush hint address traps from guest to host and do an
> > > entire fsync
> > >        on backing file which itself is costly.
> > > 
> > >      - Can be used to flush specific pages on host backing disk.
> > > We can
> > >        send data(pages information) equal to cache-line
> > > size(limitation)
> > >        and tell host to sync corresponding pages instead of
> > > entire disk
> > >        sync.
> > 
> > Are you sure?  Your previous point says only the entire device can
> > be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly.

Costly depends on just how fast the backing IO device is.

If the backing IO is a spinning disk, doing targeted range
syncs will certainly be faster.

On the other hand, if the backing IO is one of the latest
generation SSD devices, it may be faster to have just one
hypercall and flush everything, than it would be to have
separate sync calls for each range that we want flushed.

Should we design our interfaces for yesterday's storage
devices, or for tomorrow's storage devices?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 14:00         ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-21 14:00 UTC (permalink / raw)
  To: Pankaj Gupta, Stefan Hajnoczi
  Cc: kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	Dan Williams, Stefan Hajnoczi, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, xiaoguangrong eric,
	Haozhong Zhang

On Fri, 2017-07-21 at 09:29 -0400, Pankaj Gupta wrote:
> > > 
> > >      - Flush hint address traps from guest to host and do an
> > > entire fsync
> > >        on backing file which itself is costly.
> > > 
> > >      - Can be used to flush specific pages on host backing disk.
> > > We can
> > >        send data(pages information) equal to cache-line
> > > size(limitation)
> > >        and tell host to sync corresponding pages instead of
> > > entire disk
> > >        sync.
> > 
> > Are you sure?  Your previous point says only the entire device can
> > be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly.

Costly depends on just how fast the backing IO device is.

If the backing IO is a spinning disk, doing targeted range
syncs will certainly be faster.

On the other hand, if the backing IO is one of the latest
generation SSD devices, it may be faster to have just one
hypercall and flush everything, than it would be to have
separate sync calls for each range that we want flushed.

Should we design our interfaces for yesterday's storage
devices, or for tomorrow's storage devices?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-21 13:29       ` Pankaj Gupta
@ 2017-07-21 15:58         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 176+ messages in thread
From: Stefan Hajnoczi @ 2017-07-21 15:58 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, Rik van Riel, Dan Williams,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang

[-- Attachment #1: Type: text/plain, Size: 2660 bytes --]

On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
> 
> > > A] Problems to solve:
> > > ------------------
> > > 
> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
> > >     
> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > > 
> > >      - Existing interface.
> > > 
> > >      - The approach to use flush hint address is already nacked upstream.
> > >
> > >      - Flush hint not queued interface for flushing. Applications might
> > >        avoid to use it.
> > 
> > This doesn't contradicts the last point about async operation and vcpu
> > control.  KVM async page faults turn the Address Flush Hints write into
> > an async operation so the guest can get other work done while waiting
> > for completion.
> > 
> > > 
> > >      - Flush hint address traps from guest to host and do an entire fsync
> > >        on backing file which itself is costly.
> > > 
> > >      - Can be used to flush specific pages on host backing disk. We can
> > >        send data(pages information) equal to cache-line size(limitation)
> > >        and tell host to sync corresponding pages instead of entire disk
> > >        sync.
> > 
> > Are you sure?  Your previous point says only the entire device can be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly. Using flush 
> hint address to write data which contains list/info of dirty pages to 
> flush requires more thought. This calls mmio write callback at Qemu side.
> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
> of data guest can write and is equal to cache line size.
>  
> > 
> > > 
> > >      - This will be an asynchronous operation and vCPU control is returned
> > >        quickly.
> > > 
> > > 
> > >  1.2] Using additional para virt device in addition to pmem device(fake dax
> > >  with device flush)
> > 
> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> > instead of a separate KVM-only paravirt device.
> 
> Same reason as above. If we decide on sending list of dirty pages there is
> limit to send max size of data to host using flush hint address.  

I understand now: you are proposing to change the semantics of the
Address Flush Hints interface.  You want the value written to have
meaning (the address range that needs to be flushed).

Today the spec says:

  The content of the data is not relevant to the functioning of the
  flush hint mechanism.

Maybe the NVDIMM folks can comment on this idea.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-21 15:58         ` Stefan Hajnoczi
  0 siblings, 0 replies; 176+ messages in thread
From: Stefan Hajnoczi @ 2017-07-21 15:58 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, Rik van Riel, Dan Williams,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang

[-- Attachment #1: Type: text/plain, Size: 2660 bytes --]

On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
> 
> > > A] Problems to solve:
> > > ------------------
> > > 
> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
> > >     
> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > > 
> > >      - Existing interface.
> > > 
> > >      - The approach to use flush hint address is already nacked upstream.
> > >
> > >      - Flush hint not queued interface for flushing. Applications might
> > >        avoid to use it.
> > 
> > This doesn't contradicts the last point about async operation and vcpu
> > control.  KVM async page faults turn the Address Flush Hints write into
> > an async operation so the guest can get other work done while waiting
> > for completion.
> > 
> > > 
> > >      - Flush hint address traps from guest to host and do an entire fsync
> > >        on backing file which itself is costly.
> > > 
> > >      - Can be used to flush specific pages on host backing disk. We can
> > >        send data(pages information) equal to cache-line size(limitation)
> > >        and tell host to sync corresponding pages instead of entire disk
> > >        sync.
> > 
> > Are you sure?  Your previous point says only the entire device can be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly. Using flush 
> hint address to write data which contains list/info of dirty pages to 
> flush requires more thought. This calls mmio write callback at Qemu side.
> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
> of data guest can write and is equal to cache line size.
>  
> > 
> > > 
> > >      - This will be an asynchronous operation and vCPU control is returned
> > >        quickly.
> > > 
> > > 
> > >  1.2] Using additional para virt device in addition to pmem device(fake dax
> > >  with device flush)
> > 
> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> > instead of a separate KVM-only paravirt device.
> 
> Same reason as above. If we decide on sending list of dirty pages there is
> limit to send max size of data to host using flush hint address.  

I understand now: you are proposing to change the semantics of the
Address Flush Hints interface.  You want the value written to have
meaning (the address range that needs to be flushed).

Today the spec says:

  The content of the data is not relevant to the functioning of the
  flush hint mechanism.

Maybe the NVDIMM folks can comment on this idea.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-22 19:34           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-22 19:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, xiaoguangrong eric,
	kvm-devel, linux-nvdimm@lists.01.org, Qemu Developers,
	Stefan Hajnoczi, Paolo Bonzini, Nitesh Narayan Lal

On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
>>
>> > > A] Problems to solve:
>> > > ------------------
>> > >
>> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
>> > >
>> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
>> > >
>> > >      - Existing interface.
>> > >
>> > >      - The approach to use flush hint address is already nacked upstream.
>> > >
>> > >      - Flush hint not queued interface for flushing. Applications might
>> > >        avoid to use it.
>> >
>> > This doesn't contradicts the last point about async operation and vcpu
>> > control.  KVM async page faults turn the Address Flush Hints write into
>> > an async operation so the guest can get other work done while waiting
>> > for completion.
>> >
>> > >
>> > >      - Flush hint address traps from guest to host and do an entire fsync
>> > >        on backing file which itself is costly.
>> > >
>> > >      - Can be used to flush specific pages on host backing disk. We can
>> > >        send data(pages information) equal to cache-line size(limitation)
>> > >        and tell host to sync corresponding pages instead of entire disk
>> > >        sync.
>> >
>> > Are you sure?  Your previous point says only the entire device can be
>> > synced.  The NVDIMM Adress Flush Hints interface does not involve
>> > address range information.
>>
>> Just syncing entire block device should be simple but costly. Using flush
>> hint address to write data which contains list/info of dirty pages to
>> flush requires more thought. This calls mmio write callback at Qemu side.
>> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length
>> of data guest can write and is equal to cache line size.
>>
>> >
>> > >
>> > >      - This will be an asynchronous operation and vCPU control is returned
>> > >        quickly.
>> > >
>> > >
>> > >  1.2] Using additional para virt device in addition to pmem device(fake dax
>> > >  with device flush)
>> >
>> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
>> > instead of a separate KVM-only paravirt device.
>>
>> Same reason as above. If we decide on sending list of dirty pages there is
>> limit to send max size of data to host using flush hint address.
>
> I understand now: you are proposing to change the semantics of the
> Address Flush Hints interface.  You want the value written to have
> meaning (the address range that needs to be flushed).
>
> Today the spec says:
>
>   The content of the data is not relevant to the functioning of the
>   flush hint mechanism.
>
> Maybe the NVDIMM folks can comment on this idea.

I think it's unworkable to use the flush hints as a guest-to-host
fsync mechanism. That mechanism was designed to flush small memory
controller buffers, not large swaths of dirty memory. What about
running the guests in a writethrough cache mode to avoid needing dirty
cache management altogether? Either way I think you need to use
device-dax on the host, or one of the two work-in-progress filesystem
mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
metadata coordination between guests and the host.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-22 19:34           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-22 19:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, xiaoguangrong eric,
	kvm-devel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Qemu Developers, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
>>
>> > > A] Problems to solve:
>> > > ------------------
>> > >
>> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
>> > >
>> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
>> > >
>> > >      - Existing interface.
>> > >
>> > >      - The approach to use flush hint address is already nacked upstream.
>> > >
>> > >      - Flush hint not queued interface for flushing. Applications might
>> > >        avoid to use it.
>> >
>> > This doesn't contradicts the last point about async operation and vcpu
>> > control.  KVM async page faults turn the Address Flush Hints write into
>> > an async operation so the guest can get other work done while waiting
>> > for completion.
>> >
>> > >
>> > >      - Flush hint address traps from guest to host and do an entire fsync
>> > >        on backing file which itself is costly.
>> > >
>> > >      - Can be used to flush specific pages on host backing disk. We can
>> > >        send data(pages information) equal to cache-line size(limitation)
>> > >        and tell host to sync corresponding pages instead of entire disk
>> > >        sync.
>> >
>> > Are you sure?  Your previous point says only the entire device can be
>> > synced.  The NVDIMM Adress Flush Hints interface does not involve
>> > address range information.
>>
>> Just syncing entire block device should be simple but costly. Using flush
>> hint address to write data which contains list/info of dirty pages to
>> flush requires more thought. This calls mmio write callback at Qemu side.
>> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length
>> of data guest can write and is equal to cache line size.
>>
>> >
>> > >
>> > >      - This will be an asynchronous operation and vCPU control is returned
>> > >        quickly.
>> > >
>> > >
>> > >  1.2] Using additional para virt device in addition to pmem device(fake dax
>> > >  with device flush)
>> >
>> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
>> > instead of a separate KVM-only paravirt device.
>>
>> Same reason as above. If we decide on sending list of dirty pages there is
>> limit to send max size of data to host using flush hint address.
>
> I understand now: you are proposing to change the semantics of the
> Address Flush Hints interface.  You want the value written to have
> meaning (the address range that needs to be flushed).
>
> Today the spec says:
>
>   The content of the data is not relevant to the functioning of the
>   flush hint mechanism.
>
> Maybe the NVDIMM folks can comment on this idea.

I think it's unworkable to use the flush hints as a guest-to-host
fsync mechanism. That mechanism was designed to flush small memory
controller buffers, not large swaths of dirty memory. What about
running the guests in a writethrough cache mode to avoid needing dirty
cache management altogether? Either way I think you need to use
device-dax on the host, or one of the two work-in-progress filesystem
mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
metadata coordination between guests and the host.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-22 19:34           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-22 19:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Pankaj Gupta, Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, Rik van Riel, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang

On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
>>
>> > > A] Problems to solve:
>> > > ------------------
>> > >
>> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
>> > >
>> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
>> > >
>> > >      - Existing interface.
>> > >
>> > >      - The approach to use flush hint address is already nacked upstream.
>> > >
>> > >      - Flush hint not queued interface for flushing. Applications might
>> > >        avoid to use it.
>> >
>> > This doesn't contradicts the last point about async operation and vcpu
>> > control.  KVM async page faults turn the Address Flush Hints write into
>> > an async operation so the guest can get other work done while waiting
>> > for completion.
>> >
>> > >
>> > >      - Flush hint address traps from guest to host and do an entire fsync
>> > >        on backing file which itself is costly.
>> > >
>> > >      - Can be used to flush specific pages on host backing disk. We can
>> > >        send data(pages information) equal to cache-line size(limitation)
>> > >        and tell host to sync corresponding pages instead of entire disk
>> > >        sync.
>> >
>> > Are you sure?  Your previous point says only the entire device can be
>> > synced.  The NVDIMM Adress Flush Hints interface does not involve
>> > address range information.
>>
>> Just syncing entire block device should be simple but costly. Using flush
>> hint address to write data which contains list/info of dirty pages to
>> flush requires more thought. This calls mmio write callback at Qemu side.
>> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length
>> of data guest can write and is equal to cache line size.
>>
>> >
>> > >
>> > >      - This will be an asynchronous operation and vCPU control is returned
>> > >        quickly.
>> > >
>> > >
>> > >  1.2] Using additional para virt device in addition to pmem device(fake dax
>> > >  with device flush)
>> >
>> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
>> > instead of a separate KVM-only paravirt device.
>>
>> Same reason as above. If we decide on sending list of dirty pages there is
>> limit to send max size of data to host using flush hint address.
>
> I understand now: you are proposing to change the semantics of the
> Address Flush Hints interface.  You want the value written to have
> meaning (the address range that needs to be flushed).
>
> Today the spec says:
>
>   The content of the data is not relevant to the functioning of the
>   flush hint mechanism.
>
> Maybe the NVDIMM folks can comment on this idea.

I think it's unworkable to use the flush hints as a guest-to-host
fsync mechanism. That mechanism was designed to flush small memory
controller buffers, not large swaths of dirty memory. What about
running the guests in a writethrough cache mode to avoid needing dirty
cache management altogether? Either way I think you need to use
device-dax on the host, or one of the two work-in-progress filesystem
mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
metadata coordination between guests and the host.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 14:04             ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-23 14:04 UTC (permalink / raw)
  To: Dan Williams, Stefan Hajnoczi
  Cc: Kevin Wolf, Pankaj Gupta, xiaoguangrong eric, kvm-devel,
	linux-nvdimm@lists.01.org, Qemu Developers, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal

On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha@redhat.com
> > wrote:
> >
> > Maybe the NVDIMM folks can comment on this idea.
> 
> I think it's unworkable to use the flush hints as a guest-to-host
> fsync mechanism. That mechanism was designed to flush small memory
> controller buffers, not large swaths of dirty memory. What about
> running the guests in a writethrough cache mode to avoid needing
> dirty
> cache management altogether? Either way I think you need to use
> device-dax on the host, or one of the two work-in-progress filesystem
> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
> metadata coordination between guests and the host.

The thing Pankaj is looking at is to use the DAX mechanisms
inside the guest (disk image as memory mapped nvdimm area),
with that disk image backed by a regular storage device on
the host.

The goal is to increase density of guests, by moving page
cache into the host (where it can be easily reclaimed).

If we assume the guests will be backed by relatively fast
SSDs, a "whole device flush" from filesystem journaling
code (issued where the filesystem issues a barrier or
disk cache flush today) may be just what we need to make
that work.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 14:04             ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-23 14:04 UTC (permalink / raw)
  To: Dan Williams, Stefan Hajnoczi
  Cc: Kevin Wolf, Pankaj Gupta, xiaoguangrong eric, kvm-devel,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Qemu Developers, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > wrote:
> >
> > Maybe the NVDIMM folks can comment on this idea.
> 
> I think it's unworkable to use the flush hints as a guest-to-host
> fsync mechanism. That mechanism was designed to flush small memory
> controller buffers, not large swaths of dirty memory. What about
> running the guests in a writethrough cache mode to avoid needing
> dirty
> cache management altogether? Either way I think you need to use
> device-dax on the host, or one of the two work-in-progress filesystem
> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
> metadata coordination between guests and the host.

The thing Pankaj is looking at is to use the DAX mechanisms
inside the guest (disk image as memory mapped nvdimm area),
with that disk image backed by a regular storage device on
the host.

The goal is to increase density of guests, by moving page
cache into the host (where it can be easily reclaimed).

If we assume the guests will be backed by relatively fast
SSDs, a "whole device flush" from filesystem journaling
code (issued where the filesystem issues a barrier or
disk cache flush today) may be just what we need to make
that work.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 14:04             ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-23 14:04 UTC (permalink / raw)
  To: Dan Williams, Stefan Hajnoczi
  Cc: Pankaj Gupta, Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, xiaoguangrong eric,
	Haozhong Zhang

On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha@redhat.com
> > wrote:
> >
> > Maybe the NVDIMM folks can comment on this idea.
> 
> I think it's unworkable to use the flush hints as a guest-to-host
> fsync mechanism. That mechanism was designed to flush small memory
> controller buffers, not large swaths of dirty memory. What about
> running the guests in a writethrough cache mode to avoid needing
> dirty
> cache management altogether? Either way I think you need to use
> device-dax on the host, or one of the two work-in-progress filesystem
> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
> metadata coordination between guests and the host.

The thing Pankaj is looking at is to use the DAX mechanisms
inside the guest (disk image as memory mapped nvdimm area),
with that disk image backed by a regular storage device on
the host.

The goal is to increase density of guests, by moving page
cache into the host (where it can be easily reclaimed).

If we assume the guests will be backed by relatively fast
SSDs, a "whole device flush" from filesystem journaling
code (issued where the filesystem issues a barrier or
disk cache flush today) may be just what we need to make
that work.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 16:01               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-23 16:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, linux-nvdimm@lists.01.org, Zwisler,
	Ross  <ross.zwisler@intel.com>,
	Qemu Developers <qemu-devel@nongnu.org>,
	Stefan Hajnoczi, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

[ adding Ross and Jan ]

On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com> wrote:
> On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
>> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha@redhat.com
>> > wrote:
>> >
>> > Maybe the NVDIMM folks can comment on this idea.
>>
>> I think it's unworkable to use the flush hints as a guest-to-host
>> fsync mechanism. That mechanism was designed to flush small memory
>> controller buffers, not large swaths of dirty memory. What about
>> running the guests in a writethrough cache mode to avoid needing
>> dirty
>> cache management altogether? Either way I think you need to use
>> device-dax on the host, or one of the two work-in-progress filesystem
>> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
>> metadata coordination between guests and the host.
>
> The thing Pankaj is looking at is to use the DAX mechanisms
> inside the guest (disk image as memory mapped nvdimm area),
> with that disk image backed by a regular storage device on
> the host.
>
> The goal is to increase density of guests, by moving page
> cache into the host (where it can be easily reclaimed).
>
> If we assume the guests will be backed by relatively fast
> SSDs, a "whole device flush" from filesystem journaling
> code (issued where the filesystem issues a barrier or
> disk cache flush today) may be just what we need to make
> that work.

Ok, apologies, I indeed had some pieces of the proposal confused.

However, it still seems like the storage interface is not capable of
expressing what is needed, because the operation that is needed is a
range flush. In the guest you want the DAX page dirty tracking to
communicate range flush information to the host, but there's no
readily available block i/o semantic that software running on top of
the fake pmem device can use to communicate with the host. Instead you
want to intercept the dax_flush() operation and turn it into a queued
request on the host.

In 4.13 we have turned this dax_flush() operation into an explicit
driver call. That seems a better interface to modify than trying to
map block-storage flush-cache / force-unit-access commands to this
host request.

The additional piece you would need to consider is whether to track
all writes in addition to mmap writes in the guest as DAX-page-cache
dirtying events, or arrange for every dax_copy_from_iter() operation()
to also queue a sync on the host, but that essentially turns the host
page cache into a pseudo write-through mode.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 16:01               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-23 16:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Zwisler, Ross, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal

[ adding Ross and Jan ]

On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
>> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>> > wrote:
>> >
>> > Maybe the NVDIMM folks can comment on this idea.
>>
>> I think it's unworkable to use the flush hints as a guest-to-host
>> fsync mechanism. That mechanism was designed to flush small memory
>> controller buffers, not large swaths of dirty memory. What about
>> running the guests in a writethrough cache mode to avoid needing
>> dirty
>> cache management altogether? Either way I think you need to use
>> device-dax on the host, or one of the two work-in-progress filesystem
>> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
>> metadata coordination between guests and the host.
>
> The thing Pankaj is looking at is to use the DAX mechanisms
> inside the guest (disk image as memory mapped nvdimm area),
> with that disk image backed by a regular storage device on
> the host.
>
> The goal is to increase density of guests, by moving page
> cache into the host (where it can be easily reclaimed).
>
> If we assume the guests will be backed by relatively fast
> SSDs, a "whole device flush" from filesystem journaling
> code (issued where the filesystem issues a barrier or
> disk cache flush today) may be just what we need to make
> that work.

Ok, apologies, I indeed had some pieces of the proposal confused.

However, it still seems like the storage interface is not capable of
expressing what is needed, because the operation that is needed is a
range flush. In the guest you want the DAX page dirty tracking to
communicate range flush information to the host, but there's no
readily available block i/o semantic that software running on top of
the fake pmem device can use to communicate with the host. Instead you
want to intercept the dax_flush() operation and turn it into a queued
request on the host.

In 4.13 we have turned this dax_flush() operation into an explicit
driver call. That seems a better interface to modify than trying to
map block-storage flush-cache / force-unit-access commands to this
host request.

The additional piece you would need to consider is whether to track
all writes in addition to mmap writes in the guest as DAX-page-cache
dirtying events, or arrange for every dax_copy_from_iter() operation()
to also queue a sync on the host, but that essentially turns the host
page cache into a pseudo write-through mode.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 16:01               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-23 16:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stefan Hajnoczi, Pankaj Gupta, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Zwisler, Ross, Jan Kara

[ adding Ross and Jan ]

On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com> wrote:
> On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
>> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha@redhat.com
>> > wrote:
>> >
>> > Maybe the NVDIMM folks can comment on this idea.
>>
>> I think it's unworkable to use the flush hints as a guest-to-host
>> fsync mechanism. That mechanism was designed to flush small memory
>> controller buffers, not large swaths of dirty memory. What about
>> running the guests in a writethrough cache mode to avoid needing
>> dirty
>> cache management altogether? Either way I think you need to use
>> device-dax on the host, or one of the two work-in-progress filesystem
>> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
>> metadata coordination between guests and the host.
>
> The thing Pankaj is looking at is to use the DAX mechanisms
> inside the guest (disk image as memory mapped nvdimm area),
> with that disk image backed by a regular storage device on
> the host.
>
> The goal is to increase density of guests, by moving page
> cache into the host (where it can be easily reclaimed).
>
> If we assume the guests will be backed by relatively fast
> SSDs, a "whole device flush" from filesystem journaling
> code (issued where the filesystem issues a barrier or
> disk cache flush today) may be just what we need to make
> that work.

Ok, apologies, I indeed had some pieces of the proposal confused.

However, it still seems like the storage interface is not capable of
expressing what is needed, because the operation that is needed is a
range flush. In the guest you want the DAX page dirty tracking to
communicate range flush information to the host, but there's no
readily available block i/o semantic that software running on top of
the fake pmem device can use to communicate with the host. Instead you
want to intercept the dax_flush() operation and turn it into a queued
request on the host.

In 4.13 we have turned this dax_flush() operation into an explicit
driver call. That seems a better interface to modify than trying to
map block-storage flush-cache / force-unit-access commands to this
host request.

The additional piece you would need to consider is whether to track
all writes in addition to mmap writes in the guest as DAX-page-cache
dirtying events, or arrange for every dax_copy_from_iter() operation()
to also queue a sync on the host, but that essentially turns the host
page cache into a pseudo write-through mode.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 18:10                 ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-23 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, linux-nvdimm@lists.01.org, Zwisler,
	Ross  <ross.zwisler@intel.com>,
	Qemu Developers <qemu-devel@nongnu.org>,
	Stefan Hajnoczi, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> [ adding Ross and Jan ]
> 
> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > The goal is to increase density of guests, by moving page
> > cache into the host (where it can be easily reclaimed).
> > 
> > If we assume the guests will be backed by relatively fast
> > SSDs, a "whole device flush" from filesystem journaling
> > code (issued where the filesystem issues a barrier or
> > disk cache flush today) may be just what we need to make
> > that work.
> 
> Ok, apologies, I indeed had some pieces of the proposal confused.
> 
> However, it still seems like the storage interface is not capable of
> expressing what is needed, because the operation that is needed is a
> range flush. In the guest you want the DAX page dirty tracking to
> communicate range flush information to the host, but there's no
> readily available block i/o semantic that software running on top of
> the fake pmem device can use to communicate with the host. Instead
> you
> want to intercept the dax_flush() operation and turn it into a queued
> request on the host.
> 
> In 4.13 we have turned this dax_flush() operation into an explicit
> driver call. That seems a better interface to modify than trying to
> map block-storage flush-cache / force-unit-access commands to this
> host request.
> 
> The additional piece you would need to consider is whether to track
> all writes in addition to mmap writes in the guest as DAX-page-cache
> dirtying events, or arrange for every dax_copy_from_iter()
> operation()
> to also queue a sync on the host, but that essentially turns the host
> page cache into a pseudo write-through mode.

I suspect initially it will be fine to not offer DAX
semantics to applications using these "fake DAX" devices
from a virtual machine, because the DAX APIs are designed
for a much higher performance device than these fake DAX
setups could ever give.

Having userspace call fsync/msync like done normally, and
having those coarser calls be turned into somewhat efficient
backend flushes would be perfectly acceptable.

The big question is, what should that kind of interface look
like?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 18:10                 ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-23 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Zwisler, Ross, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	Paolo Bonzini, Nitesh Narayan Lal

On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> [ adding Ross and Jan ]
> 
> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> wrote:
> > 
> > The goal is to increase density of guests, by moving page
> > cache into the host (where it can be easily reclaimed).
> > 
> > If we assume the guests will be backed by relatively fast
> > SSDs, a "whole device flush" from filesystem journaling
> > code (issued where the filesystem issues a barrier or
> > disk cache flush today) may be just what we need to make
> > that work.
> 
> Ok, apologies, I indeed had some pieces of the proposal confused.
> 
> However, it still seems like the storage interface is not capable of
> expressing what is needed, because the operation that is needed is a
> range flush. In the guest you want the DAX page dirty tracking to
> communicate range flush information to the host, but there's no
> readily available block i/o semantic that software running on top of
> the fake pmem device can use to communicate with the host. Instead
> you
> want to intercept the dax_flush() operation and turn it into a queued
> request on the host.
> 
> In 4.13 we have turned this dax_flush() operation into an explicit
> driver call. That seems a better interface to modify than trying to
> map block-storage flush-cache / force-unit-access commands to this
> host request.
> 
> The additional piece you would need to consider is whether to track
> all writes in addition to mmap writes in the guest as DAX-page-cache
> dirtying events, or arrange for every dax_copy_from_iter()
> operation()
> to also queue a sync on the host, but that essentially turns the host
> page cache into a pseudo write-through mode.

I suspect initially it will be fine to not offer DAX
semantics to applications using these "fake DAX" devices
from a virtual machine, because the DAX APIs are designed
for a much higher performance device than these fake DAX
setups could ever give.

Having userspace call fsync/msync like done normally, and
having those coarser calls be turned into somewhat efficient
backend flushes would be perfectly acceptable.

The big question is, what should that kind of interface look
like?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 18:10                 ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-23 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Stefan Hajnoczi, Pankaj Gupta, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Zwisler, Ross, Jan Kara

On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> [ adding Ross and Jan ]
> 
> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > The goal is to increase density of guests, by moving page
> > cache into the host (where it can be easily reclaimed).
> > 
> > If we assume the guests will be backed by relatively fast
> > SSDs, a "whole device flush" from filesystem journaling
> > code (issued where the filesystem issues a barrier or
> > disk cache flush today) may be just what we need to make
> > that work.
> 
> Ok, apologies, I indeed had some pieces of the proposal confused.
> 
> However, it still seems like the storage interface is not capable of
> expressing what is needed, because the operation that is needed is a
> range flush. In the guest you want the DAX page dirty tracking to
> communicate range flush information to the host, but there's no
> readily available block i/o semantic that software running on top of
> the fake pmem device can use to communicate with the host. Instead
> you
> want to intercept the dax_flush() operation and turn it into a queued
> request on the host.
> 
> In 4.13 we have turned this dax_flush() operation into an explicit
> driver call. That seems a better interface to modify than trying to
> map block-storage flush-cache / force-unit-access commands to this
> host request.
> 
> The additional piece you would need to consider is whether to track
> all writes in addition to mmap writes in the guest as DAX-page-cache
> dirtying events, or arrange for every dax_copy_from_iter()
> operation()
> to also queue a sync on the host, but that essentially turns the host
> page cache into a pseudo write-through mode.

I suspect initially it will be fine to not offer DAX
semantics to applications using these "fake DAX" devices
from a virtual machine, because the DAX APIs are designed
for a much higher performance device than these fake DAX
setups could ever give.

Having userspace call fsync/msync like done normally, and
having those coarser calls be turned into somewhat efficient
backend flushes would be perfectly acceptable.

The big question is, what should that kind of interface look
like?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-23 18:10                 ` Rik van Riel
  (?)
@ 2017-07-23 20:10                   ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-23 20:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, linux-nvdimm@lists.01.org, Zwisler,
	Ross  <ross.zwisler@intel.com>,
	Qemu Developers <qemu-devel@nongnu.org>,
	Stefan Hajnoczi, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> [ adding Ross and Jan ]
>>
>> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > The goal is to increase density of guests, by moving page
>> > cache into the host (where it can be easily reclaimed).
>> >
>> > If we assume the guests will be backed by relatively fast
>> > SSDs, a "whole device flush" from filesystem journaling
>> > code (issued where the filesystem issues a barrier or
>> > disk cache flush today) may be just what we need to make
>> > that work.
>>
>> Ok, apologies, I indeed had some pieces of the proposal confused.
>>
>> However, it still seems like the storage interface is not capable of
>> expressing what is needed, because the operation that is needed is a
>> range flush. In the guest you want the DAX page dirty tracking to
>> communicate range flush information to the host, but there's no
>> readily available block i/o semantic that software running on top of
>> the fake pmem device can use to communicate with the host. Instead
>> you
>> want to intercept the dax_flush() operation and turn it into a queued
>> request on the host.
>>
>> In 4.13 we have turned this dax_flush() operation into an explicit
>> driver call. That seems a better interface to modify than trying to
>> map block-storage flush-cache / force-unit-access commands to this
>> host request.
>>
>> The additional piece you would need to consider is whether to track
>> all writes in addition to mmap writes in the guest as DAX-page-cache
>> dirtying events, or arrange for every dax_copy_from_iter()
>> operation()
>> to also queue a sync on the host, but that essentially turns the host
>> page cache into a pseudo write-through mode.
>
> I suspect initially it will be fine to not offer DAX
> semantics to applications using these "fake DAX" devices
> from a virtual machine, because the DAX APIs are designed
> for a much higher performance device than these fake DAX
> setups could ever give.

Right, we don't need DAX, per se, in the guest.

>
> Having userspace call fsync/msync like done normally, and
> having those coarser calls be turned into somewhat efficient
> backend flushes would be perfectly acceptable.
>
> The big question is, what should that kind of interface look
> like?

To me, this looks much like the dirty cache tracking that is done in
the address_space radix for the DAX case, but modified to coordinate
queued / page-based flushing when the guest  wants to persist data.
The similarity to DAX is not storing guest allocated pages in the
radix but entries that track dirty guest physical addresses.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 20:10                   ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-23 20:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stefan Hajnoczi, Pankaj Gupta, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Zwisler, Ross, Jan Kara

On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> [ adding Ross and Jan ]
>>
>> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > The goal is to increase density of guests, by moving page
>> > cache into the host (where it can be easily reclaimed).
>> >
>> > If we assume the guests will be backed by relatively fast
>> > SSDs, a "whole device flush" from filesystem journaling
>> > code (issued where the filesystem issues a barrier or
>> > disk cache flush today) may be just what we need to make
>> > that work.
>>
>> Ok, apologies, I indeed had some pieces of the proposal confused.
>>
>> However, it still seems like the storage interface is not capable of
>> expressing what is needed, because the operation that is needed is a
>> range flush. In the guest you want the DAX page dirty tracking to
>> communicate range flush information to the host, but there's no
>> readily available block i/o semantic that software running on top of
>> the fake pmem device can use to communicate with the host. Instead
>> you
>> want to intercept the dax_flush() operation and turn it into a queued
>> request on the host.
>>
>> In 4.13 we have turned this dax_flush() operation into an explicit
>> driver call. That seems a better interface to modify than trying to
>> map block-storage flush-cache / force-unit-access commands to this
>> host request.
>>
>> The additional piece you would need to consider is whether to track
>> all writes in addition to mmap writes in the guest as DAX-page-cache
>> dirtying events, or arrange for every dax_copy_from_iter()
>> operation()
>> to also queue a sync on the host, but that essentially turns the host
>> page cache into a pseudo write-through mode.
>
> I suspect initially it will be fine to not offer DAX
> semantics to applications using these "fake DAX" devices
> from a virtual machine, because the DAX APIs are designed
> for a much higher performance device than these fake DAX
> setups could ever give.

Right, we don't need DAX, per se, in the guest.

>
> Having userspace call fsync/msync like done normally, and
> having those coarser calls be turned into somewhat efficient
> backend flushes would be perfectly acceptable.
>
> The big question is, what should that kind of interface look
> like?

To me, this looks much like the dirty cache tracking that is done in
the address_space radix for the DAX case, but modified to coordinate
queued / page-based flushing when the guest  wants to persist data.
The similarity to DAX is not storing guest allocated pages in the
radix but entries that track dirty guest physical addresses.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-23 20:10                   ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-23 20:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stefan Hajnoczi, Pankaj Gupta, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Zwisler, Ross, Jan Kara

On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> [ adding Ross and Jan ]
>>
>> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > The goal is to increase density of guests, by moving page
>> > cache into the host (where it can be easily reclaimed).
>> >
>> > If we assume the guests will be backed by relatively fast
>> > SSDs, a "whole device flush" from filesystem journaling
>> > code (issued where the filesystem issues a barrier or
>> > disk cache flush today) may be just what we need to make
>> > that work.
>>
>> Ok, apologies, I indeed had some pieces of the proposal confused.
>>
>> However, it still seems like the storage interface is not capable of
>> expressing what is needed, because the operation that is needed is a
>> range flush. In the guest you want the DAX page dirty tracking to
>> communicate range flush information to the host, but there's no
>> readily available block i/o semantic that software running on top of
>> the fake pmem device can use to communicate with the host. Instead
>> you
>> want to intercept the dax_flush() operation and turn it into a queued
>> request on the host.
>>
>> In 4.13 we have turned this dax_flush() operation into an explicit
>> driver call. That seems a better interface to modify than trying to
>> map block-storage flush-cache / force-unit-access commands to this
>> host request.
>>
>> The additional piece you would need to consider is whether to track
>> all writes in addition to mmap writes in the guest as DAX-page-cache
>> dirtying events, or arrange for every dax_copy_from_iter()
>> operation()
>> to also queue a sync on the host, but that essentially turns the host
>> page cache into a pseudo write-through mode.
>
> I suspect initially it will be fine to not offer DAX
> semantics to applications using these "fake DAX" devices
> from a virtual machine, because the DAX APIs are designed
> for a much higher performance device than these fake DAX
> setups could ever give.

Right, we don't need DAX, per se, in the guest.

>
> Having userspace call fsync/msync like done normally, and
> having those coarser calls be turned into somewhat efficient
> backend flushes would be perfectly acceptable.
>
> The big question is, what should that kind of interface look
> like?

To me, this looks much like the dirty cache tracking that is done in
the address_space radix for the DAX case, but modified to coordinate
queued / page-based flushing when the guest  wants to persist data.
The similarity to DAX is not storing guest allocated pages in the
radix but entries that track dirty guest physical addresses.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 10:23                     ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 10:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara,
	xiaoguangrong eric, kvm-devel, Stefan Hajnoczi, Zwisler, Ross,
	Qemu Developers, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Sun 23-07-17 13:10:34, Dan Williams wrote:
> On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> [ adding Ross and Jan ]
> >>
> >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> >> wrote:
> >> >
> >> > The goal is to increase density of guests, by moving page
> >> > cache into the host (where it can be easily reclaimed).
> >> >
> >> > If we assume the guests will be backed by relatively fast
> >> > SSDs, a "whole device flush" from filesystem journaling
> >> > code (issued where the filesystem issues a barrier or
> >> > disk cache flush today) may be just what we need to make
> >> > that work.
> >>
> >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >>
> >> However, it still seems like the storage interface is not capable of
> >> expressing what is needed, because the operation that is needed is a
> >> range flush. In the guest you want the DAX page dirty tracking to
> >> communicate range flush information to the host, but there's no
> >> readily available block i/o semantic that software running on top of
> >> the fake pmem device can use to communicate with the host. Instead
> >> you
> >> want to intercept the dax_flush() operation and turn it into a queued
> >> request on the host.
> >>
> >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> driver call. That seems a better interface to modify than trying to
> >> map block-storage flush-cache / force-unit-access commands to this
> >> host request.
> >>
> >> The additional piece you would need to consider is whether to track
> >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> dirtying events, or arrange for every dax_copy_from_iter()
> >> operation()
> >> to also queue a sync on the host, but that essentially turns the host
> >> page cache into a pseudo write-through mode.
> >
> > I suspect initially it will be fine to not offer DAX
> > semantics to applications using these "fake DAX" devices
> > from a virtual machine, because the DAX APIs are designed
> > for a much higher performance device than these fake DAX
> > setups could ever give.
> 
> Right, we don't need DAX, per se, in the guest.
> 
> >
> > Having userspace call fsync/msync like done normally, and
> > having those coarser calls be turned into somewhat efficient
> > backend flushes would be perfectly acceptable.
> >
> > The big question is, what should that kind of interface look
> > like?
> 
> To me, this looks much like the dirty cache tracking that is done in
> the address_space radix for the DAX case, but modified to coordinate
> queued / page-based flushing when the guest  wants to persist data.
> The similarity to DAX is not storing guest allocated pages in the
> radix but entries that track dirty guest physical addresses.

Let me check whether I understand the problem correctly. So we want to
export a block device (essentially a page cache of this block device) to a
guest as PMEM and use DAX in the guest to save guest's page cache. The
natural way to make the persistence work would be to make ->flush callback
of the PMEM device to do an upcall to the host which could then fdatasync()
appropriate image file range however the performance would suck in such
case since ->flush gets called for at most one page ranges from DAX. 

So what you could do instead is to completely ignore ->flush calls for the
PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
PMEM device (generated by blkdev_issue_flush() or the journalling
machinery) and fdatasync() the whole image file at that moment - in fact
you must do that for metadata IO to hit persistent storage anyway in your
setting. This would very closely follow how exporting block devices with
volatile cache works with KVM these days AFAIU and the performance will be
the same.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 10:23                     ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 10:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara,
	xiaoguangrong eric, kvm-devel, Stefan Hajnoczi, Zwisler, Ross,
	Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Sun 23-07-17 13:10:34, Dan Williams wrote:
> On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> [ adding Ross and Jan ]
> >>
> >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> wrote:
> >> >
> >> > The goal is to increase density of guests, by moving page
> >> > cache into the host (where it can be easily reclaimed).
> >> >
> >> > If we assume the guests will be backed by relatively fast
> >> > SSDs, a "whole device flush" from filesystem journaling
> >> > code (issued where the filesystem issues a barrier or
> >> > disk cache flush today) may be just what we need to make
> >> > that work.
> >>
> >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >>
> >> However, it still seems like the storage interface is not capable of
> >> expressing what is needed, because the operation that is needed is a
> >> range flush. In the guest you want the DAX page dirty tracking to
> >> communicate range flush information to the host, but there's no
> >> readily available block i/o semantic that software running on top of
> >> the fake pmem device can use to communicate with the host. Instead
> >> you
> >> want to intercept the dax_flush() operation and turn it into a queued
> >> request on the host.
> >>
> >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> driver call. That seems a better interface to modify than trying to
> >> map block-storage flush-cache / force-unit-access commands to this
> >> host request.
> >>
> >> The additional piece you would need to consider is whether to track
> >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> dirtying events, or arrange for every dax_copy_from_iter()
> >> operation()
> >> to also queue a sync on the host, but that essentially turns the host
> >> page cache into a pseudo write-through mode.
> >
> > I suspect initially it will be fine to not offer DAX
> > semantics to applications using these "fake DAX" devices
> > from a virtual machine, because the DAX APIs are designed
> > for a much higher performance device than these fake DAX
> > setups could ever give.
> 
> Right, we don't need DAX, per se, in the guest.
> 
> >
> > Having userspace call fsync/msync like done normally, and
> > having those coarser calls be turned into somewhat efficient
> > backend flushes would be perfectly acceptable.
> >
> > The big question is, what should that kind of interface look
> > like?
> 
> To me, this looks much like the dirty cache tracking that is done in
> the address_space radix for the DAX case, but modified to coordinate
> queued / page-based flushing when the guest  wants to persist data.
> The similarity to DAX is not storing guest allocated pages in the
> radix but entries that track dirty guest physical addresses.

Let me check whether I understand the problem correctly. So we want to
export a block device (essentially a page cache of this block device) to a
guest as PMEM and use DAX in the guest to save guest's page cache. The
natural way to make the persistence work would be to make ->flush callback
of the PMEM device to do an upcall to the host which could then fdatasync()
appropriate image file range however the performance would suck in such
case since ->flush gets called for at most one page ranges from DAX. 

So what you could do instead is to completely ignore ->flush calls for the
PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
PMEM device (generated by blkdev_issue_flush() or the journalling
machinery) and fdatasync() the whole image file at that moment - in fact
you must do that for metadata IO to hit persistent storage anyway in your
setting. This would very closely follow how exporting block devices with
volatile cache works with KVM these days AFAIU and the performance will be
the same.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 10:23                     ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 10:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Stefan Hajnoczi, Pankaj Gupta, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Zwisler, Ross, Jan Kara

On Sun 23-07-17 13:10:34, Dan Williams wrote:
> On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> [ adding Ross and Jan ]
> >>
> >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> >> wrote:
> >> >
> >> > The goal is to increase density of guests, by moving page
> >> > cache into the host (where it can be easily reclaimed).
> >> >
> >> > If we assume the guests will be backed by relatively fast
> >> > SSDs, a "whole device flush" from filesystem journaling
> >> > code (issued where the filesystem issues a barrier or
> >> > disk cache flush today) may be just what we need to make
> >> > that work.
> >>
> >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >>
> >> However, it still seems like the storage interface is not capable of
> >> expressing what is needed, because the operation that is needed is a
> >> range flush. In the guest you want the DAX page dirty tracking to
> >> communicate range flush information to the host, but there's no
> >> readily available block i/o semantic that software running on top of
> >> the fake pmem device can use to communicate with the host. Instead
> >> you
> >> want to intercept the dax_flush() operation and turn it into a queued
> >> request on the host.
> >>
> >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> driver call. That seems a better interface to modify than trying to
> >> map block-storage flush-cache / force-unit-access commands to this
> >> host request.
> >>
> >> The additional piece you would need to consider is whether to track
> >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> dirtying events, or arrange for every dax_copy_from_iter()
> >> operation()
> >> to also queue a sync on the host, but that essentially turns the host
> >> page cache into a pseudo write-through mode.
> >
> > I suspect initially it will be fine to not offer DAX
> > semantics to applications using these "fake DAX" devices
> > from a virtual machine, because the DAX APIs are designed
> > for a much higher performance device than these fake DAX
> > setups could ever give.
> 
> Right, we don't need DAX, per se, in the guest.
> 
> >
> > Having userspace call fsync/msync like done normally, and
> > having those coarser calls be turned into somewhat efficient
> > backend flushes would be perfectly acceptable.
> >
> > The big question is, what should that kind of interface look
> > like?
> 
> To me, this looks much like the dirty cache tracking that is done in
> the address_space radix for the DAX case, but modified to coordinate
> queued / page-based flushing when the guest  wants to persist data.
> The similarity to DAX is not storing guest allocated pages in the
> radix but entries that track dirty guest physical addresses.

Let me check whether I understand the problem correctly. So we want to
export a block device (essentially a page cache of this block device) to a
guest as PMEM and use DAX in the guest to save guest's page cache. The
natural way to make the persistence work would be to make ->flush callback
of the PMEM device to do an upcall to the host which could then fdatasync()
appropriate image file range however the performance would suck in such
case since ->flush gets called for at most one page ranges from DAX. 

So what you could do instead is to completely ignore ->flush calls for the
PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
PMEM device (generated by blkdev_issue_flush() or the journalling
machinery) and fdatasync() the whole image file at that moment - in fact
you must do that for metadata IO to hit persistent storage anyway in your
setting. This would very closely follow how exporting block devices with
volatile cache works with KVM these days AFAIU and the performance will be
the same.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 12:06                       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-24 12:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong eric, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal


> On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > >> [ adding Ross and Jan ]
> > >>
> > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > >> wrote:
> > >> >
> > >> > The goal is to increase density of guests, by moving page
> > >> > cache into the host (where it can be easily reclaimed).
> > >> >
> > >> > If we assume the guests will be backed by relatively fast
> > >> > SSDs, a "whole device flush" from filesystem journaling
> > >> > code (issued where the filesystem issues a barrier or
> > >> > disk cache flush today) may be just what we need to make
> > >> > that work.
> > >>
> > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > >>
> > >> However, it still seems like the storage interface is not capable of
> > >> expressing what is needed, because the operation that is needed is a
> > >> range flush. In the guest you want the DAX page dirty tracking to
> > >> communicate range flush information to the host, but there's no
> > >> readily available block i/o semantic that software running on top of
> > >> the fake pmem device can use to communicate with the host. Instead
> > >> you
> > >> want to intercept the dax_flush() operation and turn it into a queued
> > >> request on the host.
> > >>
> > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > >> driver call. That seems a better interface to modify than trying to
> > >> map block-storage flush-cache / force-unit-access commands to this
> > >> host request.
> > >>
> > >> The additional piece you would need to consider is whether to track
> > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > >> dirtying events, or arrange for every dax_copy_from_iter()
> > >> operation()
> > >> to also queue a sync on the host, but that essentially turns the host
> > >> page cache into a pseudo write-through mode.
> > >
> > > I suspect initially it will be fine to not offer DAX
> > > semantics to applications using these "fake DAX" devices
> > > from a virtual machine, because the DAX APIs are designed
> > > for a much higher performance device than these fake DAX
> > > setups could ever give.
> > 
> > Right, we don't need DAX, per se, in the guest.
> > 
> > >
> > > Having userspace call fsync/msync like done normally, and
> > > having those coarser calls be turned into somewhat efficient
> > > backend flushes would be perfectly acceptable.
> > >
> > > The big question is, what should that kind of interface look
> > > like?
> > 
> > To me, this looks much like the dirty cache tracking that is done in
> > the address_space radix for the DAX case, but modified to coordinate
> > queued / page-based flushing when the guest  wants to persist data.
> > The similarity to DAX is not storing guest allocated pages in the
> > radix but entries that track dirty guest physical addresses.
> 
> Let me check whether I understand the problem correctly. So we want to
> export a block device (essentially a page cache of this block device) to a
> guest as PMEM and use DAX in the guest to save guest's page cache. The

that's correct.

> natural way to make the persistence work would be to make ->flush callback
> of the PMEM device to do an upcall to the host which could then fdatasync()
> appropriate image file range however the performance would suck in such
> case since ->flush gets called for at most one page ranges from DAX.

Discussion is : sync a range using paravirt device or flush hit addresses 
vs block device flush.

> 
> So what you could do instead is to completely ignore ->flush calls for the
> PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> PMEM device (generated by blkdev_issue_flush() or the journalling
> machinery) and fdatasync() the whole image file at that moment - in fact
> you must do that for metadata IO to hit persistent storage anyway in your
> setting. This would very closely follow how exporting block devices with
> volatile cache works with KVM these days AFAIU and the performance will be
> the same.

yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
As per suggestions looks like block flushing device is way ahead. 

If we do an asynchronous block flush at guest side(put current task in wait queue 
till host side fdatasync completes) can solve the purpose? Or do we need another paravirt
device for this?

> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 12:06                       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-24 12:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong eric, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal


> On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > >> [ adding Ross and Jan ]
> > >>
> > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >> wrote:
> > >> >
> > >> > The goal is to increase density of guests, by moving page
> > >> > cache into the host (where it can be easily reclaimed).
> > >> >
> > >> > If we assume the guests will be backed by relatively fast
> > >> > SSDs, a "whole device flush" from filesystem journaling
> > >> > code (issued where the filesystem issues a barrier or
> > >> > disk cache flush today) may be just what we need to make
> > >> > that work.
> > >>
> > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > >>
> > >> However, it still seems like the storage interface is not capable of
> > >> expressing what is needed, because the operation that is needed is a
> > >> range flush. In the guest you want the DAX page dirty tracking to
> > >> communicate range flush information to the host, but there's no
> > >> readily available block i/o semantic that software running on top of
> > >> the fake pmem device can use to communicate with the host. Instead
> > >> you
> > >> want to intercept the dax_flush() operation and turn it into a queued
> > >> request on the host.
> > >>
> > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > >> driver call. That seems a better interface to modify than trying to
> > >> map block-storage flush-cache / force-unit-access commands to this
> > >> host request.
> > >>
> > >> The additional piece you would need to consider is whether to track
> > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > >> dirtying events, or arrange for every dax_copy_from_iter()
> > >> operation()
> > >> to also queue a sync on the host, but that essentially turns the host
> > >> page cache into a pseudo write-through mode.
> > >
> > > I suspect initially it will be fine to not offer DAX
> > > semantics to applications using these "fake DAX" devices
> > > from a virtual machine, because the DAX APIs are designed
> > > for a much higher performance device than these fake DAX
> > > setups could ever give.
> > 
> > Right, we don't need DAX, per se, in the guest.
> > 
> > >
> > > Having userspace call fsync/msync like done normally, and
> > > having those coarser calls be turned into somewhat efficient
> > > backend flushes would be perfectly acceptable.
> > >
> > > The big question is, what should that kind of interface look
> > > like?
> > 
> > To me, this looks much like the dirty cache tracking that is done in
> > the address_space radix for the DAX case, but modified to coordinate
> > queued / page-based flushing when the guest  wants to persist data.
> > The similarity to DAX is not storing guest allocated pages in the
> > radix but entries that track dirty guest physical addresses.
> 
> Let me check whether I understand the problem correctly. So we want to
> export a block device (essentially a page cache of this block device) to a
> guest as PMEM and use DAX in the guest to save guest's page cache. The

that's correct.

> natural way to make the persistence work would be to make ->flush callback
> of the PMEM device to do an upcall to the host which could then fdatasync()
> appropriate image file range however the performance would suck in such
> case since ->flush gets called for at most one page ranges from DAX.

Discussion is : sync a range using paravirt device or flush hit addresses 
vs block device flush.

> 
> So what you could do instead is to completely ignore ->flush calls for the
> PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> PMEM device (generated by blkdev_issue_flush() or the journalling
> machinery) and fdatasync() the whole image file at that moment - in fact
> you must do that for metadata IO to hit persistent storage anyway in your
> setting. This would very closely follow how exporting block devices with
> volatile cache works with KVM these days AFAIU and the performance will be
> the same.

yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
As per suggestions looks like block flushing device is way ahead. 

If we do an asynchronous block flush at guest side(put current task in wait queue 
till host side fdatasync completes) can solve the purpose? Or do we need another paravirt
device for this?

> 
> 								Honza
> --
> Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 12:06                       ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-24 12:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler


> On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > >> [ adding Ross and Jan ]
> > >>
> > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > >> wrote:
> > >> >
> > >> > The goal is to increase density of guests, by moving page
> > >> > cache into the host (where it can be easily reclaimed).
> > >> >
> > >> > If we assume the guests will be backed by relatively fast
> > >> > SSDs, a "whole device flush" from filesystem journaling
> > >> > code (issued where the filesystem issues a barrier or
> > >> > disk cache flush today) may be just what we need to make
> > >> > that work.
> > >>
> > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > >>
> > >> However, it still seems like the storage interface is not capable of
> > >> expressing what is needed, because the operation that is needed is a
> > >> range flush. In the guest you want the DAX page dirty tracking to
> > >> communicate range flush information to the host, but there's no
> > >> readily available block i/o semantic that software running on top of
> > >> the fake pmem device can use to communicate with the host. Instead
> > >> you
> > >> want to intercept the dax_flush() operation and turn it into a queued
> > >> request on the host.
> > >>
> > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > >> driver call. That seems a better interface to modify than trying to
> > >> map block-storage flush-cache / force-unit-access commands to this
> > >> host request.
> > >>
> > >> The additional piece you would need to consider is whether to track
> > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > >> dirtying events, or arrange for every dax_copy_from_iter()
> > >> operation()
> > >> to also queue a sync on the host, but that essentially turns the host
> > >> page cache into a pseudo write-through mode.
> > >
> > > I suspect initially it will be fine to not offer DAX
> > > semantics to applications using these "fake DAX" devices
> > > from a virtual machine, because the DAX APIs are designed
> > > for a much higher performance device than these fake DAX
> > > setups could ever give.
> > 
> > Right, we don't need DAX, per se, in the guest.
> > 
> > >
> > > Having userspace call fsync/msync like done normally, and
> > > having those coarser calls be turned into somewhat efficient
> > > backend flushes would be perfectly acceptable.
> > >
> > > The big question is, what should that kind of interface look
> > > like?
> > 
> > To me, this looks much like the dirty cache tracking that is done in
> > the address_space radix for the DAX case, but modified to coordinate
> > queued / page-based flushing when the guest  wants to persist data.
> > The similarity to DAX is not storing guest allocated pages in the
> > radix but entries that track dirty guest physical addresses.
> 
> Let me check whether I understand the problem correctly. So we want to
> export a block device (essentially a page cache of this block device) to a
> guest as PMEM and use DAX in the guest to save guest's page cache. The

that's correct.

> natural way to make the persistence work would be to make ->flush callback
> of the PMEM device to do an upcall to the host which could then fdatasync()
> appropriate image file range however the performance would suck in such
> case since ->flush gets called for at most one page ranges from DAX.

Discussion is : sync a range using paravirt device or flush hit addresses 
vs block device flush.

> 
> So what you could do instead is to completely ignore ->flush calls for the
> PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> PMEM device (generated by blkdev_issue_flush() or the journalling
> machinery) and fdatasync() the whole image file at that moment - in fact
> you must do that for metadata IO to hit persistent storage anyway in your
> setting. This would very closely follow how exporting block devices with
> volatile cache works with KVM these days AFAIU and the performance will be
> the same.

yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
As per suggestions looks like block flushing device is way ahead. 

If we do an asynchronous block flush at guest side(put current task in wait queue 
till host side fdatasync completes) can solve the purpose? Or do we need another paravirt
device for this?

> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-24 12:06                       ` Pankaj Gupta
  (?)
@ 2017-07-24 12:37                         ` Jan Kara
  -1 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 12:37 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> 
> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > >> [ adding Ross and Jan ]
> > > >>
> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > > >> wrote:
> > > >> >
> > > >> > The goal is to increase density of guests, by moving page
> > > >> > cache into the host (where it can be easily reclaimed).
> > > >> >
> > > >> > If we assume the guests will be backed by relatively fast
> > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > >> > code (issued where the filesystem issues a barrier or
> > > >> > disk cache flush today) may be just what we need to make
> > > >> > that work.
> > > >>
> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > >>
> > > >> However, it still seems like the storage interface is not capable of
> > > >> expressing what is needed, because the operation that is needed is a
> > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > >> communicate range flush information to the host, but there's no
> > > >> readily available block i/o semantic that software running on top of
> > > >> the fake pmem device can use to communicate with the host. Instead
> > > >> you
> > > >> want to intercept the dax_flush() operation and turn it into a queued
> > > >> request on the host.
> > > >>
> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > >> driver call. That seems a better interface to modify than trying to
> > > >> map block-storage flush-cache / force-unit-access commands to this
> > > >> host request.
> > > >>
> > > >> The additional piece you would need to consider is whether to track
> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > >> operation()
> > > >> to also queue a sync on the host, but that essentially turns the host
> > > >> page cache into a pseudo write-through mode.
> > > >
> > > > I suspect initially it will be fine to not offer DAX
> > > > semantics to applications using these "fake DAX" devices
> > > > from a virtual machine, because the DAX APIs are designed
> > > > for a much higher performance device than these fake DAX
> > > > setups could ever give.
> > > 
> > > Right, we don't need DAX, per se, in the guest.
> > > 
> > > >
> > > > Having userspace call fsync/msync like done normally, and
> > > > having those coarser calls be turned into somewhat efficient
> > > > backend flushes would be perfectly acceptable.
> > > >
> > > > The big question is, what should that kind of interface look
> > > > like?
> > > 
> > > To me, this looks much like the dirty cache tracking that is done in
> > > the address_space radix for the DAX case, but modified to coordinate
> > > queued / page-based flushing when the guest  wants to persist data.
> > > The similarity to DAX is not storing guest allocated pages in the
> > > radix but entries that track dirty guest physical addresses.
> > 
> > Let me check whether I understand the problem correctly. So we want to
> > export a block device (essentially a page cache of this block device) to a
> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> 
> that's correct.
> 
> > natural way to make the persistence work would be to make ->flush callback
> > of the PMEM device to do an upcall to the host which could then fdatasync()
> > appropriate image file range however the performance would suck in such
> > case since ->flush gets called for at most one page ranges from DAX.
> 
> Discussion is : sync a range using paravirt device or flush hit addresses 
> vs block device flush.
> 
> > 
> > So what you could do instead is to completely ignore ->flush calls for the
> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > PMEM device (generated by blkdev_issue_flush() or the journalling
> > machinery) and fdatasync() the whole image file at that moment - in fact
> > you must do that for metadata IO to hit persistent storage anyway in your
> > setting. This would very closely follow how exporting block devices with
> > volatile cache works with KVM these days AFAIU and the performance will be
> > the same.
> 
> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> As per suggestions looks like block flushing device is way ahead. 
> 
> If we do an asynchronous block flush at guest side(put current task in
> wait queue till host side fdatasync completes) can solve the purpose? Or
> do we need another paravirt device for this?

Well, even currently if you have PMEM device, you still have also a block
device and a request queue associated with it and metadata IO goes through
that path. So in your case you will have the same in the guest as a result
of exposing virtual PMEM device to the guest and you just need to make sure
this virtual block device behaves the same way as traditional virtualized
block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 12:37                         ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 12:37 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Jan Kara, Dan Williams, Rik van Riel, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, xiaoguangrong eric,
	Haozhong Zhang, Ross Zwisler

On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> 
> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > >> [ adding Ross and Jan ]
> > > >>
> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > > >> wrote:
> > > >> >
> > > >> > The goal is to increase density of guests, by moving page
> > > >> > cache into the host (where it can be easily reclaimed).
> > > >> >
> > > >> > If we assume the guests will be backed by relatively fast
> > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > >> > code (issued where the filesystem issues a barrier or
> > > >> > disk cache flush today) may be just what we need to make
> > > >> > that work.
> > > >>
> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > >>
> > > >> However, it still seems like the storage interface is not capable of
> > > >> expressing what is needed, because the operation that is needed is a
> > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > >> communicate range flush information to the host, but there's no
> > > >> readily available block i/o semantic that software running on top of
> > > >> the fake pmem device can use to communicate with the host. Instead
> > > >> you
> > > >> want to intercept the dax_flush() operation and turn it into a queued
> > > >> request on the host.
> > > >>
> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > >> driver call. That seems a better interface to modify than trying to
> > > >> map block-storage flush-cache / force-unit-access commands to this
> > > >> host request.
> > > >>
> > > >> The additional piece you would need to consider is whether to track
> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > >> operation()
> > > >> to also queue a sync on the host, but that essentially turns the host
> > > >> page cache into a pseudo write-through mode.
> > > >
> > > > I suspect initially it will be fine to not offer DAX
> > > > semantics to applications using these "fake DAX" devices
> > > > from a virtual machine, because the DAX APIs are designed
> > > > for a much higher performance device than these fake DAX
> > > > setups could ever give.
> > > 
> > > Right, we don't need DAX, per se, in the guest.
> > > 
> > > >
> > > > Having userspace call fsync/msync like done normally, and
> > > > having those coarser calls be turned into somewhat efficient
> > > > backend flushes would be perfectly acceptable.
> > > >
> > > > The big question is, what should that kind of interface look
> > > > like?
> > > 
> > > To me, this looks much like the dirty cache tracking that is done in
> > > the address_space radix for the DAX case, but modified to coordinate
> > > queued / page-based flushing when the guest  wants to persist data.
> > > The similarity to DAX is not storing guest allocated pages in the
> > > radix but entries that track dirty guest physical addresses.
> > 
> > Let me check whether I understand the problem correctly. So we want to
> > export a block device (essentially a page cache of this block device) to a
> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> 
> that's correct.
> 
> > natural way to make the persistence work would be to make ->flush callback
> > of the PMEM device to do an upcall to the host which could then fdatasync()
> > appropriate image file range however the performance would suck in such
> > case since ->flush gets called for at most one page ranges from DAX.
> 
> Discussion is : sync a range using paravirt device or flush hit addresses 
> vs block device flush.
> 
> > 
> > So what you could do instead is to completely ignore ->flush calls for the
> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > PMEM device (generated by blkdev_issue_flush() or the journalling
> > machinery) and fdatasync() the whole image file at that moment - in fact
> > you must do that for metadata IO to hit persistent storage anyway in your
> > setting. This would very closely follow how exporting block devices with
> > volatile cache works with KVM these days AFAIU and the performance will be
> > the same.
> 
> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> As per suggestions looks like block flushing device is way ahead. 
> 
> If we do an asynchronous block flush at guest side(put current task in
> wait queue till host side fdatasync completes) can solve the purpose? Or
> do we need another paravirt device for this?

Well, even currently if you have PMEM device, you still have also a block
device and a request queue associated with it and metadata IO goes through
that path. So in your case you will have the same in the guest as a result
of exposing virtual PMEM device to the guest and you just need to make sure
this virtual block device behaves the same way as traditional virtualized
block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 12:37                         ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 12:37 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Jan Kara, Dan Williams, Rik van Riel, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, xiaoguangrong eric,
	Haozhong Zhang, Ross Zwisler

On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> 
> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > >> [ adding Ross and Jan ]
> > > >>
> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > > >> wrote:
> > > >> >
> > > >> > The goal is to increase density of guests, by moving page
> > > >> > cache into the host (where it can be easily reclaimed).
> > > >> >
> > > >> > If we assume the guests will be backed by relatively fast
> > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > >> > code (issued where the filesystem issues a barrier or
> > > >> > disk cache flush today) may be just what we need to make
> > > >> > that work.
> > > >>
> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > >>
> > > >> However, it still seems like the storage interface is not capable of
> > > >> expressing what is needed, because the operation that is needed is a
> > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > >> communicate range flush information to the host, but there's no
> > > >> readily available block i/o semantic that software running on top of
> > > >> the fake pmem device can use to communicate with the host. Instead
> > > >> you
> > > >> want to intercept the dax_flush() operation and turn it into a queued
> > > >> request on the host.
> > > >>
> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > >> driver call. That seems a better interface to modify than trying to
> > > >> map block-storage flush-cache / force-unit-access commands to this
> > > >> host request.
> > > >>
> > > >> The additional piece you would need to consider is whether to track
> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > >> operation()
> > > >> to also queue a sync on the host, but that essentially turns the host
> > > >> page cache into a pseudo write-through mode.
> > > >
> > > > I suspect initially it will be fine to not offer DAX
> > > > semantics to applications using these "fake DAX" devices
> > > > from a virtual machine, because the DAX APIs are designed
> > > > for a much higher performance device than these fake DAX
> > > > setups could ever give.
> > > 
> > > Right, we don't need DAX, per se, in the guest.
> > > 
> > > >
> > > > Having userspace call fsync/msync like done normally, and
> > > > having those coarser calls be turned into somewhat efficient
> > > > backend flushes would be perfectly acceptable.
> > > >
> > > > The big question is, what should that kind of interface look
> > > > like?
> > > 
> > > To me, this looks much like the dirty cache tracking that is done in
> > > the address_space radix for the DAX case, but modified to coordinate
> > > queued / page-based flushing when the guest  wants to persist data.
> > > The similarity to DAX is not storing guest allocated pages in the
> > > radix but entries that track dirty guest physical addresses.
> > 
> > Let me check whether I understand the problem correctly. So we want to
> > export a block device (essentially a page cache of this block device) to a
> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> 
> that's correct.
> 
> > natural way to make the persistence work would be to make ->flush callback
> > of the PMEM device to do an upcall to the host which could then fdatasync()
> > appropriate image file range however the performance would suck in such
> > case since ->flush gets called for at most one page ranges from DAX.
> 
> Discussion is : sync a range using paravirt device or flush hit addresses 
> vs block device flush.
> 
> > 
> > So what you could do instead is to completely ignore ->flush calls for the
> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > PMEM device (generated by blkdev_issue_flush() or the journalling
> > machinery) and fdatasync() the whole image file at that moment - in fact
> > you must do that for metadata IO to hit persistent storage anyway in your
> > setting. This would very closely follow how exporting block devices with
> > volatile cache works with KVM these days AFAIU and the performance will be
> > the same.
> 
> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> As per suggestions looks like block flushing device is way ahead. 
> 
> If we do an asynchronous block flush at guest side(put current task in
> wait queue till host side fdatasync completes) can solve the purpose? Or
> do we need another paravirt device for this?

Well, even currently if you have PMEM device, you still have also a block
device and a request queue associated with it and metadata IO goes through
that path. So in your case you will have the same in the guest as a result
of exposing virtual PMEM device to the guest and you just need to make sure
this virtual block device behaves the same way as traditional virtualized
block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 15:10                           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-24 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>>
>> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
>> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > >> [ adding Ross and Jan ]
>> > > >>
>> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> > > >> wrote:
>> > > >> >
>> > > >> > The goal is to increase density of guests, by moving page
>> > > >> > cache into the host (where it can be easily reclaimed).
>> > > >> >
>> > > >> > If we assume the guests will be backed by relatively fast
>> > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > >> > code (issued where the filesystem issues a barrier or
>> > > >> > disk cache flush today) may be just what we need to make
>> > > >> > that work.
>> > > >>
>> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > >>
>> > > >> However, it still seems like the storage interface is not capable of
>> > > >> expressing what is needed, because the operation that is needed is a
>> > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > >> communicate range flush information to the host, but there's no
>> > > >> readily available block i/o semantic that software running on top of
>> > > >> the fake pmem device can use to communicate with the host. Instead
>> > > >> you
>> > > >> want to intercept the dax_flush() operation and turn it into a queued
>> > > >> request on the host.
>> > > >>
>> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > >> driver call. That seems a better interface to modify than trying to
>> > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > >> host request.
>> > > >>
>> > > >> The additional piece you would need to consider is whether to track
>> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > >> operation()
>> > > >> to also queue a sync on the host, but that essentially turns the host
>> > > >> page cache into a pseudo write-through mode.
>> > > >
>> > > > I suspect initially it will be fine to not offer DAX
>> > > > semantics to applications using these "fake DAX" devices
>> > > > from a virtual machine, because the DAX APIs are designed
>> > > > for a much higher performance device than these fake DAX
>> > > > setups could ever give.
>> > >
>> > > Right, we don't need DAX, per se, in the guest.
>> > >
>> > > >
>> > > > Having userspace call fsync/msync like done normally, and
>> > > > having those coarser calls be turned into somewhat efficient
>> > > > backend flushes would be perfectly acceptable.
>> > > >
>> > > > The big question is, what should that kind of interface look
>> > > > like?
>> > >
>> > > To me, this looks much like the dirty cache tracking that is done in
>> > > the address_space radix for the DAX case, but modified to coordinate
>> > > queued / page-based flushing when the guest  wants to persist data.
>> > > The similarity to DAX is not storing guest allocated pages in the
>> > > radix but entries that track dirty guest physical addresses.
>> >
>> > Let me check whether I understand the problem correctly. So we want to
>> > export a block device (essentially a page cache of this block device) to a
>> > guest as PMEM and use DAX in the guest to save guest's page cache. The
>>
>> that's correct.
>>
>> > natural way to make the persistence work would be to make ->flush callback
>> > of the PMEM device to do an upcall to the host which could then fdatasync()
>> > appropriate image file range however the performance would suck in such
>> > case since ->flush gets called for at most one page ranges from DAX.
>>
>> Discussion is : sync a range using paravirt device or flush hit addresses
>> vs block device flush.
>>
>> >
>> > So what you could do instead is to completely ignore ->flush calls for the
>> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > machinery) and fdatasync() the whole image file at that moment - in fact
>> > you must do that for metadata IO to hit persistent storage anyway in your
>> > setting. This would very closely follow how exporting block devices with
>> > volatile cache works with KVM these days AFAIU and the performance will be
>> > the same.
>>
>> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> As per suggestions looks like block flushing device is way ahead.
>>
>> If we do an asynchronous block flush at guest side(put current task in
>> wait queue till host side fdatasync completes) can solve the purpose? Or
>> do we need another paravirt device for this?
>
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that path. So in your case you will have the same in the guest as a result
> of exposing virtual PMEM device to the guest and you just need to make sure
> this virtual block device behaves the same way as traditional virtualized
> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

This approach would turn into a full fsync on the host. The question
in my mind is whether there is any optimization to be had by trapping
dax_flush() and calling msync() on host ranges, but Jan is right
trapping blkdev_issue_flush() and turning around and calling host
fsync() is the most straightforward approach that does not need driver
interface changes. The dax_flush() approach would need to modify it
into a async completion interface.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 15:10                           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-24 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>>
>> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > >> [ adding Ross and Jan ]
>> > > >>
>> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> > > >> wrote:
>> > > >> >
>> > > >> > The goal is to increase density of guests, by moving page
>> > > >> > cache into the host (where it can be easily reclaimed).
>> > > >> >
>> > > >> > If we assume the guests will be backed by relatively fast
>> > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > >> > code (issued where the filesystem issues a barrier or
>> > > >> > disk cache flush today) may be just what we need to make
>> > > >> > that work.
>> > > >>
>> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > >>
>> > > >> However, it still seems like the storage interface is not capable of
>> > > >> expressing what is needed, because the operation that is needed is a
>> > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > >> communicate range flush information to the host, but there's no
>> > > >> readily available block i/o semantic that software running on top of
>> > > >> the fake pmem device can use to communicate with the host. Instead
>> > > >> you
>> > > >> want to intercept the dax_flush() operation and turn it into a queued
>> > > >> request on the host.
>> > > >>
>> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > >> driver call. That seems a better interface to modify than trying to
>> > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > >> host request.
>> > > >>
>> > > >> The additional piece you would need to consider is whether to track
>> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > >> operation()
>> > > >> to also queue a sync on the host, but that essentially turns the host
>> > > >> page cache into a pseudo write-through mode.
>> > > >
>> > > > I suspect initially it will be fine to not offer DAX
>> > > > semantics to applications using these "fake DAX" devices
>> > > > from a virtual machine, because the DAX APIs are designed
>> > > > for a much higher performance device than these fake DAX
>> > > > setups could ever give.
>> > >
>> > > Right, we don't need DAX, per se, in the guest.
>> > >
>> > > >
>> > > > Having userspace call fsync/msync like done normally, and
>> > > > having those coarser calls be turned into somewhat efficient
>> > > > backend flushes would be perfectly acceptable.
>> > > >
>> > > > The big question is, what should that kind of interface look
>> > > > like?
>> > >
>> > > To me, this looks much like the dirty cache tracking that is done in
>> > > the address_space radix for the DAX case, but modified to coordinate
>> > > queued / page-based flushing when the guest  wants to persist data.
>> > > The similarity to DAX is not storing guest allocated pages in the
>> > > radix but entries that track dirty guest physical addresses.
>> >
>> > Let me check whether I understand the problem correctly. So we want to
>> > export a block device (essentially a page cache of this block device) to a
>> > guest as PMEM and use DAX in the guest to save guest's page cache. The
>>
>> that's correct.
>>
>> > natural way to make the persistence work would be to make ->flush callback
>> > of the PMEM device to do an upcall to the host which could then fdatasync()
>> > appropriate image file range however the performance would suck in such
>> > case since ->flush gets called for at most one page ranges from DAX.
>>
>> Discussion is : sync a range using paravirt device or flush hit addresses
>> vs block device flush.
>>
>> >
>> > So what you could do instead is to completely ignore ->flush calls for the
>> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > machinery) and fdatasync() the whole image file at that moment - in fact
>> > you must do that for metadata IO to hit persistent storage anyway in your
>> > setting. This would very closely follow how exporting block devices with
>> > volatile cache works with KVM these days AFAIU and the performance will be
>> > the same.
>>
>> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> As per suggestions looks like block flushing device is way ahead.
>>
>> If we do an asynchronous block flush at guest side(put current task in
>> wait queue till host side fdatasync completes) can solve the purpose? Or
>> do we need another paravirt device for this?
>
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that path. So in your case you will have the same in the guest as a result
> of exposing virtual PMEM device to the guest and you just need to make sure
> this virtual block device behaves the same way as traditional virtualized
> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

This approach would turn into a full fsync on the host. The question
in my mind is whether there is any optimization to be had by trapping
dax_flush() and calling msync() on host ranges, but Jan is right
trapping blkdev_issue_flush() and turning around and calling host
fsync() is the most straightforward approach that does not need driver
interface changes. The dax_flush() approach would need to modify it
into a async completion interface.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 15:10                           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-24 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Pankaj Gupta, Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>>
>> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
>> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > >> [ adding Ross and Jan ]
>> > > >>
>> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> > > >> wrote:
>> > > >> >
>> > > >> > The goal is to increase density of guests, by moving page
>> > > >> > cache into the host (where it can be easily reclaimed).
>> > > >> >
>> > > >> > If we assume the guests will be backed by relatively fast
>> > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > >> > code (issued where the filesystem issues a barrier or
>> > > >> > disk cache flush today) may be just what we need to make
>> > > >> > that work.
>> > > >>
>> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > >>
>> > > >> However, it still seems like the storage interface is not capable of
>> > > >> expressing what is needed, because the operation that is needed is a
>> > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > >> communicate range flush information to the host, but there's no
>> > > >> readily available block i/o semantic that software running on top of
>> > > >> the fake pmem device can use to communicate with the host. Instead
>> > > >> you
>> > > >> want to intercept the dax_flush() operation and turn it into a queued
>> > > >> request on the host.
>> > > >>
>> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > >> driver call. That seems a better interface to modify than trying to
>> > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > >> host request.
>> > > >>
>> > > >> The additional piece you would need to consider is whether to track
>> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > >> operation()
>> > > >> to also queue a sync on the host, but that essentially turns the host
>> > > >> page cache into a pseudo write-through mode.
>> > > >
>> > > > I suspect initially it will be fine to not offer DAX
>> > > > semantics to applications using these "fake DAX" devices
>> > > > from a virtual machine, because the DAX APIs are designed
>> > > > for a much higher performance device than these fake DAX
>> > > > setups could ever give.
>> > >
>> > > Right, we don't need DAX, per se, in the guest.
>> > >
>> > > >
>> > > > Having userspace call fsync/msync like done normally, and
>> > > > having those coarser calls be turned into somewhat efficient
>> > > > backend flushes would be perfectly acceptable.
>> > > >
>> > > > The big question is, what should that kind of interface look
>> > > > like?
>> > >
>> > > To me, this looks much like the dirty cache tracking that is done in
>> > > the address_space radix for the DAX case, but modified to coordinate
>> > > queued / page-based flushing when the guest  wants to persist data.
>> > > The similarity to DAX is not storing guest allocated pages in the
>> > > radix but entries that track dirty guest physical addresses.
>> >
>> > Let me check whether I understand the problem correctly. So we want to
>> > export a block device (essentially a page cache of this block device) to a
>> > guest as PMEM and use DAX in the guest to save guest's page cache. The
>>
>> that's correct.
>>
>> > natural way to make the persistence work would be to make ->flush callback
>> > of the PMEM device to do an upcall to the host which could then fdatasync()
>> > appropriate image file range however the performance would suck in such
>> > case since ->flush gets called for at most one page ranges from DAX.
>>
>> Discussion is : sync a range using paravirt device or flush hit addresses
>> vs block device flush.
>>
>> >
>> > So what you could do instead is to completely ignore ->flush calls for the
>> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > machinery) and fdatasync() the whole image file at that moment - in fact
>> > you must do that for metadata IO to hit persistent storage anyway in your
>> > setting. This would very closely follow how exporting block devices with
>> > volatile cache works with KVM these days AFAIU and the performance will be
>> > the same.
>>
>> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> As per suggestions looks like block flushing device is way ahead.
>>
>> If we do an asynchronous block flush at guest side(put current task in
>> wait queue till host side fdatasync completes) can solve the purpose? Or
>> do we need another paravirt device for this?
>
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that path. So in your case you will have the same in the guest as a result
> of exposing virtual PMEM device to the guest and you just need to make sure
> this virtual block device behaves the same way as traditional virtualized
> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

This approach would turn into a full fsync on the host. The question
in my mind is whether there is any optimization to be had by trapping
dax_flush() and calling msync() on host ranges, but Jan is right
trapping blkdev_issue_flush() and turning around and calling host
fsync() is the most straightforward approach that does not need driver
interface changes. The dax_flush() approach would need to modify it
into a async completion interface.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-24 15:10                           ` Dan Williams
  (?)
@ 2017-07-24 15:48                             ` Jan Kara
  -1 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 15:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara,
	xiaoguangrong eric, kvm-devel, Stefan Hajnoczi, Ross Zwisler,
	Qemu Developers, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Mon 24-07-17 08:10:05, Dan Williams wrote:
> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> >>
> >> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> > > >> [ adding Ross and Jan ]
> >> > > >>
> >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > The goal is to increase density of guests, by moving page
> >> > > >> > cache into the host (where it can be easily reclaimed).
> >> > > >> >
> >> > > >> > If we assume the guests will be backed by relatively fast
> >> > > >> > SSDs, a "whole device flush" from filesystem journaling
> >> > > >> > code (issued where the filesystem issues a barrier or
> >> > > >> > disk cache flush today) may be just what we need to make
> >> > > >> > that work.
> >> > > >>
> >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >> > > >>
> >> > > >> However, it still seems like the storage interface is not capable of
> >> > > >> expressing what is needed, because the operation that is needed is a
> >> > > >> range flush. In the guest you want the DAX page dirty tracking to
> >> > > >> communicate range flush information to the host, but there's no
> >> > > >> readily available block i/o semantic that software running on top of
> >> > > >> the fake pmem device can use to communicate with the host. Instead
> >> > > >> you
> >> > > >> want to intercept the dax_flush() operation and turn it into a queued
> >> > > >> request on the host.
> >> > > >>
> >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> > > >> driver call. That seems a better interface to modify than trying to
> >> > > >> map block-storage flush-cache / force-unit-access commands to this
> >> > > >> host request.
> >> > > >>
> >> > > >> The additional piece you would need to consider is whether to track
> >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> >> > > >> operation()
> >> > > >> to also queue a sync on the host, but that essentially turns the host
> >> > > >> page cache into a pseudo write-through mode.
> >> > > >
> >> > > > I suspect initially it will be fine to not offer DAX
> >> > > > semantics to applications using these "fake DAX" devices
> >> > > > from a virtual machine, because the DAX APIs are designed
> >> > > > for a much higher performance device than these fake DAX
> >> > > > setups could ever give.
> >> > >
> >> > > Right, we don't need DAX, per se, in the guest.
> >> > >
> >> > > >
> >> > > > Having userspace call fsync/msync like done normally, and
> >> > > > having those coarser calls be turned into somewhat efficient
> >> > > > backend flushes would be perfectly acceptable.
> >> > > >
> >> > > > The big question is, what should that kind of interface look
> >> > > > like?
> >> > >
> >> > > To me, this looks much like the dirty cache tracking that is done in
> >> > > the address_space radix for the DAX case, but modified to coordinate
> >> > > queued / page-based flushing when the guest  wants to persist data.
> >> > > The similarity to DAX is not storing guest allocated pages in the
> >> > > radix but entries that track dirty guest physical addresses.
> >> >
> >> > Let me check whether I understand the problem correctly. So we want to
> >> > export a block device (essentially a page cache of this block device) to a
> >> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> >>
> >> that's correct.
> >>
> >> > natural way to make the persistence work would be to make ->flush callback
> >> > of the PMEM device to do an upcall to the host which could then fdatasync()
> >> > appropriate image file range however the performance would suck in such
> >> > case since ->flush gets called for at most one page ranges from DAX.
> >>
> >> Discussion is : sync a range using paravirt device or flush hit addresses
> >> vs block device flush.
> >>
> >> >
> >> > So what you could do instead is to completely ignore ->flush calls for the
> >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> >> > PMEM device (generated by blkdev_issue_flush() or the journalling
> >> > machinery) and fdatasync() the whole image file at that moment - in fact
> >> > you must do that for metadata IO to hit persistent storage anyway in your
> >> > setting. This would very closely follow how exporting block devices with
> >> > volatile cache works with KVM these days AFAIU and the performance will be
> >> > the same.
> >>
> >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> >> As per suggestions looks like block flushing device is way ahead.
> >>
> >> If we do an asynchronous block flush at guest side(put current task in
> >> wait queue till host side fdatasync completes) can solve the purpose? Or
> >> do we need another paravirt device for this?
> >
> > Well, even currently if you have PMEM device, you still have also a block
> > device and a request queue associated with it and metadata IO goes through
> > that path. So in your case you will have the same in the guest as a result
> > of exposing virtual PMEM device to the guest and you just need to make sure
> > this virtual block device behaves the same way as traditional virtualized
> > block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
> 
> This approach would turn into a full fsync on the host. The question
> in my mind is whether there is any optimization to be had by trapping
> dax_flush() and calling msync() on host ranges, but Jan is right
> trapping blkdev_issue_flush() and turning around and calling host
> fsync() is the most straightforward approach that does not need driver
> interface changes. The dax_flush() approach would need to modify it
> into a async completion interface.

If the backing device on the host is actually a normal block device or an
image file, doing full fsync() is the most efficient implementation
anyway...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 15:48                             ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 15:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Pankaj Gupta, Rik van Riel, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, xiaoguangrong eric,
	Haozhong Zhang, Ross Zwisler

On Mon 24-07-17 08:10:05, Dan Williams wrote:
> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> >>
> >> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> > > >> [ adding Ross and Jan ]
> >> > > >>
> >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > The goal is to increase density of guests, by moving page
> >> > > >> > cache into the host (where it can be easily reclaimed).
> >> > > >> >
> >> > > >> > If we assume the guests will be backed by relatively fast
> >> > > >> > SSDs, a "whole device flush" from filesystem journaling
> >> > > >> > code (issued where the filesystem issues a barrier or
> >> > > >> > disk cache flush today) may be just what we need to make
> >> > > >> > that work.
> >> > > >>
> >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >> > > >>
> >> > > >> However, it still seems like the storage interface is not capable of
> >> > > >> expressing what is needed, because the operation that is needed is a
> >> > > >> range flush. In the guest you want the DAX page dirty tracking to
> >> > > >> communicate range flush information to the host, but there's no
> >> > > >> readily available block i/o semantic that software running on top of
> >> > > >> the fake pmem device can use to communicate with the host. Instead
> >> > > >> you
> >> > > >> want to intercept the dax_flush() operation and turn it into a queued
> >> > > >> request on the host.
> >> > > >>
> >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> > > >> driver call. That seems a better interface to modify than trying to
> >> > > >> map block-storage flush-cache / force-unit-access commands to this
> >> > > >> host request.
> >> > > >>
> >> > > >> The additional piece you would need to consider is whether to track
> >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> >> > > >> operation()
> >> > > >> to also queue a sync on the host, but that essentially turns the host
> >> > > >> page cache into a pseudo write-through mode.
> >> > > >
> >> > > > I suspect initially it will be fine to not offer DAX
> >> > > > semantics to applications using these "fake DAX" devices
> >> > > > from a virtual machine, because the DAX APIs are designed
> >> > > > for a much higher performance device than these fake DAX
> >> > > > setups could ever give.
> >> > >
> >> > > Right, we don't need DAX, per se, in the guest.
> >> > >
> >> > > >
> >> > > > Having userspace call fsync/msync like done normally, and
> >> > > > having those coarser calls be turned into somewhat efficient
> >> > > > backend flushes would be perfectly acceptable.
> >> > > >
> >> > > > The big question is, what should that kind of interface look
> >> > > > like?
> >> > >
> >> > > To me, this looks much like the dirty cache tracking that is done in
> >> > > the address_space radix for the DAX case, but modified to coordinate
> >> > > queued / page-based flushing when the guest  wants to persist data.
> >> > > The similarity to DAX is not storing guest allocated pages in the
> >> > > radix but entries that track dirty guest physical addresses.
> >> >
> >> > Let me check whether I understand the problem correctly. So we want to
> >> > export a block device (essentially a page cache of this block device) to a
> >> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> >>
> >> that's correct.
> >>
> >> > natural way to make the persistence work would be to make ->flush callback
> >> > of the PMEM device to do an upcall to the host which could then fdatasync()
> >> > appropriate image file range however the performance would suck in such
> >> > case since ->flush gets called for at most one page ranges from DAX.
> >>
> >> Discussion is : sync a range using paravirt device or flush hit addresses
> >> vs block device flush.
> >>
> >> >
> >> > So what you could do instead is to completely ignore ->flush calls for the
> >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> >> > PMEM device (generated by blkdev_issue_flush() or the journalling
> >> > machinery) and fdatasync() the whole image file at that moment - in fact
> >> > you must do that for metadata IO to hit persistent storage anyway in your
> >> > setting. This would very closely follow how exporting block devices with
> >> > volatile cache works with KVM these days AFAIU and the performance will be
> >> > the same.
> >>
> >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> >> As per suggestions looks like block flushing device is way ahead.
> >>
> >> If we do an asynchronous block flush at guest side(put current task in
> >> wait queue till host side fdatasync completes) can solve the purpose? Or
> >> do we need another paravirt device for this?
> >
> > Well, even currently if you have PMEM device, you still have also a block
> > device and a request queue associated with it and metadata IO goes through
> > that path. So in your case you will have the same in the guest as a result
> > of exposing virtual PMEM device to the guest and you just need to make sure
> > this virtual block device behaves the same way as traditional virtualized
> > block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
> 
> This approach would turn into a full fsync on the host. The question
> in my mind is whether there is any optimization to be had by trapping
> dax_flush() and calling msync() on host ranges, but Jan is right
> trapping blkdev_issue_flush() and turning around and calling host
> fsync() is the most straightforward approach that does not need driver
> interface changes. The dax_flush() approach would need to modify it
> into a async completion interface.

If the backing device on the host is actually a normal block device or an
image file, doing full fsync() is the most efficient implementation
anyway...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 15:48                             ` Jan Kara
  0 siblings, 0 replies; 176+ messages in thread
From: Jan Kara @ 2017-07-24 15:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Pankaj Gupta, Rik van Riel, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, xiaoguangrong eric,
	Haozhong Zhang, Ross Zwisler

On Mon 24-07-17 08:10:05, Dan Williams wrote:
> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> >>
> >> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> > > >> [ adding Ross and Jan ]
> >> > > >>
> >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > The goal is to increase density of guests, by moving page
> >> > > >> > cache into the host (where it can be easily reclaimed).
> >> > > >> >
> >> > > >> > If we assume the guests will be backed by relatively fast
> >> > > >> > SSDs, a "whole device flush" from filesystem journaling
> >> > > >> > code (issued where the filesystem issues a barrier or
> >> > > >> > disk cache flush today) may be just what we need to make
> >> > > >> > that work.
> >> > > >>
> >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >> > > >>
> >> > > >> However, it still seems like the storage interface is not capable of
> >> > > >> expressing what is needed, because the operation that is needed is a
> >> > > >> range flush. In the guest you want the DAX page dirty tracking to
> >> > > >> communicate range flush information to the host, but there's no
> >> > > >> readily available block i/o semantic that software running on top of
> >> > > >> the fake pmem device can use to communicate with the host. Instead
> >> > > >> you
> >> > > >> want to intercept the dax_flush() operation and turn it into a queued
> >> > > >> request on the host.
> >> > > >>
> >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> > > >> driver call. That seems a better interface to modify than trying to
> >> > > >> map block-storage flush-cache / force-unit-access commands to this
> >> > > >> host request.
> >> > > >>
> >> > > >> The additional piece you would need to consider is whether to track
> >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> >> > > >> operation()
> >> > > >> to also queue a sync on the host, but that essentially turns the host
> >> > > >> page cache into a pseudo write-through mode.
> >> > > >
> >> > > > I suspect initially it will be fine to not offer DAX
> >> > > > semantics to applications using these "fake DAX" devices
> >> > > > from a virtual machine, because the DAX APIs are designed
> >> > > > for a much higher performance device than these fake DAX
> >> > > > setups could ever give.
> >> > >
> >> > > Right, we don't need DAX, per se, in the guest.
> >> > >
> >> > > >
> >> > > > Having userspace call fsync/msync like done normally, and
> >> > > > having those coarser calls be turned into somewhat efficient
> >> > > > backend flushes would be perfectly acceptable.
> >> > > >
> >> > > > The big question is, what should that kind of interface look
> >> > > > like?
> >> > >
> >> > > To me, this looks much like the dirty cache tracking that is done in
> >> > > the address_space radix for the DAX case, but modified to coordinate
> >> > > queued / page-based flushing when the guest  wants to persist data.
> >> > > The similarity to DAX is not storing guest allocated pages in the
> >> > > radix but entries that track dirty guest physical addresses.
> >> >
> >> > Let me check whether I understand the problem correctly. So we want to
> >> > export a block device (essentially a page cache of this block device) to a
> >> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> >>
> >> that's correct.
> >>
> >> > natural way to make the persistence work would be to make ->flush callback
> >> > of the PMEM device to do an upcall to the host which could then fdatasync()
> >> > appropriate image file range however the performance would suck in such
> >> > case since ->flush gets called for at most one page ranges from DAX.
> >>
> >> Discussion is : sync a range using paravirt device or flush hit addresses
> >> vs block device flush.
> >>
> >> >
> >> > So what you could do instead is to completely ignore ->flush calls for the
> >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> >> > PMEM device (generated by blkdev_issue_flush() or the journalling
> >> > machinery) and fdatasync() the whole image file at that moment - in fact
> >> > you must do that for metadata IO to hit persistent storage anyway in your
> >> > setting. This would very closely follow how exporting block devices with
> >> > volatile cache works with KVM these days AFAIU and the performance will be
> >> > the same.
> >>
> >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> >> As per suggestions looks like block flushing device is way ahead.
> >>
> >> If we do an asynchronous block flush at guest side(put current task in
> >> wait queue till host side fdatasync completes) can solve the purpose? Or
> >> do we need another paravirt device for this?
> >
> > Well, even currently if you have PMEM device, you still have also a block
> > device and a request queue associated with it and metadata IO goes through
> > that path. So in your case you will have the same in the guest as a result
> > of exposing virtual PMEM device to the guest and you just need to make sure
> > this virtual block device behaves the same way as traditional virtualized
> > block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
> 
> This approach would turn into a full fsync on the host. The question
> in my mind is whether there is any optimization to be had by trapping
> dax_flush() and calling msync() on host ranges, but Jan is right
> trapping blkdev_issue_flush() and turning around and calling host
> fsync() is the most straightforward approach that does not need driver
> interface changes. The dax_flush() approach would need to modify it
> into a async completion interface.

If the backing device on the host is actually a normal block device or an
image file, doing full fsync() is the most efficient implementation
anyway...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-24 15:48                             ` Jan Kara
  (?)
@ 2017-07-24 16:19                               ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-24 16:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Mon, Jul 24, 2017 at 8:48 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-07-17 08:10:05, Dan Williams wrote:
>> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
[..]
>> This approach would turn into a full fsync on the host. The question
>> in my mind is whether there is any optimization to be had by trapping
>> dax_flush() and calling msync() on host ranges, but Jan is right
>> trapping blkdev_issue_flush() and turning around and calling host
>> fsync() is the most straightforward approach that does not need driver
>> interface changes. The dax_flush() approach would need to modify it
>> into a async completion interface.
>
> If the backing device on the host is actually a normal block device or an
> image file, doing full fsync() is the most efficient implementation
> anyway...

Ah, ok, great. That was the gap in my understanding.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 16:19                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-24 16:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Pankaj Gupta, Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Mon, Jul 24, 2017 at 8:48 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-07-17 08:10:05, Dan Williams wrote:
>> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
[..]
>> This approach would turn into a full fsync on the host. The question
>> in my mind is whether there is any optimization to be had by trapping
>> dax_flush() and calling msync() on host ranges, but Jan is right
>> trapping blkdev_issue_flush() and turning around and calling host
>> fsync() is the most straightforward approach that does not need driver
>> interface changes. The dax_flush() approach would need to modify it
>> into a async completion interface.
>
> If the backing device on the host is actually a normal block device or an
> image file, doing full fsync() is the most efficient implementation
> anyway...

Ah, ok, great. That was the gap in my understanding.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-24 16:19                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-24 16:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Pankaj Gupta, Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Mon, Jul 24, 2017 at 8:48 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-07-17 08:10:05, Dan Williams wrote:
>> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara <jack@suse.cz> wrote:
[..]
>> This approach would turn into a full fsync on the host. The question
>> in my mind is whether there is any optimization to be had by trapping
>> dax_flush() and calling msync() on host ranges, but Jan is right
>> trapping blkdev_issue_flush() and turning around and calling host
>> fsync() is the most straightforward approach that does not need driver
>> interface changes. The dax_flush() approach would need to modify it
>> into a async completion interface.
>
> If the backing device on the host is actually a normal block device or an
> image file, doing full fsync() is the most efficient implementation
> anyway...

Ah, ok, great. That was the gap in my understanding.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-24 12:37                         ` Jan Kara
  (?)
@ 2017-07-25 14:27                           ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-25 14:27 UTC (permalink / raw)
  To: Jan Kara, Dan Williams
  Cc: Kevin Wolf, Rik van Riel, xiaoguangrong eric, kvm-devel,
	linux-nvdimm@lists.01.org, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal


> Subject: Re: KVM "fake DAX" flushing interface - discussion
> 
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> > 
> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > > >> [ adding Ross and Jan ]
> > > > >>
> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > The goal is to increase density of guests, by moving page
> > > > >> > cache into the host (where it can be easily reclaimed).
> > > > >> >
> > > > >> > If we assume the guests will be backed by relatively fast
> > > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > > >> > code (issued where the filesystem issues a barrier or
> > > > >> > disk cache flush today) may be just what we need to make
> > > > >> > that work.
> > > > >>
> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > > >>
> > > > >> However, it still seems like the storage interface is not capable of
> > > > >> expressing what is needed, because the operation that is needed is a
> > > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > > >> communicate range flush information to the host, but there's no
> > > > >> readily available block i/o semantic that software running on top of
> > > > >> the fake pmem device can use to communicate with the host. Instead
> > > > >> you
> > > > >> want to intercept the dax_flush() operation and turn it into a
> > > > >> queued
> > > > >> request on the host.
> > > > >>
> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > > >> driver call. That seems a better interface to modify than trying to
> > > > >> map block-storage flush-cache / force-unit-access commands to this
> > > > >> host request.
> > > > >>
> > > > >> The additional piece you would need to consider is whether to track
> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > > >> operation()
> > > > >> to also queue a sync on the host, but that essentially turns the
> > > > >> host
> > > > >> page cache into a pseudo write-through mode.
> > > > >
> > > > > I suspect initially it will be fine to not offer DAX
> > > > > semantics to applications using these "fake DAX" devices
> > > > > from a virtual machine, because the DAX APIs are designed
> > > > > for a much higher performance device than these fake DAX
> > > > > setups could ever give.
> > > > 
> > > > Right, we don't need DAX, per se, in the guest.
> > > > 
> > > > >
> > > > > Having userspace call fsync/msync like done normally, and
> > > > > having those coarser calls be turned into somewhat efficient
> > > > > backend flushes would be perfectly acceptable.
> > > > >
> > > > > The big question is, what should that kind of interface look
> > > > > like?
> > > > 
> > > > To me, this looks much like the dirty cache tracking that is done in
> > > > the address_space radix for the DAX case, but modified to coordinate
> > > > queued / page-based flushing when the guest  wants to persist data.
> > > > The similarity to DAX is not storing guest allocated pages in the
> > > > radix but entries that track dirty guest physical addresses.
> > > 
> > > Let me check whether I understand the problem correctly. So we want to
> > > export a block device (essentially a page cache of this block device) to
> > > a
> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
> > 
> > that's correct.
> > 
> > > natural way to make the persistence work would be to make ->flush
> > > callback
> > > of the PMEM device to do an upcall to the host which could then
> > > fdatasync()
> > > appropriate image file range however the performance would suck in such
> > > case since ->flush gets called for at most one page ranges from DAX.
> > 
> > Discussion is : sync a range using paravirt device or flush hit addresses
> > vs block device flush.
> > 
> > > 
> > > So what you could do instead is to completely ignore ->flush calls for
> > > the
> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > > PMEM device (generated by blkdev_issue_flush() or the journalling
> > > machinery) and fdatasync() the whole image file at that moment - in fact
> > > you must do that for metadata IO to hit persistent storage anyway in your
> > > setting. This would very closely follow how exporting block devices with
> > > volatile cache works with KVM these days AFAIU and the performance will
> > > be
> > > the same.
> > 
> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> > As per suggestions looks like block flushing device is way ahead.
> > 
> > If we do an asynchronous block flush at guest side(put current task in
> > wait queue till host side fdatasync completes) can solve the purpose? Or
> > do we need another paravirt device for this?
> 
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that path. So in your case you will have the same in the guest as a result
> of exposing virtual PMEM device to the guest and you just need to make sure
> this virtual block device behaves the same way as traditional virtualized
> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

Looks like only way to send flush(blk dev) from guest to host with nvdimm
is using flush hint addresses. Is this the correct interface I am looking?

blkdev_issue_flush
 submit_bio_wait
  submit_bio
    generic_make_request
      pmem_make_request
      ...
           if (bio->bi_opf & REQ_FLUSH)
                nvdimm_flush(nd_region);

      ...

Thanks,
Pankaj
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-25 14:27                           ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-25 14:27 UTC (permalink / raw)
  To: Jan Kara, Dan Williams
  Cc: Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler


> Subject: Re: KVM "fake DAX" flushing interface - discussion
> 
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> > 
> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > > >> [ adding Ross and Jan ]
> > > > >>
> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > The goal is to increase density of guests, by moving page
> > > > >> > cache into the host (where it can be easily reclaimed).
> > > > >> >
> > > > >> > If we assume the guests will be backed by relatively fast
> > > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > > >> > code (issued where the filesystem issues a barrier or
> > > > >> > disk cache flush today) may be just what we need to make
> > > > >> > that work.
> > > > >>
> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > > >>
> > > > >> However, it still seems like the storage interface is not capable of
> > > > >> expressing what is needed, because the operation that is needed is a
> > > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > > >> communicate range flush information to the host, but there's no
> > > > >> readily available block i/o semantic that software running on top of
> > > > >> the fake pmem device can use to communicate with the host. Instead
> > > > >> you
> > > > >> want to intercept the dax_flush() operation and turn it into a
> > > > >> queued
> > > > >> request on the host.
> > > > >>
> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > > >> driver call. That seems a better interface to modify than trying to
> > > > >> map block-storage flush-cache / force-unit-access commands to this
> > > > >> host request.
> > > > >>
> > > > >> The additional piece you would need to consider is whether to track
> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > > >> operation()
> > > > >> to also queue a sync on the host, but that essentially turns the
> > > > >> host
> > > > >> page cache into a pseudo write-through mode.
> > > > >
> > > > > I suspect initially it will be fine to not offer DAX
> > > > > semantics to applications using these "fake DAX" devices
> > > > > from a virtual machine, because the DAX APIs are designed
> > > > > for a much higher performance device than these fake DAX
> > > > > setups could ever give.
> > > > 
> > > > Right, we don't need DAX, per se, in the guest.
> > > > 
> > > > >
> > > > > Having userspace call fsync/msync like done normally, and
> > > > > having those coarser calls be turned into somewhat efficient
> > > > > backend flushes would be perfectly acceptable.
> > > > >
> > > > > The big question is, what should that kind of interface look
> > > > > like?
> > > > 
> > > > To me, this looks much like the dirty cache tracking that is done in
> > > > the address_space radix for the DAX case, but modified to coordinate
> > > > queued / page-based flushing when the guest  wants to persist data.
> > > > The similarity to DAX is not storing guest allocated pages in the
> > > > radix but entries that track dirty guest physical addresses.
> > > 
> > > Let me check whether I understand the problem correctly. So we want to
> > > export a block device (essentially a page cache of this block device) to
> > > a
> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
> > 
> > that's correct.
> > 
> > > natural way to make the persistence work would be to make ->flush
> > > callback
> > > of the PMEM device to do an upcall to the host which could then
> > > fdatasync()
> > > appropriate image file range however the performance would suck in such
> > > case since ->flush gets called for at most one page ranges from DAX.
> > 
> > Discussion is : sync a range using paravirt device or flush hit addresses
> > vs block device flush.
> > 
> > > 
> > > So what you could do instead is to completely ignore ->flush calls for
> > > the
> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > > PMEM device (generated by blkdev_issue_flush() or the journalling
> > > machinery) and fdatasync() the whole image file at that moment - in fact
> > > you must do that for metadata IO to hit persistent storage anyway in your
> > > setting. This would very closely follow how exporting block devices with
> > > volatile cache works with KVM these days AFAIU and the performance will
> > > be
> > > the same.
> > 
> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> > As per suggestions looks like block flushing device is way ahead.
> > 
> > If we do an asynchronous block flush at guest side(put current task in
> > wait queue till host side fdatasync completes) can solve the purpose? Or
> > do we need another paravirt device for this?
> 
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that path. So in your case you will have the same in the guest as a result
> of exposing virtual PMEM device to the guest and you just need to make sure
> this virtual block device behaves the same way as traditional virtualized
> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

Looks like only way to send flush(blk dev) from guest to host with nvdimm
is using flush hint addresses. Is this the correct interface I am looking?

blkdev_issue_flush
 submit_bio_wait
  submit_bio
    generic_make_request
      pmem_make_request
      ...
           if (bio->bi_opf & REQ_FLUSH)
                nvdimm_flush(nd_region);

      ...

Thanks,
Pankaj
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-25 14:27                           ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-25 14:27 UTC (permalink / raw)
  To: Jan Kara, Dan Williams
  Cc: Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler


> Subject: Re: KVM "fake DAX" flushing interface - discussion
> 
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> > 
> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > > >> [ adding Ross and Jan ]
> > > > >>
> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > The goal is to increase density of guests, by moving page
> > > > >> > cache into the host (where it can be easily reclaimed).
> > > > >> >
> > > > >> > If we assume the guests will be backed by relatively fast
> > > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > > >> > code (issued where the filesystem issues a barrier or
> > > > >> > disk cache flush today) may be just what we need to make
> > > > >> > that work.
> > > > >>
> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > > >>
> > > > >> However, it still seems like the storage interface is not capable of
> > > > >> expressing what is needed, because the operation that is needed is a
> > > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > > >> communicate range flush information to the host, but there's no
> > > > >> readily available block i/o semantic that software running on top of
> > > > >> the fake pmem device can use to communicate with the host. Instead
> > > > >> you
> > > > >> want to intercept the dax_flush() operation and turn it into a
> > > > >> queued
> > > > >> request on the host.
> > > > >>
> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > > >> driver call. That seems a better interface to modify than trying to
> > > > >> map block-storage flush-cache / force-unit-access commands to this
> > > > >> host request.
> > > > >>
> > > > >> The additional piece you would need to consider is whether to track
> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > > >> operation()
> > > > >> to also queue a sync on the host, but that essentially turns the
> > > > >> host
> > > > >> page cache into a pseudo write-through mode.
> > > > >
> > > > > I suspect initially it will be fine to not offer DAX
> > > > > semantics to applications using these "fake DAX" devices
> > > > > from a virtual machine, because the DAX APIs are designed
> > > > > for a much higher performance device than these fake DAX
> > > > > setups could ever give.
> > > > 
> > > > Right, we don't need DAX, per se, in the guest.
> > > > 
> > > > >
> > > > > Having userspace call fsync/msync like done normally, and
> > > > > having those coarser calls be turned into somewhat efficient
> > > > > backend flushes would be perfectly acceptable.
> > > > >
> > > > > The big question is, what should that kind of interface look
> > > > > like?
> > > > 
> > > > To me, this looks much like the dirty cache tracking that is done in
> > > > the address_space radix for the DAX case, but modified to coordinate
> > > > queued / page-based flushing when the guest  wants to persist data.
> > > > The similarity to DAX is not storing guest allocated pages in the
> > > > radix but entries that track dirty guest physical addresses.
> > > 
> > > Let me check whether I understand the problem correctly. So we want to
> > > export a block device (essentially a page cache of this block device) to
> > > a
> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
> > 
> > that's correct.
> > 
> > > natural way to make the persistence work would be to make ->flush
> > > callback
> > > of the PMEM device to do an upcall to the host which could then
> > > fdatasync()
> > > appropriate image file range however the performance would suck in such
> > > case since ->flush gets called for at most one page ranges from DAX.
> > 
> > Discussion is : sync a range using paravirt device or flush hit addresses
> > vs block device flush.
> > 
> > > 
> > > So what you could do instead is to completely ignore ->flush calls for
> > > the
> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > > PMEM device (generated by blkdev_issue_flush() or the journalling
> > > machinery) and fdatasync() the whole image file at that moment - in fact
> > > you must do that for metadata IO to hit persistent storage anyway in your
> > > setting. This would very closely follow how exporting block devices with
> > > volatile cache works with KVM these days AFAIU and the performance will
> > > be
> > > the same.
> > 
> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> > As per suggestions looks like block flushing device is way ahead.
> > 
> > If we do an asynchronous block flush at guest side(put current task in
> > wait queue till host side fdatasync completes) can solve the purpose? Or
> > do we need another paravirt device for this?
> 
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that path. So in your case you will have the same in the guest as a result
> of exposing virtual PMEM device to the guest and you just need to make sure
> this virtual block device behaves the same way as traditional virtualized
> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.

Looks like only way to send flush(blk dev) from guest to host with nvdimm
is using flush hint addresses. Is this the correct interface I am looking?

blkdev_issue_flush
 submit_bio_wait
  submit_bio
    generic_make_request
      pmem_make_request
      ...
           if (bio->bi_opf & REQ_FLUSH)
                nvdimm_flush(nd_region);

      ...

Thanks,
Pankaj
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-25 14:46                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-25 14:46 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>> Subject: Re: KVM "fake DAX" flushing interface - discussion
>>
>> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>> >
>> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
>> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > > >> [ adding Ross and Jan ]
>> > > > >>
>> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > The goal is to increase density of guests, by moving page
>> > > > >> > cache into the host (where it can be easily reclaimed).
>> > > > >> >
>> > > > >> > If we assume the guests will be backed by relatively fast
>> > > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > > >> > code (issued where the filesystem issues a barrier or
>> > > > >> > disk cache flush today) may be just what we need to make
>> > > > >> > that work.
>> > > > >>
>> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > > >>
>> > > > >> However, it still seems like the storage interface is not capable of
>> > > > >> expressing what is needed, because the operation that is needed is a
>> > > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > > >> communicate range flush information to the host, but there's no
>> > > > >> readily available block i/o semantic that software running on top of
>> > > > >> the fake pmem device can use to communicate with the host. Instead
>> > > > >> you
>> > > > >> want to intercept the dax_flush() operation and turn it into a
>> > > > >> queued
>> > > > >> request on the host.
>> > > > >>
>> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > > >> driver call. That seems a better interface to modify than trying to
>> > > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > > >> host request.
>> > > > >>
>> > > > >> The additional piece you would need to consider is whether to track
>> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > > >> operation()
>> > > > >> to also queue a sync on the host, but that essentially turns the
>> > > > >> host
>> > > > >> page cache into a pseudo write-through mode.
>> > > > >
>> > > > > I suspect initially it will be fine to not offer DAX
>> > > > > semantics to applications using these "fake DAX" devices
>> > > > > from a virtual machine, because the DAX APIs are designed
>> > > > > for a much higher performance device than these fake DAX
>> > > > > setups could ever give.
>> > > >
>> > > > Right, we don't need DAX, per se, in the guest.
>> > > >
>> > > > >
>> > > > > Having userspace call fsync/msync like done normally, and
>> > > > > having those coarser calls be turned into somewhat efficient
>> > > > > backend flushes would be perfectly acceptable.
>> > > > >
>> > > > > The big question is, what should that kind of interface look
>> > > > > like?
>> > > >
>> > > > To me, this looks much like the dirty cache tracking that is done in
>> > > > the address_space radix for the DAX case, but modified to coordinate
>> > > > queued / page-based flushing when the guest  wants to persist data.
>> > > > The similarity to DAX is not storing guest allocated pages in the
>> > > > radix but entries that track dirty guest physical addresses.
>> > >
>> > > Let me check whether I understand the problem correctly. So we want to
>> > > export a block device (essentially a page cache of this block device) to
>> > > a
>> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
>> >
>> > that's correct.
>> >
>> > > natural way to make the persistence work would be to make ->flush
>> > > callback
>> > > of the PMEM device to do an upcall to the host which could then
>> > > fdatasync()
>> > > appropriate image file range however the performance would suck in such
>> > > case since ->flush gets called for at most one page ranges from DAX.
>> >
>> > Discussion is : sync a range using paravirt device or flush hit addresses
>> > vs block device flush.
>> >
>> > >
>> > > So what you could do instead is to completely ignore ->flush calls for
>> > > the
>> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > > machinery) and fdatasync() the whole image file at that moment - in fact
>> > > you must do that for metadata IO to hit persistent storage anyway in your
>> > > setting. This would very closely follow how exporting block devices with
>> > > volatile cache works with KVM these days AFAIU and the performance will
>> > > be
>> > > the same.
>> >
>> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> > As per suggestions looks like block flushing device is way ahead.
>> >
>> > If we do an asynchronous block flush at guest side(put current task in
>> > wait queue till host side fdatasync completes) can solve the purpose? Or
>> > do we need another paravirt device for this?
>>
>> Well, even currently if you have PMEM device, you still have also a block
>> device and a request queue associated with it and metadata IO goes through
>> that path. So in your case you will have the same in the guest as a result
>> of exposing virtual PMEM device to the guest and you just need to make sure
>> this virtual block device behaves the same way as traditional virtualized
>> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
>
> Looks like only way to send flush(blk dev) from guest to host with nvdimm
> is using flush hint addresses. Is this the correct interface I am looking?
>
> blkdev_issue_flush
>  submit_bio_wait
>   submit_bio
>     generic_make_request
>       pmem_make_request
>       ...
>            if (bio->bi_opf & REQ_FLUSH)
>                 nvdimm_flush(nd_region);

I would inject a paravirtualized version of pmem_make_request() that
sends an async flush operation over virtio to the host. Don't try to
use flush hint addresses for this, they don't have the proper
semantics. The guest should be allowed to issue the flush and receive
the completion asynchronously rather than taking a vm exist and
blocking on that request.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-25 14:46                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-25 14:46 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> Subject: Re: KVM "fake DAX" flushing interface - discussion
>>
>> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>> >
>> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > > >> [ adding Ross and Jan ]
>> > > > >>
>> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > The goal is to increase density of guests, by moving page
>> > > > >> > cache into the host (where it can be easily reclaimed).
>> > > > >> >
>> > > > >> > If we assume the guests will be backed by relatively fast
>> > > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > > >> > code (issued where the filesystem issues a barrier or
>> > > > >> > disk cache flush today) may be just what we need to make
>> > > > >> > that work.
>> > > > >>
>> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > > >>
>> > > > >> However, it still seems like the storage interface is not capable of
>> > > > >> expressing what is needed, because the operation that is needed is a
>> > > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > > >> communicate range flush information to the host, but there's no
>> > > > >> readily available block i/o semantic that software running on top of
>> > > > >> the fake pmem device can use to communicate with the host. Instead
>> > > > >> you
>> > > > >> want to intercept the dax_flush() operation and turn it into a
>> > > > >> queued
>> > > > >> request on the host.
>> > > > >>
>> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > > >> driver call. That seems a better interface to modify than trying to
>> > > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > > >> host request.
>> > > > >>
>> > > > >> The additional piece you would need to consider is whether to track
>> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > > >> operation()
>> > > > >> to also queue a sync on the host, but that essentially turns the
>> > > > >> host
>> > > > >> page cache into a pseudo write-through mode.
>> > > > >
>> > > > > I suspect initially it will be fine to not offer DAX
>> > > > > semantics to applications using these "fake DAX" devices
>> > > > > from a virtual machine, because the DAX APIs are designed
>> > > > > for a much higher performance device than these fake DAX
>> > > > > setups could ever give.
>> > > >
>> > > > Right, we don't need DAX, per se, in the guest.
>> > > >
>> > > > >
>> > > > > Having userspace call fsync/msync like done normally, and
>> > > > > having those coarser calls be turned into somewhat efficient
>> > > > > backend flushes would be perfectly acceptable.
>> > > > >
>> > > > > The big question is, what should that kind of interface look
>> > > > > like?
>> > > >
>> > > > To me, this looks much like the dirty cache tracking that is done in
>> > > > the address_space radix for the DAX case, but modified to coordinate
>> > > > queued / page-based flushing when the guest  wants to persist data.
>> > > > The similarity to DAX is not storing guest allocated pages in the
>> > > > radix but entries that track dirty guest physical addresses.
>> > >
>> > > Let me check whether I understand the problem correctly. So we want to
>> > > export a block device (essentially a page cache of this block device) to
>> > > a
>> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
>> >
>> > that's correct.
>> >
>> > > natural way to make the persistence work would be to make ->flush
>> > > callback
>> > > of the PMEM device to do an upcall to the host which could then
>> > > fdatasync()
>> > > appropriate image file range however the performance would suck in such
>> > > case since ->flush gets called for at most one page ranges from DAX.
>> >
>> > Discussion is : sync a range using paravirt device or flush hit addresses
>> > vs block device flush.
>> >
>> > >
>> > > So what you could do instead is to completely ignore ->flush calls for
>> > > the
>> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > > machinery) and fdatasync() the whole image file at that moment - in fact
>> > > you must do that for metadata IO to hit persistent storage anyway in your
>> > > setting. This would very closely follow how exporting block devices with
>> > > volatile cache works with KVM these days AFAIU and the performance will
>> > > be
>> > > the same.
>> >
>> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> > As per suggestions looks like block flushing device is way ahead.
>> >
>> > If we do an asynchronous block flush at guest side(put current task in
>> > wait queue till host side fdatasync completes) can solve the purpose? Or
>> > do we need another paravirt device for this?
>>
>> Well, even currently if you have PMEM device, you still have also a block
>> device and a request queue associated with it and metadata IO goes through
>> that path. So in your case you will have the same in the guest as a result
>> of exposing virtual PMEM device to the guest and you just need to make sure
>> this virtual block device behaves the same way as traditional virtualized
>> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
>
> Looks like only way to send flush(blk dev) from guest to host with nvdimm
> is using flush hint addresses. Is this the correct interface I am looking?
>
> blkdev_issue_flush
>  submit_bio_wait
>   submit_bio
>     generic_make_request
>       pmem_make_request
>       ...
>            if (bio->bi_opf & REQ_FLUSH)
>                 nvdimm_flush(nd_region);

I would inject a paravirtualized version of pmem_make_request() that
sends an async flush operation over virtio to the host. Don't try to
use flush hint addresses for this, they don't have the proper
semantics. The guest should be allowed to issue the flush and receive
the completion asynchronously rather than taking a vm exist and
blocking on that request.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-25 14:46                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-25 14:46 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Jan Kara, Rik van Riel, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>> Subject: Re: KVM "fake DAX" flushing interface - discussion
>>
>> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>> >
>> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@redhat.com> wrote:
>> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > > >> [ adding Ross and Jan ]
>> > > > >>
>> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@redhat.com>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > The goal is to increase density of guests, by moving page
>> > > > >> > cache into the host (where it can be easily reclaimed).
>> > > > >> >
>> > > > >> > If we assume the guests will be backed by relatively fast
>> > > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > > >> > code (issued where the filesystem issues a barrier or
>> > > > >> > disk cache flush today) may be just what we need to make
>> > > > >> > that work.
>> > > > >>
>> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > > >>
>> > > > >> However, it still seems like the storage interface is not capable of
>> > > > >> expressing what is needed, because the operation that is needed is a
>> > > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > > >> communicate range flush information to the host, but there's no
>> > > > >> readily available block i/o semantic that software running on top of
>> > > > >> the fake pmem device can use to communicate with the host. Instead
>> > > > >> you
>> > > > >> want to intercept the dax_flush() operation and turn it into a
>> > > > >> queued
>> > > > >> request on the host.
>> > > > >>
>> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > > >> driver call. That seems a better interface to modify than trying to
>> > > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > > >> host request.
>> > > > >>
>> > > > >> The additional piece you would need to consider is whether to track
>> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > > >> operation()
>> > > > >> to also queue a sync on the host, but that essentially turns the
>> > > > >> host
>> > > > >> page cache into a pseudo write-through mode.
>> > > > >
>> > > > > I suspect initially it will be fine to not offer DAX
>> > > > > semantics to applications using these "fake DAX" devices
>> > > > > from a virtual machine, because the DAX APIs are designed
>> > > > > for a much higher performance device than these fake DAX
>> > > > > setups could ever give.
>> > > >
>> > > > Right, we don't need DAX, per se, in the guest.
>> > > >
>> > > > >
>> > > > > Having userspace call fsync/msync like done normally, and
>> > > > > having those coarser calls be turned into somewhat efficient
>> > > > > backend flushes would be perfectly acceptable.
>> > > > >
>> > > > > The big question is, what should that kind of interface look
>> > > > > like?
>> > > >
>> > > > To me, this looks much like the dirty cache tracking that is done in
>> > > > the address_space radix for the DAX case, but modified to coordinate
>> > > > queued / page-based flushing when the guest  wants to persist data.
>> > > > The similarity to DAX is not storing guest allocated pages in the
>> > > > radix but entries that track dirty guest physical addresses.
>> > >
>> > > Let me check whether I understand the problem correctly. So we want to
>> > > export a block device (essentially a page cache of this block device) to
>> > > a
>> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
>> >
>> > that's correct.
>> >
>> > > natural way to make the persistence work would be to make ->flush
>> > > callback
>> > > of the PMEM device to do an upcall to the host which could then
>> > > fdatasync()
>> > > appropriate image file range however the performance would suck in such
>> > > case since ->flush gets called for at most one page ranges from DAX.
>> >
>> > Discussion is : sync a range using paravirt device or flush hit addresses
>> > vs block device flush.
>> >
>> > >
>> > > So what you could do instead is to completely ignore ->flush calls for
>> > > the
>> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > > machinery) and fdatasync() the whole image file at that moment - in fact
>> > > you must do that for metadata IO to hit persistent storage anyway in your
>> > > setting. This would very closely follow how exporting block devices with
>> > > volatile cache works with KVM these days AFAIU and the performance will
>> > > be
>> > > the same.
>> >
>> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> > As per suggestions looks like block flushing device is way ahead.
>> >
>> > If we do an asynchronous block flush at guest side(put current task in
>> > wait queue till host side fdatasync completes) can solve the purpose? Or
>> > do we need another paravirt device for this?
>>
>> Well, even currently if you have PMEM device, you still have also a block
>> device and a request queue associated with it and metadata IO goes through
>> that path. So in your case you will have the same in the guest as a result
>> of exposing virtual PMEM device to the guest and you just need to make sure
>> this virtual block device behaves the same way as traditional virtualized
>> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests.
>
> Looks like only way to send flush(blk dev) from guest to host with nvdimm
> is using flush hint addresses. Is this the correct interface I am looking?
>
> blkdev_issue_flush
>  submit_bio_wait
>   submit_bio
>     generic_make_request
>       pmem_make_request
>       ...
>            if (bio->bi_opf & REQ_FLUSH)
>                 nvdimm_flush(nd_region);

I would inject a paravirtualized version of pmem_make_request() that
sends an async flush operation over virtio to the host. Don't try to
use flush hint addresses for this, they don't have the proper
semantics. The guest should be allowed to issue the flush and receive
the completion asynchronously rather than taking a vm exist and
blocking on that request.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-25 14:46                             ` Dan Williams
@ 2017-07-25 20:59                               ` Rik van Riel
  -1 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-25 20:59 UTC (permalink / raw)
  To: Dan Williams, Pankaj Gupta
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 1102 bytes --]

On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com>
> wrote:
> > 
> > Looks like only way to send flush(blk dev) from guest to host with
> > nvdimm
> > is using flush hint addresses. Is this the correct interface I am
> > looking?
> > 
> > blkdev_issue_flush
> >  submit_bio_wait
> >   submit_bio
> >     generic_make_request
> >       pmem_make_request
> >       ...
> >            if (bio->bi_opf & REQ_FLUSH)
> >                 nvdimm_flush(nd_region);
> 
> I would inject a paravirtualized version of pmem_make_request() that
> sends an async flush operation over virtio to the host. Don't try to
> use flush hint addresses for this, they don't have the proper
> semantics. The guest should be allowed to issue the flush and receive
> the completion asynchronously rather than taking a vm exist and
> blocking on that request.

That is my feeling, too. A slower IO device benefits
greatly from an asynchronous flush mechanism.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-25 20:59                               ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-25 20:59 UTC (permalink / raw)
  To: Dan Williams, Pankaj Gupta
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 1102 bytes --]

On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com>
> wrote:
> > 
> > Looks like only way to send flush(blk dev) from guest to host with
> > nvdimm
> > is using flush hint addresses. Is this the correct interface I am
> > looking?
> > 
> > blkdev_issue_flush
> >  submit_bio_wait
> >   submit_bio
> >     generic_make_request
> >       pmem_make_request
> >       ...
> >            if (bio->bi_opf & REQ_FLUSH)
> >                 nvdimm_flush(nd_region);
> 
> I would inject a paravirtualized version of pmem_make_request() that
> sends an async flush operation over virtio to the host. Don't try to
> use flush hint addresses for this, they don't have the proper
> semantics. The guest should be allowed to issue the flush and receive
> the completion asynchronously rather than taking a vm exist and
> blocking on that request.

That is my feeling, too. A slower IO device benefits
greatly from an asynchronous flush mechanism.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 13:47                                 ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-26 13:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Jan Kara, xiaoguangrong eric, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal


> 
> On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com>
> > wrote:
> > > 
> > > Looks like only way to send flush(blk dev) from guest to host with
> > > nvdimm
> > > is using flush hint addresses. Is this the correct interface I am
> > > looking?
> > > 
> > > blkdev_issue_flush
> > >  submit_bio_wait
> > >   submit_bio
> > >     generic_make_request
> > >       pmem_make_request
> > >       ...
> > >            if (bio->bi_opf & REQ_FLUSH)
> > >                 nvdimm_flush(nd_region);
> > 
> > I would inject a paravirtualized version of pmem_make_request() that
> > sends an async flush operation over virtio to the host. Don't try to
> > use flush hint addresses for this, they don't have the proper
> > semantics. The guest should be allowed to issue the flush and receive
> > the completion asynchronously rather than taking a vm exist and
> > blocking on that request.
> 
> That is my feeling, too. A slower IO device benefits
> greatly from an asynchronous flush mechanism.

Thanks for all the suggestions!

Just want to summarize here(high level):

This will require implementing new 'virtio-pmem' device which presents 
a DAX address range(like pmem) to guest with read/write(direct access)
& device flush functionality. Also, qemu should implement corresponding
support for flush using virtio.

Thanks,
Pankaj
> 
> --
> All rights reversed
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 13:47                                 ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-26 13:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Jan Kara, xiaoguangrong eric, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal


> 
> On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com>
> > wrote:
> > > 
> > > Looks like only way to send flush(blk dev) from guest to host with
> > > nvdimm
> > > is using flush hint addresses. Is this the correct interface I am
> > > looking?
> > > 
> > > blkdev_issue_flush
> > >  submit_bio_wait
> > >   submit_bio
> > >     generic_make_request
> > >       pmem_make_request
> > >       ...
> > >            if (bio->bi_opf & REQ_FLUSH)
> > >                 nvdimm_flush(nd_region);
> > 
> > I would inject a paravirtualized version of pmem_make_request() that
> > sends an async flush operation over virtio to the host. Don't try to
> > use flush hint addresses for this, they don't have the proper
> > semantics. The guest should be allowed to issue the flush and receive
> > the completion asynchronously rather than taking a vm exist and
> > blocking on that request.
> 
> That is my feeling, too. A slower IO device benefits
> greatly from an asynchronous flush mechanism.

Thanks for all the suggestions!

Just want to summarize here(high level):

This will require implementing new 'virtio-pmem' device which presents 
a DAX address range(like pmem) to guest with read/write(direct access)
& device flush functionality. Also, qemu should implement corresponding
support for flush using virtio.

Thanks,
Pankaj
> 
> --
> All rights reversed
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 13:47                                 ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-07-26 13:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dan Williams, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler


> 
> On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta@redhat.com>
> > wrote:
> > > 
> > > Looks like only way to send flush(blk dev) from guest to host with
> > > nvdimm
> > > is using flush hint addresses. Is this the correct interface I am
> > > looking?
> > > 
> > > blkdev_issue_flush
> > >  submit_bio_wait
> > >   submit_bio
> > >     generic_make_request
> > >       pmem_make_request
> > >       ...
> > >            if (bio->bi_opf & REQ_FLUSH)
> > >                 nvdimm_flush(nd_region);
> > 
> > I would inject a paravirtualized version of pmem_make_request() that
> > sends an async flush operation over virtio to the host. Don't try to
> > use flush hint addresses for this, they don't have the proper
> > semantics. The guest should be allowed to issue the flush and receive
> > the completion asynchronously rather than taking a vm exist and
> > blocking on that request.
> 
> That is my feeling, too. A slower IO device benefits
> greatly from an asynchronous flush mechanism.

Thanks for all the suggestions!

Just want to summarize here(high level):

This will require implementing new 'virtio-pmem' device which presents 
a DAX address range(like pmem) to guest with read/write(direct access)
& device flush functionality. Also, qemu should implement corresponding
support for flush using virtio.

Thanks,
Pankaj
> 
> --
> All rights reversed

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-26 13:47                                 ` Pankaj Gupta
@ 2017-07-26 21:27                                   ` Rik van Riel
  -1 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-26 21:27 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Dan Williams, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > 
> Just want to summarize here(high level):
> 
> This will require implementing new 'virtio-pmem' device which
> presents 
> a DAX address range(like pmem) to guest with read/write(direct
> access)
> & device flush functionality. Also, qemu should implement
> corresponding
> support for flush using virtio.
> 
Alternatively, the existing pmem code, with
a flush-only block device on the side, which
is somehow associated with the pmem device.

I wonder which alternative leads to the least
code duplication, and the least maintenance
hassle going forward.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 21:27                                   ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-26 21:27 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Dan Williams, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > 
> Just want to summarize here(high level):
> 
> This will require implementing new 'virtio-pmem' device which
> presents 
> a DAX address range(like pmem) to guest with read/write(direct
> access)
> & device flush functionality. Also, qemu should implement
> corresponding
> support for flush using virtio.
> 
Alternatively, the existing pmem code, with
a flush-only block device on the side, which
is somehow associated with the pmem device.

I wonder which alternative leads to the least
code duplication, and the least maintenance
hassle going forward.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-26 21:27                                   ` [Qemu-devel] " Rik van Riel
  (?)
@ 2017-07-26 21:40                                     ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-26 21:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.

I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.

With a $new_driver in hand you can just do:

   modprobe $new_driver
   echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
   echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
   echo $namespace > /sys/bus/nd/drivers/$new_driver/bind

...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:

    echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
    echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
    echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 21:40                                     ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-26 21:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.

I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.

With a $new_driver in hand you can just do:

   modprobe $new_driver
   echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
   echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
   echo $namespace > /sys/bus/nd/drivers/$new_driver/bind

...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:

    echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
    echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
    echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 21:40                                     ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-26 21:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.

I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.

With a $new_driver in hand you can just do:

   modprobe $new_driver
   echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
   echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
   echo $namespace > /sys/bus/nd/drivers/$new_driver/bind

...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:

    echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
    echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
    echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-07-26 21:40                                     ` Dan Williams
  (?)
@ 2017-07-26 23:46                                       ` Rik van Riel
  -1 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-26 23:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > > > 
> > > 
> > > Just want to summarize here(high level):
> > > 
> > > This will require implementing new 'virtio-pmem' device which
> > > presents
> > > a DAX address range(like pmem) to guest with read/write(direct
> > > access)
> > > & device flush functionality. Also, qemu should implement
> > > corresponding
> > > support for flush using virtio.
> > > 
> > 
> > Alternatively, the existing pmem code, with
> > a flush-only block device on the side, which
> > is somehow associated with the pmem device.
> > 
> > I wonder which alternative leads to the least
> > code duplication, and the least maintenance
> > hassle going forward.
> 
> I'd much prefer to have another driver. I.e. a driver that refactors
> out some common pmem details into a shared object and can attach to
> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
> like
> a recipe for confusion.

At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?

> With a $new_driver in hand you can just do:
> 
>    modprobe $new_driver
>    echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
> 
> ...and the guest can arrange for $new_driver to be the default, so
> you
> don't need to do those steps each boot of the VM, by doing:
> 
>     echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
>     echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf
>     echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 23:46                                       ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-26 23:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > > > 
> > > 
> > > Just want to summarize here(high level):
> > > 
> > > This will require implementing new 'virtio-pmem' device which
> > > presents
> > > a DAX address range(like pmem) to guest with read/write(direct
> > > access)
> > > & device flush functionality. Also, qemu should implement
> > > corresponding
> > > support for flush using virtio.
> > > 
> > 
> > Alternatively, the existing pmem code, with
> > a flush-only block device on the side, which
> > is somehow associated with the pmem device.
> > 
> > I wonder which alternative leads to the least
> > code duplication, and the least maintenance
> > hassle going forward.
> 
> I'd much prefer to have another driver. I.e. a driver that refactors
> out some common pmem details into a shared object and can attach to
> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
> like
> a recipe for confusion.

At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?

> With a $new_driver in hand you can just do:
> 
>    modprobe $new_driver
>    echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
> 
> ...and the guest can arrange for $new_driver to be the default, so
> you
> don't need to do those steps each boot of the VM, by doing:
> 
>     echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
>     echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf
>     echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-26 23:46                                       ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-07-26 23:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > > > 
> > > 
> > > Just want to summarize here(high level):
> > > 
> > > This will require implementing new 'virtio-pmem' device which
> > > presents
> > > a DAX address range(like pmem) to guest with read/write(direct
> > > access)
> > > & device flush functionality. Also, qemu should implement
> > > corresponding
> > > support for flush using virtio.
> > > 
> > 
> > Alternatively, the existing pmem code, with
> > a flush-only block device on the side, which
> > is somehow associated with the pmem device.
> > 
> > I wonder which alternative leads to the least
> > code duplication, and the least maintenance
> > hassle going forward.
> 
> I'd much prefer to have another driver. I.e. a driver that refactors
> out some common pmem details into a shared object and can attach to
> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
> like
> a recipe for confusion.

At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?

> With a $new_driver in hand you can just do:
> 
>    modprobe $new_driver
>    echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
> 
> ...and the guest can arrange for $new_driver to be the default, so
> you
> don't need to do those steps each boot of the VM, by doing:
> 
>     echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
>     echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf
>     echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-27  0:54                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-27  0:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Wed, Jul 26, 2017 at 4:46 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
>> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> > > >
>> > >
>> > > Just want to summarize here(high level):
>> > >
>> > > This will require implementing new 'virtio-pmem' device which
>> > > presents
>> > > a DAX address range(like pmem) to guest with read/write(direct
>> > > access)
>> > > & device flush functionality. Also, qemu should implement
>> > > corresponding
>> > > support for flush using virtio.
>> > >
>> >
>> > Alternatively, the existing pmem code, with
>> > a flush-only block device on the side, which
>> > is somehow associated with the pmem device.
>> >
>> > I wonder which alternative leads to the least
>> > code duplication, and the least maintenance
>> > hassle going forward.
>>
>> I'd much prefer to have another driver. I.e. a driver that refactors
>> out some common pmem details into a shared object and can attach to
>> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
>> like
>> a recipe for confusion.
>
> At that point, would it make sense to expose these special
> virtio-pmem areas to the guest in a slightly different way,
> so the regions that need virtio flushing are not bound by
> the regular driver, and the regular driver can continue to
> work for memory regions that are backed by actual pmem in
> the host?

Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.

See usage of UUID_PERSISTENT_MEMORY in drivers/acpi/nfit/ and the
eventual region description sent to nvdimm_pmem_region_create(). We
would then need to plumb a new flag so that nd_region_to_nstype() in
libnvdimm returns a different namespace type number for this virtio
use case, but otherwise the rest of libnvdimm should treat the region
as pmem.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-07-27  0:54                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-27  0:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Wed, Jul 26, 2017 at 4:46 PM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
>> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> wrote:
>> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> > > >
>> > >
>> > > Just want to summarize here(high level):
>> > >
>> > > This will require implementing new 'virtio-pmem' device which
>> > > presents
>> > > a DAX address range(like pmem) to guest with read/write(direct
>> > > access)
>> > > & device flush functionality. Also, qemu should implement
>> > > corresponding
>> > > support for flush using virtio.
>> > >
>> >
>> > Alternatively, the existing pmem code, with
>> > a flush-only block device on the side, which
>> > is somehow associated with the pmem device.
>> >
>> > I wonder which alternative leads to the least
>> > code duplication, and the least maintenance
>> > hassle going forward.
>>
>> I'd much prefer to have another driver. I.e. a driver that refactors
>> out some common pmem details into a shared object and can attach to
>> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
>> like
>> a recipe for confusion.
>
> At that point, would it make sense to expose these special
> virtio-pmem areas to the guest in a slightly different way,
> so the regions that need virtio flushing are not bound by
> the regular driver, and the regular driver can continue to
> work for memory regions that are backed by actual pmem in
> the host?

Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.

See usage of UUID_PERSISTENT_MEMORY in drivers/acpi/nfit/ and the
eventual region description sent to nvdimm_pmem_region_create(). We
would then need to plumb a new flag so that nd_region_to_nstype() in
libnvdimm returns a different namespace type number for this virtio
use case, but otherwise the rest of libnvdimm should treat the region
as pmem.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-07-27  0:54                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-07-27  0:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	xiaoguangrong eric, Haozhong Zhang, Ross Zwisler

On Wed, Jul 26, 2017 at 4:46 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
>> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> > > >
>> > >
>> > > Just want to summarize here(high level):
>> > >
>> > > This will require implementing new 'virtio-pmem' device which
>> > > presents
>> > > a DAX address range(like pmem) to guest with read/write(direct
>> > > access)
>> > > & device flush functionality. Also, qemu should implement
>> > > corresponding
>> > > support for flush using virtio.
>> > >
>> >
>> > Alternatively, the existing pmem code, with
>> > a flush-only block device on the side, which
>> > is somehow associated with the pmem device.
>> >
>> > I wonder which alternative leads to the least
>> > code duplication, and the least maintenance
>> > hassle going forward.
>>
>> I'd much prefer to have another driver. I.e. a driver that refactors
>> out some common pmem details into a shared object and can attach to
>> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
>> like
>> a recipe for confusion.
>
> At that point, would it make sense to expose these special
> virtio-pmem areas to the guest in a slightly different way,
> so the regions that need virtio flushing are not bound by
> the regular driver, and the regular driver can continue to
> work for memory regions that are backed by actual pmem in
> the host?

Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.

See usage of UUID_PERSISTENT_MEMORY in drivers/acpi/nfit/ and the
eventual region description sent to nvdimm_pmem_region_create(). We
would then need to plumb a new flag so that nd_region_to_nstype() in
libnvdimm returns a different namespace type number for this virtio
use case, but otherwise the rest of libnvdimm should treat the region
as pmem.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-10-31  7:13                                           ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-10-31  7:13 UTC (permalink / raw)
  To: Dan Williams, Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, kvm-devel, Stefan Hajnoczi,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal



On 07/27/2017 08:54 AM, Dan Williams wrote:

>> At that point, would it make sense to expose these special
>> virtio-pmem areas to the guest in a slightly different way,
>> so the regions that need virtio flushing are not bound by
>> the regular driver, and the regular driver can continue to
>> work for memory regions that are backed by actual pmem in
>> the host?
> 
> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
> mechanism. It would basically involve defining a new SPA (System
> Phyiscal Address) range GUID type, and then teaching libnvdimm to
> treat that as a new pmem device type.

I would prefer a new flush mechanism to a new memory type introduced
to NFIT, e.g, in that mechanism we can define request queues and
completion queues and any other features to make virtualization
friendly. That would be much simpler.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-10-31  7:13                                           ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-10-31  7:13 UTC (permalink / raw)
  To: Dan Williams, Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, kvm-devel, Stefan Hajnoczi,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal



On 07/27/2017 08:54 AM, Dan Williams wrote:

>> At that point, would it make sense to expose these special
>> virtio-pmem areas to the guest in a slightly different way,
>> so the regions that need virtio flushing are not bound by
>> the regular driver, and the regular driver can continue to
>> work for memory regions that are backed by actual pmem in
>> the host?
> 
> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
> mechanism. It would basically involve defining a new SPA (System
> Phyiscal Address) range GUID type, and then teaching libnvdimm to
> treat that as a new pmem device type.

I would prefer a new flush mechanism to a new memory type introduced
to NFIT, e.g, in that mechanism we can define request queues and
completion queues and any other features to make virtualization
friendly. That would be much simpler.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-10-31  7:13                                           ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-10-31  7:13 UTC (permalink / raw)
  To: Dan Williams, Rik van Riel
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	Haozhong Zhang, Ross Zwisler



On 07/27/2017 08:54 AM, Dan Williams wrote:

>> At that point, would it make sense to expose these special
>> virtio-pmem areas to the guest in a slightly different way,
>> so the regions that need virtio flushing are not bound by
>> the regular driver, and the regular driver can continue to
>> work for memory regions that are backed by actual pmem in
>> the host?
> 
> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
> mechanism. It would basically involve defining a new SPA (System
> Phyiscal Address) range GUID type, and then teaching libnvdimm to
> treat that as a new pmem device type.

I would prefer a new flush mechanism to a new memory type introduced
to NFIT, e.g, in that mechanism we can define request queues and
completion queues and any other features to make virtualization
friendly. That would be much simpler.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-10-31  7:13                                           ` Xiao Guangrong
  (?)
@ 2017-10-31 14:20                                             ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-10-31 14:20 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 07/27/2017 08:54 AM, Dan Williams wrote:
>
>>> At that point, would it make sense to expose these special
>>> virtio-pmem areas to the guest in a slightly different way,
>>> so the regions that need virtio flushing are not bound by
>>> the regular driver, and the regular driver can continue to
>>> work for memory regions that are backed by actual pmem in
>>> the host?
>>
>>
>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>> mechanism. It would basically involve defining a new SPA (System
>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>> treat that as a new pmem device type.
>
>
> I would prefer a new flush mechanism to a new memory type introduced
> to NFIT, e.g, in that mechanism we can define request queues and
> completion queues and any other features to make virtualization
> friendly. That would be much simpler.
>

No that's more confusing because now we are overloading the definition
of persistent memory. I want this memory type identified from the top
of the stack so it can appear differently in /proc/iomem and also
implement this alternate flush communication.

In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-10-31 14:20                                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-10-31 14:20 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 07/27/2017 08:54 AM, Dan Williams wrote:
>
>>> At that point, would it make sense to expose these special
>>> virtio-pmem areas to the guest in a slightly different way,
>>> so the regions that need virtio flushing are not bound by
>>> the regular driver, and the regular driver can continue to
>>> work for memory regions that are backed by actual pmem in
>>> the host?
>>
>>
>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>> mechanism. It would basically involve defining a new SPA (System
>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>> treat that as a new pmem device type.
>
>
> I would prefer a new flush mechanism to a new memory type introduced
> to NFIT, e.g, in that mechanism we can define request queues and
> completion queues and any other features to make virtualization
> friendly. That would be much simpler.
>

No that's more confusing because now we are overloading the definition
of persistent memory. I want this memory type identified from the top
of the stack so it can appear differently in /proc/iomem and also
implement this alternate flush communication.

In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-10-31 14:20                                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-10-31 14:20 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 07/27/2017 08:54 AM, Dan Williams wrote:
>
>>> At that point, would it make sense to expose these special
>>> virtio-pmem areas to the guest in a slightly different way,
>>> so the regions that need virtio flushing are not bound by
>>> the regular driver, and the regular driver can continue to
>>> work for memory regions that are backed by actual pmem in
>>> the host?
>>
>>
>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>> mechanism. It would basically involve defining a new SPA (System
>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>> treat that as a new pmem device type.
>
>
> I would prefer a new flush mechanism to a new memory type introduced
> to NFIT, e.g, in that mechanism we can define request queues and
> completion queues and any other features to make virtualization
> friendly. That would be much simpler.
>

No that's more confusing because now we are overloading the definition
of persistent memory. I want this memory type identified from the top
of the stack so it can appear differently in /proc/iomem and also
implement this alternate flush communication.

In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  3:43                                               ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-01  3:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal



On 10/31/2017 10:20 PM, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>
>>>> At that point, would it make sense to expose these special
>>>> virtio-pmem areas to the guest in a slightly different way,
>>>> so the regions that need virtio flushing are not bound by
>>>> the regular driver, and the regular driver can continue to
>>>> work for memory regions that are backed by actual pmem in
>>>> the host?
>>>
>>>
>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>> mechanism. It would basically involve defining a new SPA (System
>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>> treat that as a new pmem device type.
>>
>>
>> I would prefer a new flush mechanism to a new memory type introduced
>> to NFIT, e.g, in that mechanism we can define request queues and
>> completion queues and any other features to make virtualization
>> friendly. That would be much simpler.
>>
> 
> No that's more confusing because now we are overloading the definition
> of persistent memory. I want this memory type identified from the top
> of the stack so it can appear differently in /proc/iomem and also
> implement this alternate flush communication.
> 

For the characteristic of memory, I have no idea why VM should know this
difference. It can be completely transparent to VM, that means, VM
does not need to know where this virtual PMEM comes from (for a really
nvdimm backend or a normal storage). The only discrepancy is the flush
interface.

> In what way is this "more complicated"? It was trivial to add support
> for the "volatile" NFIT range, this will not be any more complicated
> than that.
> 

Introducing memory type is easy indeed, however, a new flush interface
definition is inevitable, i.e, we need a standard way to discover the
MMIOs to communicate with host.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  3:43                                               ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-01  3:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal



On 10/31/2017 10:20 PM, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
> <xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>
>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>
>>>> At that point, would it make sense to expose these special
>>>> virtio-pmem areas to the guest in a slightly different way,
>>>> so the regions that need virtio flushing are not bound by
>>>> the regular driver, and the regular driver can continue to
>>>> work for memory regions that are backed by actual pmem in
>>>> the host?
>>>
>>>
>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>> mechanism. It would basically involve defining a new SPA (System
>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>> treat that as a new pmem device type.
>>
>>
>> I would prefer a new flush mechanism to a new memory type introduced
>> to NFIT, e.g, in that mechanism we can define request queues and
>> completion queues and any other features to make virtualization
>> friendly. That would be much simpler.
>>
> 
> No that's more confusing because now we are overloading the definition
> of persistent memory. I want this memory type identified from the top
> of the stack so it can appear differently in /proc/iomem and also
> implement this alternate flush communication.
> 

For the characteristic of memory, I have no idea why VM should know this
difference. It can be completely transparent to VM, that means, VM
does not need to know where this virtual PMEM comes from (for a really
nvdimm backend or a normal storage). The only discrepancy is the flush
interface.

> In what way is this "more complicated"? It was trivial to add support
> for the "volatile" NFIT range, this will not be any more complicated
> than that.
> 

Introducing memory type is easy indeed, however, a new flush interface
definition is inevitable, i.e, we need a standard way to discover the
MMIOs to communicate with host.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  3:43                                               ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-01  3:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler



On 10/31/2017 10:20 PM, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>
>>>> At that point, would it make sense to expose these special
>>>> virtio-pmem areas to the guest in a slightly different way,
>>>> so the regions that need virtio flushing are not bound by
>>>> the regular driver, and the regular driver can continue to
>>>> work for memory regions that are backed by actual pmem in
>>>> the host?
>>>
>>>
>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>> mechanism. It would basically involve defining a new SPA (System
>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>> treat that as a new pmem device type.
>>
>>
>> I would prefer a new flush mechanism to a new memory type introduced
>> to NFIT, e.g, in that mechanism we can define request queues and
>> completion queues and any other features to make virtualization
>> friendly. That would be much simpler.
>>
> 
> No that's more confusing because now we are overloading the definition
> of persistent memory. I want this memory type identified from the top
> of the stack so it can appear differently in /proc/iomem and also
> implement this alternate flush communication.
> 

For the characteristic of memory, I have no idea why VM should know this
difference. It can be completely transparent to VM, that means, VM
does not need to know where this virtual PMEM comes from (for a really
nvdimm backend or a normal storage). The only discrepancy is the flush
interface.

> In what way is this "more complicated"? It was trivial to add support
> for the "volatile" NFIT range, this will not be any more complicated
> than that.
> 

Introducing memory type is easy indeed, however, a new flush interface
definition is inevitable, i.e, we need a standard way to discover the
MMIOs to communicate with host.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-01  3:43                                               ` Xiao Guangrong
  (?)
@ 2017-11-01  4:25                                                 ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-01  4:25 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>
>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>> <xiaoguangrong.eric@gmail.com> wrote:
>>>
>>>
>>>
>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>
>>>>> At that point, would it make sense to expose these special
>>>>> virtio-pmem areas to the guest in a slightly different way,
>>>>> so the regions that need virtio flushing are not bound by
>>>>> the regular driver, and the regular driver can continue to
>>>>> work for memory regions that are backed by actual pmem in
>>>>> the host?
>>>>
>>>>
>>>>
>>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>>> mechanism. It would basically involve defining a new SPA (System
>>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>>> treat that as a new pmem device type.
>>>
>>>
>>>
>>> I would prefer a new flush mechanism to a new memory type introduced
>>> to NFIT, e.g, in that mechanism we can define request queues and
>>> completion queues and any other features to make virtualization
>>> friendly. That would be much simpler.
>>>
>>
>> No that's more confusing because now we are overloading the definition
>> of persistent memory. I want this memory type identified from the top
>> of the stack so it can appear differently in /proc/iomem and also
>> implement this alternate flush communication.
>>
>
> For the characteristic of memory, I have no idea why VM should know this
> difference. It can be completely transparent to VM, that means, VM
> does not need to know where this virtual PMEM comes from (for a really
> nvdimm backend or a normal storage). The only discrepancy is the flush
> interface.

It's not persistent memory if it requires a hypercall to make it
persistent. Unless memory writes can be made durable purely with cpu
instructions it's dangerous for it to be treated as a PMEM range.
Consider a guest that tried to map it with device-dax which has no
facility to route requests to a special flushing interface.

>
>> In what way is this "more complicated"? It was trivial to add support
>> for the "volatile" NFIT range, this will not be any more complicated
>> than that.
>>
>
> Introducing memory type is easy indeed, however, a new flush interface
> definition is inevitable, i.e, we need a standard way to discover the
> MMIOs to communicate with host.

Right, the proposed way to do that for x86 platforms is a new SPA
Range GUID type. in the NFIT.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  4:25                                                 ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-01  4:25 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>
>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>> <xiaoguangrong.eric@gmail.com> wrote:
>>>
>>>
>>>
>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>
>>>>> At that point, would it make sense to expose these special
>>>>> virtio-pmem areas to the guest in a slightly different way,
>>>>> so the regions that need virtio flushing are not bound by
>>>>> the regular driver, and the regular driver can continue to
>>>>> work for memory regions that are backed by actual pmem in
>>>>> the host?
>>>>
>>>>
>>>>
>>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>>> mechanism. It would basically involve defining a new SPA (System
>>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>>> treat that as a new pmem device type.
>>>
>>>
>>>
>>> I would prefer a new flush mechanism to a new memory type introduced
>>> to NFIT, e.g, in that mechanism we can define request queues and
>>> completion queues and any other features to make virtualization
>>> friendly. That would be much simpler.
>>>
>>
>> No that's more confusing because now we are overloading the definition
>> of persistent memory. I want this memory type identified from the top
>> of the stack so it can appear differently in /proc/iomem and also
>> implement this alternate flush communication.
>>
>
> For the characteristic of memory, I have no idea why VM should know this
> difference. It can be completely transparent to VM, that means, VM
> does not need to know where this virtual PMEM comes from (for a really
> nvdimm backend or a normal storage). The only discrepancy is the flush
> interface.

It's not persistent memory if it requires a hypercall to make it
persistent. Unless memory writes can be made durable purely with cpu
instructions it's dangerous for it to be treated as a PMEM range.
Consider a guest that tried to map it with device-dax which has no
facility to route requests to a special flushing interface.

>
>> In what way is this "more complicated"? It was trivial to add support
>> for the "volatile" NFIT range, this will not be any more complicated
>> than that.
>>
>
> Introducing memory type is easy indeed, however, a new flush interface
> definition is inevitable, i.e, we need a standard way to discover the
> MMIOs to communicate with host.

Right, the proposed way to do that for x86 platforms is a new SPA
Range GUID type. in the NFIT.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  4:25                                                 ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-01  4:25 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>
>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>> <xiaoguangrong.eric@gmail.com> wrote:
>>>
>>>
>>>
>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>
>>>>> At that point, would it make sense to expose these special
>>>>> virtio-pmem areas to the guest in a slightly different way,
>>>>> so the regions that need virtio flushing are not bound by
>>>>> the regular driver, and the regular driver can continue to
>>>>> work for memory regions that are backed by actual pmem in
>>>>> the host?
>>>>
>>>>
>>>>
>>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>>> mechanism. It would basically involve defining a new SPA (System
>>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>>> treat that as a new pmem device type.
>>>
>>>
>>>
>>> I would prefer a new flush mechanism to a new memory type introduced
>>> to NFIT, e.g, in that mechanism we can define request queues and
>>> completion queues and any other features to make virtualization
>>> friendly. That would be much simpler.
>>>
>>
>> No that's more confusing because now we are overloading the definition
>> of persistent memory. I want this memory type identified from the top
>> of the stack so it can appear differently in /proc/iomem and also
>> implement this alternate flush communication.
>>
>
> For the characteristic of memory, I have no idea why VM should know this
> difference. It can be completely transparent to VM, that means, VM
> does not need to know where this virtual PMEM comes from (for a really
> nvdimm backend or a normal storage). The only discrepancy is the flush
> interface.

It's not persistent memory if it requires a hypercall to make it
persistent. Unless memory writes can be made durable purely with cpu
instructions it's dangerous for it to be treated as a PMEM range.
Consider a guest that tried to map it with device-dax which has no
facility to route requests to a special flushing interface.

>
>> In what way is this "more complicated"? It was trivial to add support
>> for the "volatile" NFIT range, this will not be any more complicated
>> than that.
>>
>
> Introducing memory type is easy indeed, however, a new flush interface
> definition is inevitable, i.e, we need a standard way to discover the
> MMIOs to communicate with host.

Right, the proposed way to do that for x86 platforms is a new SPA
Range GUID type. in the NFIT.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  6:46                                                   ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-01  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal



On 11/01/2017 12:25 PM, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>>
>>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>>> <xiaoguangrong.eric@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>>
>>>>>> At that point, would it make sense to expose these special
>>>>>> virtio-pmem areas to the guest in a slightly different way,
>>>>>> so the regions that need virtio flushing are not bound by
>>>>>> the regular driver, and the regular driver can continue to
>>>>>> work for memory regions that are backed by actual pmem in
>>>>>> the host?
>>>>>
>>>>>
>>>>>
>>>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>>>> mechanism. It would basically involve defining a new SPA (System
>>>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>>>> treat that as a new pmem device type.
>>>>
>>>>
>>>>
>>>> I would prefer a new flush mechanism to a new memory type introduced
>>>> to NFIT, e.g, in that mechanism we can define request queues and
>>>> completion queues and any other features to make virtualization
>>>> friendly. That would be much simpler.
>>>>
>>>
>>> No that's more confusing because now we are overloading the definition
>>> of persistent memory. I want this memory type identified from the top
>>> of the stack so it can appear differently in /proc/iomem and also
>>> implement this alternate flush communication.
>>>
>>
>> For the characteristic of memory, I have no idea why VM should know this
>> difference. It can be completely transparent to VM, that means, VM
>> does not need to know where this virtual PMEM comes from (for a really
>> nvdimm backend or a normal storage). The only discrepancy is the flush
>> interface.
> 
> It's not persistent memory if it requires a hypercall to make it
> persistent. Unless memory writes can be made durable purely with cpu
> instructions it's dangerous for it to be treated as a PMEM range.
> Consider a guest that tried to map it with device-dax which has no
> facility to route requests to a special flushing interface.
> 

Can we separate the concept of flush interface from persistent memory?
Say there are two APIs, one is used to indicate the memory type (i.e,
/proc/iomem) and another one indicates the flush interface.

So for existing nvdimm hardwares:
1: Persist-memory + CLFLUSH
2: Persiste-memory + flush-hint-table (I know Intel does not use it)

and for the virtual nvdimm which backended on normal storage:
Persist-memory + virtual flush interface

>>
>>> In what way is this "more complicated"? It was trivial to add support
>>> for the "volatile" NFIT range, this will not be any more complicated
>>> than that.
>>>
>>
>> Introducing memory type is easy indeed, however, a new flush interface
>> definition is inevitable, i.e, we need a standard way to discover the
>> MMIOs to communicate with host.
> 
> Right, the proposed way to do that for x86 platforms is a new SPA
> Range GUID type. in the NFIT.
> 

So this SPA is used for both persistent memory region and flush interface?
Maybe i missed it in previous mails, could you please detail how to do
it?

BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
are. (Oh, yes, it depends on Paolo. :))



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  6:46                                                   ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-01  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal



On 11/01/2017 12:25 PM, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
> <xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>
>> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>>
>>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>>> <xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>
>>>>
>>>>
>>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>>
>>>>>> At that point, would it make sense to expose these special
>>>>>> virtio-pmem areas to the guest in a slightly different way,
>>>>>> so the regions that need virtio flushing are not bound by
>>>>>> the regular driver, and the regular driver can continue to
>>>>>> work for memory regions that are backed by actual pmem in
>>>>>> the host?
>>>>>
>>>>>
>>>>>
>>>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>>>> mechanism. It would basically involve defining a new SPA (System
>>>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>>>> treat that as a new pmem device type.
>>>>
>>>>
>>>>
>>>> I would prefer a new flush mechanism to a new memory type introduced
>>>> to NFIT, e.g, in that mechanism we can define request queues and
>>>> completion queues and any other features to make virtualization
>>>> friendly. That would be much simpler.
>>>>
>>>
>>> No that's more confusing because now we are overloading the definition
>>> of persistent memory. I want this memory type identified from the top
>>> of the stack so it can appear differently in /proc/iomem and also
>>> implement this alternate flush communication.
>>>
>>
>> For the characteristic of memory, I have no idea why VM should know this
>> difference. It can be completely transparent to VM, that means, VM
>> does not need to know where this virtual PMEM comes from (for a really
>> nvdimm backend or a normal storage). The only discrepancy is the flush
>> interface.
> 
> It's not persistent memory if it requires a hypercall to make it
> persistent. Unless memory writes can be made durable purely with cpu
> instructions it's dangerous for it to be treated as a PMEM range.
> Consider a guest that tried to map it with device-dax which has no
> facility to route requests to a special flushing interface.
> 

Can we separate the concept of flush interface from persistent memory?
Say there are two APIs, one is used to indicate the memory type (i.e,
/proc/iomem) and another one indicates the flush interface.

So for existing nvdimm hardwares:
1: Persist-memory + CLFLUSH
2: Persiste-memory + flush-hint-table (I know Intel does not use it)

and for the virtual nvdimm which backended on normal storage:
Persist-memory + virtual flush interface

>>
>>> In what way is this "more complicated"? It was trivial to add support
>>> for the "volatile" NFIT range, this will not be any more complicated
>>> than that.
>>>
>>
>> Introducing memory type is easy indeed, however, a new flush interface
>> definition is inevitable, i.e, we need a standard way to discover the
>> MMIOs to communicate with host.
> 
> Right, the proposed way to do that for x86 platforms is a new SPA
> Range GUID type. in the NFIT.
> 

So this SPA is used for both persistent memory region and flush interface?
Maybe i missed it in previous mails, could you please detail how to do
it?

BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
are. (Oh, yes, it depends on Paolo. :))

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-01  6:46                                                   ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-01  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler



On 11/01/2017 12:25 PM, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>>
>>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>>> <xiaoguangrong.eric@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>>
>>>>>> At that point, would it make sense to expose these special
>>>>>> virtio-pmem areas to the guest in a slightly different way,
>>>>>> so the regions that need virtio flushing are not bound by
>>>>>> the regular driver, and the regular driver can continue to
>>>>>> work for memory regions that are backed by actual pmem in
>>>>>> the host?
>>>>>
>>>>>
>>>>>
>>>>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>>>>> mechanism. It would basically involve defining a new SPA (System
>>>>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>>>>> treat that as a new pmem device type.
>>>>
>>>>
>>>>
>>>> I would prefer a new flush mechanism to a new memory type introduced
>>>> to NFIT, e.g, in that mechanism we can define request queues and
>>>> completion queues and any other features to make virtualization
>>>> friendly. That would be much simpler.
>>>>
>>>
>>> No that's more confusing because now we are overloading the definition
>>> of persistent memory. I want this memory type identified from the top
>>> of the stack so it can appear differently in /proc/iomem and also
>>> implement this alternate flush communication.
>>>
>>
>> For the characteristic of memory, I have no idea why VM should know this
>> difference. It can be completely transparent to VM, that means, VM
>> does not need to know where this virtual PMEM comes from (for a really
>> nvdimm backend or a normal storage). The only discrepancy is the flush
>> interface.
> 
> It's not persistent memory if it requires a hypercall to make it
> persistent. Unless memory writes can be made durable purely with cpu
> instructions it's dangerous for it to be treated as a PMEM range.
> Consider a guest that tried to map it with device-dax which has no
> facility to route requests to a special flushing interface.
> 

Can we separate the concept of flush interface from persistent memory?
Say there are two APIs, one is used to indicate the memory type (i.e,
/proc/iomem) and another one indicates the flush interface.

So for existing nvdimm hardwares:
1: Persist-memory + CLFLUSH
2: Persiste-memory + flush-hint-table (I know Intel does not use it)

and for the virtual nvdimm which backended on normal storage:
Persist-memory + virtual flush interface

>>
>>> In what way is this "more complicated"? It was trivial to add support
>>> for the "volatile" NFIT range, this will not be any more complicated
>>> than that.
>>>
>>
>> Introducing memory type is easy indeed, however, a new flush interface
>> definition is inevitable, i.e, we need a standard way to discover the
>> MMIOs to communicate with host.
> 
> Right, the proposed way to do that for x86 platforms is a new SPA
> Range GUID type. in the NFIT.
> 

So this SPA is used for both persistent memory region and flush interface?
Maybe i missed it in previous mails, could you please detail how to do
it?

BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
are. (Oh, yes, it depends on Paolo. :))

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01 15:20                                                     ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-01 15:20 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

> On 11/01/2017 12:25 PM, Dan Williams wrote:
[..]
>> It's not persistent memory if it requires a hypercall to make it
>> persistent. Unless memory writes can be made durable purely with cpu
>> instructions it's dangerous for it to be treated as a PMEM range.
>> Consider a guest that tried to map it with device-dax which has no
>> facility to route requests to a special flushing interface.
>>
>
> Can we separate the concept of flush interface from persistent memory?
> Say there are two APIs, one is used to indicate the memory type (i.e,
> /proc/iomem) and another one indicates the flush interface.
>
> So for existing nvdimm hardwares:
> 1: Persist-memory + CLFLUSH
> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>
> and for the virtual nvdimm which backended on normal storage:
> Persist-memory + virtual flush interface

I see the flush interface as fundamental to identifying the media
properties. It's not byte-addressable persistent memory if the
application needs to call a sideband interface to manage writes. This
is why we have pushed for something like the MAP_SYNC interface to
make filesystem-dax actually behave in a way that applications can
safely treat it as persistent memory, and this is also the guarantee
that device-dax provides. Changing the flush interface makes it
distinct and unusable for applications that want to manage data
persistence in userspace.

>>>
>>>> In what way is this "more complicated"? It was trivial to add support
>>>> for the "volatile" NFIT range, this will not be any more complicated
>>>> than that.
>>>>
>>>
>>> Introducing memory type is easy indeed, however, a new flush interface
>>> definition is inevitable, i.e, we need a standard way to discover the
>>> MMIOs to communicate with host.
>>
>>
>> Right, the proposed way to do that for x86 platforms is a new SPA
>> Range GUID type. in the NFIT.
>>
>
> So this SPA is used for both persistent memory region and flush interface?
> Maybe i missed it in previous mails, could you please detail how to do
> it?

Yes, the GUID will specifically identify this range as "Virtio Shared
Memory" (or whatever name survives after a bikeshed debate). The
libnvdimm core then needs to grow a new region type that mostly
behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
new flush interface to perform the host communication. Device-dax
would be disallowed from attaching to this region type, or we could
grow a new device-dax type that does not allow the raw device to be
mapped, but allows a filesystem mounted on top to manage the flush
interface.

> BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
> are. (Oh, yes, it depends on Paolo. :))

MMIO/PIO regions works for me, that's not the part of the proposal I'm
concerned about.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-01 15:20                                                     ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-01 15:20 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

> On 11/01/2017 12:25 PM, Dan Williams wrote:
[..]
>> It's not persistent memory if it requires a hypercall to make it
>> persistent. Unless memory writes can be made durable purely with cpu
>> instructions it's dangerous for it to be treated as a PMEM range.
>> Consider a guest that tried to map it with device-dax which has no
>> facility to route requests to a special flushing interface.
>>
>
> Can we separate the concept of flush interface from persistent memory?
> Say there are two APIs, one is used to indicate the memory type (i.e,
> /proc/iomem) and another one indicates the flush interface.
>
> So for existing nvdimm hardwares:
> 1: Persist-memory + CLFLUSH
> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>
> and for the virtual nvdimm which backended on normal storage:
> Persist-memory + virtual flush interface

I see the flush interface as fundamental to identifying the media
properties. It's not byte-addressable persistent memory if the
application needs to call a sideband interface to manage writes. This
is why we have pushed for something like the MAP_SYNC interface to
make filesystem-dax actually behave in a way that applications can
safely treat it as persistent memory, and this is also the guarantee
that device-dax provides. Changing the flush interface makes it
distinct and unusable for applications that want to manage data
persistence in userspace.

>>>
>>>> In what way is this "more complicated"? It was trivial to add support
>>>> for the "volatile" NFIT range, this will not be any more complicated
>>>> than that.
>>>>
>>>
>>> Introducing memory type is easy indeed, however, a new flush interface
>>> definition is inevitable, i.e, we need a standard way to discover the
>>> MMIOs to communicate with host.
>>
>>
>> Right, the proposed way to do that for x86 platforms is a new SPA
>> Range GUID type. in the NFIT.
>>
>
> So this SPA is used for both persistent memory region and flush interface?
> Maybe i missed it in previous mails, could you please detail how to do
> it?

Yes, the GUID will specifically identify this range as "Virtio Shared
Memory" (or whatever name survives after a bikeshed debate). The
libnvdimm core then needs to grow a new region type that mostly
behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
new flush interface to perform the host communication. Device-dax
would be disallowed from attaching to this region type, or we could
grow a new device-dax type that does not allow the raw device to be
mapped, but allows a filesystem mounted on top to manage the flush
interface.

> BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
> are. (Oh, yes, it depends on Paolo. :))

MMIO/PIO regions works for me, that's not the part of the proposal I'm
concerned about.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-01 15:20                                                     ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-01 15:20 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

> On 11/01/2017 12:25 PM, Dan Williams wrote:
[..]
>> It's not persistent memory if it requires a hypercall to make it
>> persistent. Unless memory writes can be made durable purely with cpu
>> instructions it's dangerous for it to be treated as a PMEM range.
>> Consider a guest that tried to map it with device-dax which has no
>> facility to route requests to a special flushing interface.
>>
>
> Can we separate the concept of flush interface from persistent memory?
> Say there are two APIs, one is used to indicate the memory type (i.e,
> /proc/iomem) and another one indicates the flush interface.
>
> So for existing nvdimm hardwares:
> 1: Persist-memory + CLFLUSH
> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>
> and for the virtual nvdimm which backended on normal storage:
> Persist-memory + virtual flush interface

I see the flush interface as fundamental to identifying the media
properties. It's not byte-addressable persistent memory if the
application needs to call a sideband interface to manage writes. This
is why we have pushed for something like the MAP_SYNC interface to
make filesystem-dax actually behave in a way that applications can
safely treat it as persistent memory, and this is also the guarantee
that device-dax provides. Changing the flush interface makes it
distinct and unusable for applications that want to manage data
persistence in userspace.

>>>
>>>> In what way is this "more complicated"? It was trivial to add support
>>>> for the "volatile" NFIT range, this will not be any more complicated
>>>> than that.
>>>>
>>>
>>> Introducing memory type is easy indeed, however, a new flush interface
>>> definition is inevitable, i.e, we need a standard way to discover the
>>> MMIOs to communicate with host.
>>
>>
>> Right, the proposed way to do that for x86 platforms is a new SPA
>> Range GUID type. in the NFIT.
>>
>
> So this SPA is used for both persistent memory region and flush interface?
> Maybe i missed it in previous mails, could you please detail how to do
> it?

Yes, the GUID will specifically identify this range as "Virtio Shared
Memory" (or whatever name survives after a bikeshed debate). The
libnvdimm core then needs to grow a new region type that mostly
behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
new flush interface to perform the host communication. Device-dax
would be disallowed from attaching to this region type, or we could
grow a new device-dax type that does not allow the raw device to be
mapped, but allows a filesystem mounted on top to manage the flush
interface.

> BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
> are. (Oh, yes, it depends on Paolo. :))

MMIO/PIO regions works for me, that's not the part of the proposal I'm
concerned about.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-02  8:50                                                       ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-02  8:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal



On 11/01/2017 11:20 PM, Dan Williams wrote:
>> On 11/01/2017 12:25 PM, Dan Williams wrote:
> [..]
>>> It's not persistent memory if it requires a hypercall to make it
>>> persistent. Unless memory writes can be made durable purely with cpu
>>> instructions it's dangerous for it to be treated as a PMEM range.
>>> Consider a guest that tried to map it with device-dax which has no
>>> facility to route requests to a special flushing interface.
>>>
>>
>> Can we separate the concept of flush interface from persistent memory?
>> Say there are two APIs, one is used to indicate the memory type (i.e,
>> /proc/iomem) and another one indicates the flush interface.
>>
>> So for existing nvdimm hardwares:
>> 1: Persist-memory + CLFLUSH
>> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>>
>> and for the virtual nvdimm which backended on normal storage:
>> Persist-memory + virtual flush interface
> 
> I see the flush interface as fundamental to identifying the media
> properties. It's not byte-addressable persistent memory if the
> application needs to call a sideband interface to manage writes. This
> is why we have pushed for something like the MAP_SYNC interface to
> make filesystem-dax actually behave in a way that applications can
> safely treat it as persistent memory, and this is also the guarantee
> that device-dax provides. Changing the flush interface makes it
> distinct and unusable for applications that want to manage data
> persistence in userspace.
> 

I was thinking that from the device's perspective, both of them are
not persistent until a flush operation is issued (clflush or virtual
flush-interface). But you are right, from the user/software's
perspective, their fundamentals are different.

So for the virtual nvdimm which is backended on normal storage, we
should refuse MAP_SYNC and the only way to guarantee persistence
is fsync/fdatasync.

Actually, we can treat a SPA region which associates with specific
flush interface as special GUID as your proposal, please see more
in below comment...

>>>>
>>>>> In what way is this "more complicated"? It was trivial to add support
>>>>> for the "volatile" NFIT range, this will not be any more complicated
>>>>> than that.
>>>>>
>>>>
>>>> Introducing memory type is easy indeed, however, a new flush interface
>>>> definition is inevitable, i.e, we need a standard way to discover the
>>>> MMIOs to communicate with host.
>>>
>>>
>>> Right, the proposed way to do that for x86 platforms is a new SPA
>>> Range GUID type. in the NFIT.
>>>
>>
>> So this SPA is used for both persistent memory region and flush interface?
>> Maybe i missed it in previous mails, could you please detail how to do
>> it?
> 
> Yes, the GUID will specifically identify this range as "Virtio Shared
> Memory" (or whatever name survives after a bikeshed debate). The
> libnvdimm core then needs to grow a new region type that mostly
> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> new flush interface to perform the host communication. Device-dax
> would be disallowed from attaching to this region type, or we could
> grow a new device-dax type that does not allow the raw device to be
> mapped, but allows a filesystem mounted on top to manage the flush
> interface.

I am afraid it is not a good idea that a single SPA is used for multiple
purposes. For the region used as "pmem" is directly mapped to the VM so
that guest can freely access it without host's assistance, however, for
the region used as "host communication" is not mapped to VM, so that
it causes VM-exit and host gets the chance to do specific operations,
e.g, flush cache. So we'd better distinctly define these two regions to
avoid the unnecessary complexity in hypervisor.



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-02  8:50                                                       ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-02  8:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal



On 11/01/2017 11:20 PM, Dan Williams wrote:
>> On 11/01/2017 12:25 PM, Dan Williams wrote:
> [..]
>>> It's not persistent memory if it requires a hypercall to make it
>>> persistent. Unless memory writes can be made durable purely with cpu
>>> instructions it's dangerous for it to be treated as a PMEM range.
>>> Consider a guest that tried to map it with device-dax which has no
>>> facility to route requests to a special flushing interface.
>>>
>>
>> Can we separate the concept of flush interface from persistent memory?
>> Say there are two APIs, one is used to indicate the memory type (i.e,
>> /proc/iomem) and another one indicates the flush interface.
>>
>> So for existing nvdimm hardwares:
>> 1: Persist-memory + CLFLUSH
>> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>>
>> and for the virtual nvdimm which backended on normal storage:
>> Persist-memory + virtual flush interface
> 
> I see the flush interface as fundamental to identifying the media
> properties. It's not byte-addressable persistent memory if the
> application needs to call a sideband interface to manage writes. This
> is why we have pushed for something like the MAP_SYNC interface to
> make filesystem-dax actually behave in a way that applications can
> safely treat it as persistent memory, and this is also the guarantee
> that device-dax provides. Changing the flush interface makes it
> distinct and unusable for applications that want to manage data
> persistence in userspace.
> 

I was thinking that from the device's perspective, both of them are
not persistent until a flush operation is issued (clflush or virtual
flush-interface). But you are right, from the user/software's
perspective, their fundamentals are different.

So for the virtual nvdimm which is backended on normal storage, we
should refuse MAP_SYNC and the only way to guarantee persistence
is fsync/fdatasync.

Actually, we can treat a SPA region which associates with specific
flush interface as special GUID as your proposal, please see more
in below comment...

>>>>
>>>>> In what way is this "more complicated"? It was trivial to add support
>>>>> for the "volatile" NFIT range, this will not be any more complicated
>>>>> than that.
>>>>>
>>>>
>>>> Introducing memory type is easy indeed, however, a new flush interface
>>>> definition is inevitable, i.e, we need a standard way to discover the
>>>> MMIOs to communicate with host.
>>>
>>>
>>> Right, the proposed way to do that for x86 platforms is a new SPA
>>> Range GUID type. in the NFIT.
>>>
>>
>> So this SPA is used for both persistent memory region and flush interface?
>> Maybe i missed it in previous mails, could you please detail how to do
>> it?
> 
> Yes, the GUID will specifically identify this range as "Virtio Shared
> Memory" (or whatever name survives after a bikeshed debate). The
> libnvdimm core then needs to grow a new region type that mostly
> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> new flush interface to perform the host communication. Device-dax
> would be disallowed from attaching to this region type, or we could
> grow a new device-dax type that does not allow the raw device to be
> mapped, but allows a filesystem mounted on top to manage the flush
> interface.

I am afraid it is not a good idea that a single SPA is used for multiple
purposes. For the region used as "pmem" is directly mapped to the VM so
that guest can freely access it without host's assistance, however, for
the region used as "host communication" is not mapped to VM, so that
it causes VM-exit and host gets the chance to do specific operations,
e.g, flush cache. So we'd better distinctly define these two regions to
avoid the unnecessary complexity in hypervisor.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-02  8:50                                                       ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-02  8:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler



On 11/01/2017 11:20 PM, Dan Williams wrote:
>> On 11/01/2017 12:25 PM, Dan Williams wrote:
> [..]
>>> It's not persistent memory if it requires a hypercall to make it
>>> persistent. Unless memory writes can be made durable purely with cpu
>>> instructions it's dangerous for it to be treated as a PMEM range.
>>> Consider a guest that tried to map it with device-dax which has no
>>> facility to route requests to a special flushing interface.
>>>
>>
>> Can we separate the concept of flush interface from persistent memory?
>> Say there are two APIs, one is used to indicate the memory type (i.e,
>> /proc/iomem) and another one indicates the flush interface.
>>
>> So for existing nvdimm hardwares:
>> 1: Persist-memory + CLFLUSH
>> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>>
>> and for the virtual nvdimm which backended on normal storage:
>> Persist-memory + virtual flush interface
> 
> I see the flush interface as fundamental to identifying the media
> properties. It's not byte-addressable persistent memory if the
> application needs to call a sideband interface to manage writes. This
> is why we have pushed for something like the MAP_SYNC interface to
> make filesystem-dax actually behave in a way that applications can
> safely treat it as persistent memory, and this is also the guarantee
> that device-dax provides. Changing the flush interface makes it
> distinct and unusable for applications that want to manage data
> persistence in userspace.
> 

I was thinking that from the device's perspective, both of them are
not persistent until a flush operation is issued (clflush or virtual
flush-interface). But you are right, from the user/software's
perspective, their fundamentals are different.

So for the virtual nvdimm which is backended on normal storage, we
should refuse MAP_SYNC and the only way to guarantee persistence
is fsync/fdatasync.

Actually, we can treat a SPA region which associates with specific
flush interface as special GUID as your proposal, please see more
in below comment...

>>>>
>>>>> In what way is this "more complicated"? It was trivial to add support
>>>>> for the "volatile" NFIT range, this will not be any more complicated
>>>>> than that.
>>>>>
>>>>
>>>> Introducing memory type is easy indeed, however, a new flush interface
>>>> definition is inevitable, i.e, we need a standard way to discover the
>>>> MMIOs to communicate with host.
>>>
>>>
>>> Right, the proposed way to do that for x86 platforms is a new SPA
>>> Range GUID type. in the NFIT.
>>>
>>
>> So this SPA is used for both persistent memory region and flush interface?
>> Maybe i missed it in previous mails, could you please detail how to do
>> it?
> 
> Yes, the GUID will specifically identify this range as "Virtio Shared
> Memory" (or whatever name survives after a bikeshed debate). The
> libnvdimm core then needs to grow a new region type that mostly
> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> new flush interface to perform the host communication. Device-dax
> would be disallowed from attaching to this region type, or we could
> grow a new device-dax type that does not allow the raw device to be
> mapped, but allows a filesystem mounted on top to manage the flush
> interface.

I am afraid it is not a good idea that a single SPA is used for multiple
purposes. For the region used as "pmem" is directly mapped to the VM so
that guest can freely access it without host's assistance, however, for
the region used as "host communication" is not mapped to VM, so that
it causes VM-exit and host gets the chance to do specific operations,
e.g, flush cache. So we'd better distinctly define these two regions to
avoid the unnecessary complexity in hypervisor.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-02  8:50                                                       ` Xiao Guangrong
  (?)
@ 2017-11-02 16:30                                                         ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-02 16:30 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
[..]
>> Yes, the GUID will specifically identify this range as "Virtio Shared
>> Memory" (or whatever name survives after a bikeshed debate). The
>> libnvdimm core then needs to grow a new region type that mostly
>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> new flush interface to perform the host communication. Device-dax
>> would be disallowed from attaching to this region type, or we could
>> grow a new device-dax type that does not allow the raw device to be
>> mapped, but allows a filesystem mounted on top to manage the flush
>> interface.
>
>
> I am afraid it is not a good idea that a single SPA is used for multiple
> purposes. For the region used as "pmem" is directly mapped to the VM so
> that guest can freely access it without host's assistance, however, for
> the region used as "host communication" is not mapped to VM, so that
> it causes VM-exit and host gets the chance to do specific operations,
> e.g, flush cache. So we'd better distinctly define these two regions to
> avoid the unnecessary complexity in hypervisor.

Good point, I was assuming that the mmio flush interface would be
discovered separately from the NFIT-defined memory range. Perhaps via
PCI in the guest? This piece of the proposal  needs a bit more
thought...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-02 16:30                                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-02 16:30 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
[..]
>> Yes, the GUID will specifically identify this range as "Virtio Shared
>> Memory" (or whatever name survives after a bikeshed debate). The
>> libnvdimm core then needs to grow a new region type that mostly
>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> new flush interface to perform the host communication. Device-dax
>> would be disallowed from attaching to this region type, or we could
>> grow a new device-dax type that does not allow the raw device to be
>> mapped, but allows a filesystem mounted on top to manage the flush
>> interface.
>
>
> I am afraid it is not a good idea that a single SPA is used for multiple
> purposes. For the region used as "pmem" is directly mapped to the VM so
> that guest can freely access it without host's assistance, however, for
> the region used as "host communication" is not mapped to VM, so that
> it causes VM-exit and host gets the chance to do specific operations,
> e.g, flush cache. So we'd better distinctly define these two regions to
> avoid the unnecessary complexity in hypervisor.

Good point, I was assuming that the mmio flush interface would be
discovered separately from the NFIT-defined memory range. Perhaps via
PCI in the guest? This piece of the proposal  needs a bit more
thought...

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-02 16:30                                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-02 16:30 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
[..]
>> Yes, the GUID will specifically identify this range as "Virtio Shared
>> Memory" (or whatever name survives after a bikeshed debate). The
>> libnvdimm core then needs to grow a new region type that mostly
>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> new flush interface to perform the host communication. Device-dax
>> would be disallowed from attaching to this region type, or we could
>> grow a new device-dax type that does not allow the raw device to be
>> mapped, but allows a filesystem mounted on top to manage the flush
>> interface.
>
>
> I am afraid it is not a good idea that a single SPA is used for multiple
> purposes. For the region used as "pmem" is directly mapped to the VM so
> that guest can freely access it without host's assistance, however, for
> the region used as "host communication" is not mapped to VM, so that
> it causes VM-exit and host gets the chance to do specific operations,
> e.g, flush cache. So we'd better distinctly define these two regions to
> avoid the unnecessary complexity in hypervisor.

Good point, I was assuming that the mmio flush interface would be
discovered separately from the NFIT-defined memory range. Perhaps via
PCI in the guest? This piece of the proposal  needs a bit more
thought...

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-03  6:21                                                           ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-03  6:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal



On 11/03/2017 12:30 AM, Dan Williams wrote:
> On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
> [..]
>>> Yes, the GUID will specifically identify this range as "Virtio Shared
>>> Memory" (or whatever name survives after a bikeshed debate). The
>>> libnvdimm core then needs to grow a new region type that mostly
>>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>>> new flush interface to perform the host communication. Device-dax
>>> would be disallowed from attaching to this region type, or we could
>>> grow a new device-dax type that does not allow the raw device to be
>>> mapped, but allows a filesystem mounted on top to manage the flush
>>> interface.
>>
>>
>> I am afraid it is not a good idea that a single SPA is used for multiple
>> purposes. For the region used as "pmem" is directly mapped to the VM so
>> that guest can freely access it without host's assistance, however, for
>> the region used as "host communication" is not mapped to VM, so that
>> it causes VM-exit and host gets the chance to do specific operations,
>> e.g, flush cache. So we'd better distinctly define these two regions to
>> avoid the unnecessary complexity in hypervisor.
> 
> Good point, I was assuming that the mmio flush interface would be
> discovered separately from the NFIT-defined memory range. Perhaps via
> PCI in the guest? This piece of the proposal  needs a bit more
> thought...
> 

Consider the case that the vNVDIMM device on normal storage and
vNVDIMM device on real nvdimm hardware can both exist in VM, the
flush interface should be able to associate with the SPA region
respectively. That's why I'd like to integrate the flush interface
into NFIT/ACPI by using a separate table. Is it possible to be a
part of ACPI specification? :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-03  6:21                                                           ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-03  6:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal



On 11/03/2017 12:30 AM, Dan Williams wrote:
> On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
> <xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> [..]
>>> Yes, the GUID will specifically identify this range as "Virtio Shared
>>> Memory" (or whatever name survives after a bikeshed debate). The
>>> libnvdimm core then needs to grow a new region type that mostly
>>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>>> new flush interface to perform the host communication. Device-dax
>>> would be disallowed from attaching to this region type, or we could
>>> grow a new device-dax type that does not allow the raw device to be
>>> mapped, but allows a filesystem mounted on top to manage the flush
>>> interface.
>>
>>
>> I am afraid it is not a good idea that a single SPA is used for multiple
>> purposes. For the region used as "pmem" is directly mapped to the VM so
>> that guest can freely access it without host's assistance, however, for
>> the region used as "host communication" is not mapped to VM, so that
>> it causes VM-exit and host gets the chance to do specific operations,
>> e.g, flush cache. So we'd better distinctly define these two regions to
>> avoid the unnecessary complexity in hypervisor.
> 
> Good point, I was assuming that the mmio flush interface would be
> discovered separately from the NFIT-defined memory range. Perhaps via
> PCI in the guest? This piece of the proposal  needs a bit more
> thought...
> 

Consider the case that the vNVDIMM device on normal storage and
vNVDIMM device on real nvdimm hardware can both exist in VM, the
flush interface should be able to associate with the SPA region
respectively. That's why I'd like to integrate the flush interface
into NFIT/ACPI by using a separate table. Is it possible to be a
part of ACPI specification? :)

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-03  6:21                                                           ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-03  6:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler



On 11/03/2017 12:30 AM, Dan Williams wrote:
> On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
> [..]
>>> Yes, the GUID will specifically identify this range as "Virtio Shared
>>> Memory" (or whatever name survives after a bikeshed debate). The
>>> libnvdimm core then needs to grow a new region type that mostly
>>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>>> new flush interface to perform the host communication. Device-dax
>>> would be disallowed from attaching to this region type, or we could
>>> grow a new device-dax type that does not allow the raw device to be
>>> mapped, but allows a filesystem mounted on top to manage the flush
>>> interface.
>>
>>
>> I am afraid it is not a good idea that a single SPA is used for multiple
>> purposes. For the region used as "pmem" is directly mapped to the VM so
>> that guest can freely access it without host's assistance, however, for
>> the region used as "host communication" is not mapped to VM, so that
>> it causes VM-exit and host gets the chance to do specific operations,
>> e.g, flush cache. So we'd better distinctly define these two regions to
>> avoid the unnecessary complexity in hypervisor.
> 
> Good point, I was assuming that the mmio flush interface would be
> discovered separately from the NFIT-defined memory range. Perhaps via
> PCI in the guest? This piece of the proposal  needs a bit more
> thought...
> 

Consider the case that the vNVDIMM device on normal storage and
vNVDIMM device on real nvdimm hardware can both exist in VM, the
flush interface should be able to associate with the SPA region
respectively. That's why I'd like to integrate the flush interface
into NFIT/ACPI by using a separate table. Is it possible to be a
part of ACPI specification? :)

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
  2017-11-02 16:30                                                         ` Dan Williams
@ 2017-11-06  7:57                                                           ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-06  7:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, kvm-devel, Stefan Hajnoczi,
	Ross Zwisler, Qemu Developers, Christoph Hellwig,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal



> [..]
> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> libnvdimm core then needs to grow a new region type that mostly
> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> new flush interface to perform the host communication. Device-dax
> >> would be disallowed from attaching to this region type, or we could
> >> grow a new device-dax type that does not allow the raw device to be
> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> interface.
> >
> >
> > I am afraid it is not a good idea that a single SPA is used for multiple
> > purposes. For the region used as "pmem" is directly mapped to the VM so
> > that guest can freely access it without host's assistance, however, for
> > the region used as "host communication" is not mapped to VM, so that
> > it causes VM-exit and host gets the chance to do specific operations,
> > e.g, flush cache. So we'd better distinctly define these two regions to
> > avoid the unnecessary complexity in hypervisor.
> 
> Good point, I was assuming that the mmio flush interface would be
> discovered separately from the NFIT-defined memory range. Perhaps via
> PCI in the guest? This piece of the proposal  needs a bit more
> thought...

Also, in earlier discussions we agreed for entire device flush whenever guest
performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would be
trapped for the duration device flush is completed.

Instead, if we do perform an asynchronous flush guest CPU's can be utilized by
some other tasks till flush completes?

Thanks,
Pankaj   

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-06  7:57                                                           ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-06  7:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Xiao Guangrong, Kevin Wolf, Jan Kara, kvm-devel, Haozhong Zhang,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, ross zwisler,
	Nitesh Narayan Lal, Christoph Hellwig



> [..]
> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> libnvdimm core then needs to grow a new region type that mostly
> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> new flush interface to perform the host communication. Device-dax
> >> would be disallowed from attaching to this region type, or we could
> >> grow a new device-dax type that does not allow the raw device to be
> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> interface.
> >
> >
> > I am afraid it is not a good idea that a single SPA is used for multiple
> > purposes. For the region used as "pmem" is directly mapped to the VM so
> > that guest can freely access it without host's assistance, however, for
> > the region used as "host communication" is not mapped to VM, so that
> > it causes VM-exit and host gets the chance to do specific operations,
> > e.g, flush cache. So we'd better distinctly define these two regions to
> > avoid the unnecessary complexity in hypervisor.
> 
> Good point, I was assuming that the mmio flush interface would be
> discovered separately from the NFIT-defined memory range. Perhaps via
> PCI in the guest? This piece of the proposal  needs a bit more
> thought...

Also, in earlier discussions we agreed for entire device flush whenever guest
performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would be
trapped for the duration device flush is completed.

Instead, if we do perform an asynchronous flush guest CPU's can be utilized by
some other tasks till flush completes?

Thanks,
Pankaj   

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
  2017-11-06  7:57                                                           ` Pankaj Gupta
@ 2017-11-06 16:57                                                             ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-06 16:57 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, kvm-devel, Stefan Hajnoczi,
	Ross Zwisler, Qemu Developers, Christoph Hellwig,
	Stefan Hajnoczi, linux-nvdimm@lists.01.org, Paolo Bonzini,
	Nitesh Narayan Lal

On Sun, Nov 5, 2017 at 11:57 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>
>> [..]
>> >> Yes, the GUID will specifically identify this range as "Virtio Shared
>> >> Memory" (or whatever name survives after a bikeshed debate). The
>> >> libnvdimm core then needs to grow a new region type that mostly
>> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> >> new flush interface to perform the host communication. Device-dax
>> >> would be disallowed from attaching to this region type, or we could
>> >> grow a new device-dax type that does not allow the raw device to be
>> >> mapped, but allows a filesystem mounted on top to manage the flush
>> >> interface.
>> >
>> >
>> > I am afraid it is not a good idea that a single SPA is used for multiple
>> > purposes. For the region used as "pmem" is directly mapped to the VM so
>> > that guest can freely access it without host's assistance, however, for
>> > the region used as "host communication" is not mapped to VM, so that
>> > it causes VM-exit and host gets the chance to do specific operations,
>> > e.g, flush cache. So we'd better distinctly define these two regions to
>> > avoid the unnecessary complexity in hypervisor.
>>
>> Good point, I was assuming that the mmio flush interface would be
>> discovered separately from the NFIT-defined memory range. Perhaps via
>> PCI in the guest? This piece of the proposal  needs a bit more
>> thought...
>
> Also, in earlier discussions we agreed for entire device flush whenever guest
> performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would be
> trapped for the duration device flush is completed.
>
> Instead, if we do perform an asynchronous flush guest CPU's can be utilized by
> some other tasks till flush completes?

Yes, the interface for the guest to trigger and wait for flush
requests should be asynchronous, just like a storage "flush-cache"
command.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-06 16:57                                                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-06 16:57 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Xiao Guangrong, Kevin Wolf, Jan Kara, kvm-devel, Haozhong Zhang,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, ross zwisler,
	Nitesh Narayan Lal, Christoph Hellwig

On Sun, Nov 5, 2017 at 11:57 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>
>> [..]
>> >> Yes, the GUID will specifically identify this range as "Virtio Shared
>> >> Memory" (or whatever name survives after a bikeshed debate). The
>> >> libnvdimm core then needs to grow a new region type that mostly
>> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> >> new flush interface to perform the host communication. Device-dax
>> >> would be disallowed from attaching to this region type, or we could
>> >> grow a new device-dax type that does not allow the raw device to be
>> >> mapped, but allows a filesystem mounted on top to manage the flush
>> >> interface.
>> >
>> >
>> > I am afraid it is not a good idea that a single SPA is used for multiple
>> > purposes. For the region used as "pmem" is directly mapped to the VM so
>> > that guest can freely access it without host's assistance, however, for
>> > the region used as "host communication" is not mapped to VM, so that
>> > it causes VM-exit and host gets the chance to do specific operations,
>> > e.g, flush cache. So we'd better distinctly define these two regions to
>> > avoid the unnecessary complexity in hypervisor.
>>
>> Good point, I was assuming that the mmio flush interface would be
>> discovered separately from the NFIT-defined memory range. Perhaps via
>> PCI in the guest? This piece of the proposal  needs a bit more
>> thought...
>
> Also, in earlier discussions we agreed for entire device flush whenever guest
> performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would be
> trapped for the duration device flush is completed.
>
> Instead, if we do perform an asynchronous flush guest CPU's can be utilized by
> some other tasks till flush completes?

Yes, the interface for the guest to trigger and wait for flush
requests should be asynchronous, just like a storage "flush-cache"
command.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-07 11:21                                                               ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-07 11:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, kvm-devel, Amit Shah,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal, Aams


> >
> >
> >> [..]
> >> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> >> libnvdimm core then needs to grow a new region type that mostly
> >> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> >> new flush interface to perform the host communication. Device-dax
> >> >> would be disallowed from attaching to this region type, or we could
> >> >> grow a new device-dax type that does not allow the raw device to be
> >> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> >> interface.
> >> >
> >> >
> >> > I am afraid it is not a good idea that a single SPA is used for multiple
> >> > purposes. For the region used as "pmem" is directly mapped to the VM so
> >> > that guest can freely access it without host's assistance, however, for
> >> > the region used as "host communication" is not mapped to VM, so that
> >> > it causes VM-exit and host gets the chance to do specific operations,
> >> > e.g, flush cache. So we'd better distinctly define these two regions to
> >> > avoid the unnecessary complexity in hypervisor.
> >>
> >> Good point, I was assuming that the mmio flush interface would be
> >> discovered separately from the NFIT-defined memory range. Perhaps via
> >> PCI in the guest? This piece of the proposal  needs a bit more
> >> thought...
> >
> > Also, in earlier discussions we agreed for entire device flush whenever
> > guest
> > performs a fsync on DAX file. If we do a MMIO call for this, guest CPU
> > would be
> > trapped for the duration device flush is completed.
> >
> > Instead, if we do perform an asynchronous flush guest CPU's can be utilized
> > by
> > some other tasks till flush completes?
> 
> Yes, the interface for the guest to trigger and wait for flush
> requests should be asynchronous, just like a storage "flush-cache"
> command.

One idea got while discussing this with Rik & Amit during KVM forum is to use something 
similar to Hyperv Key-value pair for sharing command between guest <=> host. Don't think 
such thing exists yet for KVM? Or how we can utilize existing features in KVM to achieve this?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-07 11:21                                                               ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-07 11:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, kvm-devel, Amit Shah,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal, Aams-vV1OtcyAfmbQT0dZR+AlfA


> >
> >
> >> [..]
> >> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> >> libnvdimm core then needs to grow a new region type that mostly
> >> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> >> new flush interface to perform the host communication. Device-dax
> >> >> would be disallowed from attaching to this region type, or we could
> >> >> grow a new device-dax type that does not allow the raw device to be
> >> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> >> interface.
> >> >
> >> >
> >> > I am afraid it is not a good idea that a single SPA is used for multiple
> >> > purposes. For the region used as "pmem" is directly mapped to the VM so
> >> > that guest can freely access it without host's assistance, however, for
> >> > the region used as "host communication" is not mapped to VM, so that
> >> > it causes VM-exit and host gets the chance to do specific operations,
> >> > e.g, flush cache. So we'd better distinctly define these two regions to
> >> > avoid the unnecessary complexity in hypervisor.
> >>
> >> Good point, I was assuming that the mmio flush interface would be
> >> discovered separately from the NFIT-defined memory range. Perhaps via
> >> PCI in the guest? This piece of the proposal  needs a bit more
> >> thought...
> >
> > Also, in earlier discussions we agreed for entire device flush whenever
> > guest
> > performs a fsync on DAX file. If we do a MMIO call for this, guest CPU
> > would be
> > trapped for the duration device flush is completed.
> >
> > Instead, if we do perform an asynchronous flush guest CPU's can be utilized
> > by
> > some other tasks till flush completes?
> 
> Yes, the interface for the guest to trigger and wait for flush
> requests should be asynchronous, just like a storage "flush-cache"
> command.

One idea got while discussing this with Rik & Amit during KVM forum is to use something 
similar to Hyperv Key-value pair for sharing command between guest <=> host. Don't think 
such thing exists yet for KVM? Or how we can utilize existing features in KVM to achieve this?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-07 11:21                                                               ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-07 11:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Haozhong Zhang, Jan Kara, kvm-devel, Stefan Hajnoczi,
	Ross Zwisler, Stefan Hajnoczi, Qemu Developers,
	Christoph Hellwig, linux-nvdimm@lists.01.org, Xiao Guangrong,
	Paolo Bonzini, ross zwisler, Nitesh Narayan Lal, Amit Shah, Aams


> >
> >
> >> [..]
> >> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> >> libnvdimm core then needs to grow a new region type that mostly
> >> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> >> new flush interface to perform the host communication. Device-dax
> >> >> would be disallowed from attaching to this region type, or we could
> >> >> grow a new device-dax type that does not allow the raw device to be
> >> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> >> interface.
> >> >
> >> >
> >> > I am afraid it is not a good idea that a single SPA is used for multiple
> >> > purposes. For the region used as "pmem" is directly mapped to the VM so
> >> > that guest can freely access it without host's assistance, however, for
> >> > the region used as "host communication" is not mapped to VM, so that
> >> > it causes VM-exit and host gets the chance to do specific operations,
> >> > e.g, flush cache. So we'd better distinctly define these two regions to
> >> > avoid the unnecessary complexity in hypervisor.
> >>
> >> Good point, I was assuming that the mmio flush interface would be
> >> discovered separately from the NFIT-defined memory range. Perhaps via
> >> PCI in the guest? This piece of the proposal  needs a bit more
> >> thought...
> >
> > Also, in earlier discussions we agreed for entire device flush whenever
> > guest
> > performs a fsync on DAX file. If we do a MMIO call for this, guest CPU
> > would be
> > trapped for the duration device flush is completed.
> >
> > Instead, if we do perform an asynchronous flush guest CPU's can be utilized
> > by
> > some other tasks till flush completes?
> 
> Yes, the interface for the guest to trigger and wait for flush
> requests should be asynchronous, just like a storage "flush-cache"
> command.

One idea got while discussing this with Rik & Amit during KVM forum is to use something 
similar to Hyperv Key-value pair for sharing command between guest <=> host. Don't think 
such thing exists yet for KVM? Or how we can utilize existing features in KVM to achieve this?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-03  6:21                                                           ` Xiao Guangrong
@ 2017-11-21 18:19                                                             ` Rik van Riel
  -1 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-11-21 18:19 UTC (permalink / raw)
  To: Xiao Guangrong, Dan Williams
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
> On 11/03/2017 12:30 AM, Dan Williams wrote:
> > 
> > Good point, I was assuming that the mmio flush interface would be
> > discovered separately from the NFIT-defined memory range. Perhaps
> > via
> > PCI in the guest? This piece of the proposal  needs a bit more
> > thought...
> > 
> 
> Consider the case that the vNVDIMM device on normal storage and
> vNVDIMM device on real nvdimm hardware can both exist in VM, the
> flush interface should be able to associate with the SPA region
> respectively. That's why I'd like to integrate the flush interface
> into NFIT/ACPI by using a separate table. Is it possible to be a
> part of ACPI specification? :)

It would also be perfectly fine to have the
virtio PCI device indicate which vNVDIMM
range it flushes.

Since the guest OS needs to support that kind
of device anyway, does it really matter which
direction the device association points?

We can go with the "best" interface for what
could be a relatively slow flush (fsync on a
file on ssd/disk on the host), which requires
that the flushing task wait on completion
asynchronously.

If that kind of interface cannot be advertised
through NFIT/ACPI, wouldn't it be perfectly fine
to have only the virtio PCI device indicate which
vNVDIMM range it flushes?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-21 18:19                                                             ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-11-21 18:19 UTC (permalink / raw)
  To: Xiao Guangrong, Dan Williams
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
> On 11/03/2017 12:30 AM, Dan Williams wrote:
> > 
> > Good point, I was assuming that the mmio flush interface would be
> > discovered separately from the NFIT-defined memory range. Perhaps
> > via
> > PCI in the guest? This piece of the proposal  needs a bit more
> > thought...
> > 
> 
> Consider the case that the vNVDIMM device on normal storage and
> vNVDIMM device on real nvdimm hardware can both exist in VM, the
> flush interface should be able to associate with the SPA region
> respectively. That's why I'd like to integrate the flush interface
> into NFIT/ACPI by using a separate table. Is it possible to be a
> part of ACPI specification? :)

It would also be perfectly fine to have the
virtio PCI device indicate which vNVDIMM
range it flushes.

Since the guest OS needs to support that kind
of device anyway, does it really matter which
direction the device association points?

We can go with the "best" interface for what
could be a relatively slow flush (fsync on a
file on ssd/disk on the host), which requires
that the flushing task wait on completion
asynchronously.

If that kind of interface cannot be advertised
through NFIT/ACPI, wouldn't it be perfectly fine
to have only the virtio PCI device indicate which
vNVDIMM range it flushes?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-21 18:26                                                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-21 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
>> On 11/03/2017 12:30 AM, Dan Williams wrote:
>> >
>> > Good point, I was assuming that the mmio flush interface would be
>> > discovered separately from the NFIT-defined memory range. Perhaps
>> > via
>> > PCI in the guest? This piece of the proposal  needs a bit more
>> > thought...
>> >
>>
>> Consider the case that the vNVDIMM device on normal storage and
>> vNVDIMM device on real nvdimm hardware can both exist in VM, the
>> flush interface should be able to associate with the SPA region
>> respectively. That's why I'd like to integrate the flush interface
>> into NFIT/ACPI by using a separate table. Is it possible to be a
>> part of ACPI specification? :)
>
> It would also be perfectly fine to have the
> virtio PCI device indicate which vNVDIMM
> range it flushes.
>
> Since the guest OS needs to support that kind
> of device anyway, does it really matter which
> direction the device association points?
>
> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.
>
> If that kind of interface cannot be advertised
> through NFIT/ACPI, wouldn't it be perfectly fine
> to have only the virtio PCI device indicate which
> vNVDIMM range it flushes?
>

Yes, we could do this with a custom PCI device, however the NFIT is
frustratingly close to being able to define something like this. At
the very least we can start with a "SPA Range GUID" that is Linux
specific to indicate "call this virtio flush interface on FUA / flush
cache requests" as a stop gap until a standardized flush interface can
be defined.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-21 18:26                                                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-21 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
>> On 11/03/2017 12:30 AM, Dan Williams wrote:
>> >
>> > Good point, I was assuming that the mmio flush interface would be
>> > discovered separately from the NFIT-defined memory range. Perhaps
>> > via
>> > PCI in the guest? This piece of the proposal  needs a bit more
>> > thought...
>> >
>>
>> Consider the case that the vNVDIMM device on normal storage and
>> vNVDIMM device on real nvdimm hardware can both exist in VM, the
>> flush interface should be able to associate with the SPA region
>> respectively. That's why I'd like to integrate the flush interface
>> into NFIT/ACPI by using a separate table. Is it possible to be a
>> part of ACPI specification? :)
>
> It would also be perfectly fine to have the
> virtio PCI device indicate which vNVDIMM
> range it flushes.
>
> Since the guest OS needs to support that kind
> of device anyway, does it really matter which
> direction the device association points?
>
> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.
>
> If that kind of interface cannot be advertised
> through NFIT/ACPI, wouldn't it be perfectly fine
> to have only the virtio PCI device indicate which
> vNVDIMM range it flushes?
>

Yes, we could do this with a custom PCI device, however the NFIT is
frustratingly close to being able to define something like this. At
the very least we can start with a "SPA Range GUID" that is Linux
specific to indicate "call this virtio flush interface on FUA / flush
cache requests" as a stop gap until a standardized flush interface can
be defined.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-21 18:26                                                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-21 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Xiao Guangrong, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
>> On 11/03/2017 12:30 AM, Dan Williams wrote:
>> >
>> > Good point, I was assuming that the mmio flush interface would be
>> > discovered separately from the NFIT-defined memory range. Perhaps
>> > via
>> > PCI in the guest? This piece of the proposal  needs a bit more
>> > thought...
>> >
>>
>> Consider the case that the vNVDIMM device on normal storage and
>> vNVDIMM device on real nvdimm hardware can both exist in VM, the
>> flush interface should be able to associate with the SPA region
>> respectively. That's why I'd like to integrate the flush interface
>> into NFIT/ACPI by using a separate table. Is it possible to be a
>> part of ACPI specification? :)
>
> It would also be perfectly fine to have the
> virtio PCI device indicate which vNVDIMM
> range it flushes.
>
> Since the guest OS needs to support that kind
> of device anyway, does it really matter which
> direction the device association points?
>
> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.
>
> If that kind of interface cannot be advertised
> through NFIT/ACPI, wouldn't it be perfectly fine
> to have only the virtio PCI device indicate which
> vNVDIMM range it flushes?
>

Yes, we could do this with a custom PCI device, however the NFIT is
frustratingly close to being able to define something like this. At
the very least we can start with a "SPA Range GUID" that is Linux
specific to indicate "call this virtio flush interface on FUA / flush
cache requests" as a stop gap until a standardized flush interface can
be defined.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-21 18:26                                                               ` Dan Williams
@ 2017-11-21 18:35                                                                 ` Rik van Riel
  -1 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-11-21 18:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Xiao Guangrong, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 2329 bytes --]

On Tue, 2017-11-21 at 10:26 -0800, Dan Williams wrote:
> On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
> > > On 11/03/2017 12:30 AM, Dan Williams wrote:
> > > > 
> > > > Good point, I was assuming that the mmio flush interface would
> > > > be
> > > > discovered separately from the NFIT-defined memory range.
> > > > Perhaps
> > > > via
> > > > PCI in the guest? This piece of the proposal  needs a bit more
> > > > thought...
> > > > 
> > > 
> > > Consider the case that the vNVDIMM device on normal storage and
> > > vNVDIMM device on real nvdimm hardware can both exist in VM, the
> > > flush interface should be able to associate with the SPA region
> > > respectively. That's why I'd like to integrate the flush
> > > interface
> > > into NFIT/ACPI by using a separate table. Is it possible to be a
> > > part of ACPI specification? :)
> > 
> > It would also be perfectly fine to have the
> > virtio PCI device indicate which vNVDIMM
> > range it flushes.
> > 
> > Since the guest OS needs to support that kind
> > of device anyway, does it really matter which
> > direction the device association points?
> > 
> > We can go with the "best" interface for what
> > could be a relatively slow flush (fsync on a
> > file on ssd/disk on the host), which requires
> > that the flushing task wait on completion
> > asynchronously.
> > 
> > If that kind of interface cannot be advertised
> > through NFIT/ACPI, wouldn't it be perfectly fine
> > to have only the virtio PCI device indicate which
> > vNVDIMM range it flushes?
> > 
> 
> Yes, we could do this with a custom PCI device, however the NFIT is
> frustratingly close to being able to define something like this. At
> the very least we can start with a "SPA Range GUID" that is Linux
> specific to indicate "call this virtio flush interface on FUA / flush
> cache requests" as a stop gap until a standardized flush interface
> can
> be defined.

Ahh, is that a "look for a device with this GUID"
NFIT hint?

That would be enough to tip off OSes that do not
support that device that they found a vNVDIMM
device that they cannot safely flush, which could
help them report such errors to userspace...

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-21 18:35                                                                 ` Rik van Riel
  0 siblings, 0 replies; 176+ messages in thread
From: Rik van Riel @ 2017-11-21 18:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Xiao Guangrong, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 2329 bytes --]

On Tue, 2017-11-21 at 10:26 -0800, Dan Williams wrote:
> On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
> > > On 11/03/2017 12:30 AM, Dan Williams wrote:
> > > > 
> > > > Good point, I was assuming that the mmio flush interface would
> > > > be
> > > > discovered separately from the NFIT-defined memory range.
> > > > Perhaps
> > > > via
> > > > PCI in the guest? This piece of the proposal  needs a bit more
> > > > thought...
> > > > 
> > > 
> > > Consider the case that the vNVDIMM device on normal storage and
> > > vNVDIMM device on real nvdimm hardware can both exist in VM, the
> > > flush interface should be able to associate with the SPA region
> > > respectively. That's why I'd like to integrate the flush
> > > interface
> > > into NFIT/ACPI by using a separate table. Is it possible to be a
> > > part of ACPI specification? :)
> > 
> > It would also be perfectly fine to have the
> > virtio PCI device indicate which vNVDIMM
> > range it flushes.
> > 
> > Since the guest OS needs to support that kind
> > of device anyway, does it really matter which
> > direction the device association points?
> > 
> > We can go with the "best" interface for what
> > could be a relatively slow flush (fsync on a
> > file on ssd/disk on the host), which requires
> > that the flushing task wait on completion
> > asynchronously.
> > 
> > If that kind of interface cannot be advertised
> > through NFIT/ACPI, wouldn't it be perfectly fine
> > to have only the virtio PCI device indicate which
> > vNVDIMM range it flushes?
> > 
> 
> Yes, we could do this with a custom PCI device, however the NFIT is
> frustratingly close to being able to define something like this. At
> the very least we can start with a "SPA Range GUID" that is Linux
> specific to indicate "call this virtio flush interface on FUA / flush
> cache requests" as a stop gap until a standardized flush interface
> can
> be defined.

Ahh, is that a "look for a device with this GUID"
NFIT hint?

That would be enough to tip off OSes that do not
support that device that they found a vNVDIMM
device that they cannot safely flush, which could
help them report such errors to userspace...

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-21 18:19                                                             ` [Qemu-devel] " Rik van Riel
  (?)
@ 2017-11-23  4:05                                                               ` Xiao Guangrong
  -1 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-23  4:05 UTC (permalink / raw)
  To: Rik van Riel, Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, kvm-devel, Stefan Hajnoczi,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal



On 11/22/2017 02:19 AM, Rik van Riel wrote:

> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.

I'd like to clarify the interface of "wait on completion
asynchronously" and KVM async page fault a bit more.

Current design of async-page-fault only works on RAM rather
than MMIO, i.e, if the page fault caused by accessing the
device memory of a emulated device, it needs to go to
userspace (QEMU) which emulates the operation in vCPU's
thread.

As i mentioned before the memory region used for vNVDIMM
flush interface should be MMIO and consider its support
on other hypervisors, so we do better push this async
mechanism into the flush interface design itself rather
than depends on kvm async-page-fault.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-23  4:05                                                               ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-23  4:05 UTC (permalink / raw)
  To: Rik van Riel, Dan Williams
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	Haozhong Zhang, Ross Zwisler



On 11/22/2017 02:19 AM, Rik van Riel wrote:

> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.

I'd like to clarify the interface of "wait on completion
asynchronously" and KVM async page fault a bit more.

Current design of async-page-fault only works on RAM rather
than MMIO, i.e, if the page fault caused by accessing the
device memory of a emulated device, it needs to go to
userspace (QEMU) which emulates the operation in vCPU's
thread.

As i mentioned before the memory region used for vNVDIMM
flush interface should be MMIO and consider its support
on other hypervisors, so we do better push this async
mechanism into the flush interface design itself rather
than depends on kvm async-page-fault.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-23  4:05                                                               ` Xiao Guangrong
  0 siblings, 0 replies; 176+ messages in thread
From: Xiao Guangrong @ 2017-11-23  4:05 UTC (permalink / raw)
  To: Rik van Riel, Dan Williams
  Cc: Pankaj Gupta, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Paolo Bonzini, Kevin Wolf, Nitesh Narayan Lal,
	Haozhong Zhang, Ross Zwisler



On 11/22/2017 02:19 AM, Rik van Riel wrote:

> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.

I'd like to clarify the interface of "wait on completion
asynchronously" and KVM async page fault a bit more.

Current design of async-page-fault only works on RAM rather
than MMIO, i.e, if the page fault caused by accessing the
device memory of a emulated device, it needs to go to
userspace (QEMU) which emulates the operation in vCPU's
thread.

As i mentioned before the memory region used for vNVDIMM
flush interface should be MMIO and consider its support
on other hypervisors, so we do better push this async
mechanism into the flush interface design itself rather
than depends on kvm async-page-fault.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-23  4:05                                                               ` Xiao Guangrong
  (?)
@ 2017-11-23 16:14                                                                 ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-23 16:14 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>
>> We can go with the "best" interface for what
>> could be a relatively slow flush (fsync on a
>> file on ssd/disk on the host), which requires
>> that the flushing task wait on completion
>> asynchronously.
>
>
> I'd like to clarify the interface of "wait on completion
> asynchronously" and KVM async page fault a bit more.
>
> Current design of async-page-fault only works on RAM rather
> than MMIO, i.e, if the page fault caused by accessing the
> device memory of a emulated device, it needs to go to
> userspace (QEMU) which emulates the operation in vCPU's
> thread.
>
> As i mentioned before the memory region used for vNVDIMM
> flush interface should be MMIO and consider its support
> on other hypervisors, so we do better push this async
> mechanism into the flush interface design itself rather
> than depends on kvm async-page-fault.

I would expect this interface to be virtio-ring based to queue flush
requests asynchronously to the host.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-23 16:14                                                                 ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-23 16:14 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>
>> We can go with the "best" interface for what
>> could be a relatively slow flush (fsync on a
>> file on ssd/disk on the host), which requires
>> that the flushing task wait on completion
>> asynchronously.
>
>
> I'd like to clarify the interface of "wait on completion
> asynchronously" and KVM async page fault a bit more.
>
> Current design of async-page-fault only works on RAM rather
> than MMIO, i.e, if the page fault caused by accessing the
> device memory of a emulated device, it needs to go to
> userspace (QEMU) which emulates the operation in vCPU's
> thread.
>
> As i mentioned before the memory region used for vNVDIMM
> flush interface should be MMIO and consider its support
> on other hypervisors, so we do better push this async
> mechanism into the flush interface design itself rather
> than depends on kvm async-page-fault.

I would expect this interface to be virtio-ring based to queue flush
requests asynchronously to the host.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-23 16:14                                                                 ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-23 16:14 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Paolo Bonzini,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
<xiaoguangrong.eric@gmail.com> wrote:
>
>
> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>
>> We can go with the "best" interface for what
>> could be a relatively slow flush (fsync on a
>> file on ssd/disk on the host), which requires
>> that the flushing task wait on completion
>> asynchronously.
>
>
> I'd like to clarify the interface of "wait on completion
> asynchronously" and KVM async page fault a bit more.
>
> Current design of async-page-fault only works on RAM rather
> than MMIO, i.e, if the page fault caused by accessing the
> device memory of a emulated device, it needs to go to
> userspace (QEMU) which emulates the operation in vCPU's
> thread.
>
> As i mentioned before the memory region used for vNVDIMM
> flush interface should be MMIO and consider its support
> on other hypervisors, so we do better push this async
> mechanism into the flush interface design itself rather
> than depends on kvm async-page-fault.

I would expect this interface to be virtio-ring based to queue flush
requests asynchronously to the host.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-23 16:14                                                                 ` Dan Williams
  (?)
@ 2017-11-23 16:28                                                                   ` Paolo Bonzini
  -1 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-23 16:28 UTC (permalink / raw)
  To: Dan Williams, Xiao Guangrong
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Nitesh Narayan Lal

On 23/11/2017 17:14, Dan Williams wrote:
> On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>>
>>> We can go with the "best" interface for what
>>> could be a relatively slow flush (fsync on a
>>> file on ssd/disk on the host), which requires
>>> that the flushing task wait on completion
>>> asynchronously.
>>
>>
>> I'd like to clarify the interface of "wait on completion
>> asynchronously" and KVM async page fault a bit more.
>>
>> Current design of async-page-fault only works on RAM rather
>> than MMIO, i.e, if the page fault caused by accessing the
>> device memory of a emulated device, it needs to go to
>> userspace (QEMU) which emulates the operation in vCPU's
>> thread.
>>
>> As i mentioned before the memory region used for vNVDIMM
>> flush interface should be MMIO and consider its support
>> on other hypervisors, so we do better push this async
>> mechanism into the flush interface design itself rather
>> than depends on kvm async-page-fault.
> 
> I would expect this interface to be virtio-ring based to queue flush
> requests asynchronously to the host.

Could we reuse the virtio-blk device, only with a different device id?

Thanks,

Paolo

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-23 16:28                                                                   ` Paolo Bonzini
  0 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-23 16:28 UTC (permalink / raw)
  To: Dan Williams, Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Kevin Wolf,
	Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On 23/11/2017 17:14, Dan Williams wrote:
> On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>>
>>> We can go with the "best" interface for what
>>> could be a relatively slow flush (fsync on a
>>> file on ssd/disk on the host), which requires
>>> that the flushing task wait on completion
>>> asynchronously.
>>
>>
>> I'd like to clarify the interface of "wait on completion
>> asynchronously" and KVM async page fault a bit more.
>>
>> Current design of async-page-fault only works on RAM rather
>> than MMIO, i.e, if the page fault caused by accessing the
>> device memory of a emulated device, it needs to go to
>> userspace (QEMU) which emulates the operation in vCPU's
>> thread.
>>
>> As i mentioned before the memory region used for vNVDIMM
>> flush interface should be MMIO and consider its support
>> on other hypervisors, so we do better push this async
>> mechanism into the flush interface design itself rather
>> than depends on kvm async-page-fault.
> 
> I would expect this interface to be virtio-ring based to queue flush
> requests asynchronously to the host.

Could we reuse the virtio-blk device, only with a different device id?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-23 16:28                                                                   ` Paolo Bonzini
  0 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-23 16:28 UTC (permalink / raw)
  To: Dan Williams, Xiao Guangrong
  Cc: Rik van Riel, Pankaj Gupta, Jan Kara, Stefan Hajnoczi,
	Stefan Hajnoczi, kvm-devel, Qemu Developers,
	linux-nvdimm@lists.01.org, ross zwisler, Kevin Wolf,
	Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On 23/11/2017 17:14, Dan Williams wrote:
> On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
> <xiaoguangrong.eric@gmail.com> wrote:
>>
>>
>> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>>
>>> We can go with the "best" interface for what
>>> could be a relatively slow flush (fsync on a
>>> file on ssd/disk on the host), which requires
>>> that the flushing task wait on completion
>>> asynchronously.
>>
>>
>> I'd like to clarify the interface of "wait on completion
>> asynchronously" and KVM async page fault a bit more.
>>
>> Current design of async-page-fault only works on RAM rather
>> than MMIO, i.e, if the page fault caused by accessing the
>> device memory of a emulated device, it needs to go to
>> userspace (QEMU) which emulates the operation in vCPU's
>> thread.
>>
>> As i mentioned before the memory region used for vNVDIMM
>> flush interface should be MMIO and consider its support
>> on other hypervisors, so we do better push this async
>> mechanism into the flush interface design itself rather
>> than depends on kvm async-page-fault.
> 
> I would expect this interface to be virtio-ring based to queue flush
> requests asynchronously to the host.

Could we reuse the virtio-blk device, only with a different device id?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-23 16:28                                                                   ` Paolo Bonzini
  (?)
@ 2017-11-24 12:40                                                                     ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-24 12:40 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, kvm-devel, linux-nvdimm@lists.01.org,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	Nitesh Narayan Lal


Hello,

Thank you all for all the useful suggestions.
I want to summarize the discussions so far in the
thread. Please see below:

> >>
> >>> We can go with the "best" interface for what
> >>> could be a relatively slow flush (fsync on a
> >>> file on ssd/disk on the host), which requires
> >>> that the flushing task wait on completion
> >>> asynchronously.
> >>
> >>
> >> I'd like to clarify the interface of "wait on completion
> >> asynchronously" and KVM async page fault a bit more.
> >>
> >> Current design of async-page-fault only works on RAM rather
> >> than MMIO, i.e, if the page fault caused by accessing the
> >> device memory of a emulated device, it needs to go to
> >> userspace (QEMU) which emulates the operation in vCPU's
> >> thread.
> >>
> >> As i mentioned before the memory region used for vNVDIMM
> >> flush interface should be MMIO and consider its support
> >> on other hypervisors, so we do better push this async
> >> mechanism into the flush interface design itself rather
> >> than depends on kvm async-page-fault.
> > 
> > I would expect this interface to be virtio-ring based to queue flush
> > requests asynchronously to the host.
> 
> Could we reuse the virtio-blk device, only with a different device id?

As per previous discussions, there were suggestions on main two parts of the project:

1] Expose vNVDIMM memory range to KVM guest.

   - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec 
     changes for this? 

   - Guest should be able to add this memory in system memory map. Name of the added memory in
     '/proc/iomem' should be different(shared memory?) than persistent memory as it 
     does not satisfy exact definition of persistent memory (requires an explicit flush).

   - Guest should not allow 'device-dax' and other fancy features which are not 
     virtualization friendly.

2] Flushing interface to persist guest changes.

   - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
     Looks like most of these options are not use-case friendly. As we want to do fsync on a
     file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. 

   - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can 
     go with the existing pmem driver and add flush specific to this new memory type.

   - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
     want a flush vehicle to send guest commands to host and get reply after asynchronous
     execution. There was previous discussion [1] with Rik & Dan on this.

    [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

Is my understanding correct here?

Thanks,
Pankaj  
 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 12:40                                                                     ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-24 12:40 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler


Hello,

Thank you all for all the useful suggestions.
I want to summarize the discussions so far in the
thread. Please see below:

> >>
> >>> We can go with the "best" interface for what
> >>> could be a relatively slow flush (fsync on a
> >>> file on ssd/disk on the host), which requires
> >>> that the flushing task wait on completion
> >>> asynchronously.
> >>
> >>
> >> I'd like to clarify the interface of "wait on completion
> >> asynchronously" and KVM async page fault a bit more.
> >>
> >> Current design of async-page-fault only works on RAM rather
> >> than MMIO, i.e, if the page fault caused by accessing the
> >> device memory of a emulated device, it needs to go to
> >> userspace (QEMU) which emulates the operation in vCPU's
> >> thread.
> >>
> >> As i mentioned before the memory region used for vNVDIMM
> >> flush interface should be MMIO and consider its support
> >> on other hypervisors, so we do better push this async
> >> mechanism into the flush interface design itself rather
> >> than depends on kvm async-page-fault.
> > 
> > I would expect this interface to be virtio-ring based to queue flush
> > requests asynchronously to the host.
> 
> Could we reuse the virtio-blk device, only with a different device id?

As per previous discussions, there were suggestions on main two parts of the project:

1] Expose vNVDIMM memory range to KVM guest.

   - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec 
     changes for this? 

   - Guest should be able to add this memory in system memory map. Name of the added memory in
     '/proc/iomem' should be different(shared memory?) than persistent memory as it 
     does not satisfy exact definition of persistent memory (requires an explicit flush).

   - Guest should not allow 'device-dax' and other fancy features which are not 
     virtualization friendly.

2] Flushing interface to persist guest changes.

   - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
     Looks like most of these options are not use-case friendly. As we want to do fsync on a
     file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. 

   - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can 
     go with the existing pmem driver and add flush specific to this new memory type.

   - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
     want a flush vehicle to send guest commands to host and get reply after asynchronous
     execution. There was previous discussion [1] with Rik & Dan on this.

    [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

Is my understanding correct here?

Thanks,
Pankaj  
 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 12:40                                                                     ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-24 12:40 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler


Hello,

Thank you all for all the useful suggestions.
I want to summarize the discussions so far in the
thread. Please see below:

> >>
> >>> We can go with the "best" interface for what
> >>> could be a relatively slow flush (fsync on a
> >>> file on ssd/disk on the host), which requires
> >>> that the flushing task wait on completion
> >>> asynchronously.
> >>
> >>
> >> I'd like to clarify the interface of "wait on completion
> >> asynchronously" and KVM async page fault a bit more.
> >>
> >> Current design of async-page-fault only works on RAM rather
> >> than MMIO, i.e, if the page fault caused by accessing the
> >> device memory of a emulated device, it needs to go to
> >> userspace (QEMU) which emulates the operation in vCPU's
> >> thread.
> >>
> >> As i mentioned before the memory region used for vNVDIMM
> >> flush interface should be MMIO and consider its support
> >> on other hypervisors, so we do better push this async
> >> mechanism into the flush interface design itself rather
> >> than depends on kvm async-page-fault.
> > 
> > I would expect this interface to be virtio-ring based to queue flush
> > requests asynchronously to the host.
> 
> Could we reuse the virtio-blk device, only with a different device id?

As per previous discussions, there were suggestions on main two parts of the project:

1] Expose vNVDIMM memory range to KVM guest.

   - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec 
     changes for this? 

   - Guest should be able to add this memory in system memory map. Name of the added memory in
     '/proc/iomem' should be different(shared memory?) than persistent memory as it 
     does not satisfy exact definition of persistent memory (requires an explicit flush).

   - Guest should not allow 'device-dax' and other fancy features which are not 
     virtualization friendly.

2] Flushing interface to persist guest changes.

   - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
     Looks like most of these options are not use-case friendly. As we want to do fsync on a
     file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. 

   - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can 
     go with the existing pmem driver and add flush specific to this new memory type.

   - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
     want a flush vehicle to send guest commands to host and get reply after asynchronous
     execution. There was previous discussion [1] with Rik & Dan on this.

    [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

Is my understanding correct here?

Thanks,
Pankaj  
 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 12:44                                                                       ` Paolo Bonzini
  0 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-24 12:44 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, kvm-devel, linux-nvdimm@lists.01.org,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	Nitesh Narayan Lal

On 24/11/2017 13:40, Pankaj Gupta wrote:
>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
>      want a flush vehicle to send guest commands to host and get reply after asynchronous
>      execution. There was previous discussion [1] with Rik & Dan on this.
> 
>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

... in fact, the virtio-blk device _could_ actually accept regular I/O
too.  That would make it easier to boot from pmem.  Is there anything
similar in regular hardware?

Paolo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 12:44                                                                       ` Paolo Bonzini
  0 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-24 12:44 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, kvm-devel,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	Nitesh Narayan Lal

On 24/11/2017 13:40, Pankaj Gupta wrote:
>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
>      want a flush vehicle to send guest commands to host and get reply after asynchronous
>      execution. There was previous discussion [1] with Rik & Dan on this.
> 
>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

... in fact, the virtio-blk device _could_ actually accept regular I/O
too.  That would make it easier to boot from pmem.  Is there anything
similar in regular hardware?

Paolo

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 12:44                                                                       ` Paolo Bonzini
  0 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-24 12:44 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On 24/11/2017 13:40, Pankaj Gupta wrote:
>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
>      want a flush vehicle to send guest commands to host and get reply after asynchronous
>      execution. There was previous discussion [1] with Rik & Dan on this.
> 
>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

... in fact, the virtio-blk device _could_ actually accept regular I/O
too.  That would make it easier to boot from pmem.  Is there anything
similar in regular hardware?

Paolo

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
  2017-11-24 12:44                                                                       ` Paolo Bonzini
@ 2017-11-24 13:02                                                                         ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-24 13:02 UTC (permalink / raw)
  To: Paolo Bonzini, Christoph Hellwig
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	linux-nvdimm@lists.01.org, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, Stefan Hajnoczi, Nitesh Narayan Lal


> >    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense
> >    if just
> >      want a flush vehicle to send guest commands to host and get reply
> >      after asynchronous
> >      execution. There was previous discussion [1] with Rik & Dan on this.
> > 
> >     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html
> 
> ... in fact, the virtio-blk device _could_ actually accept regular I/O
> too.  That would make it easier to boot from pmem.  Is there anything
> similar in regular hardware?

there is existing block device associated(hard bind) with the pmem range.
Also, comment by Christoph [1], about removing block device with DAX support.
Still I am not clear about this. Am I missing anything here?

[1] https://marc.info/?l=kvm&m=150822740332536&w=2

Pankaj
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 13:02                                                                         ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2017-11-24 13:02 UTC (permalink / raw)
  To: Paolo Bonzini, Christoph Hellwig
  Cc: Dan Williams, Rik van Riel, Xiao Guangrong, Kevin Wolf,
	Haozhong Zhang, Jan Kara, kvm-devel, linux-nvdimm@lists.01.org,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	ross zwisler, Nitesh Narayan Lal


> >    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense
> >    if just
> >      want a flush vehicle to send guest commands to host and get reply
> >      after asynchronous
> >      execution. There was previous discussion [1] with Rik & Dan on this.
> > 
> >     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html
> 
> ... in fact, the virtio-blk device _could_ actually accept regular I/O
> too.  That would make it easier to boot from pmem.  Is there anything
> similar in regular hardware?

there is existing block device associated(hard bind) with the pmem range.
Also, comment by Christoph [1], about removing block device with DAX support.
Still I am not clear about this. Am I missing anything here?

[1] https://marc.info/?l=kvm&m=150822740332536&w=2

Pankaj

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
  2017-11-24 13:02                                                                         ` Pankaj Gupta
@ 2017-11-24 13:20                                                                           ` Paolo Bonzini
  -1 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-24 13:20 UTC (permalink / raw)
  To: Pankaj Gupta, Christoph Hellwig
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	linux-nvdimm@lists.01.org, Ross Zwisler, Qemu Developers,
	Stefan Hajnoczi, Stefan Hajnoczi, Nitesh Narayan Lal

On 24/11/2017 14:02, Pankaj Gupta wrote:
> 
>>>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense
>>>    if just
>>>      want a flush vehicle to send guest commands to host and get reply
>>>      after asynchronous
>>>      execution. There was previous discussion [1] with Rik & Dan on this.
>>>
>>>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html
>>
>> ... in fact, the virtio-blk device _could_ actually accept regular I/O
>> too.  That would make it easier to boot from pmem.  Is there anything
>> similar in regular hardware?
> 
> there is existing block device associated(hard bind) with the pmem range.
> Also, comment by Christoph [1], about removing block device with DAX support.
> Still I am not clear about this. Am I missing anything here?

The I/O part of the blk device would only be used by the firmware.  In
Linux, the different device id would bind the device to a different
driver that would only be used for flushing.

But maybe this idea makes no sense. :)

Paolo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-24 13:20                                                                           ` Paolo Bonzini
  0 siblings, 0 replies; 176+ messages in thread
From: Paolo Bonzini @ 2017-11-24 13:20 UTC (permalink / raw)
  To: Pankaj Gupta, Christoph Hellwig
  Cc: Dan Williams, Rik van Riel, Xiao Guangrong, Kevin Wolf,
	Haozhong Zhang, Jan Kara, kvm-devel, linux-nvdimm@lists.01.org,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	ross zwisler, Nitesh Narayan Lal

On 24/11/2017 14:02, Pankaj Gupta wrote:
> 
>>>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense
>>>    if just
>>>      want a flush vehicle to send guest commands to host and get reply
>>>      after asynchronous
>>>      execution. There was previous discussion [1] with Rik & Dan on this.
>>>
>>>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html
>>
>> ... in fact, the virtio-blk device _could_ actually accept regular I/O
>> too.  That would make it easier to boot from pmem.  Is there anything
>> similar in regular hardware?
> 
> there is existing block device associated(hard bind) with the pmem range.
> Also, comment by Christoph [1], about removing block device with DAX support.
> Still I am not clear about this. Am I missing anything here?

The I/O part of the blk device would only be used by the firmware.  In
Linux, the different device id would bind the device to a different
driver that would only be used for flushing.

But maybe this idea makes no sense. :)

Paolo

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-24 12:40                                                                     ` Pankaj Gupta
  (?)
@ 2017-11-28 18:03                                                                       ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-28 18:03 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Fri, Nov 24, 2017 at 4:40 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
[..]
> 1] Expose vNVDIMM memory range to KVM guest.
>
>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec
>      changes for this?

Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
System Physical Address (SPA) Range Structure" in the ACPI 6.2A
specification. Since it is a GUID we could define a Linux specific
type for this case, but spec changes would allow non-Linux hypervisors
to advertise a standard interface to guests.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2017-11-28 18:03                                                                       ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-28 18:03 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Fri, Nov 24, 2017 at 4:40 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
[..]
> 1] Expose vNVDIMM memory range to KVM guest.
>
>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec
>      changes for this?

Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
System Physical Address (SPA) Range Structure" in the ACPI 6.2A
specification. Since it is a GUID we could define a Linux specific
type for this case, but spec changes would allow non-Linux hypervisors
to advertise a standard interface to guests.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2017-11-28 18:03                                                                       ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2017-11-28 18:03 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Fri, Nov 24, 2017 at 4:40 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
[..]
> 1] Expose vNVDIMM memory range to KVM guest.
>
>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec
>      changes for this?

Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
System Physical Address (SPA) Range Structure" in the ACPI 6.2A
specification. Since it is a GUID we could define a Linux specific
type for this case, but spec changes would allow non-Linux hypervisors
to advertise a standard interface to guests.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-13  6:23                                                                         ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-13  6:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal


Hello Dan,

> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> specification. Since it is a GUID we could define a Linux specific
> type for this case, but spec changes would allow non-Linux hypervisors
> to advertise a standard interface to guests.
> 

I have added new SPA with a GUUID for this memory type and I could add 
this new memory type in System memory map. I need help with the namespace
handling for this new type As mentioned in [1] discussion:

- Create a new namespace for this new memory type
- Teach libnvdimm how to handle this new namespace 

I have some queries on this:

1] How namespace handling of this new memory type would be?
  
2] There are existing namespace types: 
  ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK

  How libnvdimm will handle this new name-space type in conjuction with existing
  memory type, region & namespaces?  

3] For sending guest to host flush commands we still have to think about some 
   async way?
    
[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html 

Thanks,
Pankaj
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-13  6:23                                                                         ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-13  6:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal


Hello Dan,

> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> specification. Since it is a GUID we could define a Linux specific
> type for this case, but spec changes would allow non-Linux hypervisors
> to advertise a standard interface to guests.
> 

I have added new SPA with a GUUID for this memory type and I could add 
this new memory type in System memory map. I need help with the namespace
handling for this new type As mentioned in [1] discussion:

- Create a new namespace for this new memory type
- Teach libnvdimm how to handle this new namespace 

I have some queries on this:

1] How namespace handling of this new memory type would be?
  
2] There are existing namespace types: 
  ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK

  How libnvdimm will handle this new name-space type in conjuction with existing
  memory type, region & namespaces?  

3] For sending guest to host flush commands we still have to think about some 
   async way?
    
[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html 

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-13  6:23                                                                         ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-13  6:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler


Hello Dan,

> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> specification. Since it is a GUID we could define a Linux specific
> type for this case, but spec changes would allow non-Linux hypervisors
> to advertise a standard interface to guests.
> 

I have added new SPA with a GUUID for this memory type and I could add 
this new memory type in System memory map. I need help with the namespace
handling for this new type As mentioned in [1] discussion:

- Create a new namespace for this new memory type
- Teach libnvdimm how to handle this new namespace 

I have some queries on this:

1] How namespace handling of this new memory type would be?
  
2] There are existing namespace types: 
  ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK

  How libnvdimm will handle this new name-space type in conjuction with existing
  memory type, region & namespaces?  

3] For sending guest to host flush commands we still have to think about some 
   async way?
    
[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html 

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-17 16:17                                                                           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-17 16:17 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
> Hello Dan,
>
>> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
>> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
>> specification. Since it is a GUID we could define a Linux specific
>> type for this case, but spec changes would allow non-Linux hypervisors
>> to advertise a standard interface to guests.
>>
>
> I have added new SPA with a GUUID for this memory type and I could add
> this new memory type in System memory map. I need help with the namespace
> handling for this new type As mentioned in [1] discussion:
>
> - Create a new namespace for this new memory type
> - Teach libnvdimm how to handle this new namespace
>
> I have some queries on this:
>
> 1] How namespace handling of this new memory type would be?

This would be a namespace that creates a pmem device, but does not allow DAX.

>
> 2] There are existing namespace types:
>   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
>
>   How libnvdimm will handle this new name-space type in conjuction with existing
>   memory type, region & namespaces?

The type will be either ND_DEVICE_NAMESPACE_IO or
ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
provide a virtual NVDIMM and label space. In other words the only
difference between this range and a typical persistent memory range is
that we will have a flag to disable DAX operation.

See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
example of how to pass attributes about the "region" to the the pmem
driver.

>
> 3] For sending guest to host flush commands we still have to think about some
>    async way?

I thought we discussed this being a paravirtualized virtio command ring?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-17 16:17                                                                           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-17 16:17 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> Hello Dan,
>
>> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
>> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
>> specification. Since it is a GUID we could define a Linux specific
>> type for this case, but spec changes would allow non-Linux hypervisors
>> to advertise a standard interface to guests.
>>
>
> I have added new SPA with a GUUID for this memory type and I could add
> this new memory type in System memory map. I need help with the namespace
> handling for this new type As mentioned in [1] discussion:
>
> - Create a new namespace for this new memory type
> - Teach libnvdimm how to handle this new namespace
>
> I have some queries on this:
>
> 1] How namespace handling of this new memory type would be?

This would be a namespace that creates a pmem device, but does not allow DAX.

>
> 2] There are existing namespace types:
>   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
>
>   How libnvdimm will handle this new name-space type in conjuction with existing
>   memory type, region & namespaces?

The type will be either ND_DEVICE_NAMESPACE_IO or
ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
provide a virtual NVDIMM and label space. In other words the only
difference between this range and a typical persistent memory range is
that we will have a flag to disable DAX operation.

See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
example of how to pass attributes about the "region" to the the pmem
driver.

>
> 3] For sending guest to host flush commands we still have to think about some
>    async way?

I thought we discussed this being a paravirtualized virtio command ring?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-17 16:17                                                                           ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-17 16:17 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
> Hello Dan,
>
>> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
>> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
>> specification. Since it is a GUID we could define a Linux specific
>> type for this case, but spec changes would allow non-Linux hypervisors
>> to advertise a standard interface to guests.
>>
>
> I have added new SPA with a GUUID for this memory type and I could add
> this new memory type in System memory map. I need help with the namespace
> handling for this new type As mentioned in [1] discussion:
>
> - Create a new namespace for this new memory type
> - Teach libnvdimm how to handle this new namespace
>
> I have some queries on this:
>
> 1] How namespace handling of this new memory type would be?

This would be a namespace that creates a pmem device, but does not allow DAX.

>
> 2] There are existing namespace types:
>   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
>
>   How libnvdimm will handle this new name-space type in conjuction with existing
>   memory type, region & namespaces?

The type will be either ND_DEVICE_NAMESPACE_IO or
ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
provide a virtual NVDIMM and label space. In other words the only
difference between this range and a typical persistent memory range is
that we will have a flag to disable DAX operation.

See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
example of how to pass attributes about the "region" to the the pmem
driver.

>
> 3] For sending guest to host flush commands we still have to think about some
>    async way?

I thought we discussed this being a paravirtualized virtio command ring?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2018-01-17 16:17                                                                           ` Dan Williams
  (?)
@ 2018-01-17 17:31                                                                             ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-17 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Rik van Riel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal


Hi Dan,

Thanks for your reply.

> 
> On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
> >
> > Hello Dan,
> >
> >> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> >> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> >> specification. Since it is a GUID we could define a Linux specific
> >> type for this case, but spec changes would allow non-Linux hypervisors
> >> to advertise a standard interface to guests.
> >>
> >
> > I have added new SPA with a GUUID for this memory type and I could add
> > this new memory type in System memory map. I need help with the namespace
> > handling for this new type As mentioned in [1] discussion:
> >
> > - Create a new namespace for this new memory type
> > - Teach libnvdimm how to handle this new namespace
> >
> > I have some queries on this:
> >
> > 1] How namespace handling of this new memory type would be?
> 
> This would be a namespace that creates a pmem device, but does not allow DAX.

o.k

> 
> >
> > 2] There are existing namespace types:
> >   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
> >
> >   How libnvdimm will handle this new name-space type in conjuction with
> >   existing
> >   memory type, region & namespaces?
> 
> The type will be either ND_DEVICE_NAMESPACE_IO or
> ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
> provide a virtual NVDIMM and label space. In other words the only
> difference between this range and a typical persistent memory range is
> that we will have a flag to disable DAX operation.

o.k. In short we have disable this flag 'QUEUE_FLAG_DAX' for this 
namespace & region? Also don't execute below code for this new type?

pmem_attach_disk()
...
...
        dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
        if (!dax_dev) {
                put_disk(disk);
                return -ENOMEM;
        }
        dax_write_cache(dax_dev, wbc);
        pmem->dax_dev = dax_dev;

> 
> See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
> example of how to pass attributes about the "region" to the the pmem
> driver.

sure.

> 
> >
> > 3] For sending guest to host flush commands we still have to think about
> > some
> >    async way?
> 
> I thought we discussed this being a paravirtualized virtio command ring?

o.k. will implement this. 

> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-17 17:31                                                                             ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-17 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler,
	Rik van Riel


Hi Dan,

Thanks for your reply.

> 
> On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
> >
> > Hello Dan,
> >
> >> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> >> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> >> specification. Since it is a GUID we could define a Linux specific
> >> type for this case, but spec changes would allow non-Linux hypervisors
> >> to advertise a standard interface to guests.
> >>
> >
> > I have added new SPA with a GUUID for this memory type and I could add
> > this new memory type in System memory map. I need help with the namespace
> > handling for this new type As mentioned in [1] discussion:
> >
> > - Create a new namespace for this new memory type
> > - Teach libnvdimm how to handle this new namespace
> >
> > I have some queries on this:
> >
> > 1] How namespace handling of this new memory type would be?
> 
> This would be a namespace that creates a pmem device, but does not allow DAX.

o.k

> 
> >
> > 2] There are existing namespace types:
> >   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
> >
> >   How libnvdimm will handle this new name-space type in conjuction with
> >   existing
> >   memory type, region & namespaces?
> 
> The type will be either ND_DEVICE_NAMESPACE_IO or
> ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
> provide a virtual NVDIMM and label space. In other words the only
> difference between this range and a typical persistent memory range is
> that we will have a flag to disable DAX operation.

o.k. In short we have disable this flag 'QUEUE_FLAG_DAX' for this 
namespace & region? Also don't execute below code for this new type?

pmem_attach_disk()
...
...
        dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
        if (!dax_dev) {
                put_disk(disk);
                return -ENOMEM;
        }
        dax_write_cache(dax_dev, wbc);
        pmem->dax_dev = dax_dev;

> 
> See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
> example of how to pass attributes about the "region" to the the pmem
> driver.

sure.

> 
> >
> > 3] For sending guest to host flush commands we still have to think about
> > some
> >    async way?
> 
> I thought we discussed this being a paravirtualized virtio command ring?

o.k. will implement this. 

> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-17 17:31                                                                             ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-17 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler,
	Rik van Riel


Hi Dan,

Thanks for your reply.

> 
> On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
> >
> > Hello Dan,
> >
> >> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> >> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> >> specification. Since it is a GUID we could define a Linux specific
> >> type for this case, but spec changes would allow non-Linux hypervisors
> >> to advertise a standard interface to guests.
> >>
> >
> > I have added new SPA with a GUUID for this memory type and I could add
> > this new memory type in System memory map. I need help with the namespace
> > handling for this new type As mentioned in [1] discussion:
> >
> > - Create a new namespace for this new memory type
> > - Teach libnvdimm how to handle this new namespace
> >
> > I have some queries on this:
> >
> > 1] How namespace handling of this new memory type would be?
> 
> This would be a namespace that creates a pmem device, but does not allow DAX.

o.k

> 
> >
> > 2] There are existing namespace types:
> >   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
> >
> >   How libnvdimm will handle this new name-space type in conjuction with
> >   existing
> >   memory type, region & namespaces?
> 
> The type will be either ND_DEVICE_NAMESPACE_IO or
> ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
> provide a virtual NVDIMM and label space. In other words the only
> difference between this range and a typical persistent memory range is
> that we will have a flag to disable DAX operation.

o.k. In short we have disable this flag 'QUEUE_FLAG_DAX' for this 
namespace & region? Also don't execute below code for this new type?

pmem_attach_disk()
...
...
        dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
        if (!dax_dev) {
                put_disk(disk);
                return -ENOMEM;
        }
        dax_write_cache(dax_dev, wbc);
        pmem->dax_dev = dax_dev;

> 
> See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
> example of how to pass attributes about the "region" to the the pmem
> driver.

sure.

> 
> >
> > 3] For sending guest to host flush commands we still have to think about
> > some
> >    async way?
> 
> I thought we discussed this being a paravirtualized virtio command ring?

o.k. will implement this. 

> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2017-11-24 12:40                                                                     ` Pankaj Gupta
  (?)
@ 2018-01-18 16:53                                                                       ` David Hildenbrand
  -1 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 16:53 UTC (permalink / raw)
  To: Pankaj Gupta, Paolo Bonzini, Dan Williams, Rik van Riel,
	Xiao Guangrong, Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, kvm-devel, linux-nvdimm@lists.01.org,
	Ross Zwisler, Qemu Developers, Stefan Hajnoczi, Stefan Hajnoczi,
	Nitesh Narayan Lal

On 24.11.2017 13:40, Pankaj Gupta wrote:
> 
> Hello,
> 
> Thank you all for all the useful suggestions.
> I want to summarize the discussions so far in the
> thread. Please see below:
> 
>>>>
>>>>> We can go with the "best" interface for what
>>>>> could be a relatively slow flush (fsync on a
>>>>> file on ssd/disk on the host), which requires
>>>>> that the flushing task wait on completion
>>>>> asynchronously.
>>>>
>>>>
>>>> I'd like to clarify the interface of "wait on completion
>>>> asynchronously" and KVM async page fault a bit more.
>>>>
>>>> Current design of async-page-fault only works on RAM rather
>>>> than MMIO, i.e, if the page fault caused by accessing the
>>>> device memory of a emulated device, it needs to go to
>>>> userspace (QEMU) which emulates the operation in vCPU's
>>>> thread.
>>>>
>>>> As i mentioned before the memory region used for vNVDIMM
>>>> flush interface should be MMIO and consider its support
>>>> on other hypervisors, so we do better push this async
>>>> mechanism into the flush interface design itself rather
>>>> than depends on kvm async-page-fault.
>>>
>>> I would expect this interface to be virtio-ring based to queue flush
>>> requests asynchronously to the host.
>>
>> Could we reuse the virtio-blk device, only with a different device id?
> 
> As per previous discussions, there were suggestions on main two parts of the project:
> 
> 1] Expose vNVDIMM memory range to KVM guest.
> 
>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec 
>      changes for this? 
> 
>    - Guest should be able to add this memory in system memory map. Name of the added memory in
>      '/proc/iomem' should be different(shared memory?) than persistent memory as it 
>      does not satisfy exact definition of persistent memory (requires an explicit flush).
> 
>    - Guest should not allow 'device-dax' and other fancy features which are not 
>      virtualization friendly.
> 
> 2] Flushing interface to persist guest changes.
> 
>    - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
>      Looks like most of these options are not use-case friendly. As we want to do fsync on a
>      file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. 
> 
>    - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can 
>      go with the existing pmem driver and add flush specific to this new memory type.

I'd like to emphasize again, that I would prefer a virtio-pmem only
solution.

There are architectures out there (e.g. s390x) that don't support
NVDIMMs - there is no HW interface to expose any such stuff.

However, with virtio-pmem, we could make it work also on architectures
not having ACPI and friends.

> 
>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
>      want a flush vehicle to send guest commands to host and get reply after asynchronous
>      execution. There was previous discussion [1] with Rik & Dan on this.
> 
>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 
> 
> Is my understanding correct here?
> 
> Thanks,
> Pankaj  
>  
> 


-- 

Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 16:53                                                                       ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 16:53 UTC (permalink / raw)
  To: Pankaj Gupta, Paolo Bonzini, Dan Williams, Rik van Riel,
	Xiao Guangrong, Christoph Hellwig
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On 24.11.2017 13:40, Pankaj Gupta wrote:
> 
> Hello,
> 
> Thank you all for all the useful suggestions.
> I want to summarize the discussions so far in the
> thread. Please see below:
> 
>>>>
>>>>> We can go with the "best" interface for what
>>>>> could be a relatively slow flush (fsync on a
>>>>> file on ssd/disk on the host), which requires
>>>>> that the flushing task wait on completion
>>>>> asynchronously.
>>>>
>>>>
>>>> I'd like to clarify the interface of "wait on completion
>>>> asynchronously" and KVM async page fault a bit more.
>>>>
>>>> Current design of async-page-fault only works on RAM rather
>>>> than MMIO, i.e, if the page fault caused by accessing the
>>>> device memory of a emulated device, it needs to go to
>>>> userspace (QEMU) which emulates the operation in vCPU's
>>>> thread.
>>>>
>>>> As i mentioned before the memory region used for vNVDIMM
>>>> flush interface should be MMIO and consider its support
>>>> on other hypervisors, so we do better push this async
>>>> mechanism into the flush interface design itself rather
>>>> than depends on kvm async-page-fault.
>>>
>>> I would expect this interface to be virtio-ring based to queue flush
>>> requests asynchronously to the host.
>>
>> Could we reuse the virtio-blk device, only with a different device id?
> 
> As per previous discussions, there were suggestions on main two parts of the project:
> 
> 1] Expose vNVDIMM memory range to KVM guest.
> 
>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec 
>      changes for this? 
> 
>    - Guest should be able to add this memory in system memory map. Name of the added memory in
>      '/proc/iomem' should be different(shared memory?) than persistent memory as it 
>      does not satisfy exact definition of persistent memory (requires an explicit flush).
> 
>    - Guest should not allow 'device-dax' and other fancy features which are not 
>      virtualization friendly.
> 
> 2] Flushing interface to persist guest changes.
> 
>    - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
>      Looks like most of these options are not use-case friendly. As we want to do fsync on a
>      file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. 
> 
>    - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can 
>      go with the existing pmem driver and add flush specific to this new memory type.

I'd like to emphasize again, that I would prefer a virtio-pmem only
solution.

There are architectures out there (e.g. s390x) that don't support
NVDIMMs - there is no HW interface to expose any such stuff.

However, with virtio-pmem, we could make it work also on architectures
not having ACPI and friends.

> 
>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
>      want a flush vehicle to send guest commands to host and get reply after asynchronous
>      execution. There was previous discussion [1] with Rik & Dan on this.
> 
>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 
> 
> Is my understanding correct here?
> 
> Thanks,
> Pankaj  
>  
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 16:53                                                                       ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 16:53 UTC (permalink / raw)
  To: Pankaj Gupta, Paolo Bonzini, Dan Williams, Rik van Riel,
	Xiao Guangrong, Christoph Hellwig
  Cc: Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler

On 24.11.2017 13:40, Pankaj Gupta wrote:
> 
> Hello,
> 
> Thank you all for all the useful suggestions.
> I want to summarize the discussions so far in the
> thread. Please see below:
> 
>>>>
>>>>> We can go with the "best" interface for what
>>>>> could be a relatively slow flush (fsync on a
>>>>> file on ssd/disk on the host), which requires
>>>>> that the flushing task wait on completion
>>>>> asynchronously.
>>>>
>>>>
>>>> I'd like to clarify the interface of "wait on completion
>>>> asynchronously" and KVM async page fault a bit more.
>>>>
>>>> Current design of async-page-fault only works on RAM rather
>>>> than MMIO, i.e, if the page fault caused by accessing the
>>>> device memory of a emulated device, it needs to go to
>>>> userspace (QEMU) which emulates the operation in vCPU's
>>>> thread.
>>>>
>>>> As i mentioned before the memory region used for vNVDIMM
>>>> flush interface should be MMIO and consider its support
>>>> on other hypervisors, so we do better push this async
>>>> mechanism into the flush interface design itself rather
>>>> than depends on kvm async-page-fault.
>>>
>>> I would expect this interface to be virtio-ring based to queue flush
>>> requests asynchronously to the host.
>>
>> Could we reuse the virtio-blk device, only with a different device id?
> 
> As per previous discussions, there were suggestions on main two parts of the project:
> 
> 1] Expose vNVDIMM memory range to KVM guest.
> 
>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec 
>      changes for this? 
> 
>    - Guest should be able to add this memory in system memory map. Name of the added memory in
>      '/proc/iomem' should be different(shared memory?) than persistent memory as it 
>      does not satisfy exact definition of persistent memory (requires an explicit flush).
> 
>    - Guest should not allow 'device-dax' and other fancy features which are not 
>      virtualization friendly.
> 
> 2] Flushing interface to persist guest changes.
> 
>    - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
>      Looks like most of these options are not use-case friendly. As we want to do fsync on a
>      file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. 
> 
>    - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can 
>      go with the existing pmem driver and add flush specific to this new memory type.

I'd like to emphasize again, that I would prefer a virtio-pmem only
solution.

There are architectures out there (e.g. s390x) that don't support
NVDIMMs - there is no HW interface to expose any such stuff.

However, with virtio-pmem, we could make it work also on architectures
not having ACPI and friends.

> 
>    - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just 
>      want a flush vehicle to send guest commands to host and get reply after asynchronous
>      execution. There was previous discussion [1] with Rik & Dan on this.
> 
>     [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 
> 
> Is my understanding correct here?
> 
> Thanks,
> Pankaj  
>  
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 17:38                                                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 17:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 8:53 AM, David Hildenbrand <david@redhat.com> wrote:
> On 24.11.2017 13:40, Pankaj Gupta wrote:
>>
>> Hello,
>>
>> Thank you all for all the useful suggestions.
>> I want to summarize the discussions so far in the
>> thread. Please see below:
>>
>>>>>
>>>>>> We can go with the "best" interface for what
>>>>>> could be a relatively slow flush (fsync on a
>>>>>> file on ssd/disk on the host), which requires
>>>>>> that the flushing task wait on completion
>>>>>> asynchronously.
>>>>>
>>>>>
>>>>> I'd like to clarify the interface of "wait on completion
>>>>> asynchronously" and KVM async page fault a bit more.
>>>>>
>>>>> Current design of async-page-fault only works on RAM rather
>>>>> than MMIO, i.e, if the page fault caused by accessing the
>>>>> device memory of a emulated device, it needs to go to
>>>>> userspace (QEMU) which emulates the operation in vCPU's
>>>>> thread.
>>>>>
>>>>> As i mentioned before the memory region used for vNVDIMM
>>>>> flush interface should be MMIO and consider its support
>>>>> on other hypervisors, so we do better push this async
>>>>> mechanism into the flush interface design itself rather
>>>>> than depends on kvm async-page-fault.
>>>>
>>>> I would expect this interface to be virtio-ring based to queue flush
>>>> requests asynchronously to the host.
>>>
>>> Could we reuse the virtio-blk device, only with a different device id?
>>
>> As per previous discussions, there were suggestions on main two parts of the project:
>>
>> 1] Expose vNVDIMM memory range to KVM guest.
>>
>>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec
>>      changes for this?
>>
>>    - Guest should be able to add this memory in system memory map. Name of the added memory in
>>      '/proc/iomem' should be different(shared memory?) than persistent memory as it
>>      does not satisfy exact definition of persistent memory (requires an explicit flush).
>>
>>    - Guest should not allow 'device-dax' and other fancy features which are not
>>      virtualization friendly.
>>
>> 2] Flushing interface to persist guest changes.
>>
>>    - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
>>      Looks like most of these options are not use-case friendly. As we want to do fsync on a
>>      file on ssd/disk on the host and we cannot make guest vCPU's wait for that time.
>>
>>    - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can
>>      go with the existing pmem driver and add flush specific to this new memory type.
>
> I'd like to emphasize again, that I would prefer a virtio-pmem only
> solution.
>
> There are architectures out there (e.g. s390x) that don't support
> NVDIMMs - there is no HW interface to expose any such stuff.
>
> However, with virtio-pmem, we could make it work also on architectures
> not having ACPI and friends.

ACPI and virtio-only can share the same pmem driver. There are two
parts to this, region discovery and setting up the pmem driver. For
discovery you can either have an NFIT-bus defined range, or a new
virtio-pmem-bus define it. As far as the pmem driver itself it's
agnostic to how the range is discovered.

In other words, pmem consumes 'regions' from libnvdimm and the a bus
provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 17:38                                                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 17:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 8:53 AM, David Hildenbrand <david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 24.11.2017 13:40, Pankaj Gupta wrote:
>>
>> Hello,
>>
>> Thank you all for all the useful suggestions.
>> I want to summarize the discussions so far in the
>> thread. Please see below:
>>
>>>>>
>>>>>> We can go with the "best" interface for what
>>>>>> could be a relatively slow flush (fsync on a
>>>>>> file on ssd/disk on the host), which requires
>>>>>> that the flushing task wait on completion
>>>>>> asynchronously.
>>>>>
>>>>>
>>>>> I'd like to clarify the interface of "wait on completion
>>>>> asynchronously" and KVM async page fault a bit more.
>>>>>
>>>>> Current design of async-page-fault only works on RAM rather
>>>>> than MMIO, i.e, if the page fault caused by accessing the
>>>>> device memory of a emulated device, it needs to go to
>>>>> userspace (QEMU) which emulates the operation in vCPU's
>>>>> thread.
>>>>>
>>>>> As i mentioned before the memory region used for vNVDIMM
>>>>> flush interface should be MMIO and consider its support
>>>>> on other hypervisors, so we do better push this async
>>>>> mechanism into the flush interface design itself rather
>>>>> than depends on kvm async-page-fault.
>>>>
>>>> I would expect this interface to be virtio-ring based to queue flush
>>>> requests asynchronously to the host.
>>>
>>> Could we reuse the virtio-blk device, only with a different device id?
>>
>> As per previous discussions, there were suggestions on main two parts of the project:
>>
>> 1] Expose vNVDIMM memory range to KVM guest.
>>
>>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec
>>      changes for this?
>>
>>    - Guest should be able to add this memory in system memory map. Name of the added memory in
>>      '/proc/iomem' should be different(shared memory?) than persistent memory as it
>>      does not satisfy exact definition of persistent memory (requires an explicit flush).
>>
>>    - Guest should not allow 'device-dax' and other fancy features which are not
>>      virtualization friendly.
>>
>> 2] Flushing interface to persist guest changes.
>>
>>    - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
>>      Looks like most of these options are not use-case friendly. As we want to do fsync on a
>>      file on ssd/disk on the host and we cannot make guest vCPU's wait for that time.
>>
>>    - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can
>>      go with the existing pmem driver and add flush specific to this new memory type.
>
> I'd like to emphasize again, that I would prefer a virtio-pmem only
> solution.
>
> There are architectures out there (e.g. s390x) that don't support
> NVDIMMs - there is no HW interface to expose any such stuff.
>
> However, with virtio-pmem, we could make it work also on architectures
> not having ACPI and friends.

ACPI and virtio-only can share the same pmem driver. There are two
parts to this, region discovery and setting up the pmem driver. For
discovery you can either have an NFIT-bus defined range, or a new
virtio-pmem-bus define it. As far as the pmem driver itself it's
agnostic to how the range is discovered.

In other words, pmem consumes 'regions' from libnvdimm and the a bus
provider like nfit, e820, or a new virtio-mechansim produce 'regions'.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 17:38                                                                         ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 17:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler

On Thu, Jan 18, 2018 at 8:53 AM, David Hildenbrand <david@redhat.com> wrote:
> On 24.11.2017 13:40, Pankaj Gupta wrote:
>>
>> Hello,
>>
>> Thank you all for all the useful suggestions.
>> I want to summarize the discussions so far in the
>> thread. Please see below:
>>
>>>>>
>>>>>> We can go with the "best" interface for what
>>>>>> could be a relatively slow flush (fsync on a
>>>>>> file on ssd/disk on the host), which requires
>>>>>> that the flushing task wait on completion
>>>>>> asynchronously.
>>>>>
>>>>>
>>>>> I'd like to clarify the interface of "wait on completion
>>>>> asynchronously" and KVM async page fault a bit more.
>>>>>
>>>>> Current design of async-page-fault only works on RAM rather
>>>>> than MMIO, i.e, if the page fault caused by accessing the
>>>>> device memory of a emulated device, it needs to go to
>>>>> userspace (QEMU) which emulates the operation in vCPU's
>>>>> thread.
>>>>>
>>>>> As i mentioned before the memory region used for vNVDIMM
>>>>> flush interface should be MMIO and consider its support
>>>>> on other hypervisors, so we do better push this async
>>>>> mechanism into the flush interface design itself rather
>>>>> than depends on kvm async-page-fault.
>>>>
>>>> I would expect this interface to be virtio-ring based to queue flush
>>>> requests asynchronously to the host.
>>>
>>> Could we reuse the virtio-blk device, only with a different device id?
>>
>> As per previous discussions, there were suggestions on main two parts of the project:
>>
>> 1] Expose vNVDIMM memory range to KVM guest.
>>
>>    - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec
>>      changes for this?
>>
>>    - Guest should be able to add this memory in system memory map. Name of the added memory in
>>      '/proc/iomem' should be different(shared memory?) than persistent memory as it
>>      does not satisfy exact definition of persistent memory (requires an explicit flush).
>>
>>    - Guest should not allow 'device-dax' and other fancy features which are not
>>      virtualization friendly.
>>
>> 2] Flushing interface to persist guest changes.
>>
>>    - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc.
>>      Looks like most of these options are not use-case friendly. As we want to do fsync on a
>>      file on ssd/disk on the host and we cannot make guest vCPU's wait for that time.
>>
>>    - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can
>>      go with the existing pmem driver and add flush specific to this new memory type.
>
> I'd like to emphasize again, that I would prefer a virtio-pmem only
> solution.
>
> There are architectures out there (e.g. s390x) that don't support
> NVDIMMs - there is no HW interface to expose any such stuff.
>
> However, with virtio-pmem, we could make it work also on architectures
> not having ACPI and friends.

ACPI and virtio-only can share the same pmem driver. There are two
parts to this, region discovery and setting up the pmem driver. For
discovery you can either have an NFIT-bus defined range, or a new
virtio-pmem-bus define it. As far as the pmem driver itself it's
agnostic to how the range is discovered.

In other words, pmem consumes 'regions' from libnvdimm and the a bus
provider like nfit, e820, or a new virtio-mechansim produce 'regions'.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2018-01-18 17:38                                                                         ` Dan Williams
  (?)
@ 2018-01-18 17:48                                                                           ` David Hildenbrand
  -1 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 17:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> solution.
>>
>> There are architectures out there (e.g. s390x) that don't support
>> NVDIMMs - there is no HW interface to expose any such stuff.
>>
>> However, with virtio-pmem, we could make it work also on architectures
>> not having ACPI and friends.
> 
> ACPI and virtio-only can share the same pmem driver. There are two
> parts to this, region discovery and setting up the pmem driver. For
> discovery you can either have an NFIT-bus defined range, or a new
> virtio-pmem-bus define it. As far as the pmem driver itself it's
> agnostic to how the range is discovered.
> 

And in addition to discovery + setup, we need the flush via virtio.

> In other words, pmem consumes 'regions' from libnvdimm and the a bus
> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> 

That sounds good to me. I would like to see how the ACPI discovery
variant connects to a virtio ring.

The natural way for me would be:

A virtio-X device supplies a memory region ("discovery") and also the
interface for flushes for this device. So one virtio-X corresponds to
one pmem device. No ACPI to be involved (also not on architectures that
have ACPI)

-- 

Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 17:48                                                                           ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 17:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pankaj Gupta, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler

>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> solution.
>>
>> There are architectures out there (e.g. s390x) that don't support
>> NVDIMMs - there is no HW interface to expose any such stuff.
>>
>> However, with virtio-pmem, we could make it work also on architectures
>> not having ACPI and friends.
> 
> ACPI and virtio-only can share the same pmem driver. There are two
> parts to this, region discovery and setting up the pmem driver. For
> discovery you can either have an NFIT-bus defined range, or a new
> virtio-pmem-bus define it. As far as the pmem driver itself it's
> agnostic to how the range is discovered.
> 

And in addition to discovery + setup, we need the flush via virtio.

> In other words, pmem consumes 'regions' from libnvdimm and the a bus
> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> 

That sounds good to me. I would like to see how the ACPI discovery
variant connects to a virtio ring.

The natural way for me would be:

A virtio-X device supplies a memory region ("discovery") and also the
interface for flushes for this device. So one virtio-X corresponds to
one pmem device. No ACPI to be involved (also not on architectures that
have ACPI)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 17:48                                                                           ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 17:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pankaj Gupta, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler

>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> solution.
>>
>> There are architectures out there (e.g. s390x) that don't support
>> NVDIMMs - there is no HW interface to expose any such stuff.
>>
>> However, with virtio-pmem, we could make it work also on architectures
>> not having ACPI and friends.
> 
> ACPI and virtio-only can share the same pmem driver. There are two
> parts to this, region discovery and setting up the pmem driver. For
> discovery you can either have an NFIT-bus defined range, or a new
> virtio-pmem-bus define it. As far as the pmem driver itself it's
> agnostic to how the range is discovered.
> 

And in addition to discovery + setup, we need the flush via virtio.

> In other words, pmem consumes 'regions' from libnvdimm and the a bus
> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> 

That sounds good to me. I would like to see how the ACPI discovery
variant connects to a virtio ring.

The natural way for me would be:

A virtio-X device supplies a memory region ("discovery") and also the
interface for flushes for this device. So one virtio-X corresponds to
one pmem device. No ACPI to be involved (also not on architectures that
have ACPI)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:45                                                                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 18:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 9:48 AM, David Hildenbrand <david@redhat.com> wrote:
>>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>>> solution.
>>>
>>> There are architectures out there (e.g. s390x) that don't support
>>> NVDIMMs - there is no HW interface to expose any such stuff.
>>>
>>> However, with virtio-pmem, we could make it work also on architectures
>>> not having ACPI and friends.
>>
>> ACPI and virtio-only can share the same pmem driver. There are two
>> parts to this, region discovery and setting up the pmem driver. For
>> discovery you can either have an NFIT-bus defined range, or a new
>> virtio-pmem-bus define it. As far as the pmem driver itself it's
>> agnostic to how the range is discovered.
>>
>
> And in addition to discovery + setup, we need the flush via virtio.
>
>> In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>>
>
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
>
> The natural way for me would be:
>
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

Hmm, yes, it seems if ACPI is just going to be used as a trigger for
"go find the virtio-X interface for this range" we could have started
from a virtio device in the first place.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:45                                                                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 18:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	kvm-devel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 9:48 AM, David Hildenbrand <david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>>> solution.
>>>
>>> There are architectures out there (e.g. s390x) that don't support
>>> NVDIMMs - there is no HW interface to expose any such stuff.
>>>
>>> However, with virtio-pmem, we could make it work also on architectures
>>> not having ACPI and friends.
>>
>> ACPI and virtio-only can share the same pmem driver. There are two
>> parts to this, region discovery and setting up the pmem driver. For
>> discovery you can either have an NFIT-bus defined range, or a new
>> virtio-pmem-bus define it. As far as the pmem driver itself it's
>> agnostic to how the range is discovered.
>>
>
> And in addition to discovery + setup, we need the flush via virtio.
>
>> In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>>
>
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
>
> The natural way for me would be:
>
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

Hmm, yes, it seems if ACPI is just going to be used as a trigger for
"go find the virtio-X interface for this range" we could have started
from a virtio device in the first place.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:45                                                                             ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 18:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler

On Thu, Jan 18, 2018 at 9:48 AM, David Hildenbrand <david@redhat.com> wrote:
>>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>>> solution.
>>>
>>> There are architectures out there (e.g. s390x) that don't support
>>> NVDIMMs - there is no HW interface to expose any such stuff.
>>>
>>> However, with virtio-pmem, we could make it work also on architectures
>>> not having ACPI and friends.
>>
>> ACPI and virtio-only can share the same pmem driver. There are two
>> parts to this, region discovery and setting up the pmem driver. For
>> discovery you can either have an NFIT-bus defined range, or a new
>> virtio-pmem-bus define it. As far as the pmem driver itself it's
>> agnostic to how the range is discovered.
>>
>
> And in addition to discovery + setup, we need the flush via virtio.
>
>> In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>>
>
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
>
> The natural way for me would be:
>
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

Hmm, yes, it seems if ACPI is just going to be used as a trigger for
"go find the virtio-X interface for this range" we could have started
from a virtio device in the first place.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2018-01-18 17:48                                                                           ` David Hildenbrand
  (?)
@ 2018-01-18 18:54                                                                             ` Pankaj Gupta
  -1 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-18 18:54 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Rik van Riel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal


> 
> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> solution.
> >>
> >> There are architectures out there (e.g. s390x) that don't support
> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >>
> >> However, with virtio-pmem, we could make it work also on architectures
> >> not having ACPI and friends.
> > 
> > ACPI and virtio-only can share the same pmem driver. There are two
> > parts to this, region discovery and setting up the pmem driver. For
> > discovery you can either have an NFIT-bus defined range, or a new
> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> > agnostic to how the range is discovered.
> > 
> 
> And in addition to discovery + setup, we need the flush via virtio.
> 
> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> > 
> 
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
> 
> The natural way for me would be:
> 
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

I agree here if we discover regions with virtio-X we don't need to worry about
NFIT ACPI. Actually, there are three ways to do it with pros and cons of these 
approaches: 

1] Existing pmem driver & virtio for region discovery:
  -----------------------------------------------------
  Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc 
  from ACPI region discovery and re-implement these concepts with virtio so that existing
  pmem driver can understand it. In addition to this, task of pmem driver to send flush command
  using virtio.
  
2] Existing pmem driver & ACPI NFIT for region discovery:
  ----------------------------------------------------------------
- If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
  type and teach existing pmem driver to handle this new memory type. Still we need 
  an asynchronous(virtio) way to send flush commands. We need virtio device/driver
  or arbitrary key/value like pair just to send commands from guest to host using virtio. 

3] New Virtio pmem driver & paravirt device:
 ----------------------------------------
  Third way is new virtio pmem driver with less work to support existing features of different protocols, 
  and with asynchronous way of sending flush commands.

  But this needs to duplicate some of the work which existing pmem driver does but as discussed 
  previously we can separate common code from existing pmem driver and reuse it.

Among these approaches I also prefer 3].

> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:54                                                                             ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-18 18:54 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler,
	Rik van Riel


> 
> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> solution.
> >>
> >> There are architectures out there (e.g. s390x) that don't support
> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >>
> >> However, with virtio-pmem, we could make it work also on architectures
> >> not having ACPI and friends.
> > 
> > ACPI and virtio-only can share the same pmem driver. There are two
> > parts to this, region discovery and setting up the pmem driver. For
> > discovery you can either have an NFIT-bus defined range, or a new
> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> > agnostic to how the range is discovered.
> > 
> 
> And in addition to discovery + setup, we need the flush via virtio.
> 
> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> > 
> 
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
> 
> The natural way for me would be:
> 
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

I agree here if we discover regions with virtio-X we don't need to worry about
NFIT ACPI. Actually, there are three ways to do it with pros and cons of these 
approaches: 

1] Existing pmem driver & virtio for region discovery:
  -----------------------------------------------------
  Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc 
  from ACPI region discovery and re-implement these concepts with virtio so that existing
  pmem driver can understand it. In addition to this, task of pmem driver to send flush command
  using virtio.
  
2] Existing pmem driver & ACPI NFIT for region discovery:
  ----------------------------------------------------------------
- If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
  type and teach existing pmem driver to handle this new memory type. Still we need 
  an asynchronous(virtio) way to send flush commands. We need virtio device/driver
  or arbitrary key/value like pair just to send commands from guest to host using virtio. 

3] New Virtio pmem driver & paravirt device:
 ----------------------------------------
  Third way is new virtio pmem driver with less work to support existing features of different protocols, 
  and with asynchronous way of sending flush commands.

  But this needs to duplicate some of the work which existing pmem driver does but as discussed 
  previously we can separate common code from existing pmem driver and reuse it.

Among these approaches I also prefer 3].

> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:54                                                                             ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-18 18:54 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler,
	Rik van Riel


> 
> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> solution.
> >>
> >> There are architectures out there (e.g. s390x) that don't support
> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >>
> >> However, with virtio-pmem, we could make it work also on architectures
> >> not having ACPI and friends.
> > 
> > ACPI and virtio-only can share the same pmem driver. There are two
> > parts to this, region discovery and setting up the pmem driver. For
> > discovery you can either have an NFIT-bus defined range, or a new
> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> > agnostic to how the range is discovered.
> > 
> 
> And in addition to discovery + setup, we need the flush via virtio.
> 
> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> > 
> 
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
> 
> The natural way for me would be:
> 
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

I agree here if we discover regions with virtio-X we don't need to worry about
NFIT ACPI. Actually, there are three ways to do it with pros and cons of these 
approaches: 

1] Existing pmem driver & virtio for region discovery:
  -----------------------------------------------------
  Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc 
  from ACPI region discovery and re-implement these concepts with virtio so that existing
  pmem driver can understand it. In addition to this, task of pmem driver to send flush command
  using virtio.
  
2] Existing pmem driver & ACPI NFIT for region discovery:
  ----------------------------------------------------------------
- If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
  type and teach existing pmem driver to handle this new memory type. Still we need 
  an asynchronous(virtio) way to send flush commands. We need virtio device/driver
  or arbitrary key/value like pair just to send commands from guest to host using virtio. 

3] New Virtio pmem driver & paravirt device:
 ----------------------------------------
  Third way is new virtio pmem driver with less work to support existing features of different protocols, 
  and with asynchronous way of sending flush commands.

  But this needs to duplicate some of the work which existing pmem driver does but as discussed 
  previously we can separate common code from existing pmem driver and reuse it.

Among these approaches I also prefer 3].

> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2018-01-18 18:54                                                                             ` Pankaj Gupta
  (?)
@ 2018-01-18 18:59                                                                               ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 18:59 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	David Hildenbrand, Stefan Hajnoczi, Rik van Riel, Ross Zwisler,
	Qemu Developers, Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>>
>> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> solution.
>> >>
>> >> There are architectures out there (e.g. s390x) that don't support
>> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >>
>> >> However, with virtio-pmem, we could make it work also on architectures
>> >> not having ACPI and friends.
>> >
>> > ACPI and virtio-only can share the same pmem driver. There are two
>> > parts to this, region discovery and setting up the pmem driver. For
>> > discovery you can either have an NFIT-bus defined range, or a new
>> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> > agnostic to how the range is discovered.
>> >
>>
>> And in addition to discovery + setup, we need the flush via virtio.
>>
>> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >
>>
>> That sounds good to me. I would like to see how the ACPI discovery
>> variant connects to a virtio ring.
>>
>> The natural way for me would be:
>>
>> A virtio-X device supplies a memory region ("discovery") and also the
>> interface for flushes for this device. So one virtio-X corresponds to
>> one pmem device. No ACPI to be involved (also not on architectures that
>> have ACPI)
>
> I agree here if we discover regions with virtio-X we don't need to worry about
> NFIT ACPI. Actually, there are three ways to do it with pros and cons of these
> approaches:
>
> 1] Existing pmem driver & virtio for region discovery:
>   -----------------------------------------------------
>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>   using virtio.

It's not tightly coupled. The whole point of libnvdimm is to be
agnostic to ACPI, e820 or any other range discovery. The only work to
do beyond identifying the address range is teaching libnvdimm to pass
along a flush control interface to the pmem driver.

>
> 2] Existing pmem driver & ACPI NFIT for region discovery:
>   ----------------------------------------------------------------
> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>   type and teach existing pmem driver to handle this new memory type. Still we need
>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>
> 3] New Virtio pmem driver & paravirt device:
>  ----------------------------------------
>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>   and with asynchronous way of sending flush commands.
>
>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>   previously we can separate common code from existing pmem driver and reuse it.
>
> Among these approaches I also prefer 3].

I disagree, the reason we went down this ACPI path was to limit the
needless duplication of most of the pmem driver.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:59                                                                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 18:59 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel

On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>>
>> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> solution.
>> >>
>> >> There are architectures out there (e.g. s390x) that don't support
>> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >>
>> >> However, with virtio-pmem, we could make it work also on architectures
>> >> not having ACPI and friends.
>> >
>> > ACPI and virtio-only can share the same pmem driver. There are two
>> > parts to this, region discovery and setting up the pmem driver. For
>> > discovery you can either have an NFIT-bus defined range, or a new
>> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> > agnostic to how the range is discovered.
>> >
>>
>> And in addition to discovery + setup, we need the flush via virtio.
>>
>> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >
>>
>> That sounds good to me. I would like to see how the ACPI discovery
>> variant connects to a virtio ring.
>>
>> The natural way for me would be:
>>
>> A virtio-X device supplies a memory region ("discovery") and also the
>> interface for flushes for this device. So one virtio-X corresponds to
>> one pmem device. No ACPI to be involved (also not on architectures that
>> have ACPI)
>
> I agree here if we discover regions with virtio-X we don't need to worry about
> NFIT ACPI. Actually, there are three ways to do it with pros and cons of these
> approaches:
>
> 1] Existing pmem driver & virtio for region discovery:
>   -----------------------------------------------------
>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>   using virtio.

It's not tightly coupled. The whole point of libnvdimm is to be
agnostic to ACPI, e820 or any other range discovery. The only work to
do beyond identifying the address range is teaching libnvdimm to pass
along a flush control interface to the pmem driver.

>
> 2] Existing pmem driver & ACPI NFIT for region discovery:
>   ----------------------------------------------------------------
> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>   type and teach existing pmem driver to handle this new memory type. Still we need
>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>
> 3] New Virtio pmem driver & paravirt device:
>  ----------------------------------------
>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>   and with asynchronous way of sending flush commands.
>
>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>   previously we can separate common code from existing pmem driver and reuse it.
>
> Among these approaches I also prefer 3].

I disagree, the reason we went down this ACPI path was to limit the
needless duplication of most of the pmem driver.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 18:59                                                                               ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 18:59 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel

On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>>
>> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> solution.
>> >>
>> >> There are architectures out there (e.g. s390x) that don't support
>> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >>
>> >> However, with virtio-pmem, we could make it work also on architectures
>> >> not having ACPI and friends.
>> >
>> > ACPI and virtio-only can share the same pmem driver. There are two
>> > parts to this, region discovery and setting up the pmem driver. For
>> > discovery you can either have an NFIT-bus defined range, or a new
>> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> > agnostic to how the range is discovered.
>> >
>>
>> And in addition to discovery + setup, we need the flush via virtio.
>>
>> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >
>>
>> That sounds good to me. I would like to see how the ACPI discovery
>> variant connects to a virtio ring.
>>
>> The natural way for me would be:
>>
>> A virtio-X device supplies a memory region ("discovery") and also the
>> interface for flushes for this device. So one virtio-X corresponds to
>> one pmem device. No ACPI to be involved (also not on architectures that
>> have ACPI)
>
> I agree here if we discover regions with virtio-X we don't need to worry about
> NFIT ACPI. Actually, there are three ways to do it with pros and cons of these
> approaches:
>
> 1] Existing pmem driver & virtio for region discovery:
>   -----------------------------------------------------
>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>   using virtio.

It's not tightly coupled. The whole point of libnvdimm is to be
agnostic to ACPI, e820 or any other range discovery. The only work to
do beyond identifying the address range is teaching libnvdimm to pass
along a flush control interface to the pmem driver.

>
> 2] Existing pmem driver & ACPI NFIT for region discovery:
>   ----------------------------------------------------------------
> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>   type and teach existing pmem driver to handle this new memory type. Still we need
>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>
> 3] New Virtio pmem driver & paravirt device:
>  ----------------------------------------
>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>   and with asynchronous way of sending flush commands.
>
>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>   previously we can separate common code from existing pmem driver and reuse it.
>
> Among these approaches I also prefer 3].

I disagree, the reason we went down this ACPI path was to limit the
needless duplication of most of the pmem driver.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:36                                                                                 ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-18 19:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	David Hildenbrand, Stefan Hajnoczi, Rik van Riel, Ross Zwisler,
	Qemu Developers, Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal


> 
> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
> >
> >>
> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> >> solution.
> >> >>
> >> >> There are architectures out there (e.g. s390x) that don't support
> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >> >>
> >> >> However, with virtio-pmem, we could make it work also on architectures
> >> >> not having ACPI and friends.
> >> >
> >> > ACPI and virtio-only can share the same pmem driver. There are two
> >> > parts to this, region discovery and setting up the pmem driver. For
> >> > discovery you can either have an NFIT-bus defined range, or a new
> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> >> > agnostic to how the range is discovered.
> >> >
> >>
> >> And in addition to discovery + setup, we need the flush via virtio.
> >>
> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> >> >
> >>
> >> That sounds good to me. I would like to see how the ACPI discovery
> >> variant connects to a virtio ring.
> >>
> >> The natural way for me would be:
> >>
> >> A virtio-X device supplies a memory region ("discovery") and also the
> >> interface for flushes for this device. So one virtio-X corresponds to
> >> one pmem device. No ACPI to be involved (also not on architectures that
> >> have ACPI)
> >
> > I agree here if we discover regions with virtio-X we don't need to worry
> > about
> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
> > these
> > approaches:
> >
> > 1] Existing pmem driver & virtio for region discovery:
> >   -----------------------------------------------------
> >   Use existing pmem driver which is tightly coupled with concepts of
> >   namespaces, labels etc
> >   from ACPI region discovery and re-implement these concepts with virtio so
> >   that existing
> >   pmem driver can understand it. In addition to this, task of pmem driver
> >   to send flush command
> >   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.

o.k that means we can configure libnvdimm with virtio as well and use existing pmem
driver. AFAICU it uses nvdimm bus? 

Do we need other features which ACPI provides?

acpi_nfit_init
 nvdimm_bus_register
  ...
    acpi_nfit_register_region
      acpi_region_create
        nvdimm_pmem_region_create
  
Also, need to check how to pass virtio flush interface.

> 
> >
> > 2] Existing pmem driver & ACPI NFIT for region discovery:
> >   ----------------------------------------------------------------
> > - If we use NFIT ACPI, we need to teach existing ACPI driver to add this
> > new memory
> >   type and teach existing pmem driver to handle this new memory type. Still
> >   we need
> >   an asynchronous(virtio) way to send flush commands. We need virtio
> >   device/driver
> >   or arbitrary key/value like pair just to send commands from guest to host
> >   using virtio.
> >
> > 3] New Virtio pmem driver & paravirt device:
> >  ----------------------------------------
> >   Third way is new virtio pmem driver with less work to support existing
> >   features of different protocols,
> >   and with asynchronous way of sending flush commands.
> >
> >   But this needs to duplicate some of the work which existing pmem driver
> >   does but as discussed
> >   previously we can separate common code from existing pmem driver and
> >   reuse it.
> >
> > Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.

yes.
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:36                                                                                 ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-18 19:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	David Hildenbrand, Stefan Hajnoczi, Rik van Riel, Ross Zwisler,
	Qemu Developers, Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal


> 
> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >>
> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> >> solution.
> >> >>
> >> >> There are architectures out there (e.g. s390x) that don't support
> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >> >>
> >> >> However, with virtio-pmem, we could make it work also on architectures
> >> >> not having ACPI and friends.
> >> >
> >> > ACPI and virtio-only can share the same pmem driver. There are two
> >> > parts to this, region discovery and setting up the pmem driver. For
> >> > discovery you can either have an NFIT-bus defined range, or a new
> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> >> > agnostic to how the range is discovered.
> >> >
> >>
> >> And in addition to discovery + setup, we need the flush via virtio.
> >>
> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> >> >
> >>
> >> That sounds good to me. I would like to see how the ACPI discovery
> >> variant connects to a virtio ring.
> >>
> >> The natural way for me would be:
> >>
> >> A virtio-X device supplies a memory region ("discovery") and also the
> >> interface for flushes for this device. So one virtio-X corresponds to
> >> one pmem device. No ACPI to be involved (also not on architectures that
> >> have ACPI)
> >
> > I agree here if we discover regions with virtio-X we don't need to worry
> > about
> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
> > these
> > approaches:
> >
> > 1] Existing pmem driver & virtio for region discovery:
> >   -----------------------------------------------------
> >   Use existing pmem driver which is tightly coupled with concepts of
> >   namespaces, labels etc
> >   from ACPI region discovery and re-implement these concepts with virtio so
> >   that existing
> >   pmem driver can understand it. In addition to this, task of pmem driver
> >   to send flush command
> >   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.

o.k that means we can configure libnvdimm with virtio as well and use existing pmem
driver. AFAICU it uses nvdimm bus? 

Do we need other features which ACPI provides?

acpi_nfit_init
 nvdimm_bus_register
  ...
    acpi_nfit_register_region
      acpi_region_create
        nvdimm_pmem_region_create
  
Also, need to check how to pass virtio flush interface.

> 
> >
> > 2] Existing pmem driver & ACPI NFIT for region discovery:
> >   ----------------------------------------------------------------
> > - If we use NFIT ACPI, we need to teach existing ACPI driver to add this
> > new memory
> >   type and teach existing pmem driver to handle this new memory type. Still
> >   we need
> >   an asynchronous(virtio) way to send flush commands. We need virtio
> >   device/driver
> >   or arbitrary key/value like pair just to send commands from guest to host
> >   using virtio.
> >
> > 3] New Virtio pmem driver & paravirt device:
> >  ----------------------------------------
> >   Third way is new virtio pmem driver with less work to support existing
> >   features of different protocols,
> >   and with asynchronous way of sending flush commands.
> >
> >   But this needs to duplicate some of the work which existing pmem driver
> >   does but as discussed
> >   previously we can separate common code from existing pmem driver and
> >   reuse it.
> >
> > Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.

yes.
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:36                                                                                 ` Pankaj Gupta
  0 siblings, 0 replies; 176+ messages in thread
From: Pankaj Gupta @ 2018-01-18 19:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel


> 
> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
> >
> >>
> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> >> solution.
> >> >>
> >> >> There are architectures out there (e.g. s390x) that don't support
> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >> >>
> >> >> However, with virtio-pmem, we could make it work also on architectures
> >> >> not having ACPI and friends.
> >> >
> >> > ACPI and virtio-only can share the same pmem driver. There are two
> >> > parts to this, region discovery and setting up the pmem driver. For
> >> > discovery you can either have an NFIT-bus defined range, or a new
> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> >> > agnostic to how the range is discovered.
> >> >
> >>
> >> And in addition to discovery + setup, we need the flush via virtio.
> >>
> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> >> >
> >>
> >> That sounds good to me. I would like to see how the ACPI discovery
> >> variant connects to a virtio ring.
> >>
> >> The natural way for me would be:
> >>
> >> A virtio-X device supplies a memory region ("discovery") and also the
> >> interface for flushes for this device. So one virtio-X corresponds to
> >> one pmem device. No ACPI to be involved (also not on architectures that
> >> have ACPI)
> >
> > I agree here if we discover regions with virtio-X we don't need to worry
> > about
> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
> > these
> > approaches:
> >
> > 1] Existing pmem driver & virtio for region discovery:
> >   -----------------------------------------------------
> >   Use existing pmem driver which is tightly coupled with concepts of
> >   namespaces, labels etc
> >   from ACPI region discovery and re-implement these concepts with virtio so
> >   that existing
> >   pmem driver can understand it. In addition to this, task of pmem driver
> >   to send flush command
> >   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.

o.k that means we can configure libnvdimm with virtio as well and use existing pmem
driver. AFAICU it uses nvdimm bus? 

Do we need other features which ACPI provides?

acpi_nfit_init
 nvdimm_bus_register
  ...
    acpi_nfit_register_region
      acpi_region_create
        nvdimm_pmem_region_create
  
Also, need to check how to pass virtio flush interface.

> 
> >
> > 2] Existing pmem driver & ACPI NFIT for region discovery:
> >   ----------------------------------------------------------------
> > - If we use NFIT ACPI, we need to teach existing ACPI driver to add this
> > new memory
> >   type and teach existing pmem driver to handle this new memory type. Still
> >   we need
> >   an asynchronous(virtio) way to send flush commands. We need virtio
> >   device/driver
> >   or arbitrary key/value like pair just to send commands from guest to host
> >   using virtio.
> >
> > 3] New Virtio pmem driver & paravirt device:
> >  ----------------------------------------
> >   Third way is new virtio pmem driver with less work to support existing
> >   features of different protocols,
> >   and with asynchronous way of sending flush commands.
> >
> >   But this needs to duplicate some of the work which existing pmem driver
> >   does but as discussed
> >   previously we can separate common code from existing pmem driver and
> >   reuse it.
> >
> > Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.

yes.
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2018-01-18 19:36                                                                                 ` Pankaj Gupta
  (?)
@ 2018-01-18 19:48                                                                                   ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 19:48 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	David Hildenbrand, Stefan Hajnoczi, Rik van Riel, Ross Zwisler,
	Qemu Developers, Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 11:36 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>>
>> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>> >
>> >>
>> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> >> solution.
>> >> >>
>> >> >> There are architectures out there (e.g. s390x) that don't support
>> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >> >>
>> >> >> However, with virtio-pmem, we could make it work also on architectures
>> >> >> not having ACPI and friends.
>> >> >
>> >> > ACPI and virtio-only can share the same pmem driver. There are two
>> >> > parts to this, region discovery and setting up the pmem driver. For
>> >> > discovery you can either have an NFIT-bus defined range, or a new
>> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> >> > agnostic to how the range is discovered.
>> >> >
>> >>
>> >> And in addition to discovery + setup, we need the flush via virtio.
>> >>
>> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >> >
>> >>
>> >> That sounds good to me. I would like to see how the ACPI discovery
>> >> variant connects to a virtio ring.
>> >>
>> >> The natural way for me would be:
>> >>
>> >> A virtio-X device supplies a memory region ("discovery") and also the
>> >> interface for flushes for this device. So one virtio-X corresponds to
>> >> one pmem device. No ACPI to be involved (also not on architectures that
>> >> have ACPI)
>> >
>> > I agree here if we discover regions with virtio-X we don't need to worry
>> > about
>> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
>> > these
>> > approaches:
>> >
>> > 1] Existing pmem driver & virtio for region discovery:
>> >   -----------------------------------------------------
>> >   Use existing pmem driver which is tightly coupled with concepts of
>> >   namespaces, labels etc
>> >   from ACPI region discovery and re-implement these concepts with virtio so
>> >   that existing
>> >   pmem driver can understand it. In addition to this, task of pmem driver
>> >   to send flush command
>> >   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>
> o.k that means we can configure libnvdimm with virtio as well and use existing pmem
> driver. AFAICU it uses nvdimm bus?
>
> Do we need other features which ACPI provides?

No, to keep it simple use nvdimm_pmem_region_create without
registering any DIMM devices. I'd start with the e820 driver as a bus
driver reference (drivers/nvdimm/e820.c) rather than try to unwind the
complexity of the nfit driver.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:48                                                                                   ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 19:48 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel

On Thu, Jan 18, 2018 at 11:36 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>>
>> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>> >
>> >>
>> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> >> solution.
>> >> >>
>> >> >> There are architectures out there (e.g. s390x) that don't support
>> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >> >>
>> >> >> However, with virtio-pmem, we could make it work also on architectures
>> >> >> not having ACPI and friends.
>> >> >
>> >> > ACPI and virtio-only can share the same pmem driver. There are two
>> >> > parts to this, region discovery and setting up the pmem driver. For
>> >> > discovery you can either have an NFIT-bus defined range, or a new
>> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> >> > agnostic to how the range is discovered.
>> >> >
>> >>
>> >> And in addition to discovery + setup, we need the flush via virtio.
>> >>
>> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >> >
>> >>
>> >> That sounds good to me. I would like to see how the ACPI discovery
>> >> variant connects to a virtio ring.
>> >>
>> >> The natural way for me would be:
>> >>
>> >> A virtio-X device supplies a memory region ("discovery") and also the
>> >> interface for flushes for this device. So one virtio-X corresponds to
>> >> one pmem device. No ACPI to be involved (also not on architectures that
>> >> have ACPI)
>> >
>> > I agree here if we discover regions with virtio-X we don't need to worry
>> > about
>> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
>> > these
>> > approaches:
>> >
>> > 1] Existing pmem driver & virtio for region discovery:
>> >   -----------------------------------------------------
>> >   Use existing pmem driver which is tightly coupled with concepts of
>> >   namespaces, labels etc
>> >   from ACPI region discovery and re-implement these concepts with virtio so
>> >   that existing
>> >   pmem driver can understand it. In addition to this, task of pmem driver
>> >   to send flush command
>> >   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>
> o.k that means we can configure libnvdimm with virtio as well and use existing pmem
> driver. AFAICU it uses nvdimm bus?
>
> Do we need other features which ACPI provides?

No, to keep it simple use nvdimm_pmem_region_create without
registering any DIMM devices. I'd start with the e820 driver as a bus
driver reference (drivers/nvdimm/e820.c) rather than try to unwind the
complexity of the nfit driver.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:48                                                                                   ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 19:48 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel

On Thu, Jan 18, 2018 at 11:36 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>>
>> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>> >
>> >>
>> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> >> solution.
>> >> >>
>> >> >> There are architectures out there (e.g. s390x) that don't support
>> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >> >>
>> >> >> However, with virtio-pmem, we could make it work also on architectures
>> >> >> not having ACPI and friends.
>> >> >
>> >> > ACPI and virtio-only can share the same pmem driver. There are two
>> >> > parts to this, region discovery and setting up the pmem driver. For
>> >> > discovery you can either have an NFIT-bus defined range, or a new
>> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> >> > agnostic to how the range is discovered.
>> >> >
>> >>
>> >> And in addition to discovery + setup, we need the flush via virtio.
>> >>
>> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >> >
>> >>
>> >> That sounds good to me. I would like to see how the ACPI discovery
>> >> variant connects to a virtio ring.
>> >>
>> >> The natural way for me would be:
>> >>
>> >> A virtio-X device supplies a memory region ("discovery") and also the
>> >> interface for flushes for this device. So one virtio-X corresponds to
>> >> one pmem device. No ACPI to be involved (also not on architectures that
>> >> have ACPI)
>> >
>> > I agree here if we discover regions with virtio-X we don't need to worry
>> > about
>> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
>> > these
>> > approaches:
>> >
>> > 1] Existing pmem driver & virtio for region discovery:
>> >   -----------------------------------------------------
>> >   Use existing pmem driver which is tightly coupled with concepts of
>> >   namespaces, labels etc
>> >   from ACPI region discovery and re-implement these concepts with virtio so
>> >   that existing
>> >   pmem driver can understand it. In addition to this, task of pmem driver
>> >   to send flush command
>> >   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>
> o.k that means we can configure libnvdimm with virtio as well and use existing pmem
> driver. AFAICU it uses nvdimm bus?
>
> Do we need other features which ACPI provides?

No, to keep it simple use nvdimm_pmem_region_create without
registering any DIMM devices. I'd start with the e820 driver as a bus
driver reference (drivers/nvdimm/e820.c) rather than try to unwind the
complexity of the nfit driver.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:51                                                                                 ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 19:51 UTC (permalink / raw)
  To: Dan Williams, Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Rik van Riel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi, linux-nvdimm@lists.01.org,
	Paolo Bonzini, Nitesh Narayan Lal


>> 1] Existing pmem driver & virtio for region discovery:
>>   -----------------------------------------------------
>>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>>   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.
> 
>>
>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>   ----------------------------------------------------------------
>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>>   type and teach existing pmem driver to handle this new memory type. Still we need
>>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>>
>> 3] New Virtio pmem driver & paravirt device:
>>  ----------------------------------------
>>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>>   and with asynchronous way of sending flush commands.
>>
>>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>>   previously we can separate common code from existing pmem driver and reuse it.
>>
>> Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.
> 

I have way to little insight to make qualified statements to different
approaches here. :)

All I am interesting in is making this as independent of architecture
specific technologies (e.g. ACPI) as possible. We will want this e.g.
for s390x too. Rather sooner than later. So trying to couple this
(somehow) to ACPI just for the sake of less code to copy will not pay of
in the long run.

Better have a clean virtio interface / design right from the start.

So I hope my words will be heard :)

-- 

Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:51                                                                                 ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 19:51 UTC (permalink / raw)
  To: Dan Williams, Pankaj Gupta
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, kvm-devel,
	Rik van Riel, Stefan Hajnoczi, Ross Zwisler, Qemu Developers,
	Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Paolo Bonzini, Nitesh Narayan Lal


>> 1] Existing pmem driver & virtio for region discovery:
>>   -----------------------------------------------------
>>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>>   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.
> 
>>
>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>   ----------------------------------------------------------------
>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>>   type and teach existing pmem driver to handle this new memory type. Still we need
>>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>>
>> 3] New Virtio pmem driver & paravirt device:
>>  ----------------------------------------
>>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>>   and with asynchronous way of sending flush commands.
>>
>>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>>   previously we can separate common code from existing pmem driver and reuse it.
>>
>> Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.
> 

I have way to little insight to make qualified statements to different
approaches here. :)

All I am interesting in is making this as independent of architecture
specific technologies (e.g. ACPI) as possible. We will want this e.g.
for s390x too. Rather sooner than later. So trying to couple this
(somehow) to ACPI just for the sake of less code to copy will not pay of
in the long run.

Better have a clean virtio interface / design right from the start.

So I hope my words will be heard :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 19:51                                                                                 ` David Hildenbrand
  0 siblings, 0 replies; 176+ messages in thread
From: David Hildenbrand @ 2018-01-18 19:51 UTC (permalink / raw)
  To: Dan Williams, Pankaj Gupta
  Cc: Paolo Bonzini, Rik van Riel, Xiao Guangrong, Christoph Hellwig,
	Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi, kvm-devel,
	Qemu Developers, linux-nvdimm@lists.01.org, ross zwisler,
	Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang, Ross Zwisler,
	Rik van Riel


>> 1] Existing pmem driver & virtio for region discovery:
>>   -----------------------------------------------------
>>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>>   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.
> 
>>
>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>   ----------------------------------------------------------------
>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>>   type and teach existing pmem driver to handle this new memory type. Still we need
>>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>>
>> 3] New Virtio pmem driver & paravirt device:
>>  ----------------------------------------
>>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>>   and with asynchronous way of sending flush commands.
>>
>>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>>   previously we can separate common code from existing pmem driver and reuse it.
>>
>> Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.
> 

I have way to little insight to make qualified statements to different
approaches here. :)

All I am interesting in is making this as independent of architecture
specific technologies (e.g. ACPI) as possible. We will want this e.g.
for s390x too. Rather sooner than later. So trying to couple this
(somehow) to ACPI just for the sake of less code to copy will not pay of
in the long run.

Better have a clean virtio interface / design right from the start.

So I hope my words will be heard :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
  2018-01-18 19:51                                                                                 ` David Hildenbrand
  (?)
@ 2018-01-18 20:11                                                                                   ` Dan Williams
  -1 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	kvm-devel, Stefan Hajnoczi, Rik van Riel, Ross Zwisler,
	Qemu Developers, Christoph Hellwig, Stefan Hajnoczi,
	linux-nvdimm@lists.01.org, Paolo Bonzini, Nitesh Narayan Lal

On Thu, Jan 18, 2018 at 11:51 AM, David Hildenbrand <david@redhat.com> wrote:
>
>>> 1] Existing pmem driver & virtio for region discovery:
>>>   -----------------------------------------------------
>>>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>>>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>>>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>>>   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>>
>>>
>>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>>   ----------------------------------------------------------------
>>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>>>   type and teach existing pmem driver to handle this new memory type. Still we need
>>>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>>>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>>>
>>> 3] New Virtio pmem driver & paravirt device:
>>>  ----------------------------------------
>>>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>>>   and with asynchronous way of sending flush commands.
>>>
>>>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>>>   previously we can separate common code from existing pmem driver and reuse it.
>>>
>>> Among these approaches I also prefer 3].
>>
>> I disagree, the reason we went down this ACPI path was to limit the
>> needless duplication of most of the pmem driver.
>>
>
> I have way to little insight to make qualified statements to different
> approaches here. :)
>
> All I am interesting in is making this as independent of architecture
> specific technologies (e.g. ACPI) as possible. We will want this e.g.
> for s390x too. Rather sooner than later. So trying to couple this
> (somehow) to ACPI just for the sake of less code to copy will not pay of
> in the long run.
>
> Better have a clean virtio interface / design right from the start.
>
> So I hope my words will be heard :)

I think that's reasonable. Once we have the virtio based discovery I
think the incremental changes to libnvdimm core and the pmem driver
are small.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 20:11                                                                                   ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel

On Thu, Jan 18, 2018 at 11:51 AM, David Hildenbrand <david@redhat.com> wrote:
>
>>> 1] Existing pmem driver & virtio for region discovery:
>>>   -----------------------------------------------------
>>>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>>>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>>>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>>>   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>>
>>>
>>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>>   ----------------------------------------------------------------
>>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>>>   type and teach existing pmem driver to handle this new memory type. Still we need
>>>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>>>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>>>
>>> 3] New Virtio pmem driver & paravirt device:
>>>  ----------------------------------------
>>>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>>>   and with asynchronous way of sending flush commands.
>>>
>>>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>>>   previously we can separate common code from existing pmem driver and reuse it.
>>>
>>> Among these approaches I also prefer 3].
>>
>> I disagree, the reason we went down this ACPI path was to limit the
>> needless duplication of most of the pmem driver.
>>
>
> I have way to little insight to make qualified statements to different
> approaches here. :)
>
> All I am interesting in is making this as independent of architecture
> specific technologies (e.g. ACPI) as possible. We will want this e.g.
> for s390x too. Rather sooner than later. So trying to couple this
> (somehow) to ACPI just for the sake of less code to copy will not pay of
> in the long run.
>
> Better have a clean virtio interface / design right from the start.
>
> So I hope my words will be heard :)

I think that's reasonable. Once we have the virtio based discovery I
think the incremental changes to libnvdimm core and the pmem driver
are small.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
@ 2018-01-18 20:11                                                                                   ` Dan Williams
  0 siblings, 0 replies; 176+ messages in thread
From: Dan Williams @ 2018-01-18 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Paolo Bonzini, Rik van Riel, Xiao Guangrong,
	Christoph Hellwig, Jan Kara, Stefan Hajnoczi, Stefan Hajnoczi,
	kvm-devel, Qemu Developers, linux-nvdimm@lists.01.org,
	ross zwisler, Kevin Wolf, Nitesh Narayan Lal, Haozhong Zhang,
	Ross Zwisler, Rik van Riel

On Thu, Jan 18, 2018 at 11:51 AM, David Hildenbrand <david@redhat.com> wrote:
>
>>> 1] Existing pmem driver & virtio for region discovery:
>>>   -----------------------------------------------------
>>>   Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc
>>>   from ACPI region discovery and re-implement these concepts with virtio so that existing
>>>   pmem driver can understand it. In addition to this, task of pmem driver to send flush command
>>>   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>>
>>>
>>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>>   ----------------------------------------------------------------
>>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory
>>>   type and teach existing pmem driver to handle this new memory type. Still we need
>>>   an asynchronous(virtio) way to send flush commands. We need virtio device/driver
>>>   or arbitrary key/value like pair just to send commands from guest to host using virtio.
>>>
>>> 3] New Virtio pmem driver & paravirt device:
>>>  ----------------------------------------
>>>   Third way is new virtio pmem driver with less work to support existing features of different protocols,
>>>   and with asynchronous way of sending flush commands.
>>>
>>>   But this needs to duplicate some of the work which existing pmem driver does but as discussed
>>>   previously we can separate common code from existing pmem driver and reuse it.
>>>
>>> Among these approaches I also prefer 3].
>>
>> I disagree, the reason we went down this ACPI path was to limit the
>> needless duplication of most of the pmem driver.
>>
>
> I have way to little insight to make qualified statements to different
> approaches here. :)
>
> All I am interesting in is making this as independent of architecture
> specific technologies (e.g. ACPI) as possible. We will want this e.g.
> for s390x too. Rather sooner than later. So trying to couple this
> (somehow) to ACPI just for the sake of less code to copy will not pay of
> in the long run.
>
> Better have a clean virtio interface / design right from the start.
>
> So I hope my words will be heard :)

I think that's reasonable. Once we have the virtio based discovery I
think the incremental changes to libnvdimm core and the pmem driver
are small.

^ permalink raw reply	[flat|nested] 176+ messages in thread

end of thread, other threads:[~2018-01-18 20:11 UTC | newest]

Thread overview: 176+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1455443283.33337333.1500618150787.JavaMail.zimbra@redhat.com>
2017-07-21  6:56 ` KVM "fake DAX" flushing interface - discussion Pankaj Gupta
2017-07-21  6:56   ` [Qemu-devel] " Pankaj Gupta
2017-07-21  6:56   ` Pankaj Gupta
2017-07-21  9:51   ` Haozhong Zhang
2017-07-21  9:51     ` [Qemu-devel] " Haozhong Zhang
2017-07-21  9:51     ` Haozhong Zhang
2017-07-21 10:21     ` Pankaj Gupta
2017-07-21 10:21       ` [Qemu-devel] " Pankaj Gupta
2017-07-21 10:21       ` Pankaj Gupta
2017-07-21 12:12   ` Stefan Hajnoczi
2017-07-21 12:12     ` [Qemu-devel] " Stefan Hajnoczi
2017-07-21 13:29     ` Pankaj Gupta
2017-07-21 13:29       ` [Qemu-devel] " Pankaj Gupta
2017-07-21 13:29       ` Pankaj Gupta
2017-07-21 14:00       ` Rik van Riel
2017-07-21 14:00         ` [Qemu-devel] " Rik van Riel
2017-07-21 14:00         ` Rik van Riel
2017-07-21 15:58       ` Stefan Hajnoczi
2017-07-21 15:58         ` [Qemu-devel] " Stefan Hajnoczi
2017-07-22 19:34         ` Dan Williams
2017-07-22 19:34           ` [Qemu-devel] " Dan Williams
2017-07-22 19:34           ` Dan Williams
2017-07-23 14:04           ` Rik van Riel
2017-07-23 14:04             ` [Qemu-devel] " Rik van Riel
2017-07-23 14:04             ` Rik van Riel
2017-07-23 16:01             ` Dan Williams
2017-07-23 16:01               ` [Qemu-devel] " Dan Williams
2017-07-23 16:01               ` Dan Williams
2017-07-23 18:10               ` Rik van Riel
2017-07-23 18:10                 ` [Qemu-devel] " Rik van Riel
2017-07-23 18:10                 ` Rik van Riel
2017-07-23 20:10                 ` Dan Williams
2017-07-23 20:10                   ` [Qemu-devel] " Dan Williams
2017-07-23 20:10                   ` Dan Williams
2017-07-24 10:23                   ` Jan Kara
2017-07-24 10:23                     ` [Qemu-devel] " Jan Kara
2017-07-24 10:23                     ` Jan Kara
2017-07-24 12:06                     ` Pankaj Gupta
2017-07-24 12:06                       ` [Qemu-devel] " Pankaj Gupta
2017-07-24 12:06                       ` Pankaj Gupta
2017-07-24 12:37                       ` Jan Kara
2017-07-24 12:37                         ` [Qemu-devel] " Jan Kara
2017-07-24 12:37                         ` Jan Kara
2017-07-24 15:10                         ` Dan Williams
2017-07-24 15:10                           ` [Qemu-devel] " Dan Williams
2017-07-24 15:10                           ` Dan Williams
2017-07-24 15:48                           ` Jan Kara
2017-07-24 15:48                             ` [Qemu-devel] " Jan Kara
2017-07-24 15:48                             ` Jan Kara
2017-07-24 16:19                             ` Dan Williams
2017-07-24 16:19                               ` [Qemu-devel] " Dan Williams
2017-07-24 16:19                               ` Dan Williams
2017-07-25 14:27                         ` Pankaj Gupta
2017-07-25 14:27                           ` [Qemu-devel] " Pankaj Gupta
2017-07-25 14:27                           ` Pankaj Gupta
2017-07-25 14:46                           ` Dan Williams
2017-07-25 14:46                             ` [Qemu-devel] " Dan Williams
2017-07-25 14:46                             ` Dan Williams
2017-07-25 20:59                             ` Rik van Riel
2017-07-25 20:59                               ` [Qemu-devel] " Rik van Riel
2017-07-26 13:47                               ` Pankaj Gupta
2017-07-26 13:47                                 ` [Qemu-devel] " Pankaj Gupta
2017-07-26 13:47                                 ` Pankaj Gupta
2017-07-26 21:27                                 ` Rik van Riel
2017-07-26 21:27                                   ` [Qemu-devel] " Rik van Riel
2017-07-26 21:40                                   ` Dan Williams
2017-07-26 21:40                                     ` [Qemu-devel] " Dan Williams
2017-07-26 21:40                                     ` Dan Williams
2017-07-26 23:46                                     ` Rik van Riel
2017-07-26 23:46                                       ` [Qemu-devel] " Rik van Riel
2017-07-26 23:46                                       ` Rik van Riel
2017-07-27  0:54                                       ` Dan Williams
2017-07-27  0:54                                         ` [Qemu-devel] " Dan Williams
2017-07-27  0:54                                         ` Dan Williams
2017-10-31  7:13                                         ` Xiao Guangrong
2017-10-31  7:13                                           ` [Qemu-devel] " Xiao Guangrong
2017-10-31  7:13                                           ` Xiao Guangrong
2017-10-31 14:20                                           ` Dan Williams
2017-10-31 14:20                                             ` [Qemu-devel] " Dan Williams
2017-10-31 14:20                                             ` Dan Williams
2017-11-01  3:43                                             ` Xiao Guangrong
2017-11-01  3:43                                               ` [Qemu-devel] " Xiao Guangrong
2017-11-01  3:43                                               ` Xiao Guangrong
2017-11-01  4:25                                               ` Dan Williams
2017-11-01  4:25                                                 ` [Qemu-devel] " Dan Williams
2017-11-01  4:25                                                 ` Dan Williams
2017-11-01  6:46                                                 ` Xiao Guangrong
2017-11-01  6:46                                                   ` [Qemu-devel] " Xiao Guangrong
2017-11-01  6:46                                                   ` Xiao Guangrong
2017-11-01 15:20                                                   ` Dan Williams
2017-11-01 15:20                                                     ` [Qemu-devel] " Dan Williams
2017-11-01 15:20                                                     ` Dan Williams
2017-11-02  8:50                                                     ` Xiao Guangrong
2017-11-02  8:50                                                       ` [Qemu-devel] " Xiao Guangrong
2017-11-02  8:50                                                       ` Xiao Guangrong
2017-11-02 16:30                                                       ` Dan Williams
2017-11-02 16:30                                                         ` [Qemu-devel] " Dan Williams
2017-11-02 16:30                                                         ` Dan Williams
2017-11-03  6:21                                                         ` Xiao Guangrong
2017-11-03  6:21                                                           ` [Qemu-devel] " Xiao Guangrong
2017-11-03  6:21                                                           ` Xiao Guangrong
2017-11-21 18:19                                                           ` Rik van Riel
2017-11-21 18:19                                                             ` [Qemu-devel] " Rik van Riel
2017-11-21 18:26                                                             ` Dan Williams
2017-11-21 18:26                                                               ` [Qemu-devel] " Dan Williams
2017-11-21 18:26                                                               ` Dan Williams
2017-11-21 18:35                                                               ` Rik van Riel
2017-11-21 18:35                                                                 ` [Qemu-devel] " Rik van Riel
2017-11-23  4:05                                                             ` Xiao Guangrong
2017-11-23  4:05                                                               ` [Qemu-devel] " Xiao Guangrong
2017-11-23  4:05                                                               ` Xiao Guangrong
2017-11-23 16:14                                                               ` Dan Williams
2017-11-23 16:14                                                                 ` [Qemu-devel] " Dan Williams
2017-11-23 16:14                                                                 ` Dan Williams
2017-11-23 16:28                                                                 ` Paolo Bonzini
2017-11-23 16:28                                                                   ` [Qemu-devel] " Paolo Bonzini
2017-11-23 16:28                                                                   ` Paolo Bonzini
2017-11-24 12:40                                                                   ` Pankaj Gupta
2017-11-24 12:40                                                                     ` [Qemu-devel] " Pankaj Gupta
2017-11-24 12:40                                                                     ` Pankaj Gupta
2017-11-24 12:44                                                                     ` Paolo Bonzini
2017-11-24 12:44                                                                       ` [Qemu-devel] " Paolo Bonzini
2017-11-24 12:44                                                                       ` Paolo Bonzini
2017-11-24 13:02                                                                       ` [Qemu-devel] " Pankaj Gupta
2017-11-24 13:02                                                                         ` Pankaj Gupta
2017-11-24 13:20                                                                         ` Paolo Bonzini
2017-11-24 13:20                                                                           ` Paolo Bonzini
2017-11-28 18:03                                                                     ` Dan Williams
2017-11-28 18:03                                                                       ` [Qemu-devel] " Dan Williams
2017-11-28 18:03                                                                       ` Dan Williams
2018-01-13  6:23                                                                       ` Pankaj Gupta
2018-01-13  6:23                                                                         ` [Qemu-devel] " Pankaj Gupta
2018-01-13  6:23                                                                         ` Pankaj Gupta
2018-01-17 16:17                                                                         ` Dan Williams
2018-01-17 16:17                                                                           ` [Qemu-devel] " Dan Williams
2018-01-17 16:17                                                                           ` Dan Williams
2018-01-17 17:31                                                                           ` Pankaj Gupta
2018-01-17 17:31                                                                             ` [Qemu-devel] " Pankaj Gupta
2018-01-17 17:31                                                                             ` Pankaj Gupta
2018-01-18 16:53                                                                     ` David Hildenbrand
2018-01-18 16:53                                                                       ` [Qemu-devel] " David Hildenbrand
2018-01-18 16:53                                                                       ` David Hildenbrand
2018-01-18 17:38                                                                       ` Dan Williams
2018-01-18 17:38                                                                         ` [Qemu-devel] " Dan Williams
2018-01-18 17:38                                                                         ` Dan Williams
2018-01-18 17:48                                                                         ` David Hildenbrand
2018-01-18 17:48                                                                           ` [Qemu-devel] " David Hildenbrand
2018-01-18 17:48                                                                           ` David Hildenbrand
2018-01-18 18:45                                                                           ` Dan Williams
2018-01-18 18:45                                                                             ` [Qemu-devel] " Dan Williams
2018-01-18 18:45                                                                             ` Dan Williams
2018-01-18 18:54                                                                           ` Pankaj Gupta
2018-01-18 18:54                                                                             ` [Qemu-devel] " Pankaj Gupta
2018-01-18 18:54                                                                             ` Pankaj Gupta
2018-01-18 18:59                                                                             ` Dan Williams
2018-01-18 18:59                                                                               ` [Qemu-devel] " Dan Williams
2018-01-18 18:59                                                                               ` Dan Williams
2018-01-18 19:36                                                                               ` Pankaj Gupta
2018-01-18 19:36                                                                                 ` [Qemu-devel] " Pankaj Gupta
2018-01-18 19:36                                                                                 ` Pankaj Gupta
2018-01-18 19:48                                                                                 ` Dan Williams
2018-01-18 19:48                                                                                   ` [Qemu-devel] " Dan Williams
2018-01-18 19:48                                                                                   ` Dan Williams
2018-01-18 19:51                                                                               ` David Hildenbrand
2018-01-18 19:51                                                                                 ` [Qemu-devel] " David Hildenbrand
2018-01-18 19:51                                                                                 ` David Hildenbrand
2018-01-18 20:11                                                                                 ` Dan Williams
2018-01-18 20:11                                                                                   ` [Qemu-devel] " Dan Williams
2018-01-18 20:11                                                                                   ` Dan Williams
2017-11-06  7:57                                                         ` [Qemu-devel] " Pankaj Gupta
2017-11-06  7:57                                                           ` Pankaj Gupta
2017-11-06 16:57                                                           ` Dan Williams
2017-11-06 16:57                                                             ` Dan Williams
2017-11-07 11:21                                                             ` Pankaj Gupta
2017-11-07 11:21                                                               ` Pankaj Gupta
2017-11-07 11:21                                                               ` Pankaj Gupta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.