All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Some question about savem/qcow2 incremental snapshot
@ 2017-12-25  7:33 He Junyan
  2018-05-08 14:41 ` Eric Blake
  0 siblings, 1 reply; 18+ messages in thread
From: He Junyan @ 2017-12-25  7:33 UTC (permalink / raw)
  To: qemu-devel

hi all:

I am now focusing on snapshot optimization for Intel NVDimm kind
memory. Different from the normal memory, the NVDimm may be 128G, 256G
or even more for just one guest, and its speed is slower than the
normal memory. So sometimes it may take several minutes to complete
just one snapshot saving. Even with compression enabled, the snapshot
point may consume more than 30G disk space. 
We decide to add incremental kind snapshot saving to resolve this. Just
store difference between snapshot points to save time and disk space.
But the current snapshot/save_vm framework seems not to support this.
We need to add snapshot dependency and extra operations when we LOAD
and DELETE the snapshot point.
Is that possible to modify the savevm framework and add some
incremental snapshot support to QCOW2 format?

Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] Some question about savem/qcow2 incremental snapshot
  2017-12-25  7:33 [Qemu-devel] Some question about savem/qcow2 incremental snapshot He Junyan
@ 2018-05-08 14:41 ` Eric Blake
  2018-05-08 15:03   ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Blake @ 2018-05-08 14:41 UTC (permalink / raw)
  To: He Junyan, qemu-devel, John Snow, qemu block

On 12/25/2017 01:33 AM, He Junyan wrote:
> hi all:
> 
> I am now focusing on snapshot optimization for Intel NVDimm kind
> memory. Different from the normal memory, the NVDimm may be 128G, 256G
> or even more for just one guest, and its speed is slower than the
> normal memory. So sometimes it may take several minutes to complete
> just one snapshot saving. Even with compression enabled, the snapshot
> point may consume more than 30G disk space.
> We decide to add incremental kind snapshot saving to resolve this. Just
> store difference between snapshot points to save time and disk space.
> But the current snapshot/save_vm framework seems not to support this.
> We need to add snapshot dependency and extra operations when we LOAD
> and DELETE the snapshot point.
> Is that possible to modify the savevm framework and add some
> incremental snapshot support to QCOW2 format?

In general, the list has tended to focus on external snapshots rather 
than internal; where persistent bitmaps have been the proposed mechanism 
for tracking incremental differences between snapshots.  But yes, it is 
certainly feasible that patches to improve internal snapshots to take 
advantage of incremental relationships may prove useful.  You will need 
to document all enhancements to the qcow2 file format and get that 
approved first, as interoperability demands that others reading the same 
spec would be able to interpret the image you create that is utilizing 
an internal snapshot with an incremental diff.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-08 14:41 ` Eric Blake
@ 2018-05-08 15:03   ` Kevin Wolf
  2018-05-09 10:16     ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Kevin Wolf @ 2018-05-08 15:03 UTC (permalink / raw)
  To: Eric Blake; +Cc: He Junyan, qemu-devel, John Snow, qemu block, stefanha

Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> On 12/25/2017 01:33 AM, He Junyan wrote:
> > hi all:
> > 
> > I am now focusing on snapshot optimization for Intel NVDimm kind
> > memory. Different from the normal memory, the NVDimm may be 128G, 256G
> > or even more for just one guest, and its speed is slower than the
> > normal memory. So sometimes it may take several minutes to complete
> > just one snapshot saving. Even with compression enabled, the snapshot
> > point may consume more than 30G disk space.
> > We decide to add incremental kind snapshot saving to resolve this. Just
> > store difference between snapshot points to save time and disk space.
> > But the current snapshot/save_vm framework seems not to support this.
> > We need to add snapshot dependency and extra operations when we LOAD
> > and DELETE the snapshot point.
> > Is that possible to modify the savevm framework and add some
> > incremental snapshot support to QCOW2 format?
> 
> In general, the list has tended to focus on external snapshots rather than
> internal; where persistent bitmaps have been the proposed mechanism for
> tracking incremental differences between snapshots.  But yes, it is
> certainly feasible that patches to improve internal snapshots to take
> advantage of incremental relationships may prove useful.  You will need to
> document all enhancements to the qcow2 file format and get that approved
> first, as interoperability demands that others reading the same spec would
> be able to interpret the image you create that is utilizing an internal
> snapshot with an incremental diff.

Snapshots are incremental by their very nature. That is, the snapshot of
the disk content is incremental. We don't diff VM state. Persistent
bitmaps are a completely separate thing.

I may be misunderstanding the problem, but to me it sounds as if the
content of the nvdimm device ended up in the VM state, which is stored
in a (non-nvdimm) qcow2 image. Having the nvdimm in the VM state is
certainly not the right approach. Instead, it needs to be treated like
a block device.

What I believe you really need is two things:

1. Stop the nvdimm from ending up in the VM state. This should be fairly
   easy.

2. Make the nvdimm device use the QEMU block layer so that it is backed
   by a non-raw disk image (such as a qcow2 file representing the
   content of the nvdimm) that supports snapshots.

   This part is hard because it requires some completely new
   infrastructure such as mapping clusters of the image file to guest
   pages, and doing cluster allocation (including the copy on write
   logic) by handling guest page faults.

I think it makes sense to invest some effort into such interfaces, but
be prepared for a long journey.

Kevin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-08 15:03   ` [Qemu-devel] [Qemu-block] " Kevin Wolf
@ 2018-05-09 10:16     ` Stefan Hajnoczi
  2018-05-09 17:54       ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2018-05-09 10:16 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Eric Blake, He Junyan, qemu-devel, John Snow, qemu block, Pankaj Gupta

[-- Attachment #1: Type: text/plain, Size: 2801 bytes --]

On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > On 12/25/2017 01:33 AM, He Junyan wrote:
> 2. Make the nvdimm device use the QEMU block layer so that it is backed
>    by a non-raw disk image (such as a qcow2 file representing the
>    content of the nvdimm) that supports snapshots.
> 
>    This part is hard because it requires some completely new
>    infrastructure such as mapping clusters of the image file to guest
>    pages, and doing cluster allocation (including the copy on write
>    logic) by handling guest page faults.
> 
> I think it makes sense to invest some effort into such interfaces, but
> be prepared for a long journey.

I like the suggestion but it needs to be followed up with a concrete
design that is feasible and fair for Junyan and others to implement.
Otherwise the "long journey" is really just a way of rejecting this
feature.

Let's discuss the details of using the block layer for NVDIMM and try to
come up with a plan.

The biggest issue with using the block layer is that persistent memory
applications use load/store instructions to directly access data.  This
is fundamentally different from the block layer, which transfers blocks
of data to and from the device.

Because of block DMA, QEMU is able to perform processing at each block
driver graph node.  This doesn't exist for persistent memory because
software does not trap I/O.  Therefore the concept of filter nodes
doesn't make sense for persistent memory - we certainly do not want to
trap every I/O because performance would be terrible.

Another difference is that persistent memory I/O is synchronous.
Load/store instructions execute quickly.  Perhaps we could use KVM async
page faults in cases where QEMU needs to perform processing, but again
the performance would be bad.

Most protocol drivers do not support direct memory access.  iscsi, curl,
etc just don't fit the model.  One might be tempted to implement
buffering but at that point it's better to just use block devices.

I have CCed Pankaj, who is working on the virtio-pmem device.  I need to
be clear that emulated NVDIMM cannot be supported with the block layer
since it lacks a guest flush mechanism.  There is no way for
applications to let the hypervisor know the file needs to be fsynced.
That's what virtio-pmem addresses.

Summary:
A subset of the block layer could be used to back virtio-pmem.  This
requires a new block driver API and the KVM async page fault mechanism
for trapping and mapping pages.  Actual emulated NVDIMM devices cannot
be supported unless the hardware specification is extended with a
virtualization-friendly interface in the future.

Please let me know your thoughts.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-09 10:16     ` Stefan Hajnoczi
@ 2018-05-09 17:54       ` Max Reitz
  2018-05-10  8:26         ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2018-05-09 17:54 UTC (permalink / raw)
  To: Stefan Hajnoczi, Kevin Wolf
  Cc: Pankaj Gupta, qemu block, qemu-devel, He Junyan

[-- Attachment #1: Type: text/plain, Size: 4576 bytes --]

On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
>> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
>>> On 12/25/2017 01:33 AM, He Junyan wrote:
>> 2. Make the nvdimm device use the QEMU block layer so that it is backed
>>    by a non-raw disk image (such as a qcow2 file representing the
>>    content of the nvdimm) that supports snapshots.
>>
>>    This part is hard because it requires some completely new
>>    infrastructure such as mapping clusters of the image file to guest
>>    pages, and doing cluster allocation (including the copy on write
>>    logic) by handling guest page faults.
>>
>> I think it makes sense to invest some effort into such interfaces, but
>> be prepared for a long journey.
> 
> I like the suggestion but it needs to be followed up with a concrete
> design that is feasible and fair for Junyan and others to implement.
> Otherwise the "long journey" is really just a way of rejecting this
> feature.
> 
> Let's discuss the details of using the block layer for NVDIMM and try to
> come up with a plan.
> 
> The biggest issue with using the block layer is that persistent memory
> applications use load/store instructions to directly access data.  This
> is fundamentally different from the block layer, which transfers blocks
> of data to and from the device.
> 
> Because of block DMA, QEMU is able to perform processing at each block
> driver graph node.  This doesn't exist for persistent memory because
> software does not trap I/O.  Therefore the concept of filter nodes
> doesn't make sense for persistent memory - we certainly do not want to
> trap every I/O because performance would be terrible.
> 
> Another difference is that persistent memory I/O is synchronous.
> Load/store instructions execute quickly.  Perhaps we could use KVM async
> page faults in cases where QEMU needs to perform processing, but again
> the performance would be bad.

Let me first say that I have no idea how the interface to NVDIMM looks.
I just assume it works pretty much like normal RAM (so the interface is
just that it’s a part of the physical address space).

Also, it sounds a bit like you are already discarding my idea, but here
goes anyway.

Would it be possible to introduce a buffering block driver that presents
the guest an area of RAM/NVDIMM through an NVDIMM interface (so I
suppose as part of the guest address space)?  For writing, we’d keep a
dirty bitmap on it, and then we’d asynchronously move the dirty areas
through the block layer, so basically like mirror.  On flushing, we’d
block until everything is clean.

For reading, we’d follow a COR/stream model, basically, where everything
is unpopulated in the beginning and everything is loaded through the
block layer both asynchronously all the time and on-demand whenever the
guest needs something that has not been loaded yet.

Now I notice that that looks pretty much like a backing file model where
we constantly run both a stream and a commit job at the same time.

The user could decide how much memory to use for the buffer, so it could
either hold everything or be partially unallocated.

You’d probably want to back the buffer by NVDIMM normally, so that
nothing is lost on crashes (though this would imply that for partial
allocation the buffering block driver would need to know the mapping
between the area in real NVDIMM and its virtual representation of it).

Just my two cents while scanning through qemu-block to find emails that
don’t actually concern me...

Max

> Most protocol drivers do not support direct memory access.  iscsi, curl,
> etc just don't fit the model.  One might be tempted to implement
> buffering but at that point it's better to just use block devices.
> 
> I have CCed Pankaj, who is working on the virtio-pmem device.  I need to
> be clear that emulated NVDIMM cannot be supported with the block layer
> since it lacks a guest flush mechanism.  There is no way for
> applications to let the hypervisor know the file needs to be fsynced.
> That's what virtio-pmem addresses.
> 
> Summary:
> A subset of the block layer could be used to back virtio-pmem.  This
> requires a new block driver API and the KVM async page fault mechanism
> for trapping and mapping pages.  Actual emulated NVDIMM devices cannot
> be supported unless the hardware specification is extended with a
> virtualization-friendly interface in the future.
> 
> Please let me know your thoughts.
> 
> Stefan
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-09 17:54       ` Max Reitz
@ 2018-05-10  8:26         ` Stefan Hajnoczi
  2018-05-11 17:25           ` Kevin Wolf
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2018-05-10  8:26 UTC (permalink / raw)
  To: Max Reitz; +Cc: Kevin Wolf, Pankaj Gupta, qemu block, qemu-devel, He Junyan

[-- Attachment #1: Type: text/plain, Size: 5385 bytes --]

On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> >>    by a non-raw disk image (such as a qcow2 file representing the
> >>    content of the nvdimm) that supports snapshots.
> >>
> >>    This part is hard because it requires some completely new
> >>    infrastructure such as mapping clusters of the image file to guest
> >>    pages, and doing cluster allocation (including the copy on write
> >>    logic) by handling guest page faults.
> >>
> >> I think it makes sense to invest some effort into such interfaces, but
> >> be prepared for a long journey.
> > 
> > I like the suggestion but it needs to be followed up with a concrete
> > design that is feasible and fair for Junyan and others to implement.
> > Otherwise the "long journey" is really just a way of rejecting this
> > feature.
> > 
> > Let's discuss the details of using the block layer for NVDIMM and try to
> > come up with a plan.
> > 
> > The biggest issue with using the block layer is that persistent memory
> > applications use load/store instructions to directly access data.  This
> > is fundamentally different from the block layer, which transfers blocks
> > of data to and from the device.
> > 
> > Because of block DMA, QEMU is able to perform processing at each block
> > driver graph node.  This doesn't exist for persistent memory because
> > software does not trap I/O.  Therefore the concept of filter nodes
> > doesn't make sense for persistent memory - we certainly do not want to
> > trap every I/O because performance would be terrible.
> > 
> > Another difference is that persistent memory I/O is synchronous.
> > Load/store instructions execute quickly.  Perhaps we could use KVM async
> > page faults in cases where QEMU needs to perform processing, but again
> > the performance would be bad.
> 
> Let me first say that I have no idea how the interface to NVDIMM looks.
> I just assume it works pretty much like normal RAM (so the interface is
> just that it’s a part of the physical address space).
> 
> Also, it sounds a bit like you are already discarding my idea, but here
> goes anyway.
> 
> Would it be possible to introduce a buffering block driver that presents
> the guest an area of RAM/NVDIMM through an NVDIMM interface (so I
> suppose as part of the guest address space)?  For writing, we’d keep a
> dirty bitmap on it, and then we’d asynchronously move the dirty areas
> through the block layer, so basically like mirror.  On flushing, we’d
> block until everything is clean.
> 
> For reading, we’d follow a COR/stream model, basically, where everything
> is unpopulated in the beginning and everything is loaded through the
> block layer both asynchronously all the time and on-demand whenever the
> guest needs something that has not been loaded yet.
> 
> Now I notice that that looks pretty much like a backing file model where
> we constantly run both a stream and a commit job at the same time.
> 
> The user could decide how much memory to use for the buffer, so it could
> either hold everything or be partially unallocated.
> 
> You’d probably want to back the buffer by NVDIMM normally, so that
> nothing is lost on crashes (though this would imply that for partial
> allocation the buffering block driver would need to know the mapping
> between the area in real NVDIMM and its virtual representation of it).
> 
> Just my two cents while scanning through qemu-block to find emails that
> don’t actually concern me...

The guest kernel already implements this - it's the page cache and the
block layer!

Doing it in QEMU with dirty memory logging enabled is less efficient
than doing it in the guest.

That's why I said it's better to just use block devices than to
implement buffering.

I'm saying that persistent memory emulation on top of the iscsi:// block
driver (for example) does not make sense.  It could be implemented but
the performance wouldn't be better than block I/O and the
complexity/code size in QEMU isn't justified IMO.

Stefan

> > Most protocol drivers do not support direct memory access.  iscsi, curl,
> > etc just don't fit the model.  One might be tempted to implement
> > buffering but at that point it's better to just use block devices.
> > 
> > I have CCed Pankaj, who is working on the virtio-pmem device.  I need to
> > be clear that emulated NVDIMM cannot be supported with the block layer
> > since it lacks a guest flush mechanism.  There is no way for
> > applications to let the hypervisor know the file needs to be fsynced.
> > That's what virtio-pmem addresses.
> > 
> > Summary:
> > A subset of the block layer could be used to back virtio-pmem.  This
> > requires a new block driver API and the KVM async page fault mechanism
> > for trapping and mapping pages.  Actual emulated NVDIMM devices cannot
> > be supported unless the hardware specification is extended with a
> > virtualization-friendly interface in the future.
> > 
> > Please let me know your thoughts.
> > 
> > Stefan
> > 
> 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-10  8:26         ` Stefan Hajnoczi
@ 2018-05-11 17:25           ` Kevin Wolf
  2018-05-14 13:48             ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Kevin Wolf @ 2018-05-11 17:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Max Reitz, Pankaj Gupta, qemu block, qemu-devel, He Junyan

[-- Attachment #1: Type: text/plain, Size: 5808 bytes --]

Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> > >>    by a non-raw disk image (such as a qcow2 file representing the
> > >>    content of the nvdimm) that supports snapshots.
> > >>
> > >>    This part is hard because it requires some completely new
> > >>    infrastructure such as mapping clusters of the image file to guest
> > >>    pages, and doing cluster allocation (including the copy on write
> > >>    logic) by handling guest page faults.
> > >>
> > >> I think it makes sense to invest some effort into such interfaces, but
> > >> be prepared for a long journey.
> > > 
> > > I like the suggestion but it needs to be followed up with a concrete
> > > design that is feasible and fair for Junyan and others to implement.
> > > Otherwise the "long journey" is really just a way of rejecting this
> > > feature.
> > > 
> > > Let's discuss the details of using the block layer for NVDIMM and try to
> > > come up with a plan.
> > > 
> > > The biggest issue with using the block layer is that persistent memory
> > > applications use load/store instructions to directly access data.  This
> > > is fundamentally different from the block layer, which transfers blocks
> > > of data to and from the device.
> > > 
> > > Because of block DMA, QEMU is able to perform processing at each block
> > > driver graph node.  This doesn't exist for persistent memory because
> > > software does not trap I/O.  Therefore the concept of filter nodes
> > > doesn't make sense for persistent memory - we certainly do not want to
> > > trap every I/O because performance would be terrible.
> > > 
> > > Another difference is that persistent memory I/O is synchronous.
> > > Load/store instructions execute quickly.  Perhaps we could use KVM async
> > > page faults in cases where QEMU needs to perform processing, but again
> > > the performance would be bad.
> > 
> > Let me first say that I have no idea how the interface to NVDIMM looks.
> > I just assume it works pretty much like normal RAM (so the interface is
> > just that it’s a part of the physical address space).
> > 
> > Also, it sounds a bit like you are already discarding my idea, but here
> > goes anyway.
> > 
> > Would it be possible to introduce a buffering block driver that presents
> > the guest an area of RAM/NVDIMM through an NVDIMM interface (so I
> > suppose as part of the guest address space)?  For writing, we’d keep a
> > dirty bitmap on it, and then we’d asynchronously move the dirty areas
> > through the block layer, so basically like mirror.  On flushing, we’d
> > block until everything is clean.
> > 
> > For reading, we’d follow a COR/stream model, basically, where everything
> > is unpopulated in the beginning and everything is loaded through the
> > block layer both asynchronously all the time and on-demand whenever the
> > guest needs something that has not been loaded yet.
> > 
> > Now I notice that that looks pretty much like a backing file model where
> > we constantly run both a stream and a commit job at the same time.
> > 
> > The user could decide how much memory to use for the buffer, so it could
> > either hold everything or be partially unallocated.
> > 
> > You’d probably want to back the buffer by NVDIMM normally, so that
> > nothing is lost on crashes (though this would imply that for partial
> > allocation the buffering block driver would need to know the mapping
> > between the area in real NVDIMM and its virtual representation of it).
> > 
> > Just my two cents while scanning through qemu-block to find emails that
> > don’t actually concern me...
> 
> The guest kernel already implements this - it's the page cache and the
> block layer!
> 
> Doing it in QEMU with dirty memory logging enabled is less efficient
> than doing it in the guest.
> 
> That's why I said it's better to just use block devices than to
> implement buffering.
> 
> I'm saying that persistent memory emulation on top of the iscsi:// block
> driver (for example) does not make sense.  It could be implemented but
> the performance wouldn't be better than block I/O and the
> complexity/code size in QEMU isn't justified IMO.

I think it could make sense if you put everything together.

The primary motivation to use this would of course be that you can
directly map the guest clusters of a qcow2 file into the guest. We'd
potentially fault on the first access, but once it's mapped, you get raw
speed. You're right about flushing, and I was indeed thinking of
Pankaj's work there; maybe I should have been more explicit about that.

Now buffering in QEMU might come in useful when you want to run a block
job on the device. Block jobs are usually just temporary, and accepting
temporarily lower performance might be very acceptable when the
alternative is that you can't perform block jobs at all.

If we want to offer something nvdimm-like not only for the extreme
"performance only, no features" case, but as a viable option for the
average user, we need to be fast in the normal case, and allow to use
any block layer features without having to restart the VM with a
different storage device, even if at a performance penalty.

On iscsi, you still don't gain anything compared to just using a block
device, but support for that might just happen as a side effect when you
implement the interesting features.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-11 17:25           ` Kevin Wolf
@ 2018-05-14 13:48             ` Stefan Hajnoczi
  2018-05-28  7:01               ` He, Junyan
  2018-05-30 14:44               ` Stefan Hajnoczi
  0 siblings, 2 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2018-05-14 13:48 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Pankaj Gupta, He Junyan, qemu-devel, qemu block,
	Max Reitz

[-- Attachment #1: Type: text/plain, Size: 7211 bytes --]

On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> > > >>    by a non-raw disk image (such as a qcow2 file representing the
> > > >>    content of the nvdimm) that supports snapshots.
> > > >>
> > > >>    This part is hard because it requires some completely new
> > > >>    infrastructure such as mapping clusters of the image file to guest
> > > >>    pages, and doing cluster allocation (including the copy on write
> > > >>    logic) by handling guest page faults.
> > > >>
> > > >> I think it makes sense to invest some effort into such interfaces, but
> > > >> be prepared for a long journey.
> > > > 
> > > > I like the suggestion but it needs to be followed up with a concrete
> > > > design that is feasible and fair for Junyan and others to implement.
> > > > Otherwise the "long journey" is really just a way of rejecting this
> > > > feature.
> > > > 
> > > > Let's discuss the details of using the block layer for NVDIMM and try to
> > > > come up with a plan.
> > > > 
> > > > The biggest issue with using the block layer is that persistent memory
> > > > applications use load/store instructions to directly access data.  This
> > > > is fundamentally different from the block layer, which transfers blocks
> > > > of data to and from the device.
> > > > 
> > > > Because of block DMA, QEMU is able to perform processing at each block
> > > > driver graph node.  This doesn't exist for persistent memory because
> > > > software does not trap I/O.  Therefore the concept of filter nodes
> > > > doesn't make sense for persistent memory - we certainly do not want to
> > > > trap every I/O because performance would be terrible.
> > > > 
> > > > Another difference is that persistent memory I/O is synchronous.
> > > > Load/store instructions execute quickly.  Perhaps we could use KVM async
> > > > page faults in cases where QEMU needs to perform processing, but again
> > > > the performance would be bad.
> > > 
> > > Let me first say that I have no idea how the interface to NVDIMM looks.
> > > I just assume it works pretty much like normal RAM (so the interface is
> > > just that it’s a part of the physical address space).
> > > 
> > > Also, it sounds a bit like you are already discarding my idea, but here
> > > goes anyway.
> > > 
> > > Would it be possible to introduce a buffering block driver that presents
> > > the guest an area of RAM/NVDIMM through an NVDIMM interface (so I
> > > suppose as part of the guest address space)?  For writing, we’d keep a
> > > dirty bitmap on it, and then we’d asynchronously move the dirty areas
> > > through the block layer, so basically like mirror.  On flushing, we’d
> > > block until everything is clean.
> > > 
> > > For reading, we’d follow a COR/stream model, basically, where everything
> > > is unpopulated in the beginning and everything is loaded through the
> > > block layer both asynchronously all the time and on-demand whenever the
> > > guest needs something that has not been loaded yet.
> > > 
> > > Now I notice that that looks pretty much like a backing file model where
> > > we constantly run both a stream and a commit job at the same time.
> > > 
> > > The user could decide how much memory to use for the buffer, so it could
> > > either hold everything or be partially unallocated.
> > > 
> > > You’d probably want to back the buffer by NVDIMM normally, so that
> > > nothing is lost on crashes (though this would imply that for partial
> > > allocation the buffering block driver would need to know the mapping
> > > between the area in real NVDIMM and its virtual representation of it).
> > > 
> > > Just my two cents while scanning through qemu-block to find emails that
> > > don’t actually concern me...
> > 
> > The guest kernel already implements this - it's the page cache and the
> > block layer!
> > 
> > Doing it in QEMU with dirty memory logging enabled is less efficient
> > than doing it in the guest.
> > 
> > That's why I said it's better to just use block devices than to
> > implement buffering.
> > 
> > I'm saying that persistent memory emulation on top of the iscsi:// block
> > driver (for example) does not make sense.  It could be implemented but
> > the performance wouldn't be better than block I/O and the
> > complexity/code size in QEMU isn't justified IMO.
> 
> I think it could make sense if you put everything together.
> 
> The primary motivation to use this would of course be that you can
> directly map the guest clusters of a qcow2 file into the guest. We'd
> potentially fault on the first access, but once it's mapped, you get raw
> speed. You're right about flushing, and I was indeed thinking of
> Pankaj's work there; maybe I should have been more explicit about that.
> 
> Now buffering in QEMU might come in useful when you want to run a block
> job on the device. Block jobs are usually just temporary, and accepting
> temporarily lower performance might be very acceptable when the
> alternative is that you can't perform block jobs at all.

Why is buffering needed for block jobs?  They access the image using
traditional block layer I/O requests.

> If we want to offer something nvdimm-like not only for the extreme
> "performance only, no features" case, but as a viable option for the
> average user, we need to be fast in the normal case, and allow to use
> any block layer features without having to restart the VM with a
> different storage device, even if at a performance penalty.

What are the details involved in making this possible?

Persistent memory does not trap I/O but that is what filter drivers and
before write notifiers need.  So a page protection mechanism is required
for the block layer to trap persistent memory accesses.

Next, this needs to be integrated with BdrvTrackedRequest and
req->serialising so that copy-on-read, blockjobs, etc work correctly
when both traditional block I/O requests from blockjobs and direct
memory access from guest are taking place at the same time.

Page protection is only realistic with KVM async page faults, otherwise
faults freeze the vcpu until they are resolved.  kvm.ko needs to return
the page fault information to QEMU and QEMU must be able to resolve the
async page fault once it has mapped.  Perhaps userfaultfd(2) can be used
for this.

This is as far as I've gotten before thinking about how buffering would
work.

> On iscsi, you still don't gain anything compared to just using a block
> device, but support for that might just happen as a side effect when you
> implement the interesting features.

If we get the feature for free as part of addressing another use case, I
won't complain :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-14 13:48             ` Stefan Hajnoczi
@ 2018-05-28  7:01               ` He, Junyan
  2018-05-30 14:44               ` Stefan Hajnoczi
  1 sibling, 0 replies; 18+ messages in thread
From: He, Junyan @ 2018-05-28  7:01 UTC (permalink / raw)
  To: zy107165
  Cc: Stefan Hajnoczi, Pankaj Gupta, qemu-devel, qemu block, Max Reitz,
	Stefan Hajnoczi, Kevin Wolf, Zhang, Yu C, Zhang, Yi Z

Hi yang,

Alibaba made this proposal for NVDimm snapshot optimization, 
can you give some advice about this discussion?

Thanks



-----Original Message-----
From: Stefan Hajnoczi [mailto:stefanha@gmail.com] 
Sent: Monday, May 14, 2018 9:49 PM
To: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>; Pankaj Gupta <pagupta@redhat.com>; He, Junyan <junyan.he@intel.com>; qemu-devel@nongnu.org; qemu block <qemu-block@nongnu.org>; Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2 incremental snapshot

On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> > > >>    by a non-raw disk image (such as a qcow2 file representing the
> > > >>    content of the nvdimm) that supports snapshots.
> > > >>
> > > >>    This part is hard because it requires some completely new
> > > >>    infrastructure such as mapping clusters of the image file to guest
> > > >>    pages, and doing cluster allocation (including the copy on write
> > > >>    logic) by handling guest page faults.
> > > >>
> > > >> I think it makes sense to invest some effort into such 
> > > >> interfaces, but be prepared for a long journey.
> > > > 
> > > > I like the suggestion but it needs to be followed up with a 
> > > > concrete design that is feasible and fair for Junyan and others to implement.
> > > > Otherwise the "long journey" is really just a way of rejecting 
> > > > this feature.
> > > > 
> > > > Let's discuss the details of using the block layer for NVDIMM 
> > > > and try to come up with a plan.
> > > > 
> > > > The biggest issue with using the block layer is that persistent 
> > > > memory applications use load/store instructions to directly 
> > > > access data.  This is fundamentally different from the block 
> > > > layer, which transfers blocks of data to and from the device.
> > > > 
> > > > Because of block DMA, QEMU is able to perform processing at each 
> > > > block driver graph node.  This doesn't exist for persistent 
> > > > memory because software does not trap I/O.  Therefore the 
> > > > concept of filter nodes doesn't make sense for persistent memory 
> > > > - we certainly do not want to trap every I/O because performance would be terrible.
> > > > 
> > > > Another difference is that persistent memory I/O is synchronous.
> > > > Load/store instructions execute quickly.  Perhaps we could use 
> > > > KVM async page faults in cases where QEMU needs to perform 
> > > > processing, but again the performance would be bad.
> > > 
> > > Let me first say that I have no idea how the interface to NVDIMM looks.
> > > I just assume it works pretty much like normal RAM (so the 
> > > interface is just that it’s a part of the physical address space).
> > > 
> > > Also, it sounds a bit like you are already discarding my idea, but 
> > > here goes anyway.
> > > 
> > > Would it be possible to introduce a buffering block driver that 
> > > presents the guest an area of RAM/NVDIMM through an NVDIMM 
> > > interface (so I suppose as part of the guest address space)?  For 
> > > writing, we’d keep a dirty bitmap on it, and then we’d 
> > > asynchronously move the dirty areas through the block layer, so 
> > > basically like mirror.  On flushing, we’d block until everything is clean.
> > > 
> > > For reading, we’d follow a COR/stream model, basically, where 
> > > everything is unpopulated in the beginning and everything is 
> > > loaded through the block layer both asynchronously all the time 
> > > and on-demand whenever the guest needs something that has not been loaded yet.
> > > 
> > > Now I notice that that looks pretty much like a backing file model 
> > > where we constantly run both a stream and a commit job at the same time.
> > > 
> > > The user could decide how much memory to use for the buffer, so it 
> > > could either hold everything or be partially unallocated.
> > > 
> > > You’d probably want to back the buffer by NVDIMM normally, so that 
> > > nothing is lost on crashes (though this would imply that for 
> > > partial allocation the buffering block driver would need to know 
> > > the mapping between the area in real NVDIMM and its virtual representation of it).
> > > 
> > > Just my two cents while scanning through qemu-block to find emails 
> > > that don’t actually concern me...
> > 
> > The guest kernel already implements this - it's the page cache and 
> > the block layer!
> > 
> > Doing it in QEMU with dirty memory logging enabled is less efficient 
> > than doing it in the guest.
> > 
> > That's why I said it's better to just use block devices than to 
> > implement buffering.
> > 
> > I'm saying that persistent memory emulation on top of the iscsi:// 
> > block driver (for example) does not make sense.  It could be 
> > implemented but the performance wouldn't be better than block I/O 
> > and the complexity/code size in QEMU isn't justified IMO.
> 
> I think it could make sense if you put everything together.
> 
> The primary motivation to use this would of course be that you can 
> directly map the guest clusters of a qcow2 file into the guest. We'd 
> potentially fault on the first access, but once it's mapped, you get 
> raw speed. You're right about flushing, and I was indeed thinking of 
> Pankaj's work there; maybe I should have been more explicit about that.
> 
> Now buffering in QEMU might come in useful when you want to run a 
> block job on the device. Block jobs are usually just temporary, and 
> accepting temporarily lower performance might be very acceptable when 
> the alternative is that you can't perform block jobs at all.

Why is buffering needed for block jobs?  They access the image using traditional block layer I/O requests.

> If we want to offer something nvdimm-like not only for the extreme 
> "performance only, no features" case, but as a viable option for the 
> average user, we need to be fast in the normal case, and allow to use 
> any block layer features without having to restart the VM with a 
> different storage device, even if at a performance penalty.

What are the details involved in making this possible?

Persistent memory does not trap I/O but that is what filter drivers and before write notifiers need.  So a page protection mechanism is required for the block layer to trap persistent memory accesses.

Next, this needs to be integrated with BdrvTrackedRequest and
req->serialising so that copy-on-read, blockjobs, etc work correctly
when both traditional block I/O requests from blockjobs and direct memory access from guest are taking place at the same time.

Page protection is only realistic with KVM async page faults, otherwise faults freeze the vcpu until they are resolved.  kvm.ko needs to return the page fault information to QEMU and QEMU must be able to resolve the async page fault once it has mapped.  Perhaps userfaultfd(2) can be used for this.

This is as far as I've gotten before thinking about how buffering would work.

> On iscsi, you still don't gain anything compared to just using a block 
> device, but support for that might just happen as a side effect when 
> you implement the interesting features.

If we get the feature for free as part of addressing another use case, I won't complain :).

Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-14 13:48             ` Stefan Hajnoczi
  2018-05-28  7:01               ` He, Junyan
@ 2018-05-30 14:44               ` Stefan Hajnoczi
  2018-05-30 16:07                 ` Kevin Wolf
  1 sibling, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2018-05-30 14:44 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz, He Junyan; +Cc: Pankaj Gupta, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 1616 bytes --]

On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > >> I think it makes sense to invest some effort into such interfaces, but
> > > > >> be prepared for a long journey.
> > > > > 
> > > > > I like the suggestion but it needs to be followed up with a concrete
> > > > > design that is feasible and fair for Junyan and others to implement.
> > > > > Otherwise the "long journey" is really just a way of rejecting this
> > > > > feature.

The discussion on NVDIMM via the block layer has runs its course.  It
would be a big project and I don't think it's fair to ask Junyan to
implement it.

My understanding is this patch series doesn't modify the qcow2 on-disk
file format.  Rather, it just uses existing qcow2 mechanisms and extends
live migration to identify the NVDIMM state state region to share the
clusters.

Since this feature does not involve qcow2 format changes and is just an
optimization (dirty blocks still need to be allocated), it can be
removed from QEMU in the future if a better alternative becomes
available.

Junyan: Can you rebase the series and send a new revision?

Kevin and Max: Does this sound alright?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-30 14:44               ` Stefan Hajnoczi
@ 2018-05-30 16:07                 ` Kevin Wolf
  2018-05-31 10:48                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Kevin Wolf @ 2018-05-30 16:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Max Reitz, He Junyan, Pankaj Gupta, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]

Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > > >> I think it makes sense to invest some effort into such interfaces, but
> > > > > >> be prepared for a long journey.
> > > > > > 
> > > > > > I like the suggestion but it needs to be followed up with a concrete
> > > > > > design that is feasible and fair for Junyan and others to implement.
> > > > > > Otherwise the "long journey" is really just a way of rejecting this
> > > > > > feature.
> 
> The discussion on NVDIMM via the block layer has runs its course.  It
> would be a big project and I don't think it's fair to ask Junyan to
> implement it.
> 
> My understanding is this patch series doesn't modify the qcow2 on-disk
> file format.  Rather, it just uses existing qcow2 mechanisms and extends
> live migration to identify the NVDIMM state state region to share the
> clusters.
> 
> Since this feature does not involve qcow2 format changes and is just an
> optimization (dirty blocks still need to be allocated), it can be
> removed from QEMU in the future if a better alternative becomes
> available.
> 
> Junyan: Can you rebase the series and send a new revision?
> 
> Kevin and Max: Does this sound alright?

Do patches exist? I've never seen any, so I thought this was just the
early design stage.

I suspect that while it wouldn't change the qcow2 on-disk format in a
way that the qcow2 spec would have to be change, it does need to change
the VMState format that is stored as a blob within the qcow2 file.
At least, you need to store which other snapshot it is based upon so
that you can actually resume a VM from the incremental state.

Once you modify the VMState format/the migration stream, removing it
from QEMU again later means that you can't load your old snapshots any
more. Doing that, even with the two-release deprecation period, would be
quite nasty.

But you're right, depending on how the feature is implemented, it might
not be a thing that affects qcow2 much, but one that the migration
maintainers need to have a look at. I kind of suspect that it would
actually touch both parts to a degree that it would need approval from
both sides.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-30 16:07                 ` Kevin Wolf
@ 2018-05-31 10:48                   ` Stefan Hajnoczi
  2018-06-08  5:02                     ` He, Junyan
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2018-05-31 10:48 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, He Junyan, Pankaj Gupta, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 3933 bytes --]

On Wed, May 30, 2018 at 06:07:19PM +0200, Kevin Wolf wrote:
> Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> > On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> > > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > > > >> I think it makes sense to invest some effort into such interfaces, but
> > > > > > >> be prepared for a long journey.
> > > > > > > 
> > > > > > > I like the suggestion but it needs to be followed up with a concrete
> > > > > > > design that is feasible and fair for Junyan and others to implement.
> > > > > > > Otherwise the "long journey" is really just a way of rejecting this
> > > > > > > feature.
> > 
> > The discussion on NVDIMM via the block layer has runs its course.  It
> > would be a big project and I don't think it's fair to ask Junyan to
> > implement it.
> > 
> > My understanding is this patch series doesn't modify the qcow2 on-disk
> > file format.  Rather, it just uses existing qcow2 mechanisms and extends
> > live migration to identify the NVDIMM state state region to share the
> > clusters.
> > 
> > Since this feature does not involve qcow2 format changes and is just an
> > optimization (dirty blocks still need to be allocated), it can be
> > removed from QEMU in the future if a better alternative becomes
> > available.
> > 
> > Junyan: Can you rebase the series and send a new revision?
> > 
> > Kevin and Max: Does this sound alright?
> 
> Do patches exist? I've never seen any, so I thought this was just the
> early design stage.

Sorry for the confusion, the earlier patch series was here:

  https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg04530.html

> I suspect that while it wouldn't change the qcow2 on-disk format in a
> way that the qcow2 spec would have to be change, it does need to change
> the VMState format that is stored as a blob within the qcow2 file.
> At least, you need to store which other snapshot it is based upon so
> that you can actually resume a VM from the incremental state.
> 
> Once you modify the VMState format/the migration stream, removing it
> from QEMU again later means that you can't load your old snapshots any
> more. Doing that, even with the two-release deprecation period, would be
> quite nasty.
> 
> But you're right, depending on how the feature is implemented, it might
> not be a thing that affects qcow2 much, but one that the migration
> maintainers need to have a look at. I kind of suspect that it would
> actually touch both parts to a degree that it would need approval from
> both sides.

VMState wire format changes are minimal.  The only issue is that the
previous snapshot's nvdimm vmstate can start at an arbitrary offset in
the qcow2 cluster.  We can find a solution to the misalignment problem
(I think Junyan's patch series adds padding).

The approach references existing clusters in the previous snapshot's
vmstate area and only allocates new clusters for dirty NVDIMM regions.
In the non-qcow2 case we fall back to writing the entire NVDIMM
contents.

So instead of:

  write(qcow2_bs, all_vmstate_data); /* duplicates nvdimm contents :( */

do:

  write(bs, vmstate_data_upto_nvdimm);
  if (is_qcow2(bs)) {
      snapshot_clone_vmstate_range(bs, previous_snapshot,
                                   offset_to_nvdimm_vmstate);
      overwrite_nvdimm_dirty_blocks(bs, nvdimm);
  } else {
      write(bs, nvdimm_vmstate_data);
  }
  write(bs, vmstate_data_after_nvdimm);

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-05-31 10:48                   ` Stefan Hajnoczi
@ 2018-06-08  5:02                     ` He, Junyan
  2018-06-08  7:59                       ` Pankaj Gupta
  2018-06-08 13:29                       ` Stefan Hajnoczi
  0 siblings, 2 replies; 18+ messages in thread
From: He, Junyan @ 2018-06-08  5:02 UTC (permalink / raw)
  To: Stefan Hajnoczi, Kevin Wolf
  Cc: Max Reitz, Pankaj Gupta, qemu-devel, qemu block

Dear all:

I just switched from graphic/media field to virtualization at the end of the last year,
so I am sorry that though I have already try my best but I still feel a little dizzy
about your previous discussion about NVDimm via block layer:)
In today's qemu, we use the SaveVMHandlers functions to handle both snapshot and migration.
So for nvdimm kind memory, its migration and snapshot use the same way as the 
ram(savevm_ram_handlers). But the difference is the size of nvdimm may be huge, and the load
and store speed is slower. According to my usage, when I use 256G nvdimm as memory backend,
it may take more than 5 minutes to complete one snapshot saving, and after saving the qcow2
image is bigger than 50G. For migration, this may not be a problem because we do not need
extra disk space and the guest is not paused when in migration process. But for snapshot,
we need to pause the VM and the user experience is bad, and we got concerns about that.
I posted this question in Jan this year but failed to get enough reply. Then I sent a RFC patch
set in Mar, basic idea is using the dependency snapshot and dirty log trace in kernel to
optimize this.

https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg04530.html

I use the simple way to handle this,
1. Separate the nvdimm region from ram when do snapshot.
2. If the first time, we dump all the nvdimm data the same as ram, and enable dirty log trace
for nvdimm kind region.
3. If not the first time, we find the previous snapshot point and add reference to its clusters
which is used to store nvdimm data. And this time, we just save dirty page bitmap and dirty pages.
Because the previous nvdimm data clusters is ref added, we do not need to worry about its deleting.

I encounter a lot of problems:
1. Migration and snapshot logic is mixed and need to separate them for nvdimm.
2. Cluster has its alignment. When do snapshot, we just save data to disk continuous. Because we
need to add ref to cluster, we really need to consider the alignment. I just use a little trick way 
to padding some data to alignment now, and I think it is not a good way.
3. Dirty log trace may have some performance problem.

In theory, this manner can be used to handle all kind of huge memory snapshot, we need to find the 
balance between guest performance(Because of dirty log trace) and snapshot saving time.

Thanks
Junyan


-----Original Message-----
From: Stefan Hajnoczi [mailto:stefanha@redhat.com] 
Sent: Thursday, May 31, 2018 6:49 PM
To: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>; He, Junyan <junyan.he@intel.com>; Pankaj Gupta <pagupta@redhat.com>; qemu-devel@nongnu.org; qemu block <qemu-block@nongnu.org>
Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2 incremental snapshot

On Wed, May 30, 2018 at 06:07:19PM +0200, Kevin Wolf wrote:
> Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> > On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> > > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > > > >> I think it makes sense to invest some effort into such 
> > > > > > >> interfaces, but be prepared for a long journey.
> > > > > > > 
> > > > > > > I like the suggestion but it needs to be followed up with 
> > > > > > > a concrete design that is feasible and fair for Junyan and others to implement.
> > > > > > > Otherwise the "long journey" is really just a way of 
> > > > > > > rejecting this feature.
> > 
> > The discussion on NVDIMM via the block layer has runs its course.  
> > It would be a big project and I don't think it's fair to ask Junyan 
> > to implement it.
> > 
> > My understanding is this patch series doesn't modify the qcow2 
> > on-disk file format.  Rather, it just uses existing qcow2 mechanisms 
> > and extends live migration to identify the NVDIMM state state region 
> > to share the clusters.
> > 
> > Since this feature does not involve qcow2 format changes and is just 
> > an optimization (dirty blocks still need to be allocated), it can be 
> > removed from QEMU in the future if a better alternative becomes 
> > available.
> > 
> > Junyan: Can you rebase the series and send a new revision?
> > 
> > Kevin and Max: Does this sound alright?
> 
> Do patches exist? I've never seen any, so I thought this was just the 
> early design stage.

Sorry for the confusion, the earlier patch series was here:

  https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg04530.html

> I suspect that while it wouldn't change the qcow2 on-disk format in a 
> way that the qcow2 spec would have to be change, it does need to 
> change the VMState format that is stored as a blob within the qcow2 file.
> At least, you need to store which other snapshot it is based upon so 
> that you can actually resume a VM from the incremental state.
> 
> Once you modify the VMState format/the migration stream, removing it 
> from QEMU again later means that you can't load your old snapshots any 
> more. Doing that, even with the two-release deprecation period, would 
> be quite nasty.
> 
> But you're right, depending on how the feature is implemented, it 
> might not be a thing that affects qcow2 much, but one that the 
> migration maintainers need to have a look at. I kind of suspect that 
> it would actually touch both parts to a degree that it would need 
> approval from both sides.

VMState wire format changes are minimal.  The only issue is that the previous snapshot's nvdimm vmstate can start at an arbitrary offset in the qcow2 cluster.  We can find a solution to the misalignment problem (I think Junyan's patch series adds padding).

The approach references existing clusters in the previous snapshot's vmstate area and only allocates new clusters for dirty NVDIMM regions.
In the non-qcow2 case we fall back to writing the entire NVDIMM contents.

So instead of:

  write(qcow2_bs, all_vmstate_data); /* duplicates nvdimm contents :( */

do:

  write(bs, vmstate_data_upto_nvdimm);
  if (is_qcow2(bs)) {
      snapshot_clone_vmstate_range(bs, previous_snapshot,
                                   offset_to_nvdimm_vmstate);
      overwrite_nvdimm_dirty_blocks(bs, nvdimm);
  } else {
      write(bs, nvdimm_vmstate_data);
  }
  write(bs, vmstate_data_after_nvdimm);

Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-06-08  5:02                     ` He, Junyan
@ 2018-06-08  7:59                       ` Pankaj Gupta
  2018-06-08 15:58                         ` Junyan He
  2018-06-08 13:29                       ` Stefan Hajnoczi
  1 sibling, 1 reply; 18+ messages in thread
From: Pankaj Gupta @ 2018-06-08  7:59 UTC (permalink / raw)
  To: Junyan He; +Cc: Stefan Hajnoczi, Kevin Wolf, qemu-devel, qemu block, Max Reitz


Hi Junyan,

AFAICU you are trying to utilize qcow2 capabilities to do incremental
snapshot. As I understand NVDIMM device (being it real or emulated), its
contents are always be backed up in backing device.  

Now, the question comes to take a snapshot at some point in time. You are 
trying to achieve this with qcow2 format (not checked code yet), I have below 
queries:

- Are you implementing this feature for both actual DAX device pass-through 
  as well as emulated DAX?
- Are you using additional qcow2 disk for storing/taking snapshots? How we are 
  planning to use this feature?

Reason I asked this question is if we concentrate on integrating qcow2
with DAX, we will have a full fledged solution for most of the use-cases. 

Thanks,
Pankaj 

> 
> Dear all:
> 
> I just switched from graphic/media field to virtualization at the end of the
> last year,
> so I am sorry that though I have already try my best but I still feel a
> little dizzy
> about your previous discussion about NVDimm via block layer:)
> In today's qemu, we use the SaveVMHandlers functions to handle both snapshot
> and migration.
> So for nvdimm kind memory, its migration and snapshot use the same way as the
> ram(savevm_ram_handlers). But the difference is the size of nvdimm may be
> huge, and the load
> and store speed is slower. According to my usage, when I use 256G nvdimm as
> memory backend,
> it may take more than 5 minutes to complete one snapshot saving, and after
> saving the qcow2
> image is bigger than 50G. For migration, this may not be a problem because we
> do not need
> extra disk space and the guest is not paused when in migration process. But
> for snapshot,
> we need to pause the VM and the user experience is bad, and we got concerns
> about that.
> I posted this question in Jan this year but failed to get enough reply. Then
> I sent a RFC patch
> set in Mar, basic idea is using the dependency snapshot and dirty log trace
> in kernel to
> optimize this.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg04530.html
> 
> I use the simple way to handle this,
> 1. Separate the nvdimm region from ram when do snapshot.
> 2. If the first time, we dump all the nvdimm data the same as ram, and enable
> dirty log trace
> for nvdimm kind region.
> 3. If not the first time, we find the previous snapshot point and add
> reference to its clusters
> which is used to store nvdimm data. And this time, we just save dirty page
> bitmap and dirty pages.
> Because the previous nvdimm data clusters is ref added, we do not need to
> worry about its deleting.
> 
> I encounter a lot of problems:
> 1. Migration and snapshot logic is mixed and need to separate them for
> nvdimm.
> 2. Cluster has its alignment. When do snapshot, we just save data to disk
> continuous. Because we
> need to add ref to cluster, we really need to consider the alignment. I just
> use a little trick way
> to padding some data to alignment now, and I think it is not a good way.
> 3. Dirty log trace may have some performance problem.
> 
> In theory, this manner can be used to handle all kind of huge memory
> snapshot, we need to find the
> balance between guest performance(Because of dirty log trace) and snapshot
> saving time.
> 
> Thanks
> Junyan
> 
> 
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Thursday, May 31, 2018 6:49 PM
> To: Kevin Wolf <kwolf@redhat.com>
> Cc: Max Reitz <mreitz@redhat.com>; He, Junyan <junyan.he@intel.com>; Pankaj
> Gupta <pagupta@redhat.com>; qemu-devel@nongnu.org; qemu block
> <qemu-block@nongnu.org>
> Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2
> incremental snapshot
> 
> On Wed, May 30, 2018 at 06:07:19PM +0200, Kevin Wolf wrote:
> > Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> > > On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> > > > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > > > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > > > > >> I think it makes sense to invest some effort into such
> > > > > > > >> interfaces, but be prepared for a long journey.
> > > > > > > > 
> > > > > > > > I like the suggestion but it needs to be followed up with
> > > > > > > > a concrete design that is feasible and fair for Junyan and
> > > > > > > > others to implement.
> > > > > > > > Otherwise the "long journey" is really just a way of
> > > > > > > > rejecting this feature.
> > > 
> > > The discussion on NVDIMM via the block layer has runs its course.
> > > It would be a big project and I don't think it's fair to ask Junyan
> > > to implement it.
> > > 
> > > My understanding is this patch series doesn't modify the qcow2
> > > on-disk file format.  Rather, it just uses existing qcow2 mechanisms
> > > and extends live migration to identify the NVDIMM state state region
> > > to share the clusters.
> > > 
> > > Since this feature does not involve qcow2 format changes and is just
> > > an optimization (dirty blocks still need to be allocated), it can be
> > > removed from QEMU in the future if a better alternative becomes
> > > available.
> > > 
> > > Junyan: Can you rebase the series and send a new revision?
> > > 
> > > Kevin and Max: Does this sound alright?
> > 
> > Do patches exist? I've never seen any, so I thought this was just the
> > early design stage.
> 
> Sorry for the confusion, the earlier patch series was here:
> 
>   https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg04530.html
> 
> > I suspect that while it wouldn't change the qcow2 on-disk format in a
> > way that the qcow2 spec would have to be change, it does need to
> > change the VMState format that is stored as a blob within the qcow2 file.
> > At least, you need to store which other snapshot it is based upon so
> > that you can actually resume a VM from the incremental state.
> > 
> > Once you modify the VMState format/the migration stream, removing it
> > from QEMU again later means that you can't load your old snapshots any
> > more. Doing that, even with the two-release deprecation period, would
> > be quite nasty.
> > 
> > But you're right, depending on how the feature is implemented, it
> > might not be a thing that affects qcow2 much, but one that the
> > migration maintainers need to have a look at. I kind of suspect that
> > it would actually touch both parts to a degree that it would need
> > approval from both sides.
> 
> VMState wire format changes are minimal.  The only issue is that the previous
> snapshot's nvdimm vmstate can start at an arbitrary offset in the qcow2
> cluster.  We can find a solution to the misalignment problem (I think
> Junyan's patch series adds padding).
> 
> The approach references existing clusters in the previous snapshot's vmstate
> area and only allocates new clusters for dirty NVDIMM regions.
> In the non-qcow2 case we fall back to writing the entire NVDIMM contents.
> 
> So instead of:
> 
>   write(qcow2_bs, all_vmstate_data); /* duplicates nvdimm contents :( */
> 
> do:
> 
>   write(bs, vmstate_data_upto_nvdimm);
>   if (is_qcow2(bs)) {
>       snapshot_clone_vmstate_range(bs, previous_snapshot,
>                                    offset_to_nvdimm_vmstate);
>       overwrite_nvdimm_dirty_blocks(bs, nvdimm);
>   } else {
>       write(bs, nvdimm_vmstate_data);
>   }
>   write(bs, vmstate_data_after_nvdimm);
> 
> Stefan
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-06-08  5:02                     ` He, Junyan
  2018-06-08  7:59                       ` Pankaj Gupta
@ 2018-06-08 13:29                       ` Stefan Hajnoczi
  1 sibling, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2018-06-08 13:29 UTC (permalink / raw)
  To: He, Junyan; +Cc: Kevin Wolf, Max Reitz, Pankaj Gupta, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

On Fri, Jun 08, 2018 at 05:02:58AM +0000, He, Junyan wrote:
> I use the simple way to handle this,
> 1. Separate the nvdimm region from ram when do snapshot.
> 2. If the first time, we dump all the nvdimm data the same as ram, and enable dirty log trace
> for nvdimm kind region.
> 3. If not the first time, we find the previous snapshot point and add reference to its clusters
> which is used to store nvdimm data. And this time, we just save dirty page bitmap and dirty pages.
> Because the previous nvdimm data clusters is ref added, we do not need to worry about its deleting.
> 
> I encounter a lot of problems:
> 1. Migration and snapshot logic is mixed and need to separate them for nvdimm.
> 2. Cluster has its alignment. When do snapshot, we just save data to disk continuous. Because we
> need to add ref to cluster, we really need to consider the alignment. I just use a little trick way 
> to padding some data to alignment now, and I think it is not a good way.
> 3. Dirty log trace may have some performance problem.
> 
> In theory, this manner can be used to handle all kind of huge memory snapshot, we need to find the 
> balance between guest performance(Because of dirty log trace) and snapshot saving time.

If the snapshots are placed on the NVDIMM then save/load times should be
shorter.  I'm not sure how practical that is since this approach may be
too expensive for users.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-06-08  7:59                       ` Pankaj Gupta
@ 2018-06-08 15:58                         ` Junyan He
  2018-06-08 16:38                           ` Pankaj Gupta
  0 siblings, 1 reply; 18+ messages in thread
From: Junyan He @ 2018-06-08 15:58 UTC (permalink / raw)
  To: Pankaj Gupta, Junyan He
  Cc: Kevin Wolf, qemu block, qemu-devel, Stefan Hajnoczi, Max Reitz

I think nvdimm kind memory can really save the content(no matter real or emulated). But I think it is still

memory, as I understand, its data should be stored in the qcow2 image or some external snapshot data

image, so that we can copy this qcow2 image to other place and restore the same environment.

Qcow2 image contain all vm state, disk data and memory data, so I think nvdimm's data should also be

stored in this qcow2 image.

I am really a new guy in vmm field and do not know the usage of qcow2 with DAX. So far as I know, DAX

is a kernel FS option to let the page mapping bypass all block device logic and can improve performance.

But qcow2 is a user space file format used by qemu to emulate disks(Am I right?), so I have no idea about

that.


Thanks

Junyan

________________________________
From: Qemu-devel <qemu-devel-bounces+junyan.he=gmx.com@nongnu.org> on behalf of Pankaj Gupta <pagupta@redhat.com>
Sent: Friday, June 8, 2018 7:59:24 AM
To: Junyan He
Cc: Kevin Wolf; qemu block; qemu-devel@nongnu.org; Stefan Hajnoczi; Max Reitz
Subject: Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot


Hi Junyan,

AFAICU you are trying to utilize qcow2 capabilities to do incremental
snapshot. As I understand NVDIMM device (being it real or emulated), its
contents are always be backed up in backing device.

Now, the question comes to take a snapshot at some point in time. You are
trying to achieve this with qcow2 format (not checked code yet), I have below
queries:

- Are you implementing this feature for both actual DAX device pass-through
  as well as emulated DAX?
- Are you using additional qcow2 disk for storing/taking snapshots? How we are
  planning to use this feature?

Reason I asked this question is if we concentrate on integrating qcow2
with DAX, we will have a full fledged solution for most of the use-cases.

Thanks,
Pankaj

>
> Dear all:
>
> I just switched from graphic/media field to virtualization at the end of the
> last year,
> so I am sorry that though I have already try my best but I still feel a
> little dizzy
> about your previous discussion about NVDimm via block layer:)
> In today's qemu, we use the SaveVMHandlers functions to handle both snapshot
> and migration.
> So for nvdimm kind memory, its migration and snapshot use the same way as the
> ram(savevm_ram_handlers). But the difference is the size of nvdimm may be
> huge, and the load
> and store speed is slower. According to my usage, when I use 256G nvdimm as
> memory backend,
> it may take more than 5 minutes to complete one snapshot saving, and after
> saving the qcow2
> image is bigger than 50G. For migration, this may not be a problem because we
> do not need
> extra disk space and the guest is not paused when in migration process. But
> for snapshot,
> we need to pause the VM and the user experience is bad, and we got concerns
> about that.
> I posted this question in Jan this year but failed to get enough reply. Then
> I sent a RFC patch
> set in Mar, basic idea is using the dependency snapshot and dirty log trace
> in kernel to
> optimize this.
>
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg04530.html
>
> I use the simple way to handle this,
> 1. Separate the nvdimm region from ram when do snapshot.
> 2. If the first time, we dump all the nvdimm data the same as ram, and enable
> dirty log trace
> for nvdimm kind region.
> 3. If not the first time, we find the previous snapshot point and add
> reference to its clusters
> which is used to store nvdimm data. And this time, we just save dirty page
> bitmap and dirty pages.
> Because the previous nvdimm data clusters is ref added, we do not need to
> worry about its deleting.
>
> I encounter a lot of problems:
> 1. Migration and snapshot logic is mixed and need to separate them for
> nvdimm.
> 2. Cluster has its alignment. When do snapshot, we just save data to disk
> continuous. Because we
> need to add ref to cluster, we really need to consider the alignment. I just
> use a little trick way
> to padding some data to alignment now, and I think it is not a good way.
> 3. Dirty log trace may have some performance problem.
>
> In theory, this manner can be used to handle all kind of huge memory
> snapshot, we need to find the
> balance between guest performance(Because of dirty log trace) and snapshot
> saving time.
>
> Thanks
> Junyan
>
>
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Thursday, May 31, 2018 6:49 PM
> To: Kevin Wolf <kwolf@redhat.com>
> Cc: Max Reitz <mreitz@redhat.com>; He, Junyan <junyan.he@intel.com>; Pankaj
> Gupta <pagupta@redhat.com>; qemu-devel@nongnu.org; qemu block
> <qemu-block@nongnu.org>
> Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2
> incremental snapshot
>
> On Wed, May 30, 2018 at 06:07:19PM +0200, Kevin Wolf wrote:
> > Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> > > On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> > > > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > > > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > > > > >> I think it makes sense to invest some effort into such
> > > > > > > >> interfaces, but be prepared for a long journey.
> > > > > > > >
> > > > > > > > I like the suggestion but it needs to be followed up with
> > > > > > > > a concrete design that is feasible and fair for Junyan and
> > > > > > > > others to implement.
> > > > > > > > Otherwise the "long journey" is really just a way of
> > > > > > > > rejecting this feature.
> > >
> > > The discussion on NVDIMM via the block layer has runs its course.
> > > It would be a big project and I don't think it's fair to ask Junyan
> > > to implement it.
> > >
> > > My understanding is this patch series doesn't modify the qcow2
> > > on-disk file format.  Rather, it just uses existing qcow2 mechanisms
> > > and extends live migration to identify the NVDIMM state state region
> > > to share the clusters.
> > >
> > > Since this feature does not involve qcow2 format changes and is just
> > > an optimization (dirty blocks still need to be allocated), it can be
> > > removed from QEMU in the future if a better alternative becomes
> > > available.
> > >
> > > Junyan: Can you rebase the series and send a new revision?
> > >
> > > Kevin and Max: Does this sound alright?
> >
> > Do patches exist? I've never seen any, so I thought this was just the
> > early design stage.
>
> Sorry for the confusion, the earlier patch series was here:
>
>   https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg04530.html
>
> > I suspect that while it wouldn't change the qcow2 on-disk format in a
> > way that the qcow2 spec would have to be change, it does need to
> > change the VMState format that is stored as a blob within the qcow2 file.
> > At least, you need to store which other snapshot it is based upon so
> > that you can actually resume a VM from the incremental state.
> >
> > Once you modify the VMState format/the migration stream, removing it
> > from QEMU again later means that you can't load your old snapshots any
> > more. Doing that, even with the two-release deprecation period, would
> > be quite nasty.
> >
> > But you're right, depending on how the feature is implemented, it
> > might not be a thing that affects qcow2 much, but one that the
> > migration maintainers need to have a look at. I kind of suspect that
> > it would actually touch both parts to a degree that it would need
> > approval from both sides.
>
> VMState wire format changes are minimal.  The only issue is that the previous
> snapshot's nvdimm vmstate can start at an arbitrary offset in the qcow2
> cluster.  We can find a solution to the misalignment problem (I think
> Junyan's patch series adds padding).
>
> The approach references existing clusters in the previous snapshot's vmstate
> area and only allocates new clusters for dirty NVDIMM regions.
> In the non-qcow2 case we fall back to writing the entire NVDIMM contents.
>
> So instead of:
>
>   write(qcow2_bs, all_vmstate_data); /* duplicates nvdimm contents :( */
>
> do:
>
>   write(bs, vmstate_data_upto_nvdimm);
>   if (is_qcow2(bs)) {
>       snapshot_clone_vmstate_range(bs, previous_snapshot,
>                                    offset_to_nvdimm_vmstate);
>       overwrite_nvdimm_dirty_blocks(bs, nvdimm);
>   } else {
>       write(bs, nvdimm_vmstate_data);
>   }
>   write(bs, vmstate_data_after_nvdimm);
>
> Stefan
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-06-08 15:58                         ` Junyan He
@ 2018-06-08 16:38                           ` Pankaj Gupta
  2018-06-08 16:49                             ` Pankaj Gupta
  0 siblings, 1 reply; 18+ messages in thread
From: Pankaj Gupta @ 2018-06-08 16:38 UTC (permalink / raw)
  To: Junyan He
  Cc: Junyan He, Kevin Wolf, qemu block, qemu-devel, Stefan Hajnoczi,
	Max Reitz

Hi Junyan, 

> I think nvdimm kind memory can really save the content(no matter real or
> emulated). But I think it is still

> memory, as I understand, its data should be stored in the qcow2 image or some
> external snapshot data

> image, so that we can copy this qcow2 image to other place and restore the
> same environment.

Emulated NVDIMM is just a file backed mmaped region in guest address space. So, whatever IO 
operations guest does its done directly in memory pages and synced to backup file. 

For this usecase, normally 'raw' image format is supported. If we have to support qcow2 format, we have to 
mmap different chunks for ranges of host virtual memory regions into guest address space so that whenever guest does IO 
into some offset host/qemu knows the corresponding location. Also, there is need to manage large number of these 
small chunks. 

> Qcow2 image contain all vm state, disk data and memory data, so I think
> nvdimm's data should also be

> stored in this qcow2 image.

That will not work for emulated NVDIMM. I guest same is for pass-through or actual NVDIMM. 

> I am really a new guy in vmm field and do not know the usage of qcow2 with
> DAX. So far as I know, DAX

> is a kernel FS option to let the page mapping bypass all block device logic
> and can improve performance.

> But qcow2 is a user space file format used by qemu to emulate disks(Am I
> right?), so I have no idea about

> that.

FYI DAX is direct access, whenever guest does file operations to DAX capable NVDIMM device, it bypasses 
regular block device logic and uses iomap calls to direct IO into the NVDIMM device. But qcow2 disk emulation 
part is host thing and it should be able to provide support, if it provides a mmaped area as guest physical or host 
virtual address. 

Thanks, 
Pankaj 

> Thanks

> Junyan

> From: Qemu-devel <qemu-devel-bounces+junyan.he=gmx.com@nongnu.org> on behalf
> of Pankaj Gupta <pagupta@redhat.com>
> Sent: Friday, June 8, 2018 7:59:24 AM
> To: Junyan He
> Cc: Kevin Wolf; qemu block; qemu-devel@nongnu.org; Stefan Hajnoczi; Max Reitz
> Subject: Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2
> incremental snapshot

> Hi Junyan,

> AFAICU you are trying to utilize qcow2 capabilities to do incremental
> snapshot. As I understand NVDIMM device (being it real or emulated), its
> contents are always be backed up in backing device.

> Now, the question comes to take a snapshot at some point in time. You are
> trying to achieve this with qcow2 format (not checked code yet), I have below
> queries:

> - Are you implementing this feature for both actual DAX device pass-through
> as well as emulated DAX?
> - Are you using additional qcow2 disk for storing/taking snapshots? How we
> are
> planning to use this feature?

> Reason I asked this question is if we concentrate on integrating qcow2
> with DAX, we will have a full fledged solution for most of the use-cases.

> Thanks,
> Pankaj

> >
> > Dear all:
> >
> > I just switched from graphic/media field to virtualization at the end of
> > the
> > last year,
> > so I am sorry that though I have already try my best but I still feel a
> > little dizzy
> > about your previous discussion about NVDimm via block layer:)
> > In today's qemu, we use the SaveVMHandlers functions to handle both
> > snapshot
> > and migration.
> > So for nvdimm kind memory, its migration and snapshot use the same way as
> > the
> > ram(savevm_ram_handlers). But the difference is the size of nvdimm may be
> > huge, and the load
> > and store speed is slower. According to my usage, when I use 256G nvdimm as
> > memory backend,
> > it may take more than 5 minutes to complete one snapshot saving, and after
> > saving the qcow2
> > image is bigger than 50G. For migration, this may not be a problem because
> > we
> > do not need
> > extra disk space and the guest is not paused when in migration process. But
> > for snapshot,
> > we need to pause the VM and the user experience is bad, and we got concerns
> > about that.
> > I posted this question in Jan this year but failed to get enough reply.
> > Then
> > I sent a RFC patch
> > set in Mar, basic idea is using the dependency snapshot and dirty log trace
> > in kernel to
> > optimize this.
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg04530.html
> >
> > I use the simple way to handle this,
> > 1. Separate the nvdimm region from ram when do snapshot.
> > 2. If the first time, we dump all the nvdimm data the same as ram, and
> > enable
> > dirty log trace
> > for nvdimm kind region.
> > 3. If not the first time, we find the previous snapshot point and add
> > reference to its clusters
> > which is used to store nvdimm data. And this time, we just save dirty page
> > bitmap and dirty pages.
> > Because the previous nvdimm data clusters is ref added, we do not need to
> > worry about its deleting.
> >
> > I encounter a lot of problems:
> > 1. Migration and snapshot logic is mixed and need to separate them for
> > nvdimm.
> > 2. Cluster has its alignment. When do snapshot, we just save data to disk
> > continuous. Because we
> > need to add ref to cluster, we really need to consider the alignment. I
> > just
> > use a little trick way
> > to padding some data to alignment now, and I think it is not a good way.
> > 3. Dirty log trace may have some performance problem.
> >
> > In theory, this manner can be used to handle all kind of huge memory
> > snapshot, we need to find the
> > balance between guest performance(Because of dirty log trace) and snapshot
> > saving time.
> >
> > Thanks
> > Junyan
> >
> >
> > -----Original Message-----
> > From: Stefan Hajnoczi [ mailto:stefanha@redhat.com ]
> > Sent: Thursday, May 31, 2018 6:49 PM
> > To: Kevin Wolf <kwolf@redhat.com>
> > Cc: Max Reitz <mreitz@redhat.com>; He, Junyan <junyan.he@intel.com>; Pankaj
> > Gupta <pagupta@redhat.com>; qemu-devel@nongnu.org; qemu block
> > <qemu-block@nongnu.org>
> > Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2
> > incremental snapshot
> >
> > On Wed, May 30, 2018 at 06:07:19PM +0200, Kevin Wolf wrote:
> > > Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> > > > On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> > > > > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> > > > > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > > > > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > > > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > > > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > > > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > > > > > > >> I think it makes sense to invest some effort into such
> > > > > > > > >> interfaces, but be prepared for a long journey.
> > > > > > > > >
> > > > > > > > > I like the suggestion but it needs to be followed up with
> > > > > > > > > a concrete design that is feasible and fair for Junyan and
> > > > > > > > > others to implement.
> > > > > > > > > Otherwise the "long journey" is really just a way of
> > > > > > > > > rejecting this feature.
> > > >
> > > > The discussion on NVDIMM via the block layer has runs its course.
> > > > It would be a big project and I don't think it's fair to ask Junyan
> > > > to implement it.
> > > >
> > > > My understanding is this patch series doesn't modify the qcow2
> > > > on-disk file format. Rather, it just uses existing qcow2 mechanisms
> > > > and extends live migration to identify the NVDIMM state state region
> > > > to share the clusters.
> > > >
> > > > Since this feature does not involve qcow2 format changes and is just
> > > > an optimization (dirty blocks still need to be allocated), it can be
> > > > removed from QEMU in the future if a better alternative becomes
> > > > available.
> > > >
> > > > Junyan: Can you rebase the series and send a new revision?
> > > >
> > > > Kevin and Max: Does this sound alright?
> > >
> > > Do patches exist? I've never seen any, so I thought this was just the
> > > early design stage.
> >
> > Sorry for the confusion, the earlier patch series was here:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg04530.html
> >
> > > I suspect that while it wouldn't change the qcow2 on-disk format in a
> > > way that the qcow2 spec would have to be change, it does need to
> > > change the VMState format that is stored as a blob within the qcow2 file.
> > > At least, you need to store which other snapshot it is based upon so
> > > that you can actually resume a VM from the incremental state.
> > >
> > > Once you modify the VMState format/the migration stream, removing it
> > > from QEMU again later means that you can't load your old snapshots any
> > > more. Doing that, even with the two-release deprecation period, would
> > > be quite nasty.
> > >
> > > But you're right, depending on how the feature is implemented, it
> > > might not be a thing that affects qcow2 much, but one that the
> > > migration maintainers need to have a look at. I kind of suspect that
> > > it would actually touch both parts to a degree that it would need
> > > approval from both sides.
> >
> > VMState wire format changes are minimal. The only issue is that the
> > previous
> > snapshot's nvdimm vmstate can start at an arbitrary offset in the qcow2
> > cluster. We can find a solution to the misalignment problem (I think
> > Junyan's patch series adds padding).
> >
> > The approach references existing clusters in the previous snapshot's
> > vmstate
> > area and only allocates new clusters for dirty NVDIMM regions.
> > In the non-qcow2 case we fall back to writing the entire NVDIMM contents.
> >
> > So instead of:
> >
> > write(qcow2_bs, all_vmstate_data); /* duplicates nvdimm contents :( */
> >
> > do:
> >
> > write(bs, vmstate_data_upto_nvdimm);
> > if (is_qcow2(bs)) {
> > snapshot_clone_vmstate_range(bs, previous_snapshot,
> > offset_to_nvdimm_vmstate);
> > overwrite_nvdimm_dirty_blocks(bs, nvdimm);
> > } else {
> > write(bs, nvdimm_vmstate_data);
> > }
> > write(bs, vmstate_data_after_nvdimm);
> >
> > Stefan
> >
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
  2018-06-08 16:38                           ` Pankaj Gupta
@ 2018-06-08 16:49                             ` Pankaj Gupta
  0 siblings, 0 replies; 18+ messages in thread
From: Pankaj Gupta @ 2018-06-08 16:49 UTC (permalink / raw)
  To: Junyan He
  Cc: Junyan He, Kevin Wolf, qemu block, qemu-devel, Stefan Hajnoczi,
	Max Reitz

> Hi Junyan,

Just want to add in same email thread there is discussion by Stefan & Kevin describing 
ways of handling NVDIMM similar to block device with the pros & cons of the approach. 

In light of that we have to consider this as memory but support of qcow2 is required 
for snapshot, handling block errors and other advance features. 

Hope this helps. 

Thanks, 
Pankaj 

> > I think nvdimm kind memory can really save the content(no matter real or
> > emulated). But I think it is still
> 

> > memory, as I understand, its data should be stored in the qcow2 image or
> > some
> > external snapshot data
> 

> > image, so that we can copy this qcow2 image to other place and restore the
> > same environment.
> 

> Emulated NVDIMM is just a file backed mmaped region in guest address space.
> So, whatever IO
> operations guest does its done directly in memory pages and synced to backup
> file.

> For this usecase, normally 'raw' image format is supported. If we have to
> support qcow2 format, we have to
> mmap different chunks for ranges of host virtual memory regions into guest
> address space so that whenever guest does IO
> into some offset host/qemu knows the corresponding location. Also, there is
> need to manage large number of these
> small chunks.

> > Qcow2 image contain all vm state, disk data and memory data, so I think
> > nvdimm's data should also be
> 

> > stored in this qcow2 image.
> 

> That will not work for emulated NVDIMM. I guest same is for pass-through or
> actual NVDIMM.

> > I am really a new guy in vmm field and do not know the usage of qcow2 with
> > DAX. So far as I know, DAX
> 

> > is a kernel FS option to let the page mapping bypass all block device logic
> > and can improve performance.
> 

> > But qcow2 is a user space file format used by qemu to emulate disks(Am I
> > right?), so I have no idea about
> 

> > that.
> 

> FYI DAX is direct access, whenever guest does file operations to DAX capable
> NVDIMM device, it bypasses
> regular block device logic and uses iomap calls to direct IO into the NVDIMM
> device. But qcow2 disk emulation
> part is host thing and it should be able to provide support, if it provides a
> mmaped area as guest physical or host
> virtual address.

> Thanks,
> Pankaj

> > Thanks
> 

> > Junyan
> 

> > From: Qemu-devel <qemu-devel-bounces+junyan.he=gmx.com@nongnu.org> on
> > behalf
> > of Pankaj Gupta <pagupta@redhat.com>
> 
> > Sent: Friday, June 8, 2018 7:59:24 AM
> 
> > To: Junyan He
> 
> > Cc: Kevin Wolf; qemu block; qemu-devel@nongnu.org; Stefan Hajnoczi; Max
> > Reitz
> 
> > Subject: Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2
> > incremental snapshot
> 

> > Hi Junyan,
> 

> > AFAICU you are trying to utilize qcow2 capabilities to do incremental
> 
> > snapshot. As I understand NVDIMM device (being it real or emulated), its
> 
> > contents are always be backed up in backing device.
> 

> > Now, the question comes to take a snapshot at some point in time. You are
> 
> > trying to achieve this with qcow2 format (not checked code yet), I have
> > below
> 
> > queries:
> 

> > - Are you implementing this feature for both actual DAX device pass-through
> 
> > as well as emulated DAX?
> 
> > - Are you using additional qcow2 disk for storing/taking snapshots? How we
> > are
> 
> > planning to use this feature?
> 

> > Reason I asked this question is if we concentrate on integrating qcow2
> 
> > with DAX, we will have a full fledged solution for most of the use-cases.
> 

> > Thanks,
> 
> > Pankaj
> 

> > >
> 
> > > Dear all:
> 
> > >
> 
> > > I just switched from graphic/media field to virtualization at the end of
> > > the
> 
> > > last year,
> 
> > > so I am sorry that though I have already try my best but I still feel a
> 
> > > little dizzy
> 
> > > about your previous discussion about NVDimm via block layer:)
> 
> > > In today's qemu, we use the SaveVMHandlers functions to handle both
> > > snapshot
> 
> > > and migration.
> 
> > > So for nvdimm kind memory, its migration and snapshot use the same way as
> > > the
> 
> > > ram(savevm_ram_handlers). But the difference is the size of nvdimm may be
> 
> > > huge, and the load
> 
> > > and store speed is slower. According to my usage, when I use 256G nvdimm
> > > as
> 
> > > memory backend,
> 
> > > it may take more than 5 minutes to complete one snapshot saving, and
> > > after
> 
> > > saving the qcow2
> 
> > > image is bigger than 50G. For migration, this may not be a problem
> > > because
> > > we
> 
> > > do not need
> 
> > > extra disk space and the guest is not paused when in migration process.
> > > But
> 
> > > for snapshot,
> 
> > > we need to pause the VM and the user experience is bad, and we got
> > > concerns
> 
> > > about that.
> 
> > > I posted this question in Jan this year but failed to get enough reply.
> > > Then
> 
> > > I sent a RFC patch
> 
> > > set in Mar, basic idea is using the dependency snapshot and dirty log
> > > trace
> 
> > > in kernel to
> 
> > > optimize this.
> 
> > >
> 
> > > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg04530.html
> 
> > >
> 
> > > I use the simple way to handle this,
> 
> > > 1. Separate the nvdimm region from ram when do snapshot.
> 
> > > 2. If the first time, we dump all the nvdimm data the same as ram, and
> > > enable
> 
> > > dirty log trace
> 
> > > for nvdimm kind region.
> 
> > > 3. If not the first time, we find the previous snapshot point and add
> 
> > > reference to its clusters
> 
> > > which is used to store nvdimm data. And this time, we just save dirty
> > > page
> 
> > > bitmap and dirty pages.
> 
> > > Because the previous nvdimm data clusters is ref added, we do not need to
> 
> > > worry about its deleting.
> 
> > >
> 
> > > I encounter a lot of problems:
> 
> > > 1. Migration and snapshot logic is mixed and need to separate them for
> 
> > > nvdimm.
> 
> > > 2. Cluster has its alignment. When do snapshot, we just save data to disk
> 
> > > continuous. Because we
> 
> > > need to add ref to cluster, we really need to consider the alignment. I
> > > just
> 
> > > use a little trick way
> 
> > > to padding some data to alignment now, and I think it is not a good way.
> 
> > > 3. Dirty log trace may have some performance problem.
> 
> > >
> 
> > > In theory, this manner can be used to handle all kind of huge memory
> 
> > > snapshot, we need to find the
> 
> > > balance between guest performance(Because of dirty log trace) and
> > > snapshot
> 
> > > saving time.
> 
> > >
> 
> > > Thanks
> 
> > > Junyan
> 
> > >
> 
> > >
> 
> > > -----Original Message-----
> 
> > > From: Stefan Hajnoczi [ mailto:stefanha@redhat.com ]
> 
> > > Sent: Thursday, May 31, 2018 6:49 PM
> 
> > > To: Kevin Wolf <kwolf@redhat.com>
> 
> > > Cc: Max Reitz <mreitz@redhat.com>; He, Junyan <junyan.he@intel.com>;
> > > Pankaj
> 
> > > Gupta <pagupta@redhat.com>; qemu-devel@nongnu.org; qemu block
> 
> > > <qemu-block@nongnu.org>
> 
> > > Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2
> 
> > > incremental snapshot
> 
> > >
> 
> > > On Wed, May 30, 2018 at 06:07:19PM +0200, Kevin Wolf wrote:
> 
> > > > Am 30.05.2018 um 16:44 hat Stefan Hajnoczi geschrieben:
> 
> > > > > On Mon, May 14, 2018 at 02:48:47PM +0100, Stefan Hajnoczi wrote:
> 
> > > > > > On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> 
> > > > > > > Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> 
> > > > > > > > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> 
> > > > > > > > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> 
> > > > > > > > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> 
> > > > > > > > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> 
> > > > > > > > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> 
> > > > > > > > > >> I think it makes sense to invest some effort into such
> 
> > > > > > > > > >> interfaces, but be prepared for a long journey.
> 
> > > > > > > > > >
> 
> > > > > > > > > > I like the suggestion but it needs to be followed up with
> 
> > > > > > > > > > a concrete design that is feasible and fair for Junyan and
> 
> > > > > > > > > > others to implement.
> 
> > > > > > > > > > Otherwise the "long journey" is really just a way of
> 
> > > > > > > > > > rejecting this feature.
> 
> > > > >
> 
> > > > > The discussion on NVDIMM via the block layer has runs its course.
> 
> > > > > It would be a big project and I don't think it's fair to ask Junyan
> 
> > > > > to implement it.
> 
> > > > >
> 
> > > > > My understanding is this patch series doesn't modify the qcow2
> 
> > > > > on-disk file format. Rather, it just uses existing qcow2 mechanisms
> 
> > > > > and extends live migration to identify the NVDIMM state state region
> 
> > > > > to share the clusters.
> 
> > > > >
> 
> > > > > Since this feature does not involve qcow2 format changes and is just
> 
> > > > > an optimization (dirty blocks still need to be allocated), it can be
> 
> > > > > removed from QEMU in the future if a better alternative becomes
> 
> > > > > available.
> 
> > > > >
> 
> > > > > Junyan: Can you rebase the series and send a new revision?
> 
> > > > >
> 
> > > > > Kevin and Max: Does this sound alright?
> 
> > > >
> 
> > > > Do patches exist? I've never seen any, so I thought this was just the
> 
> > > > early design stage.
> 
> > >
> 
> > > Sorry for the confusion, the earlier patch series was here:
> 
> > >
> 
> > > https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg04530.html
> 
> > >
> 
> > > > I suspect that while it wouldn't change the qcow2 on-disk format in a
> 
> > > > way that the qcow2 spec would have to be change, it does need to
> 
> > > > change the VMState format that is stored as a blob within the qcow2
> > > > file.
> 
> > > > At least, you need to store which other snapshot it is based upon so
> 
> > > > that you can actually resume a VM from the incremental state.
> 
> > > >
> 
> > > > Once you modify the VMState format/the migration stream, removing it
> 
> > > > from QEMU again later means that you can't load your old snapshots any
> 
> > > > more. Doing that, even with the two-release deprecation period, would
> 
> > > > be quite nasty.
> 
> > > >
> 
> > > > But you're right, depending on how the feature is implemented, it
> 
> > > > might not be a thing that affects qcow2 much, but one that the
> 
> > > > migration maintainers need to have a look at. I kind of suspect that
> 
> > > > it would actually touch both parts to a degree that it would need
> 
> > > > approval from both sides.
> 
> > >
> 
> > > VMState wire format changes are minimal. The only issue is that the
> > > previous
> 
> > > snapshot's nvdimm vmstate can start at an arbitrary offset in the qcow2
> 
> > > cluster. We can find a solution to the misalignment problem (I think
> 
> > > Junyan's patch series adds padding).
> 
> > >
> 
> > > The approach references existing clusters in the previous snapshot's
> > > vmstate
> 
> > > area and only allocates new clusters for dirty NVDIMM regions.
> 
> > > In the non-qcow2 case we fall back to writing the entire NVDIMM contents.
> 
> > >
> 
> > > So instead of:
> 
> > >
> 
> > > write(qcow2_bs, all_vmstate_data); /* duplicates nvdimm contents :( */
> 
> > >
> 
> > > do:
> 
> > >
> 
> > > write(bs, vmstate_data_upto_nvdimm);
> 
> > > if (is_qcow2(bs)) {
> 
> > > snapshot_clone_vmstate_range(bs, previous_snapshot,
> 
> > > offset_to_nvdimm_vmstate);
> 
> > > overwrite_nvdimm_dirty_blocks(bs, nvdimm);
> 
> > > } else {
> 
> > > write(bs, nvdimm_vmstate_data);
> 
> > > }
> 
> > > write(bs, vmstate_data_after_nvdimm);
> 
> > >
> 
> > > Stefan
> 
> > >
> 
> > >
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-06-08 16:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-25  7:33 [Qemu-devel] Some question about savem/qcow2 incremental snapshot He Junyan
2018-05-08 14:41 ` Eric Blake
2018-05-08 15:03   ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2018-05-09 10:16     ` Stefan Hajnoczi
2018-05-09 17:54       ` Max Reitz
2018-05-10  8:26         ` Stefan Hajnoczi
2018-05-11 17:25           ` Kevin Wolf
2018-05-14 13:48             ` Stefan Hajnoczi
2018-05-28  7:01               ` He, Junyan
2018-05-30 14:44               ` Stefan Hajnoczi
2018-05-30 16:07                 ` Kevin Wolf
2018-05-31 10:48                   ` Stefan Hajnoczi
2018-06-08  5:02                     ` He, Junyan
2018-06-08  7:59                       ` Pankaj Gupta
2018-06-08 15:58                         ` Junyan He
2018-06-08 16:38                           ` Pankaj Gupta
2018-06-08 16:49                             ` Pankaj Gupta
2018-06-08 13:29                       ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.