All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED
       [not found] <OF3C9DAE9F.EC6B5878-ON85257826.00715C10-85257826.007A14FB@LocalDomain>
@ 2011-02-15 19:45 ` Chunqiang Tang
  2011-02-16 12:34   ` Kevin Wolf
  2011-02-16 13:21   ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Stefan Hajnoczi
  0 siblings, 2 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-02-15 19:45 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
> As you requested, I set up a wiki page for FVD at 
http://wiki.qemu.org/Features/FVD
> . It includes a summary of FVD, a detailed specification of FVD, and a 
> comparison of the design and performance of FVD and QED. 

> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
figure 
> shows that the file creation throughput of NetApp's PostMark benchmark 
under 
> FVD is 74.9% to 215% higher than that under QED.

Hi Anthony,

Please let me know if more information is needed. I would appreciate your 
feedback and advice on the best way to proceed with FVD. 

BTW, I recently added QCOW2 into the performance comparison figure on 
wiki.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED
  2011-02-15 19:45 ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Chunqiang Tang
@ 2011-02-16 12:34   ` Kevin Wolf
  2011-02-17 16:04     ` Chunqiang Tang
  2011-02-18  9:12     ` Strategic decision: COW format (was: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED) Markus Armbruster
  2011-02-16 13:21   ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Stefan Hajnoczi
  1 sibling, 2 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-16 12:34 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>> As you requested, I set up a wiki page for FVD at 
> http://wiki.qemu.org/Features/FVD
>> . It includes a summary of FVD, a detailed specification of FVD, and a 
>> comparison of the design and performance of FVD and QED. 
> 
>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
> figure 
>> shows that the file creation throughput of NetApp's PostMark benchmark 
> under 
>> FVD is 74.9% to 215% higher than that under QED.
> 
> Hi Anthony,
> 
> Please let me know if more information is needed. I would appreciate your 
> feedback and advice on the best way to proceed with FVD. 

Yet another file format with yet another implementation is definitely
not what we need. We should probably take some of the ideas in FVD and
consider them for qcow3.

However, I think some of them like the "no-alloc" mode aren't that
useful: If I want the features and the performance of raw, I can just
take raw.

> BTW, I recently added QCOW2 into the performance comparison figure on 
> wiki.

It's obvious why you have only one case for QED (it doesn't support
anything else), but qcow2 works on block devices, too, and you can also
use metadata preallocation. Are you aware of this?

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED
  2011-02-15 19:45 ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Chunqiang Tang
  2011-02-16 12:34   ` Kevin Wolf
@ 2011-02-16 13:21   ` Stefan Hajnoczi
  2011-02-17 16:04     ` Chunqiang Tang
  1 sibling, 1 reply; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-02-16 13:21 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Tue, Feb 15, 2011 at 7:45 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>> As you requested, I set up a wiki page for FVD at
> http://wiki.qemu.org/Features/FVD
>> . It includes a summary of FVD, a detailed specification of FVD, and a
>> comparison of the design and performance of FVD and QED.
>
>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
> figure
>> shows that the file creation throughput of NetApp's PostMark benchmark
> under
>> FVD is 74.9% to 215% higher than that under QED.

File creation on a sparse image is currently limited by the fact that
the QED implementation serializes allocating writes.  In earlier QED
patch series I had fine-grained metadata locking so allocating writes
scale better but that lead a regression on another benchmark.  I have
this on my todo list.

This is an implementation-specific limitation and is unrelated to the
file format.  It doesn't tell us about a difference between formats.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED
  2011-02-16 12:34   ` Kevin Wolf
@ 2011-02-17 16:04     ` Chunqiang Tang
  2011-02-18  9:12     ` Strategic decision: COW format (was: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED) Markus Armbruster
  1 sibling, 0 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-02-17 16:04 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
> >> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
> >> As you requested, I set up a wiki page for FVD at 
> > http://wiki.qemu.org/Features/FVD
> >> . It includes a summary of FVD, a detailed specification of FVD, and 
a 
> >> comparison of the design and performance of FVD and QED. 
> > 
> >> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
> > figure 
> >> shows that the file creation throughput of NetApp's PostMark 
benchmark 
> > under 
> >> FVD is 74.9% to 215% higher than that under QED.
> > 
> > Hi Anthony,
> > 
> > Please let me know if more information is needed. I would appreciate 
your 
> > feedback and advice on the best way to proceed with FVD. 
> 
> Yet another file format with yet another implementation is definitely
> not what we need. We should probably take some of the ideas in FVD and
> consider them for qcow3.

I certainly agree that convergence is ideal, and FVD was designed from the 
very beginning for the purpose of convergence, by going a long way to make 
it flexible and fully configurable to suit different use cases. FVD is by 
no means another random format coming out of a random research hacking 
project. 

QCOW2 is feature rich (especially snapshots) and I am certain that it will 
be highly valuable to many users for a long foreseeable future, but it was 
simply not designed for achieving absolute high performance needed in 
server environments. One key thing missing in QEMU is a high-performance 
copy-on-write image format, which is dreadfully needed, at least based on 
my experience in Cloud. Even VirtualBox VDI is in a much better position 
in achieving high performance than QCOW2 because of VDI’s simple design. 
If we are going to move beyond QCOW2, as indicated by the move of adopting 
QED, regardless of the name of the next image format, be it QCOW3, QED, or 
FVD, it is important that we seriously learn the lessons from past image 
formats. Unfortunately, I was unaware of QED when it was under 
development. In my view, QED is so similar to QCOW2 (both by design and by 
implementation) that it does not achieve the goal of addressing the main 
limitations of QCOW2 and moving beyond to the next level. This will lead 
to further future image format fragmentation and suffering, rather than 
achieving the goal of convergence.

Why not make FVD the basis of QCOW3, with additional requirements 
incorporated? I have posted the spec and am quite open to suggestions. I 
performed a careful study on image formats, and truly believe that FVD is 
a great leap forward beyond any existing image formats (VMDK, VDI, VHD, 
QCOW2, etc.). FVD is quite mature as I spent more than half a year on 
hardening it. Even 5 years down the road, say by 2016, the chance of 
another image format beating the performance of FVD probably is slim. 

> However, I think some of them like the "no-alloc" mode aren't that
> useful: If I want the features and the performance of raw, I can just
> take raw.

FVD-no-alloc is a copy-on-write image format, whereas RAW is not. We are 
interested in copy-on-write image format. Otherwise, RAW would be great. 
The experiment uses a 50GB QCOW2/QED/FVD image based on a 1GB base image, 
which reflects the flavor of a configuration in Cloud. This info is 
available in the detailed experiment setup description.

> > BTW, I recently added QCOW2 into the performance comparison figure on 
> > wiki.
> 
> It's obvious why you have only one case for QED (it doesn't support
> anything else), but qcow2 works on block devices, too, and you can also
> use metadata preallocation. Are you aware of this?

The QCOW2 code shows that preallocation does not work with backing file. 
We need a copy-on-write image. Later I will add to the wiki new QCOW2 
results on block devices. My paper has results of old version QCOW2 on 
block devices. Based on my past much more extensive benchmarking of QCOW2 
in QEMU 0.12.3, increasing the metadata cache of QCOW2 would be more 
effective in increasing performance. I recommend increasing QCOW2's cache 
size to at least QED's level.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED
  2011-02-16 13:21   ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Stefan Hajnoczi
@ 2011-02-17 16:04     ` Chunqiang Tang
  0 siblings, 0 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-02-17 16:04 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

> On Tue, Feb 15, 2011 at 7:45 PM, Chunqiang Tang <ctang@us.ibm.com> 
wrote:
> >> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
> >> As you requested, I set up a wiki page for FVD at
> > http://wiki.qemu.org/Features/FVD
> >> . It includes a summary of FVD, a detailed specification of FVD, and 
a
> >> comparison of the design and performance of FVD and QED.
> >
> >> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
> > figure
> >> shows that the file creation throughput of NetApp's PostMark 
benchmark
> > under
> >> FVD is 74.9% to 215% higher than that under QED.
> 
> File creation on a sparse image is currently limited by the fact that
> the QED implementation serializes allocating writes.  In earlier QED
> patch series I had fine-grained metadata locking so allocating writes
> scale better but that lead a regression on another benchmark.  I have
> this on my todo list.
> 
> This is an implementation-specific limitation and is unrelated to the
> file format.  It doesn't tell us about a difference between formats.

I agree that a good comparison of the design difference is as important as 
the quantitative numbers. I previously posted the spec of FVD as well as a 
comparison of FVD and QED on wiki, http://wiki.qemu.org/Features/FVD . 
Part of the comparison was also copied in an early email of this thread. I 
welcome comments and feedback on the design aspect.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Strategic decision: COW format (was: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED)
  2011-02-16 12:34   ` Kevin Wolf
  2011-02-17 16:04     ` Chunqiang Tang
@ 2011-02-18  9:12     ` Markus Armbruster
  2011-02-18  9:57       ` [Qemu-devel] Re: Strategic decision: COW format Kevin Wolf
  1 sibling, 1 reply; 87+ messages in thread
From: Markus Armbruster @ 2011-02-18  9:12 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, qemu-devel, Stefan Hajnoczi

Kevin Wolf <kwolf@redhat.com> writes:

> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>> As you requested, I set up a wiki page for FVD at 
>> http://wiki.qemu.org/Features/FVD
>>> . It includes a summary of FVD, a detailed specification of FVD, and a 
>>> comparison of the design and performance of FVD and QED. 
>> 
>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
>> figure 
>>> shows that the file creation throughput of NetApp's PostMark benchmark 
>> under 
>>> FVD is 74.9% to 215% higher than that under QED.
>> 
>> Hi Anthony,
>> 
>> Please let me know if more information is needed. I would appreciate your 
>> feedback and advice on the best way to proceed with FVD. 
>
> Yet another file format with yet another implementation is definitely
> not what we need. We should probably take some of the ideas in FVD and
> consider them for qcow3.

Got an assumption there: that the one COW format we need must be qcow3,
i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
has happened on the list already, I missed it.  If not, it's overdue,
and then we better start it right away.

The choice of COW format will have a significant impact over a long
term.  That makes it a strategic decision.  Such decisions should not be
made purely on short-term considerations.

What are our core requirements for The One COW Format?

What's merely nice to have?

How much of it do we want to have ready for prime time in 0.15?

[...]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18  9:12     ` Strategic decision: COW format (was: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED) Markus Armbruster
@ 2011-02-18  9:57       ` Kevin Wolf
  2011-02-18 14:20         ` Anthony Liguori
                           ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-18  9:57 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Chunqiang Tang, qemu-devel, Stefan Hajnoczi

Am 18.02.2011 10:12, schrieb Markus Armbruster:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
>> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>>> As you requested, I set up a wiki page for FVD at 
>>> http://wiki.qemu.org/Features/FVD
>>>> . It includes a summary of FVD, a detailed specification of FVD, and a 
>>>> comparison of the design and performance of FVD and QED. 
>>>
>>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
>>> figure 
>>>> shows that the file creation throughput of NetApp's PostMark benchmark 
>>> under 
>>>> FVD is 74.9% to 215% higher than that under QED.
>>>
>>> Hi Anthony,
>>>
>>> Please let me know if more information is needed. I would appreciate your 
>>> feedback and advice on the best way to proceed with FVD. 
>>
>> Yet another file format with yet another implementation is definitely
>> not what we need. We should probably take some of the ideas in FVD and
>> consider them for qcow3.
> 
> Got an assumption there: that the one COW format we need must be qcow3,
> i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
> has happened on the list already, I missed it.  If not, it's overdue,
> and then we better start it right away.

Right. I probably wasn't very clear about what I mean with qcow3 either,
so let me try to summarize my reasoning.


The first point is an assumption that you made, too: That we want to
have only one format. I hope it's easy to agree on this, duplication is
bad and every additional format creates new maintenance burden,
especially if we're taking it serious. Until now, there were exactly two
formats for which we managed to do this, raw and qcow2. raw is more or
less for free, so with the introduction of another format, we basically
double the supported block driver code overnight (while not doubling the
number of developers).

The consequence of having only one file format is that it must be able
to obsolete the existing ones, most notably qcow2. We can only neglect
qcow1 today because we can tell users to use qcow2. It supports
everything that qcow1 supports and more. We couldn't have done this if
qcow2 lacked features compared to qcow1.

So the one really essential requirement that I see is that we provide a
way forward for _all_ users by maintaining all of qcow2's features. This
is the only way of getting people to not stay with qcow2.


Of course, you could invent another format that implements the same
features, but I think just carefully extending qcow2 has some real
advantages.

The first is that conversion of existing images would be really easy.
Basically increment the version number in the header file and you're
done. Structures would be compatible. If you compare it to file systems,
I rarely ever change the file system on a non-empty partition. Even if I
wanted, it's usually just too painful. Except when I was able to use
"tune2fs -j" to make ext3 out of ext2, that was really easy. We can
provide the same for qcow2 to qcow3 conversion, but not with a
completely new format.

Also, while obsoleting a file format means that we need not put much
effort in its maintenance, we still need to keep the code around for
reading old images. With an extension of qcow2, it would be the same
code that is used for both versions.

Third, qcow2 already exists, is used in practice and we have put quite
some effort into QA. At least initially confidence would be higher than
in a completely new, yet untested format. Remember that with qcow3 I'm
not talking about rewriting everything, it's a careful evolution, mostly
with optional additions here and there.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18  9:57       ` [Qemu-devel] Re: Strategic decision: COW format Kevin Wolf
@ 2011-02-18 14:20         ` Anthony Liguori
  2011-02-22  8:37           ` Markus Armbruster
  2011-02-18 17:43         ` Stefan Weil
  2011-02-20 22:13         ` [Qemu-devel] Re: Strategic decision: COW format Aurelien Jarno
  2 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-18 14:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

On 02/18/2011 03:57 AM, Kevin Wolf wrote:
> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>    
>> Kevin Wolf<kwolf@redhat.com>  writes:
>>
>>      
>>> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>>        
>>>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>>>> As you requested, I set up a wiki page for FVD at
>>>>>            
>>>> http://wiki.qemu.org/Features/FVD
>>>>          
>>>>> . It includes a summary of FVD, a detailed specification of FVD, and a
>>>>> comparison of the design and performance of FVD and QED.
>>>>>            
>>>>          
>>>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
>>>>>            
>>>> figure
>>>>          
>>>>> shows that the file creation throughput of NetApp's PostMark benchmark
>>>>>            
>>>> under
>>>>          
>>>>> FVD is 74.9% to 215% higher than that under QED.
>>>>>            
>>>> Hi Anthony,
>>>>
>>>> Please let me know if more information is needed. I would appreciate your
>>>> feedback and advice on the best way to proceed with FVD.
>>>>          
>>> Yet another file format with yet another implementation is definitely
>>> not what we need. We should probably take some of the ideas in FVD and
>>> consider them for qcow3.
>>>        
>> Got an assumption there: that the one COW format we need must be qcow3,
>> i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
>> has happened on the list already, I missed it.  If not, it's overdue,
>> and then we better start it right away.
>>      
> Right. I probably wasn't very clear about what I mean with qcow3 either,
> so let me try to summarize my reasoning.
>
>
> The first point is an assumption that you made, too: That we want to
> have only one format. I hope it's easy to agree on this, duplication is
> bad and every additional format creates new maintenance burden,
> especially if we're taking it serious. Until now, there were exactly two
> formats for which we managed to do this, raw and qcow2. raw is more or
> less for free, so with the introduction of another format, we basically
> double the supported block driver code overnight (while not doubling the
> number of developers).
>    

Not sure what project you're following, but we've had an awful lot of 
formats before qcow2 :-)

And qcow2 was never all that special, it just was dropped in the code 
base one day.  You've put a lot of work into qcow2, but there are other 
folks that are contributing additional formats and that means more 
developers.

> The consequence of having only one file format is that it must be able
> to obsolete the existing ones, most notably qcow2. We can only neglect
> qcow1 today because we can tell users to use qcow2. It supports
> everything that qcow1 supports and more. We couldn't have done this if
> qcow2 lacked features compared to qcow1.
>
> So the one really essential requirement that I see is that we provide a
> way forward for _all_ users by maintaining all of qcow2's features. This
> is the only way of getting people to not stay with qcow2.
>
>
> Of course, you could invent another format that implements the same
> features, but I think just carefully extending qcow2 has some real
> advantages.
>
> The first is that conversion of existing images would be really easy.
> Basically increment the version number in the header file and you're
> done. Structures would be compatible.

qemu-img convert is a reasonable path for conversion.

>   If you compare it to file systems,
> I rarely ever change the file system on a non-empty partition. Even if I
> wanted, it's usually just too painful. Except when I was able to use
> "tune2fs -j" to make ext3 out of ext2, that was really easy. We can
> provide the same for qcow2 to qcow3 conversion, but not with a
> completely new format.
>
> Also, while obsoleting a file format means that we need not put much
> effort in its maintenance, we still need to keep the code around for
> reading old images. With an extension of qcow2, it would be the same
> code that is used for both versions.
>
> Third, qcow2 already exists, is used in practice and we have put quite
> some effort into QA. At least initially confidence would be higher than
> in a completely new, yet untested format. Remember that with qcow3 I'm
> not talking about rewriting everything, it's a careful evolution, mostly
> with optional additions here and there.
>    

My requirements for a new format are as followed:

1) documented, thought-out specification that is covered under and open 
license with a clear process for extension.

2) ability to add both compatible and incompatible features in a 
graceful way

3) ability to achieve performance that's close to raw.  I want our new 
format to be able to be used universally both for servers and desktops.

I think qcow2 has some misfeatures like compression and internal 
snapshots.  I think preserving those misfeatures is a mistake because I 
don't think we can satisfy the above while trying to preserve those 
features.  If the image format degrades when those features are enabled, 
then it decreases confidence in the format.

I think QED satisfies all of these today.

Regards,

Anthony Liguori

> Kevin
>
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18  9:57       ` [Qemu-devel] Re: Strategic decision: COW format Kevin Wolf
  2011-02-18 14:20         ` Anthony Liguori
@ 2011-02-18 17:43         ` Stefan Weil
  2011-02-18 19:11           ` Kevin Wolf
                             ` (2 more replies)
  2011-02-20 22:13         ` [Qemu-devel] Re: Strategic decision: COW format Aurelien Jarno
  2 siblings, 3 replies; 87+ messages in thread
From: Stefan Weil @ 2011-02-18 17:43 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

Am 18.02.2011 10:57, schrieb Kevin Wolf:
> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>> Kevin Wolf <kwolf@redhat.com> writes:
>>
>>> Yet another file format with yet another implementation is definitely
>>> not what we need. We should probably take some of the ideas in FVD and
>>> consider them for qcow3.
>>
>> Got an assumption there: that the one COW format we need must be qcow3,
>> i.e. an evolution of qcow2. Needs to be justified. If that discussion
>> has happened on the list already, I missed it. If not, it's overdue,
>> and then we better start it right away.
>
> Right. I probably wasn't very clear about what I mean with qcow3 either,
> so let me try to summarize my reasoning.
>
>
> The first point is an assumption that you made, too: That we want to
> have only one format. I hope it's easy to agree on this, duplication is
> bad and every additional format creates new maintenance burden,
> especially if we're taking it serious. Until now, there were exactly two
> formats for which we managed to do this, raw and qcow2. raw is more or
> less for free, so with the introduction of another format, we basically
> double the supported block driver code overnight (while not doubling the
> number of developers).
>
> The consequence of having only one file format is that it must be able
> to obsolete the existing ones, most notably qcow2. We can only neglect
> qcow1 today because we can tell users to use qcow2. It supports
> everything that qcow1 supports and more. We couldn't have done this if
> qcow2 lacked features compared to qcow1.
>
> So the one really essential requirement that I see is that we provide a
> way forward for _all_ users by maintaining all of qcow2's features. This
> is the only way of getting people to not stay with qcow2.


The support of several different file formats is one of the
strong points of QEMU, at least in my opinion.

Reducing this to offline conversion would be a bad idea because it costs
too much time and disk space for quick tests (for production environments,
this might be totally different).

Is maintaining an additional file format really so much work?
I have only some personal experience with vdi.c, and there maintainance
was largely caused by interface changes and done by Kevin.
Hopefully interfaces will stabilize, so changes will become less frequent.

A new file format like fvd would be a challenge for the existing ones.
Declare its support as unsupported or experimental, but let users
decide which one is best suited to their needs!

Maybe adding a staging tree (like for the linux kernel) for experimental
drivers, devices, file formats, tcg targets and so on would make it easier
to add new code and reduce the need for QEMU forks. I'd appreciate such
or any other solution which allows this very much!

Regards,
Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 17:43         ` Stefan Weil
@ 2011-02-18 19:11           ` Kevin Wolf
  2011-02-18 19:47             ` Anthony Liguori
  2011-02-19 17:19             ` Stefan Hajnoczi
  2011-02-18 20:31           ` Anthony Liguori
  2011-02-19 12:27           ` [Qemu-devel] Bugs in the VDI Block Device Driver Chunqiang Tang
  2 siblings, 2 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-18 19:11 UTC (permalink / raw)
  To: Stefan Weil
  Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

Am 18.02.2011 18:43, schrieb Stefan Weil:
> Am 18.02.2011 10:57, schrieb Kevin Wolf:
>> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>>> Kevin Wolf <kwolf@redhat.com> writes:
>>>
>>>> Yet another file format with yet another implementation is definitely
>>>> not what we need. We should probably take some of the ideas in FVD and
>>>> consider them for qcow3.
>>>
>>> Got an assumption there: that the one COW format we need must be qcow3,
>>> i.e. an evolution of qcow2. Needs to be justified. If that discussion
>>> has happened on the list already, I missed it. If not, it's overdue,
>>> and then we better start it right away.
>>
>> Right. I probably wasn't very clear about what I mean with qcow3 either,
>> so let me try to summarize my reasoning.
>>
>>
>> The first point is an assumption that you made, too: That we want to
>> have only one format. I hope it's easy to agree on this, duplication is
>> bad and every additional format creates new maintenance burden,
>> especially if we're taking it serious. Until now, there were exactly two
>> formats for which we managed to do this, raw and qcow2. raw is more or
>> less for free, so with the introduction of another format, we basically
>> double the supported block driver code overnight (while not doubling the
>> number of developers).
>>
>> The consequence of having only one file format is that it must be able
>> to obsolete the existing ones, most notably qcow2. We can only neglect
>> qcow1 today because we can tell users to use qcow2. It supports
>> everything that qcow1 supports and more. We couldn't have done this if
>> qcow2 lacked features compared to qcow1.
>>
>> So the one really essential requirement that I see is that we provide a
>> way forward for _all_ users by maintaining all of qcow2's features. This
>> is the only way of getting people to not stay with qcow2.
> 
> 
> The support of several different file formats is one of the
> strong points of QEMU, at least in my opinion.

I totally agree. qemu-img is known as a Swiss army knife for disk images
and this is definitely a strength.

However, it's not useful just because it supports a high number of
formats, but because these formats are in active use. Most of them are
the native formats of some other software.

I think things look a bit different when we're talking about
qemu-specific formats. qcow1 isn't in use any more because nobody needs
it for compatibility with other software and for use with qemu, there is
qcow2. Still, the qcow1 driver is still around and bitrots.

> Reducing this to offline conversion would be a bad idea because it costs
> too much time and disk space for quick tests (for production environments,
> this might be totally different).

Either I'm misunderstanding what you try to say here, or you
miunderstood what I said. I agree that we don't want to have to do
qemu-img convert (i.e. a full copy) in order to upgrade. This is one of
the reasons why I think we should have a qcow3 which can be upgraded
basically by increasing the version number in the header (look at it as
an incompatible feature flag, if you want) instead of starting something
completely new.

> Is maintaining an additional file format really so much work?
> I have only some personal experience with vdi.c, and there maintainance
> was largely caused by interface changes and done by Kevin.
> Hopefully interfaces will stabilize, so changes will become less frequent.

Well, there are different types of "maintenance".

It's not much work to just drop the code into qemu and let it bitrot.
This is what happens to the funky formats like bochs or dmg. They are
usually patches enough so that they still build, but nobody tries if
they actually work.

Then there are formats in which there is at least some interest, like
vmdk or vdi. Occasionally they get some fixes, they are probably fine
for image conversion, but I wouldn't really trust them for production use.

And then there's raw and qcow2, which are used by a lot of people for
running VMs, that are actively maintained, get a decent level of review
and fixes etc. Getting a format into this group really takes a lot of
work. Taking something like FVD would only make sense if we are willing
to do that work - I mean, really nobody wants to convert from/to a file
format that isn't implemented anywhere else.

> A new file format like fvd would be a challenge for the existing ones.
> Declare its support as unsupported or experimental, but let users
> decide which one is best suited to their needs!

Basically this is what we did for QED. In hindsight I consider it a
mistake because it set a bad precedence of inventing something new
instead of fixing what's there. I really don't want to convert all my
images each time to take advantage of new qemu version.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 19:11           ` Kevin Wolf
@ 2011-02-18 19:47             ` Anthony Liguori
  2011-02-18 20:49               ` Kevin Wolf
  2011-02-19 17:19             ` Stefan Hajnoczi
  1 sibling, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-18 19:47 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

On 02/18/2011 01:11 PM, Kevin Wolf wrote:
>> A new file format like fvd would be a challenge for the existing ones.
>> Declare its support as unsupported or experimental, but let users
>> decide which one is best suited to their needs!
>>      
> Basically this is what we did for QED. In hindsight I consider it a
> mistake because it set a bad precedence of inventing something new
> instead of fixing what's there.

I don't see how qcow3 is fixing something that's there since it's still 
an incompatible format.

It'd be a stronger argument if you were suggesting something that was 
still fully compatible with qcow2 but once compatibility is broken, it's 
broken.

Regards,

Anthony Liguori

>   I really don't want to convert all my
> images each time to take advantage of new qemu version.
>
> Kevin
>
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 17:43         ` Stefan Weil
  2011-02-18 19:11           ` Kevin Wolf
@ 2011-02-18 20:31           ` Anthony Liguori
  2011-02-19 12:27           ` [Qemu-devel] Bugs in the VDI Block Device Driver Chunqiang Tang
  2 siblings, 0 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-02-18 20:31 UTC (permalink / raw)
  To: Stefan Weil
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi

On 02/18/2011 11:43 AM, Stefan Weil wrote:
>
> Is maintaining an additional file format really so much work?
> I have only some personal experience with vdi.c, and there maintainance
> was largely caused by interface changes and done by Kevin.
> Hopefully interfaces will stabilize, so changes will become less 
> frequent.
>
> A new file format like fvd would be a challenge for the existing ones.

FVD isn't merged because it's gotten almost no review.  If it turns out 
that it is identical to an existing format and an existing format just 
has a crappy implementation, it wouldn't be merged in favor of fixing 
the existing format.

But if it has a compelling advantage for a reasonable use-case, it will 
be merged.

I don't know where this whole discussion of "strategic formats" for QEMU 
came from but that's never been the way the project has operated.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 19:47             ` Anthony Liguori
@ 2011-02-18 20:49               ` Kevin Wolf
  2011-02-18 20:50                 ` Anthony Liguori
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-02-18 20:49 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

Am 18.02.2011 20:47, schrieb Anthony Liguori:
> On 02/18/2011 01:11 PM, Kevin Wolf wrote:
>>> A new file format like fvd would be a challenge for the existing ones.
>>> Declare its support as unsupported or experimental, but let users
>>> decide which one is best suited to their needs!
>>>      
>> Basically this is what we did for QED. In hindsight I consider it a
>> mistake because it set a bad precedence of inventing something new
>> instead of fixing what's there.
> 
> I don't see how qcow3 is fixing something that's there since it's still 
> an incompatible format.
> 
> It'd be a stronger argument if you were suggesting something that was 
> still fully compatible with qcow2 but once compatibility is broken, it's 
> broken.

It's really more like adding an incompatible feature flag in QED. You
still have one implementation for old and new images instead of
splitting up development efforts, you still have all of the features and
so on. It's a completely different story than QED.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 20:49               ` Kevin Wolf
@ 2011-02-18 20:50                 ` Anthony Liguori
  2011-02-18 21:27                   ` Kevin Wolf
  0 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-18 20:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

On 02/18/2011 02:49 PM, Kevin Wolf wrote:
> Am 18.02.2011 20:47, schrieb Anthony Liguori:
>    
>> On 02/18/2011 01:11 PM, Kevin Wolf wrote:
>>      
>>>> A new file format like fvd would be a challenge for the existing ones.
>>>> Declare its support as unsupported or experimental, but let users
>>>> decide which one is best suited to their needs!
>>>>
>>>>          
>>> Basically this is what we did for QED. In hindsight I consider it a
>>> mistake because it set a bad precedence of inventing something new
>>> instead of fixing what's there.
>>>        
>> I don't see how qcow3 is fixing something that's there since it's still
>> an incompatible format.
>>
>> It'd be a stronger argument if you were suggesting something that was
>> still fully compatible with qcow2 but once compatibility is broken, it's
>> broken.
>>      
> It's really more like adding an incompatible feature flag in QED. You
> still have one implementation for old and new images instead of
> splitting up development efforts, you still have all of the features and
> so on.

In theory.  Since an implementation doesn't exist, we have no idea how 
much code is actually going to be shared at the end of the day.

I suspect that, especially if you drop the ref table updates, there 
won't be an awful lot of common code in the two paths.

Regards,

Anthony Liguori

>   It's a completely different story than QED.
>
> Kevin
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 20:50                 ` Anthony Liguori
@ 2011-02-18 21:27                   ` Kevin Wolf
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-18 21:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

Am 18.02.2011 21:50, schrieb Anthony Liguori:
> On 02/18/2011 02:49 PM, Kevin Wolf wrote:
>> Am 18.02.2011 20:47, schrieb Anthony Liguori:
>>    
>>> On 02/18/2011 01:11 PM, Kevin Wolf wrote:
>>>      
>>>>> A new file format like fvd would be a challenge for the existing ones.
>>>>> Declare its support as unsupported or experimental, but let users
>>>>> decide which one is best suited to their needs!
>>>>>
>>>>>          
>>>> Basically this is what we did for QED. In hindsight I consider it a
>>>> mistake because it set a bad precedence of inventing something new
>>>> instead of fixing what's there.
>>>>        
>>> I don't see how qcow3 is fixing something that's there since it's still
>>> an incompatible format.
>>>
>>> It'd be a stronger argument if you were suggesting something that was
>>> still fully compatible with qcow2 but once compatibility is broken, it's
>>> broken.
>>>      
>> It's really more like adding an incompatible feature flag in QED. You
>> still have one implementation for old and new images instead of
>> splitting up development efforts, you still have all of the features and
>> so on.
> 
> In theory.  Since an implementation doesn't exist, we have no idea how 
> much code is actually going to be shared at the end of the day.
> 
> I suspect that, especially if you drop the ref table updates, there 
> won't be an awful lot of common code in the two paths.

Allowing refcounts to be inconsistent, protected by a dirty flag, is
only an option, and you should only take it if you absolutely need it
(i.e. your guest is broken and requires cache=writethrough, but you
desperately need performance)

My preferred way of implementing it is telling the refcount cache that
it should ignore flushes and write its data only back when another
refcount block must be loaded into the cache (which happens rarely
enough that it doesn't really hurt performance). This makes the
difference from the existing code more or less one if statement that
returns early.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [Qemu-devel] Bugs in the VDI Block Device Driver
  2011-02-18 17:43         ` Stefan Weil
  2011-02-18 19:11           ` Kevin Wolf
  2011-02-18 20:31           ` Anthony Liguori
@ 2011-02-19 12:27           ` Chunqiang Tang
  2011-02-19 16:21             ` Stefan Hajnoczi
  2 siblings, 1 reply; 87+ messages in thread
From: Chunqiang Tang @ 2011-02-19 12:27 UTC (permalink / raw)
  To: Stefan Weil; +Cc: qemu-devel

Hi Stefan,

I applied FVD's fully automated testing tool to the VDI block device 
driver and found several bugs. Some bugs are easy to fix whereas others 
need some thoughts on design. Therefore, I thought you might be able to 
handle the bugs better than me. These bugs occur only if I/O errors or 
aio_cancel are injected, and hence are not that critical. If VDI is meant 
for read-only image format conversion, then these bugs are "ok". If write 
is enabled in VDI, these bugs may corrupt data. FVD's testing tool 
discovered these bugs because it automatically injects errors and race 
conditions, and performs exhaustive randomized tests. Below is a list of 
the bugs.

Bug 1. The most serious bug is caused by race condition in updating a new 
bmap entry in memory and on disk. Considering the following operation 
sequence. 
  O1: VM issues a write to sector X
  O2: VDI allocates a new bmap entry and updates in-memory s->bmap
  O3: VDI writes data to disk
  O4: The disk I/O for writing sector X fails
  O5: VDI reports error to VM and returns.

Note that the bmap entry is updated in memory, but not persisted on disk. 
Now consider another write that immediately follows:
  P1: VM issues a write to sector X+1, which locates in the same block as 
the previously used sector X.
  P2: s->bmap already has one entry for the block, and hence VDI writes 
data directly without persisting the new s->bmap entry on disk.
  P3: The write disk I/O succeeds
  P4: VDI report success to VM, but the bitmap entry is still not 
persisted on disk.

Now suppose the VM powers off gracefully (i.e., the QEMU process quits) 
and reboots. The second write to sector X+1, which is reported as finished 
successfully, is simply lost, because the corresponding in-memory s->bmap 
entry is never persisted on disk. This is exactly what FVD's testing tool 
discovers. After the block device is closed and then re-opened, disk 
content verification fails.

This is just one example of the problem. Race condition plus host crash 
also causes problems. Consider another example below.
  Q1: VM issues a write to sector X
  Q2: VDI allocates a new bmap entry and updates in-memory s->bmap
  Q3: VDI writes sector X to disk and waits for the callback
  Q4: VM issues a write to another sector X+1, which is in the same block 
as sector X.
  Q5: VDI sees the bitmap entry in s->bmap is already allocated, and 
writes sector X+1 to disk.
  Q6: Write to sector X+1 finishes, and VDI's callback is invoked.
  Q7: VDI acknowledges to the VM the completion of writing sector X+1
  Q8: After observing the completion of writing sector X+1, VM issues a 
flush to ensure that sector X+1 is persisted on disk.
  Q9: VDI finishes the flush and acknowledge the completion of the 
operation.
  Q10: ... (some other arbitrary operations, but the disk I/O for writing 
sector X is still not finished....)
  Q11: The host crashes
 
Now the new bitmap entry is not persisted on disk, while both writing to 
sector X+1 and the flush has been acknowledged as finished. Sector X+1 is 
lost, which is a corruption. This problem exists even if it uses O_DSYNC. 
The root cause of the problem is that, if a request updates in-memory 
s->bmap, another request that sees this update assumes that the update is 
already persisted on disk, which is not.

Bug 2: Similar to the bugs the FVD testing tool found for QCOW2, there are 
several cases of the code below on failure handling path without setting 
error return code, which mistakenly reports failure as success. This 
mistake is caught by FVD when doing image content validation.
       if (acb->hd_aiocb == NULL) {
           /* missing     ret = -EIO; */
            goto done; 
        } 

Bug 3: Similar to the bugs the FVD testing tool found for QCOW2, 
vdi_aio_cancel does not perform a complete clean up and there are several 
related bugs. First, memory buffer is not freed, acb->orig_buf and 
acb->block_buffer. Second, acb->bh is not cancelled. Third, 
vdi_aio_setup() does not initialize acb->bh to NULL so that when a request 
acb is cancelled and then later reused for another request, its acb->bh != 
NULL and the new request fails in  vdi_schedule_bh(). This is caught by 
FVD's testing tool, when it observes that no I/O failure is injected but 
VDI reports a failed I/O request, which indicates a bug in the driver. 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Bugs in the VDI Block Device Driver
  2011-02-19 12:27           ` [Qemu-devel] Bugs in the VDI Block Device Driver Chunqiang Tang
@ 2011-02-19 16:21             ` Stefan Hajnoczi
  2011-02-19 18:49               ` Stefan Weil
  0 siblings, 1 reply; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-02-19 16:21 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Sat, Feb 19, 2011 at 12:27 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
> I applied FVD's fully automated testing tool to the VDI block device
> driver and found several bugs. Some bugs are easy to fix whereas others
> need some thoughts on design. Therefore, I thought you might be able to
> handle the bugs better than me. These bugs occur only if I/O errors or
> aio_cancel are injected, and hence are not that critical. If VDI is meant
> for read-only image format conversion, then these bugs are "ok". If write
> is enabled in VDI, these bugs may corrupt data. FVD's testing tool
> discovered these bugs because it automatically injects errors and race
> conditions, and performs exhaustive randomized tests. Below is a list of
> the bugs.

Thanks for this detailed list.  vdi is there for conversion and not a
high priority for me, but I have filed it as a bug so it is not
forgotten:

https://bugs.launchpad.net/qemu/+bug/721825

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 19:11           ` Kevin Wolf
  2011-02-18 19:47             ` Anthony Liguori
@ 2011-02-19 17:19             ` Stefan Hajnoczi
  1 sibling, 0 replies; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-02-19 17:19 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

On Fri, Feb 18, 2011 at 7:11 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 18.02.2011 18:43, schrieb Stefan Weil:
>> Is maintaining an additional file format really so much work?
>> I have only some personal experience with vdi.c, and there maintainance
>> was largely caused by interface changes and done by Kevin.
>> Hopefully interfaces will stabilize, so changes will become less frequent.
>
> Well, there are different types of "maintenance".
>
> It's not much work to just drop the code into qemu and let it bitrot.
> This is what happens to the funky formats like bochs or dmg. They are
> usually patches enough so that they still build, but nobody tries if
> they actually work.
>
> Then there are formats in which there is at least some interest, like
> vmdk or vdi. Occasionally they get some fixes, they are probably fine
> for image conversion, but I wouldn't really trust them for production use.
>
> And then there's raw and qcow2, which are used by a lot of people for
> running VMs, that are actively maintained, get a decent level of review
> and fixes etc. Getting a format into this group really takes a lot of
> work. Taking something like FVD would only make sense if we are willing
> to do that work - I mean, really nobody wants to convert from/to a file
> format that isn't implemented anywhere else.

This is a good thing to agree on so I want to reiterate:

There are two types of image formats in QEMU today.

1. Native formats that are maintained and suitable for running VMs.
This includes raw, qcow2, and qed.

2. Convert-only formats that may not be maintained and are not
suitable for running VMs.  All other formats in qemu.git.

The convert-only formats have synchronous implementations which makes
it a bad idea to run VMs with them.  They don't fit into QEMU's
event-driven architecture and will cause poor performance and possible
hangs.

I hope folks agree on this.

The next step is to consider that native support requires at least an
order of magnitude more work and code.  It would be wise to focus on a
flagship format in order to share that effort.  So I think this thread
is a useful discussion to have even if no one can be forced to
collaborate on just one format.

Kevin's position seems to be that an evolution of qcow2 is best for
code maintenance and reuse.

The position that QED and FVD have taken is to start from a clean
slate in order to make incompatible changes and leave out problematic
features.

I think we can get there eventually with either approach but we'll be
introducing incompatible changes either way.  In terms of code reuse,
it's initially nice to share code with qcow2 but in the long run the
two formats might diverge far enough that it becomes a liability due
to extra complexity.

For reference, here is the QCOW3 roadmap wiki page:
http://wiki.qemu.org/Qcow3_Roadmap
Here is the QED outstanding work page:
http://wiki.qemu.org/Features/QED/OutstandingWork

Does FVD have a roadmap or future features?

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Bugs in the VDI Block Device Driver
  2011-02-19 16:21             ` Stefan Hajnoczi
@ 2011-02-19 18:49               ` Stefan Weil
  0 siblings, 0 replies; 87+ messages in thread
From: Stefan Weil @ 2011-02-19 18:49 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Chunqiang Tang, qemu-devel

Am 19.02.2011 17:21, schrieb Stefan Hajnoczi:
> On Sat, Feb 19, 2011 at 12:27 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> I applied FVD's fully automated testing tool to the VDI block device
>> driver and found several bugs. Some bugs are easy to fix whereas others
>> need some thoughts on design. Therefore, I thought you might be able to
>> handle the bugs better than me. These bugs occur only if I/O errors or
>> aio_cancel are injected, and hence are not that critical. If VDI is meant
>> for read-only image format conversion, then these bugs are "ok". If write
>> is enabled in VDI, these bugs may corrupt data. FVD's testing tool
>> discovered these bugs because it automatically injects errors and race
>> conditions, and performs exhaustive randomized tests. Below is a list of
>> the bugs.
>
> Thanks for this detailed list. vdi is there for conversion and not a
> high priority for me, but I have filed it as a bug so it is not
> forgotten:
>
> https://bugs.launchpad.net/qemu/+bug/721825
>
> Stefan


Hi Stefan (H.),

the bug report was for me, and I'm already looking after it.

Regards,
Stefan (W.)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18  9:57       ` [Qemu-devel] Re: Strategic decision: COW format Kevin Wolf
  2011-02-18 14:20         ` Anthony Liguori
  2011-02-18 17:43         ` Stefan Weil
@ 2011-02-20 22:13         ` Aurelien Jarno
  2011-02-21  8:59           ` Kevin Wolf
  2011-02-22  8:40           ` Markus Armbruster
  2 siblings, 2 replies; 87+ messages in thread
From: Aurelien Jarno @ 2011-02-20 22:13 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

On Fri, Feb 18, 2011 at 10:57:05AM +0100, Kevin Wolf wrote:
> Am 18.02.2011 10:12, schrieb Markus Armbruster:
> > Kevin Wolf <kwolf@redhat.com> writes:
> > 
> >> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
> >>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
> >>>> As you requested, I set up a wiki page for FVD at 
> >>> http://wiki.qemu.org/Features/FVD
> >>>> . It includes a summary of FVD, a detailed specification of FVD, and a 
> >>>> comparison of the design and performance of FVD and QED. 
> >>>
> >>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
> >>> figure 
> >>>> shows that the file creation throughput of NetApp's PostMark benchmark 
> >>> under 
> >>>> FVD is 74.9% to 215% higher than that under QED.
> >>>
> >>> Hi Anthony,
> >>>
> >>> Please let me know if more information is needed. I would appreciate your 
> >>> feedback and advice on the best way to proceed with FVD. 
> >>
> >> Yet another file format with yet another implementation is definitely
> >> not what we need. We should probably take some of the ideas in FVD and
> >> consider them for qcow3.
> > 
> > Got an assumption there: that the one COW format we need must be qcow3,
> > i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
> > has happened on the list already, I missed it.  If not, it's overdue,
> > and then we better start it right away.
> 
> Right. I probably wasn't very clear about what I mean with qcow3 either,
> so let me try to summarize my reasoning.
> 
> 
> The first point is an assumption that you made, too: That we want to
> have only one format. I hope it's easy to agree on this, duplication is
> bad and every additional format creates new maintenance burden,
> especially if we're taking it serious. Until now, there were exactly two
> formats for which we managed to do this, raw and qcow2. raw is more or
> less for free, so with the introduction of another format, we basically
> double the supported block driver code overnight (while not doubling the
> number of developers).
> 
> The consequence of having only one file format is that it must be able
> to obsolete the existing ones, most notably qcow2. We can only neglect
> qcow1 today because we can tell users to use qcow2. It supports
> everything that qcow1 supports and more. We couldn't have done this if
> qcow2 lacked features compared to qcow1.
> 
> So the one really essential requirement that I see is that we provide a
> way forward for _all_ users by maintaining all of qcow2's features. This
> is the only way of getting people to not stay with qcow2.
> 

I agree that the best would be to have a single format, and it's
probably a goal to have. That said, what is most important to my view is
having one or two formats which together have _all_ the features (and 
here I consider speed as a feature) of the existing qcow2 format. QED or
FVD have been designed with the "virtualization in a datacenter" in mind,
and are very good for this use. OTOH they don't support compression or 
snapshotting, that are quite useful for demo, debugging, testing, or
even for occasionally running a Windows VM, in other words in situations
where the speed is not the priority.

If we can't find a tradeoff for that, we should go for two instead of 
one image format.

-- 
Aurelien Jarno	                        GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-20 22:13         ` [Qemu-devel] Re: Strategic decision: COW format Aurelien Jarno
@ 2011-02-21  8:59           ` Kevin Wolf
  2011-02-21 13:44             ` Stefan Hajnoczi
  2011-02-22  8:40           ` Markus Armbruster
  1 sibling, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-02-21  8:59 UTC (permalink / raw)
  To: Aurelien Jarno
  Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

Am 20.02.2011 23:13, schrieb Aurelien Jarno:
> On Fri, Feb 18, 2011 at 10:57:05AM +0100, Kevin Wolf wrote:
>> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>>> Kevin Wolf <kwolf@redhat.com> writes:
>>>
>>>> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>>>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>>>>> As you requested, I set up a wiki page for FVD at 
>>>>> http://wiki.qemu.org/Features/FVD
>>>>>> . It includes a summary of FVD, a detailed specification of FVD, and a 
>>>>>> comparison of the design and performance of FVD and QED. 
>>>>>
>>>>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
>>>>> figure 
>>>>>> shows that the file creation throughput of NetApp's PostMark benchmark 
>>>>> under 
>>>>>> FVD is 74.9% to 215% higher than that under QED.
>>>>>
>>>>> Hi Anthony,
>>>>>
>>>>> Please let me know if more information is needed. I would appreciate your 
>>>>> feedback and advice on the best way to proceed with FVD. 
>>>>
>>>> Yet another file format with yet another implementation is definitely
>>>> not what we need. We should probably take some of the ideas in FVD and
>>>> consider them for qcow3.
>>>
>>> Got an assumption there: that the one COW format we need must be qcow3,
>>> i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
>>> has happened on the list already, I missed it.  If not, it's overdue,
>>> and then we better start it right away.
>>
>> Right. I probably wasn't very clear about what I mean with qcow3 either,
>> so let me try to summarize my reasoning.
>>
>>
>> The first point is an assumption that you made, too: That we want to
>> have only one format. I hope it's easy to agree on this, duplication is
>> bad and every additional format creates new maintenance burden,
>> especially if we're taking it serious. Until now, there were exactly two
>> formats for which we managed to do this, raw and qcow2. raw is more or
>> less for free, so with the introduction of another format, we basically
>> double the supported block driver code overnight (while not doubling the
>> number of developers).
>>
>> The consequence of having only one file format is that it must be able
>> to obsolete the existing ones, most notably qcow2. We can only neglect
>> qcow1 today because we can tell users to use qcow2. It supports
>> everything that qcow1 supports and more. We couldn't have done this if
>> qcow2 lacked features compared to qcow1.
>>
>> So the one really essential requirement that I see is that we provide a
>> way forward for _all_ users by maintaining all of qcow2's features. This
>> is the only way of getting people to not stay with qcow2.
> 
> I agree that the best would be to have a single format, and it's
> probably a goal to have. That said, what is most important to my view is
> having one or two formats which together have _all_ the features (and 
> here I consider speed as a feature) of the existing qcow2 format. QED or
> FVD have been designed with the "virtualization in a datacenter" in mind,
> and are very good for this use. OTOH they don't support compression or 
> snapshotting, that are quite useful for demo, debugging, testing, or
> even for occasionally running a Windows VM, in other words in situations
> where the speed is not the priority.
> 
> If we can't find a tradeoff for that, we should go for two instead of 
> one image format.

I agree. Though that's purely theoretical because there no reason why we
shouldn't find a way to get both. ;-)

In fact, the only area where qcow2 in performs really bad in 0.14 is
cache=writethrough (which unfortunately is the default...). With
cache=none it's easy to find scenarios where it provides higher
throughput than QED.

Anyway, there's really only one crucial difference between QED and
qcow2, which is that qcow2 ensures that metadata is consistent on disk
at any time whereas QED relies on a dirty flag and rebuilds metadata
after a crash (basically requiring an fsck). The obvious solution if you
want to have this in qcow2, is adding a dirty flag there as well.

In my opinion, an additional flag certainly doesn't justify maintaining
an additional format instead of extending the existing one.

Likewise, I think FVD might provide some ideas that we can integrate as
well, I just don't see a justification to include it as a separate format.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-21  8:59           ` Kevin Wolf
@ 2011-02-21 13:44             ` Stefan Hajnoczi
  2011-02-21 14:10               ` Kevin Wolf
                                 ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-02-21 13:44 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Aurelien Jarno,
	Stefan Hajnoczi

On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> In fact, the only area where qcow2 in performs really bad in 0.14 is
> cache=writethrough (which unfortunately is the default...). With
> cache=none it's easy to find scenarios where it provides higher
> throughput than QED.

Yeah, I'm tempted to implement parallel allocating writes now so I can
pick on qcow2 in all benchmarks again ;).

> Anyway, there's really only one crucial difference between QED and
> qcow2, which is that qcow2 ensures that metadata is consistent on disk
> at any time whereas QED relies on a dirty flag and rebuilds metadata
> after a crash (basically requiring an fsck). The obvious solution if you
> want to have this in qcow2, is adding a dirty flag there as well.
>
> Likewise, I think FVD might provide some ideas that we can integrate as
> well, I just don't see a justification to include it as a separate format.

You think that QED and FVD can be integrated into a QCOW2-based
format.  I agree it's possible and has some value.  It isn't pretty
and I would prefer to work on a clean new format because that, too,
has value.

In any case, the next step is to get down to specifics.  Here is the
page with the current QCOW3 roadmap:

http://wiki.qemu.org/Qcow3_Roadmap

Please raise concrete requirements or features so they can be
discussed and captured.

For example, journalling is an alternative to the dirty bit approach.
If you feel that journalling is the best technique to address
consistent updates, then make your case outside the context of today's
qcow2, QED, and FVD implementations (although benchmark data will rely
on current implementations).  Explain how the technique would fit into
QCOW3 and what format changes need to be made.

I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-21 13:44             ` Stefan Hajnoczi
@ 2011-02-21 14:10               ` Kevin Wolf
  2011-02-21 15:16                 ` Anthony Liguori
  2011-02-23  3:32               ` Chunqiang Tang
       [not found]               ` <OFAEB4CD91.BE989F29-ON8525783F.007366B8-85257840.00130B47@LocalDomain>
  2 siblings, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-02-21 14:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Aurelien Jarno,
	Stefan Hajnoczi

Am 21.02.2011 14:44, schrieb Stefan Hajnoczi:
> On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> In fact, the only area where qcow2 in performs really bad in 0.14 is
>> cache=writethrough (which unfortunately is the default...). With
>> cache=none it's easy to find scenarios where it provides higher
>> throughput than QED.
> 
> Yeah, I'm tempted to implement parallel allocating writes now so I can
> pick on qcow2 in all benchmarks again ;).

Heh. ;-)

In the end it just shows that the differences are mainly in the
implementation, not in the format.

>> Anyway, there's really only one crucial difference between QED and
>> qcow2, which is that qcow2 ensures that metadata is consistent on disk
>> at any time whereas QED relies on a dirty flag and rebuilds metadata
>> after a crash (basically requiring an fsck). The obvious solution if you
>> want to have this in qcow2, is adding a dirty flag there as well.
>>
>> Likewise, I think FVD might provide some ideas that we can integrate as
>> well, I just don't see a justification to include it as a separate format.
> 
> You think that QED and FVD can be integrated into a QCOW2-based
> format.  I agree it's possible and has some value.  It isn't pretty
> and I would prefer to work on a clean new format because that, too,
> has value.
> 
> In any case, the next step is to get down to specifics.  Here is the
> page with the current QCOW3 roadmap:
> 
> http://wiki.qemu.org/Qcow3_Roadmap
> 
> Please raise concrete requirements or features so they can be
> discussed and captured.
> 
> For example, journalling is an alternative to the dirty bit approach.
> If you feel that journalling is the best technique to address
> consistent updates, then make your case outside the context of today's
> qcow2, QED, and FVD implementations (although benchmark data will rely
> on current implementations).  Explain how the technique would fit into
> QCOW3 and what format changes need to be made.

I think journalling is an interesting option, but I'm not sure if we
should target it for 0.15. As you know, there's already more than enough
stuff to do until then, with coroutines etc. The dirty flag thing would
be way easier to implement. We can always add a journal as a compatible
feature in 0.16.

To be honest, I'm not even sure any more that the dirty flag is that
important. Originally we have been talking about cache=none and it
definitely makes a big difference there because we save flushes.
However, we're talking about cache=writethrough now and you flush on any
write. It might be more important to make things parallel for writethrough.

Maybe not writing out refcounts is something we should measure before we
start implementing anything. (It's easy to disable all writes for a
benchmark, even if the image will be broken afterwards)

> I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD.

Definitely more productive, yes.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-21 14:10               ` Kevin Wolf
@ 2011-02-21 15:16                 ` Anthony Liguori
  2011-02-21 15:26                   ` Kevin Wolf
  0 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-21 15:16 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, qemu-devel, Markus Armbruster,
	Chunqiang Tang, Aurelien Jarno

On 02/21/2011 08:10 AM, Kevin Wolf wrote:
> Am 21.02.2011 14:44, schrieb Stefan Hajnoczi:
>    
>> On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>      
>>> In fact, the only area where qcow2 in performs really bad in 0.14 is
>>> cache=writethrough (which unfortunately is the default...). With
>>> cache=none it's easy to find scenarios where it provides higher
>>> throughput than QED.
>>>        
>> Yeah, I'm tempted to implement parallel allocating writes now so I can
>> pick on qcow2 in all benchmarks again ;).
>>      
> Heh. ;-)
>
> In the end it just shows that the differences are mainly in the
> implementation, not in the format.
>
>    
>>> Anyway, there's really only one crucial difference between QED and
>>> qcow2, which is that qcow2 ensures that metadata is consistent on disk
>>> at any time whereas QED relies on a dirty flag and rebuilds metadata
>>> after a crash (basically requiring an fsck). The obvious solution if you
>>> want to have this in qcow2, is adding a dirty flag there as well.
>>>
>>> Likewise, I think FVD might provide some ideas that we can integrate as
>>> well, I just don't see a justification to include it as a separate format.
>>>        
>> You think that QED and FVD can be integrated into a QCOW2-based
>> format.  I agree it's possible and has some value.  It isn't pretty
>> and I would prefer to work on a clean new format because that, too,
>> has value.
>>
>> In any case, the next step is to get down to specifics.  Here is the
>> page with the current QCOW3 roadmap:
>>
>> http://wiki.qemu.org/Qcow3_Roadmap
>>
>> Please raise concrete requirements or features so they can be
>> discussed and captured.
>>
>> For example, journalling is an alternative to the dirty bit approach.
>> If you feel that journalling is the best technique to address
>> consistent updates, then make your case outside the context of today's
>> qcow2, QED, and FVD implementations (although benchmark data will rely
>> on current implementations).  Explain how the technique would fit into
>> QCOW3 and what format changes need to be made.
>>      
> I think journalling is an interesting option, but I'm not sure if we
> should target it for 0.15. As you know, there's already more than enough
> stuff to do until then, with coroutines etc. The dirty flag thing would
> be way easier to implement. We can always add a journal as a compatible
> feature in 0.16.
>
> To be honest, I'm not even sure any more that the dirty flag is that
> important. Originally we have been talking about cache=none and it
> definitely makes a big difference there because we save flushes.
> However, we're talking about cache=writethrough now and you flush on any
> write. It might be more important to make things parallel for writethrough.
>    

One thing I wonder about is whether we really need to have cache=X and 
wce=X.  I never really minded the fact that cache=none advertised wce=on 
because we behaved effectively as if wce=on.  But now that qcow2 
triggers on wce=on, I'm a bit concerned that we're introducing a subtle 
degradation that most people won't realize.

Ignoring some of the problems with O_DIRECT, semantically, I think 
there's a strong use-case for cache=none, wce=off.

Regards,

Anthony Liguori

> Maybe not writing out refcounts is something we should measure before we
> start implementing anything. (It's easy to disable all writes for a
> benchmark, even if the image will be broken afterwards)
>
>    
>> I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD.
>>      
> Definitely more productive, yes.
>
> Kevin
>
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-21 15:16                 ` Anthony Liguori
@ 2011-02-21 15:26                   ` Kevin Wolf
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-21 15:26 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, qemu-devel, Markus Armbruster,
	Chunqiang Tang, Christoph Hellwig, Aurelien Jarno

Am 21.02.2011 16:16, schrieb Anthony Liguori:
> On 02/21/2011 08:10 AM, Kevin Wolf wrote:
>> Am 21.02.2011 14:44, schrieb Stefan Hajnoczi:
>>    
>>> On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>      
>>>> In fact, the only area where qcow2 in performs really bad in 0.14 is
>>>> cache=writethrough (which unfortunately is the default...). With
>>>> cache=none it's easy to find scenarios where it provides higher
>>>> throughput than QED.
>>>>        
>>> Yeah, I'm tempted to implement parallel allocating writes now so I can
>>> pick on qcow2 in all benchmarks again ;).
>>>      
>> Heh. ;-)
>>
>> In the end it just shows that the differences are mainly in the
>> implementation, not in the format.
>>
>>    
>>>> Anyway, there's really only one crucial difference between QED and
>>>> qcow2, which is that qcow2 ensures that metadata is consistent on disk
>>>> at any time whereas QED relies on a dirty flag and rebuilds metadata
>>>> after a crash (basically requiring an fsck). The obvious solution if you
>>>> want to have this in qcow2, is adding a dirty flag there as well.
>>>>
>>>> Likewise, I think FVD might provide some ideas that we can integrate as
>>>> well, I just don't see a justification to include it as a separate format.
>>>>        
>>> You think that QED and FVD can be integrated into a QCOW2-based
>>> format.  I agree it's possible and has some value.  It isn't pretty
>>> and I would prefer to work on a clean new format because that, too,
>>> has value.
>>>
>>> In any case, the next step is to get down to specifics.  Here is the
>>> page with the current QCOW3 roadmap:
>>>
>>> http://wiki.qemu.org/Qcow3_Roadmap
>>>
>>> Please raise concrete requirements or features so they can be
>>> discussed and captured.
>>>
>>> For example, journalling is an alternative to the dirty bit approach.
>>> If you feel that journalling is the best technique to address
>>> consistent updates, then make your case outside the context of today's
>>> qcow2, QED, and FVD implementations (although benchmark data will rely
>>> on current implementations).  Explain how the technique would fit into
>>> QCOW3 and what format changes need to be made.
>>>      
>> I think journalling is an interesting option, but I'm not sure if we
>> should target it for 0.15. As you know, there's already more than enough
>> stuff to do until then, with coroutines etc. The dirty flag thing would
>> be way easier to implement. We can always add a journal as a compatible
>> feature in 0.16.
>>
>> To be honest, I'm not even sure any more that the dirty flag is that
>> important. Originally we have been talking about cache=none and it
>> definitely makes a big difference there because we save flushes.
>> However, we're talking about cache=writethrough now and you flush on any
>> write. It might be more important to make things parallel for writethrough.
>>    
> 
> One thing I wonder about is whether we really need to have cache=X and 
> wce=X.  I never really minded the fact that cache=none advertised wce=on 
> because we behaved effectively as if wce=on.  But now that qcow2 
> triggers on wce=on, I'm a bit concerned that we're introducing a subtle 
> degradation that most people won't realize.
> 
> Ignoring some of the problems with O_DIRECT, semantically, I think 
> there's a strong use-case for cache=none, wce=off.

Fully agree, there's no real reason for having three writeback modes,
but only one writethrough mode. It should be completely symmetrical.

I think Christoph has mentioned several times that he has some patches
for this. What's the status of them, Christoph?

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-18 14:20         ` Anthony Liguori
@ 2011-02-22  8:37           ` Markus Armbruster
  2011-02-22  8:56             ` Kevin Wolf
  0 siblings, 1 reply; 87+ messages in thread
From: Markus Armbruster @ 2011-02-22  8:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi

Anthony Liguori <anthony@codemonkey.ws> writes:

> On 02/18/2011 03:57 AM, Kevin Wolf wrote:
>> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>>    
>>> Kevin Wolf<kwolf@redhat.com>  writes:
>>>
>>>      
>>>> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>>>        
>>>>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>>>>> As you requested, I set up a wiki page for FVD at
>>>>>>            
>>>>> http://wiki.qemu.org/Features/FVD
>>>>>          
>>>>>> . It includes a summary of FVD, a detailed specification of FVD, and a
>>>>>> comparison of the design and performance of FVD and QED.
>>>>>>            
>>>>>          
>>>>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
>>>>>>            
>>>>> figure
>>>>>          
>>>>>> shows that the file creation throughput of NetApp's PostMark benchmark
>>>>>>            
>>>>> under
>>>>>          
>>>>>> FVD is 74.9% to 215% higher than that under QED.
>>>>>>            
>>>>> Hi Anthony,
>>>>>
>>>>> Please let me know if more information is needed. I would appreciate your
>>>>> feedback and advice on the best way to proceed with FVD.
>>>>>          
>>>> Yet another file format with yet another implementation is definitely
>>>> not what we need. We should probably take some of the ideas in FVD and
>>>> consider them for qcow3.
>>>>        
>>> Got an assumption there: that the one COW format we need must be qcow3,
>>> i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
>>> has happened on the list already, I missed it.  If not, it's overdue,
>>> and then we better start it right away.
>>>      
>> Right. I probably wasn't very clear about what I mean with qcow3 either,
>> so let me try to summarize my reasoning.
>>
>>
>> The first point is an assumption that you made, too: That we want to
>> have only one format. I hope it's easy to agree on this, duplication is
>> bad and every additional format creates new maintenance burden,
>> especially if we're taking it serious. Until now, there were exactly two
>> formats for which we managed to do this, raw and qcow2. raw is more or
>> less for free, so with the introduction of another format, we basically
>> double the supported block driver code overnight (while not doubling the
>> number of developers).
>>    
>
> Not sure what project you're following, but we've had an awful lot of
> formats before qcow2 :-)
>
> And qcow2 was never all that special, it just was dropped in the code
> base one day.  You've put a lot of work into qcow2, but there are
> other folks that are contributing additional formats and that means
> more developers.
>
>> The consequence of having only one file format is that it must be able
>> to obsolete the existing ones, most notably qcow2. We can only neglect
>> qcow1 today because we can tell users to use qcow2. It supports
>> everything that qcow1 supports and more. We couldn't have done this if
>> qcow2 lacked features compared to qcow1.
>>
>> So the one really essential requirement that I see is that we provide a
>> way forward for _all_ users by maintaining all of qcow2's features. This
>> is the only way of getting people to not stay with qcow2.
>>
>>
>> Of course, you could invent another format that implements the same
>> features, but I think just carefully extending qcow2 has some real
>> advantages.
>>
>> The first is that conversion of existing images would be really easy.
>> Basically increment the version number in the header file and you're
>> done. Structures would be compatible.
>
> qemu-img convert is a reasonable path for conversion.
>
>>   If you compare it to file systems,
>> I rarely ever change the file system on a non-empty partition. Even if I
>> wanted, it's usually just too painful. Except when I was able to use
>> "tune2fs -j" to make ext3 out of ext2, that was really easy. We can
>> provide the same for qcow2 to qcow3 conversion, but not with a
>> completely new format.
>>
>> Also, while obsoleting a file format means that we need not put much
>> effort in its maintenance, we still need to keep the code around for
>> reading old images. With an extension of qcow2, it would be the same
>> code that is used for both versions.
>>
>> Third, qcow2 already exists, is used in practice and we have put quite
>> some effort into QA. At least initially confidence would be higher than
>> in a completely new, yet untested format. Remember that with qcow3 I'm
>> not talking about rewriting everything, it's a careful evolution, mostly
>> with optional additions here and there.
>>    
>
> My requirements for a new format are as followed:
>
> 1) documented, thought-out specification that is covered under and
> open license with a clear process for extension.
>
> 2) ability to add both compatible and incompatible features in a
> graceful way
>
> 3) ability to achieve performance that's close to raw.  I want our new
> format to be able to be used universally both for servers and
> desktops.

I'd like to add

4) minimize complexity and maximize maintainability of the code.  I'd
gladly sacrifice nice-to-have features for that.

> I think qcow2 has some misfeatures like compression and internal
> snapshots.  I think preserving those misfeatures is a mistake because
> I don't think we can satisfy the above while trying to preserve those
> features.  If the image format degrades when those features are
> enabled, then it decreases confidence in the format.

I'm inclined to agree.  There's one way to prove us wrong: implement the
misfeatures without compromising the requirements.

> I think QED satisfies all of these today.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-20 22:13         ` [Qemu-devel] Re: Strategic decision: COW format Aurelien Jarno
  2011-02-21  8:59           ` Kevin Wolf
@ 2011-02-22  8:40           ` Markus Armbruster
  1 sibling, 0 replies; 87+ messages in thread
From: Markus Armbruster @ 2011-02-22  8:40 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi

Aurelien Jarno <aurelien@aurel32.net> writes:

[...]
> I agree that the best would be to have a single format, and it's
> probably a goal to have. That said, what is most important to my view is
> having one or two formats which together have _all_ the features (and 
> here I consider speed as a feature) of the existing qcow2 format. QED or
> FVD have been designed with the "virtualization in a datacenter" in mind,
> and are very good for this use. OTOH they don't support compression or 
> snapshotting, that are quite useful for demo, debugging, testing, or
> even for occasionally running a Windows VM, in other words in situations
> where the speed is not the priority.

Speed not a priority means the requirements are pretty radically
different.  Satisfying two radically different sets of requirements with
the same format could be difficult.  Great to have, but possibly
difficult.

> If we can't find a tradeoff for that, we should go for two instead of 
> one image format.

Less bad than a jack-of-all-trades.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22  8:37           ` Markus Armbruster
@ 2011-02-22  8:56             ` Kevin Wolf
  2011-02-22 10:21               ` Markus Armbruster
                                 ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-22  8:56 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Chunqiang Tang, qemu-devel, Stefan Hajnoczi

Am 22.02.2011 09:37, schrieb Markus Armbruster:
> Anthony Liguori <anthony@codemonkey.ws> writes:
> 
>> On 02/18/2011 03:57 AM, Kevin Wolf wrote:
>>> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>>>    
>>>> Kevin Wolf<kwolf@redhat.com>  writes:
>>>>
>>>>      
>>>>> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>>>>        
>>>>>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>>>>>> As you requested, I set up a wiki page for FVD at
>>>>>>>            
>>>>>> http://wiki.qemu.org/Features/FVD
>>>>>>          
>>>>>>> . It includes a summary of FVD, a detailed specification of FVD, and a
>>>>>>> comparison of the design and performance of FVD and QED.
>>>>>>>            
>>>>>>          
>>>>>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
>>>>>>>            
>>>>>> figure
>>>>>>          
>>>>>>> shows that the file creation throughput of NetApp's PostMark benchmark
>>>>>>>            
>>>>>> under
>>>>>>          
>>>>>>> FVD is 74.9% to 215% higher than that under QED.
>>>>>>>            
>>>>>> Hi Anthony,
>>>>>>
>>>>>> Please let me know if more information is needed. I would appreciate your
>>>>>> feedback and advice on the best way to proceed with FVD.
>>>>>>          
>>>>> Yet another file format with yet another implementation is definitely
>>>>> not what we need. We should probably take some of the ideas in FVD and
>>>>> consider them for qcow3.
>>>>>        
>>>> Got an assumption there: that the one COW format we need must be qcow3,
>>>> i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
>>>> has happened on the list already, I missed it.  If not, it's overdue,
>>>> and then we better start it right away.
>>>>      
>>> Right. I probably wasn't very clear about what I mean with qcow3 either,
>>> so let me try to summarize my reasoning.
>>>
>>>
>>> The first point is an assumption that you made, too: That we want to
>>> have only one format. I hope it's easy to agree on this, duplication is
>>> bad and every additional format creates new maintenance burden,
>>> especially if we're taking it serious. Until now, there were exactly two
>>> formats for which we managed to do this, raw and qcow2. raw is more or
>>> less for free, so with the introduction of another format, we basically
>>> double the supported block driver code overnight (while not doubling the
>>> number of developers).
>>>    
>>
>> Not sure what project you're following, but we've had an awful lot of
>> formats before qcow2 :-)
>>
>> And qcow2 was never all that special, it just was dropped in the code
>> base one day.  You've put a lot of work into qcow2, but there are
>> other folks that are contributing additional formats and that means
>> more developers.
>>
>>> The consequence of having only one file format is that it must be able
>>> to obsolete the existing ones, most notably qcow2. We can only neglect
>>> qcow1 today because we can tell users to use qcow2. It supports
>>> everything that qcow1 supports and more. We couldn't have done this if
>>> qcow2 lacked features compared to qcow1.
>>>
>>> So the one really essential requirement that I see is that we provide a
>>> way forward for _all_ users by maintaining all of qcow2's features. This
>>> is the only way of getting people to not stay with qcow2.
>>>
>>>
>>> Of course, you could invent another format that implements the same
>>> features, but I think just carefully extending qcow2 has some real
>>> advantages.
>>>
>>> The first is that conversion of existing images would be really easy.
>>> Basically increment the version number in the header file and you're
>>> done. Structures would be compatible.
>>
>> qemu-img convert is a reasonable path for conversion.
>>
>>>   If you compare it to file systems,
>>> I rarely ever change the file system on a non-empty partition. Even if I
>>> wanted, it's usually just too painful. Except when I was able to use
>>> "tune2fs -j" to make ext3 out of ext2, that was really easy. We can
>>> provide the same for qcow2 to qcow3 conversion, but not with a
>>> completely new format.
>>>
>>> Also, while obsoleting a file format means that we need not put much
>>> effort in its maintenance, we still need to keep the code around for
>>> reading old images. With an extension of qcow2, it would be the same
>>> code that is used for both versions.
>>>
>>> Third, qcow2 already exists, is used in practice and we have put quite
>>> some effort into QA. At least initially confidence would be higher than
>>> in a completely new, yet untested format. Remember that with qcow3 I'm
>>> not talking about rewriting everything, it's a careful evolution, mostly
>>> with optional additions here and there.
>>>    
>>
>> My requirements for a new format are as followed:
>>
>> 1) documented, thought-out specification that is covered under and
>> open license with a clear process for extension.
>>
>> 2) ability to add both compatible and incompatible features in a
>> graceful way
>>
>> 3) ability to achieve performance that's close to raw.  I want our new
>> format to be able to be used universally both for servers and
>> desktops.
> 
> I'd like to add
> 
> 4) minimize complexity and maximize maintainability of the code.  I'd
> gladly sacrifice nice-to-have features for that.

Especially if they are features that only other users use, right?

What's the "Sankt-Florians-Prinzip" called in English?

>> I think qcow2 has some misfeatures like compression and internal
>> snapshots.  I think preserving those misfeatures is a mistake because
>> I don't think we can satisfy the above while trying to preserve those
>> features.  If the image format degrades when those features are
>> enabled, then it decreases confidence in the format.
> 
> I'm inclined to agree.  There's one way to prove us wrong: implement the
> misfeatures without compromising the requirements.

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call "misfeatures") _are_ implemented in a way
that they don't impact the "normal" case. And they are it today.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22  8:56             ` Kevin Wolf
@ 2011-02-22 10:21               ` Markus Armbruster
  2011-02-22 15:57               ` Anthony Liguori
  2011-02-23 13:43               ` Avi Kivity
  2 siblings, 0 replies; 87+ messages in thread
From: Markus Armbruster @ 2011-02-22 10:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, qemu-devel, Stefan Hajnoczi

Kevin Wolf <kwolf@redhat.com> writes:

> Am 22.02.2011 09:37, schrieb Markus Armbruster:
>> Anthony Liguori <anthony@codemonkey.ws> writes:
>> 
>>> On 02/18/2011 03:57 AM, Kevin Wolf wrote:
>>>> Am 18.02.2011 10:12, schrieb Markus Armbruster:
>>>>    
>>>>> Kevin Wolf<kwolf@redhat.com>  writes:
>>>>>
>>>>>      
>>>>>> Am 15.02.2011 20:45, schrieb Chunqiang Tang:
>>>>>>        
>>>>>>>> Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
>>>>>>>> As you requested, I set up a wiki page for FVD at
>>>>>>>>            
>>>>>>> http://wiki.qemu.org/Features/FVD
>>>>>>>          
>>>>>>>> . It includes a summary of FVD, a detailed specification of FVD, and a
>>>>>>>> comparison of the design and performance of FVD and QED.
>>>>>>>>            
>>>>>>>          
>>>>>>>> See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
>>>>>>>>            
>>>>>>> figure
>>>>>>>          
>>>>>>>> shows that the file creation throughput of NetApp's PostMark benchmark
>>>>>>>>            
>>>>>>> under
>>>>>>>          
>>>>>>>> FVD is 74.9% to 215% higher than that under QED.
>>>>>>>>            
>>>>>>> Hi Anthony,
>>>>>>>
>>>>>>> Please let me know if more information is needed. I would appreciate your
>>>>>>> feedback and advice on the best way to proceed with FVD.
>>>>>>>          
>>>>>> Yet another file format with yet another implementation is definitely
>>>>>> not what we need. We should probably take some of the ideas in FVD and
>>>>>> consider them for qcow3.
>>>>>>        
>>>>> Got an assumption there: that the one COW format we need must be qcow3,
>>>>> i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
>>>>> has happened on the list already, I missed it.  If not, it's overdue,
>>>>> and then we better start it right away.
>>>>>      
>>>> Right. I probably wasn't very clear about what I mean with qcow3 either,
>>>> so let me try to summarize my reasoning.
>>>>
>>>>
>>>> The first point is an assumption that you made, too: That we want to
>>>> have only one format. I hope it's easy to agree on this, duplication is
>>>> bad and every additional format creates new maintenance burden,
>>>> especially if we're taking it serious. Until now, there were exactly two
>>>> formats for which we managed to do this, raw and qcow2. raw is more or
>>>> less for free, so with the introduction of another format, we basically
>>>> double the supported block driver code overnight (while not doubling the
>>>> number of developers).
>>>>    
>>>
>>> Not sure what project you're following, but we've had an awful lot of
>>> formats before qcow2 :-)
>>>
>>> And qcow2 was never all that special, it just was dropped in the code
>>> base one day.  You've put a lot of work into qcow2, but there are
>>> other folks that are contributing additional formats and that means
>>> more developers.
>>>
>>>> The consequence of having only one file format is that it must be able
>>>> to obsolete the existing ones, most notably qcow2. We can only neglect
>>>> qcow1 today because we can tell users to use qcow2. It supports
>>>> everything that qcow1 supports and more. We couldn't have done this if
>>>> qcow2 lacked features compared to qcow1.
>>>>
>>>> So the one really essential requirement that I see is that we provide a
>>>> way forward for _all_ users by maintaining all of qcow2's features. This
>>>> is the only way of getting people to not stay with qcow2.
>>>>
>>>>
>>>> Of course, you could invent another format that implements the same
>>>> features, but I think just carefully extending qcow2 has some real
>>>> advantages.
>>>>
>>>> The first is that conversion of existing images would be really easy.
>>>> Basically increment the version number in the header file and you're
>>>> done. Structures would be compatible.
>>>
>>> qemu-img convert is a reasonable path for conversion.
>>>
>>>>   If you compare it to file systems,
>>>> I rarely ever change the file system on a non-empty partition. Even if I
>>>> wanted, it's usually just too painful. Except when I was able to use
>>>> "tune2fs -j" to make ext3 out of ext2, that was really easy. We can
>>>> provide the same for qcow2 to qcow3 conversion, but not with a
>>>> completely new format.
>>>>
>>>> Also, while obsoleting a file format means that we need not put much
>>>> effort in its maintenance, we still need to keep the code around for
>>>> reading old images. With an extension of qcow2, it would be the same
>>>> code that is used for both versions.
>>>>
>>>> Third, qcow2 already exists, is used in practice and we have put quite
>>>> some effort into QA. At least initially confidence would be higher than
>>>> in a completely new, yet untested format. Remember that with qcow3 I'm
>>>> not talking about rewriting everything, it's a careful evolution, mostly
>>>> with optional additions here and there.
>>>>    
>>>
>>> My requirements for a new format are as followed:
>>>
>>> 1) documented, thought-out specification that is covered under and
>>> open license with a clear process for extension.
>>>
>>> 2) ability to add both compatible and incompatible features in a
>>> graceful way
>>>
>>> 3) ability to achieve performance that's close to raw.  I want our new
>>> format to be able to be used universally both for servers and
>>> desktops.
>> 
>> I'd like to add
>> 
>> 4) minimize complexity and maximize maintainability of the code.  I'd
>> gladly sacrifice nice-to-have features for that.
>
> Especially if they are features that only other users use, right?
>
> What's the "Sankt-Florians-Prinzip" called in English?

NIMBY (not in my backyard)

Separating must-have from nice-to-have is basic requirements analysis.
Calling anything anybody would ever want a requirement isn't.

>>> I think qcow2 has some misfeatures like compression and internal
>>> snapshots.  I think preserving those misfeatures is a mistake because
>>> I don't think we can satisfy the above while trying to preserve those
>>> features.  If the image format degrades when those features are
>>> enabled, then it decreases confidence in the format.
>> 
>> I'm inclined to agree.  There's one way to prove us wrong: implement the
>> misfeatures without compromising the requirements.
>
> *sigh*
>
> It starts to get annoying, but if you really insist, I can repeat it
> once more: These features that you don't need (this is the correct
> description for what you call "misfeatures") _are_ implemented in a way
> that they don't impact the "normal" case. And they are it today.

Then you're on track to proving us wrong.  No need to get annoyed, just
finish the proof by showing us code that satisfies the nice-to-have
requirements in addition to the must-have requirements :)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22  8:56             ` Kevin Wolf
  2011-02-22 10:21               ` Markus Armbruster
@ 2011-02-22 15:57               ` Anthony Liguori
  2011-02-22 16:15                 ` Kevin Wolf
  2011-02-23 13:43               ` Avi Kivity
  2 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-22 15:57 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

On 02/22/2011 02:56 AM, Kevin Wolf wrote:
>
> *sigh*
>
> It starts to get annoying, but if you really insist, I can repeat it
> once more: These features that you don't need (this is the correct
> description for what you call "misfeatures") _are_ implemented in a way
> that they don't impact the "normal" case.

Except that they require a refcount table that adds additional metadata 
that needs to be updated in the fast path.  I consider that impacting 
the normal case.

Regards,

Anthony Liguori

>   And they are it today.
>
> Kevin
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22 15:57               ` Anthony Liguori
@ 2011-02-22 16:15                 ` Kevin Wolf
  2011-02-22 18:18                   ` Anthony Liguori
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-02-22 16:15 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

Am 22.02.2011 16:57, schrieb Anthony Liguori:
> On 02/22/2011 02:56 AM, Kevin Wolf wrote:
>>
>> *sigh*
>>
>> It starts to get annoying, but if you really insist, I can repeat it
>> once more: These features that you don't need (this is the correct
>> description for what you call "misfeatures") _are_ implemented in a way
>> that they don't impact the "normal" case.
> 
> Except that they require a refcount table that adds additional metadata 
> that needs to be updated in the fast path.  I consider that impacting 
> the normal case.

Like it or not, this requirement exists anyway, without any of your
"misfeatures".

You chose to use the dirty flag in QED in order to avoid having to flush
metadata too often, which is an approach that any other format, even one
using refcounts, can take as well.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22 16:15                 ` Kevin Wolf
@ 2011-02-22 18:18                   ` Anthony Liguori
  2011-02-23  9:13                     ` Kevin Wolf
  0 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-22 18:18 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

On 02/22/2011 10:15 AM, Kevin Wolf wrote:
> Am 22.02.2011 16:57, schrieb Anthony Liguori:
>    
>> On 02/22/2011 02:56 AM, Kevin Wolf wrote:
>>      
>>> *sigh*
>>>
>>> It starts to get annoying, but if you really insist, I can repeat it
>>> once more: These features that you don't need (this is the correct
>>> description for what you call "misfeatures") _are_ implemented in a way
>>> that they don't impact the "normal" case.
>>>        
>> Except that they require a refcount table that adds additional metadata
>> that needs to be updated in the fast path.  I consider that impacting
>> the normal case.
>>      
> Like it or not, this requirement exists anyway, without any of your
> "misfeatures".
>
> You chose to use the dirty flag in QED in order to avoid having to flush
> metadata too often, which is an approach that any other format, even one
> using refcounts, can take as well.
>    

It's a minor detail, but flushing and the amount of metadata are 
separate points.

The dirty flag prevents metadata from being flushed to disk very often 
but the use of a refcount table adds additional metadata.

A refcount table is definitely not required even if you claim the 
requirement exists for other features.  I assume you mean to implement 
trim/discard support but instead of a refcount table, a free list would 
work just as well and would leave the metadata update out of the fast 
path (allocating writes) and instead only be in the slow path 
(trim/discard).

As a format feature, a refcount table really only makes sense if the 
refcount is required to be greater than a single bit.  There are more 
optimal data structures that can be used if the refcount of a block is 
fixed to 1-bit (like a free list) which is what the fundamental design 
difference between qcow2 and qed is.

The only use of a refcount of more than 1-bit is internal snapshots AFAICT.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-21 13:44             ` Stefan Hajnoczi
  2011-02-21 14:10               ` Kevin Wolf
@ 2011-02-23  3:32               ` Chunqiang Tang
  2011-02-23 13:20                 ` Markus Armbruster
       [not found]               ` <OFAEB4CD91.BE989F29-ON8525783F.007366B8-85257840.00130B47@LocalDomain>
  2 siblings, 1 reply; 87+ messages in thread
From: Chunqiang Tang @ 2011-02-23  3:32 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Markus Armbruster, qemu-devel, Aurelien Jarno

> In any case, the next step is to get down to specifics.  Here is the
> page with the current QCOW3 roadmap:
> 
> http://wiki.qemu.org/Qcow3_Roadmap
>
> Please raise concrete requirements or features so they can be
> discussed and captured.

Now it turns into a more productive discussion, but it seems to lose the 
big picture too quickly and has gone too narrowly into issues like the 
“dirty bit”. Let’s try to answer a bigger question: how to take a holistic 
approach to address all the factors that make a virtual disk slower than a 
physical disk? Even if issues like the “dirty bit” are addressed 
perfectly, they may still only be a small part of the total solution. The 
discussion of internal snapshot is at the end of this email.

Compared with a physical disk, a virtual disk (even RAW) incurs some or 
all of the following overheads. Obviously, the way to achieve high 
performance is to eliminate or reduce these overheads.

Overhead at the image level:
I1: Data fragmentation caused by an image format.
I2: Overhead in reading an image format’s metadata from disk.
I3: Overhead in writing an image format’s metadata to disk.
I4: Inefficiency and complexity in the block driver implementation, e.g., 
waiting synchronously for reading or writing metadata, submitting I/O 
requests sequentially when they should be done concurrently, performing a 
flush unnecessarily, etc.

Overhead at the host file system level:
H1: Data fragmentation caused by a host file system.
H2: Overhead in reading a host file system’s metadata.
H3: Overhead in writing a host file system’s metadata.

Existing image formats by design do not address many of these issues, 
which is the reason why FVD was invented (
http://wiki.qemu.org/Features/FVD).  Let’s look at these issues one by 
one.

Regarding I1: Data fragmentation caused by an image format:
This problem exists in most image formats, as they insist on doing storage 
allocation for the second time at the image level (including QCOW2, QED, 
VMDK, VDI, VHD, etc.), even if the host file system already does storage 
allocation. These image formats unnecessarily mix the function of storage 
allocation with the function of copy-on-write, i.e., they determine 
whether a cluster is dirty by checking whether it has storage space 
allocated at the image level. This is wrong. Storage allocation and 
tracking dirty clusters are two separate functions. Data fragmentation at 
the image level can be totally avoided by using a RAW image plus a bitmap 
header to indicate whether clusters are dirty due to copy-on-write. FVD 
can be configured to take this approach, although it can also be 
configured to do storage allocation.  Doing storage allocation at the 
image level can be optional, but should never be mandatory.

Regarding I2: Overhead in reading an image format’s metadata from disk:
Obviously, the solution is to make the metadata small so that it can be 
cached entirely in memory. Is this aspect, QCOW1/QCOW2/QED and 
VMDK-workstation-version are wrong, and VirtualBox VDI, Microsoft VHD, and 
VMDK-esx-server-version are right. With QCOW1/QCOW2/QED, for a 1TB virtual 
disk, the metadata size is at least 128MB. By contrast, with VDI, for a 
1TB virtual disk, the metadata size is only 4MB. The “wrong formats” all 
use a two-level lookup table to do storage allocation at a small 
granularity (e.g., 64KB), whereas the “right formats” all use a one-level 
lookup table to do storage allocation at a large granularity (1MB or 2MB). 
The one-level table is easier to implementation. Note that VMware VMDK 
started wrong in VMware’s workstation version, and then was corrected to 
be right in the ESX server version, which is a good move. As virtual disks 
grow bigger, it is likely that the storage allocation unit will be 
increased in the future, e.g., to 10MB or even larger. In existing image 
formats, one limitation of using a large storage allocation unit is that 
it forces copy-on-write being performed on a large cluster (e.g., 10MB in 
the future), which is sort of wrong. FVD gets the bests of both worlds. It 
uses a one-level table to perform storage allocation at a large 
granularity, but uses a bitmap to track copy-on-write at a smaller 
granularity. For a 1TB virtual disk, this approach needs only 6MB 
metadata, slightly larger than VDI’s 4MB.

Regarding I3: Overhead in writing an image format’s metadata to disk:
This is where the “dirty bit” discussion fits, but FVD goes way beyond 
that to reduce metadata updates.  When an FVD image is fully optimized 
(e.g., the one-level lookup table is disabled and the base image is 
reduced to its minimum size), FVD has almost zero overhead in metadata 
update and the data layout is just like a RAW image. More specifically, 
metadata updates are skipped, delayed, batched, or merged as much as 
possible without compromising data integrity. First, even with 
cache=writethrough (i.e., O_DSYNC), all metadata updates are sequential 
writes to FVD’s journal, which can be merged into a single write by the 
host Linux kernel. Second, when cache!=writethrough, metadata updates are 
batched and sent to the journal either on a flush, or memory pressure, or 
periodically cleaned, just like page cache in kernel. Third, FVD’s table 
can be (preferably) disabled and hence it incurs no update overhead. Even 
if the table is enabled, FVD’s chunk is much larger than QCOW2/QED’s 
cluster, and hence needs less updates. Finally, although QCOW2/QED and FVD 
use the same block/cluster size, FVD can be optimized to eliminate most 
bitmap updates with several techniques: A) Use resize2fs to reduce the 
base image to its minimum size (which is what a Cloud can do) so that most 
writes occur at locations beyond the size of the base image, without the 
need to update the bitmap; B) ‘qemu-img create’ can find zero-filled 
sectors in a sparse base image and preset the corresponding bits of 
bitmap, which then requires no runtime update; and C) copy-on-read and 
prefetching do not update the bitmap and once prefetching finishes, there 
is completely no need for FVD to read or write the bitmap. Again, when an 
FVD image is fully optimized (e.g., the table is disabled and the base 
image is reduced to its minimum size), FVD has almost zero overhead in 
metadata update and the data layout is just like a RAW image.

Regarding I4: Inefficiency in block driver, e.g., synchronous metadata 
read/write:
Today, FVD is the only fully asynchronous, nonblocking COW driver 
implemented for QEMU, and has the best performance. This is partially due 
to its simple design. The one-level table is easy to implement than a 
two-level table. The journal avoids sophisticated locking that would 
otherwise be required for performing metadata updates. FVD parallelizes 
I/Os to the maximum degree possible. For example, if processing a 
VM-generated read request needs to read data from the base image as well 
as several non-continuous chunks in the FVD image, FVD issues all I/O 
requests in parallel rather than sequentially.

Regarding H1&H2&H3: host file system caused fragmentation and metadata 
read/write:
FVD can be optionally configured to get rid of the host file system and 
store an image on a logical volume directly. This seems straightforward 
but a naïve solution like that currently in QCOW2 would not be able to 
achieve storage thin provisioning (i.e., storage over-commit), as the 
initial logical volume size need be allocated to the full size of the 
image. FVD supports thin provisioning on a logical volume, by starting 
with a small one and growing it automatically when needed. It is quite 
easy for FVD to track the size of used space, without the need to update a 
size field in the image header on every storage allocation (which is a 
problem in VDI). There are multiple efficient solutions possible in FVD. 
One solution is to piggyback the size field as part of the journal entry 
that records a new storage allocation. Alternatively, even doing an ‘fsck’ 
like scan on FVD’s one-level lookup table to figure out the used space is 
trivial. Because the table is only 4MB for a 1TB virtual disk and it is 
contiguous in the image, a scan takes only about 20 milliseconds: 15 
milliseconds to load 4MB from disk and less than 5 milliseconds to scan 
4MB in memory. This is more efficient than a dirty bit in QCOW2 or QED.

In summary, it seems that people’s imagination for QCOW3 is unfortunately 
limited by the overwhelming experience from QCOW2, without even looking at 
what VirtualBox VDI, VMware VMDK, and Microsoft VHD have done, not to 
mention going beyond all those to ascend to the next level. Regardless of 
its name, I hope QCOW3 will take the right actions to fix wrong things in 
QCOW2, including:

A1: abandon a two-level table and adopt a one-level table, as that in VDI, 
VMDK, and VHD, for simplicity and much smaller metadata size.

A2: introduce a bitmap to allow copy-on-write without doing storage 
allocation, which 1) avoids image-level fragmentation, 2) eliminates 
metadata update overhead for storage allocation, 3) allows copy-on-write 
being performed on a smaller storage unit (64KB) while still having very 
small metadata size.

A3: introduce a journal to batch and merge metadata updates and to reduce 
fsck recovery time after a host crash.

This is exactly the process how I arrived at the design of FVD. It is not 
by chance, but instead by taking a holistic approach to analyze problems 
in a virtual disk. I think the status of “QCOW3” today is comparable to 
FVD’s status 10 months ago when the design started to emerge, but FVD’s 
implementation today is very mature. It is the only asynchronous, 
nonblocking COW driver implemented for QEMU with undoubtedly the best 
performance, both by design and by implementation. 

Now let’s talk about features. It seems that there is great interest in 
QCOW2’ internal snapshot feature. If we really want to do that, the right 
solution is to follow VMDK’s approach of storing each snapshot as a 
separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather 
than using the reference count table. VMDK’s approach can be easily 
implemented for any COW format, or even as a function of the generic block 
layer, without complicating any COW format or hurting its performance. I 
know the snapshots are not really “internal” as stored in a single file 
but instead more like external snapshots, but users don’t care about that 
so long as they support the same use cases. Probably many people who use 
VMware don't even know that the snapshots are stored as separate files. Do 
they care?

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22 18:18                   ` Anthony Liguori
@ 2011-02-23  9:13                     ` Kevin Wolf
  2011-02-23 14:21                       ` Anthony Liguori
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-02-23  9:13 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

Am 22.02.2011 19:18, schrieb Anthony Liguori:
> On 02/22/2011 10:15 AM, Kevin Wolf wrote:
>> Am 22.02.2011 16:57, schrieb Anthony Liguori:
>>    
>>> On 02/22/2011 02:56 AM, Kevin Wolf wrote:
>>>      
>>>> *sigh*
>>>>
>>>> It starts to get annoying, but if you really insist, I can repeat it
>>>> once more: These features that you don't need (this is the correct
>>>> description for what you call "misfeatures") _are_ implemented in a way
>>>> that they don't impact the "normal" case.
>>>>        
>>> Except that they require a refcount table that adds additional metadata
>>> that needs to be updated in the fast path.  I consider that impacting
>>> the normal case.
>>>      
>> Like it or not, this requirement exists anyway, without any of your
>> "misfeatures".
>>
>> You chose to use the dirty flag in QED in order to avoid having to flush
>> metadata too often, which is an approach that any other format, even one
>> using refcounts, can take as well.
>>    
> 
> It's a minor detail, but flushing and the amount of metadata are 
> separate points.

I agree that they are separate...

> 
> The dirty flag prevents metadata from being flushed to disk very often 
> but the use of a refcount table adds additional metadata.
> 
> A refcount table is definitely not required even if you claim the 
> requirement exists for other features.  I assume you mean to implement 
> trim/discard support but instead of a refcount table, a free list would 
> work just as well and would leave the metadata update out of the fast 
> path (allocating writes) and instead only be in the slow path 
> (trim/discard).

...but here you're arguing about writing metadata out in the fast path,
so you're actually not interested in the amount of metadata but in the
overhead of flushing it. Which is a problem that's solved.

A refcount table is essential for internal snapshots and compression,
it's useful for discard and for running on block devices, it's necessary
for avoiding the dirty flag and fsck on startup.

These are five use cases that I can enumerate without thinking a lot
about it, there might be more. You propose using three different
mechanisms for allowing normal allocations (use the file size), block
devices (add a size field into the header) and discard (free list), and
the other three features, for which you can't think of a hack, you
declare "misfeatures".

I don't think what you're proposing is a satisfactory solution. In my
book, a single data structure that can provide all of the features is
better than a bunch of independent hacks that allows only half of it.

> As a format feature, a refcount table really only makes sense if the 
> refcount is required to be greater than a single bit.  There are more 
> optimal data structures that can be used if the refcount of a block is 
> fixed to 1-bit (like a free list) which is what the fundamental design 
> difference between qcow2 and qed is.

Okay, so even assuming that there's something like misfeatures that we
can kick out (with which I strongly disagree), what's the crucial
advantage of free lists that would make you switch the image format?

That you only access it in the slow path (discard) isn't true, because
you certainly want to reallocate freed clusters. Otherwise you could
just leak them without maintaining a list of leaked clusters...

> The only use of a refcount of more than 1-bit is internal snapshots AFAICT.

Of the currently implemented features, internal snapshots and compression.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23  3:32               ` Chunqiang Tang
@ 2011-02-23 13:20                 ` Markus Armbruster
  0 siblings, 0 replies; 87+ messages in thread
From: Markus Armbruster @ 2011-02-23 13:20 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Aurelien Jarno

Chunqiang Tang <ctang@us.ibm.com> writes:

[...]
> Now let’s talk about features. It seems that there is great interest in 
> QCOW2’ internal snapshot feature. If we really want to do that, the right 

Great interest?  Its use cases are demo, debugging, testing and such.
Kind of useful for developers, but I wouldn't want to use it in anger.
Nice to have if we can get it cheaply, but I'm not prepared to pay much
for it in performance or complexity, and I doubt I'm the only one.

Users always say "yes" when you ask them whether they need some feature.
Hence, the question is useless.  A better question to ask is "how much
are you willing to pay for it?"

> solution is to follow VMDK’s approach of storing each snapshot as a 
> separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather 
> than using the reference count table. VMDK’s approach can be easily 
> implemented for any COW format, or even as a function of the generic block 
> layer, without complicating any COW format or hurting its performance. I 
> know the snapshots are not really “internal” as stored in a single file 
> but instead more like external snapshots, but users don’t care about that 
> so long as they support the same use cases. Probably many people who use 
> VMware don't even know that the snapshots are stored as separate files. Do 
> they care?

I certainly wouldn't.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-22  8:56             ` Kevin Wolf
  2011-02-22 10:21               ` Markus Armbruster
  2011-02-22 15:57               ` Anthony Liguori
@ 2011-02-23 13:43               ` Avi Kivity
  2011-02-23 14:23                 ` Anthony Liguori
  2 siblings, 1 reply; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 13:43 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, Markus Armbruster, Stefan Hajnoczi, qemu-devel

On 02/22/2011 10:56 AM, Kevin Wolf wrote:
> *sigh*
>
> It starts to get annoying, but if you really insist, I can repeat it
> once more: These features that you don't need (this is the correct
> description for what you call "misfeatures") _are_ implemented in a way
> that they don't impact the "normal" case. And they are it today.
>

Plus, encryption and snapshots can be implemented in a way that doesn't 
impact performance more than is reasonable.  Compression perhaps not, 
but if you choose compression, then performance is not your top 
consideration.  That's the case with filesystems that support 
compression as well.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23  9:13                     ` Kevin Wolf
@ 2011-02-23 14:21                       ` Anthony Liguori
  2011-02-23 14:55                         ` Kevin Wolf
  0 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 14:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

On 02/23/2011 03:13 AM, Kevin Wolf wrote:
> Am 22.02.2011 19:18, schrieb Anthony Liguori:
>    
>> On 02/22/2011 10:15 AM, Kevin Wolf wrote:
>>      
>>> Am 22.02.2011 16:57, schrieb Anthony Liguori:
>>>
>>>        
>>>> On 02/22/2011 02:56 AM, Kevin Wolf wrote:
>>>>
>>>>          
>>>>> *sigh*
>>>>>
>>>>> It starts to get annoying, but if you really insist, I can repeat it
>>>>> once more: These features that you don't need (this is the correct
>>>>> description for what you call "misfeatures") _are_ implemented in a way
>>>>> that they don't impact the "normal" case.
>>>>>
>>>>>            
>>>> Except that they require a refcount table that adds additional metadata
>>>> that needs to be updated in the fast path.  I consider that impacting
>>>> the normal case.
>>>>
>>>>          
>>> Like it or not, this requirement exists anyway, without any of your
>>> "misfeatures".
>>>
>>> You chose to use the dirty flag in QED in order to avoid having to flush
>>> metadata too often, which is an approach that any other format, even one
>>> using refcounts, can take as well.
>>>
>>>        
>> It's a minor detail, but flushing and the amount of metadata are
>> separate points.
>>      
> I agree that they are separate...
>
>    
>> The dirty flag prevents metadata from being flushed to disk very often
>> but the use of a refcount table adds additional metadata.
>>
>> A refcount table is definitely not required even if you claim the
>> requirement exists for other features.  I assume you mean to implement
>> trim/discard support but instead of a refcount table, a free list would
>> work just as well and would leave the metadata update out of the fast
>> path (allocating writes) and instead only be in the slow path
>> (trim/discard).
>>      
> ...but here you're arguing about writing metadata out in the fast path,
> so you're actually not interested in the amount of metadata but in the
> overhead of flushing it. Which is a problem that's solved.
>    

I'm interested in both.  An extra write is always going to be an extra 
write.  The flush just makes it very painful.

> A refcount table is essential for internal snapshots and compression,
> it's useful for discard and for running on block devices, it's necessary
> for avoiding the dirty flag and fsck on startup.
>    

No, as designed today, qcow2 still needs a dirty flag to avoid leaking 
blocks.

> These are five use cases that I can enumerate without thinking a lot
> about it, there might be more. You propose using three different
> mechanisms for allowing normal allocations (use the file size), block
> devices (add a size field into the header) and discard (free list), and
> the other three features, for which you can't think of a hack, you
> declare "misfeatures".
>    

No, I only label compression and internal snapshots as misfeatures.  
Encryption is a completely reasonable feature.

So even with qcow3, what's the expectation of snapshots?  Are we going 
to scale to images with over 1000 snapshots?  I believe snapshot support 
in qcow2 is not a feature that has been designed with any serious 
thought.  If we truly want to support internal snapshots, let's design 
it correctly.

>> As a format feature, a refcount table really only makes sense if the
>> refcount is required to be greater than a single bit.  There are more
>> optimal data structures that can be used if the refcount of a block is
>> fixed to 1-bit (like a free list) which is what the fundamental design
>> difference between qcow2 and qed is.
>>      
> Okay, so even assuming that there's something like misfeatures that we
> can kick out (with which I strongly disagree), what's the crucial
> advantage of free lists that would make you switch the image format?
>    

Performance.  One thing we haven't tested with qcow2 is O_SYNC 
performance in the guest but my suspicion is that an O_SYNC workload is 
going to perform poorly even with cache=none.

Starting with a simple format that we don't have to jump through 
tremendous hoops to get reasonable performance out of has a lot of virtues.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 13:43               ` Avi Kivity
@ 2011-02-23 14:23                 ` Anthony Liguori
  2011-02-23 14:38                   ` Kevin Wolf
  2011-02-23 15:23                   ` Avi Kivity
  0 siblings, 2 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 14:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi

On 02/23/2011 07:43 AM, Avi Kivity wrote:
> On 02/22/2011 10:56 AM, Kevin Wolf wrote:
>> *sigh*
>>
>> It starts to get annoying, but if you really insist, I can repeat it
>> once more: These features that you don't need (this is the correct
>> description for what you call "misfeatures") _are_ implemented in a way
>> that they don't impact the "normal" case. And they are it today.
>>
>
> Plus, encryption and snapshots can be implemented in a way that 
> doesn't impact performance more than is reasonable.

We're still missing the existence proof of this, but even assuming it 
existed, what about snapshots?  Are we okay having a feature in a 
prominent format that isn't going to meet user's expectations?

Is there any hope that an image with 1000, 1000, or 10000 snapshots is 
going to have even reasonable performance in qcow2?

Regards,

Anthony Liguori

>   Compression perhaps not, but if you choose compression, then 
> performance is not your top consideration.  That's the case with 
> filesystems that support compression as well.
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 14:23                 ` Anthony Liguori
@ 2011-02-23 14:38                   ` Kevin Wolf
  2011-02-23 15:29                     ` Anthony Liguori
  2011-02-23 15:23                   ` Avi Kivity
  1 sibling, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-02-23 14:38 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, qemu-devel, Avi Kivity, Stefan Hajnoczi,
	Markus Armbruster

Am 23.02.2011 15:23, schrieb Anthony Liguori:
> On 02/23/2011 07:43 AM, Avi Kivity wrote:
>> On 02/22/2011 10:56 AM, Kevin Wolf wrote:
>>> *sigh*
>>>
>>> It starts to get annoying, but if you really insist, I can repeat it
>>> once more: These features that you don't need (this is the correct
>>> description for what you call "misfeatures") _are_ implemented in a way
>>> that they don't impact the "normal" case. And they are it today.
>>>
>>
>> Plus, encryption and snapshots can be implemented in a way that 
>> doesn't impact performance more than is reasonable.
> 
> We're still missing the existence proof of this, but even assuming it 

Define "reasonable". I sent you some numbers not too long for
encryption, and I consider them reasonable (iirc, between 25% and 40%
slower than without encryption).

> existed, what about snapshots?  Are we okay having a feature in a 
> prominent format that isn't going to meet user's expectations?
> 
> Is there any hope that an image with 1000, 1000, or 10000 snapshots is 
> going to have even reasonable performance in qcow2?

Is there any hope for backing file chains of 1000 files or more? I
haven't tried it out, but in theory I'd expect that internal snapshots
could cope better with it than external ones because internal snapshots
don't have to go through the whole chain all the time.

What are the points where you think that performance of internal
snapshots suffers?

The argument that I would understand is that internal snapshots are
probably not as handy in all scenarios.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 14:21                       ` Anthony Liguori
@ 2011-02-23 14:55                         ` Kevin Wolf
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-23 14:55 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, qemu-devel, Markus Armbruster, Stefan Hajnoczi

Am 23.02.2011 15:21, schrieb Anthony Liguori:
> On 02/23/2011 03:13 AM, Kevin Wolf wrote:
>> Am 22.02.2011 19:18, schrieb Anthony Liguori:
>>    
>>> On 02/22/2011 10:15 AM, Kevin Wolf wrote:
>>>      
>>>> Am 22.02.2011 16:57, schrieb Anthony Liguori:
>>>>
>>>>        
>>>>> On 02/22/2011 02:56 AM, Kevin Wolf wrote:
>>>>>
>>>>>          
>>>>>> *sigh*
>>>>>>
>>>>>> It starts to get annoying, but if you really insist, I can repeat it
>>>>>> once more: These features that you don't need (this is the correct
>>>>>> description for what you call "misfeatures") _are_ implemented in a way
>>>>>> that they don't impact the "normal" case.
>>>>>>
>>>>>>            
>>>>> Except that they require a refcount table that adds additional metadata
>>>>> that needs to be updated in the fast path.  I consider that impacting
>>>>> the normal case.
>>>>>
>>>>>          
>>>> Like it or not, this requirement exists anyway, without any of your
>>>> "misfeatures".
>>>>
>>>> You chose to use the dirty flag in QED in order to avoid having to flush
>>>> metadata too often, which is an approach that any other format, even one
>>>> using refcounts, can take as well.
>>>>
>>>>        
>>> It's a minor detail, but flushing and the amount of metadata are
>>> separate points.
>>>      
>> I agree that they are separate...
>>
>>    
>>> The dirty flag prevents metadata from being flushed to disk very often
>>> but the use of a refcount table adds additional metadata.
>>>
>>> A refcount table is definitely not required even if you claim the
>>> requirement exists for other features.  I assume you mean to implement
>>> trim/discard support but instead of a refcount table, a free list would
>>> work just as well and would leave the metadata update out of the fast
>>> path (allocating writes) and instead only be in the slow path
>>> (trim/discard).
>>>      
>> ...but here you're arguing about writing metadata out in the fast path,
>> so you're actually not interested in the amount of metadata but in the
>> overhead of flushing it. Which is a problem that's solved.
>>    
> 
> I'm interested in both.  An extra write is always going to be an extra 
> write.  The flush just makes it very painful.

One extra write of 64k every 2 GB. Hardly relevant.

>> A refcount table is essential for internal snapshots and compression,
>> it's useful for discard and for running on block devices, it's necessary
>> for avoiding the dirty flag and fsck on startup.
>>    
> 
> No, as designed today, qcow2 still needs a dirty flag to avoid leaking 
> blocks.

I know that this is your opinion and I do respect that, this is one of
the reasons why there is the suggestion to add the dirty flag for you.

On the other hand, it would be about time for you to accept that there
are people who think differently about it and who don't want the same as
you. This is why using the dirty flag should be optional.

>> These are five use cases that I can enumerate without thinking a lot
>> about it, there might be more. You propose using three different
>> mechanisms for allowing normal allocations (use the file size), block
>> devices (add a size field into the header) and discard (free list), and
>> the other three features, for which you can't think of a hack, you
>> declare "misfeatures".
>>    
> 
> No, I only label compression and internal snapshots as misfeatures.  
> Encryption is a completely reasonable feature.

I didn't even mention encryption. It's obvious that it's a "reasonable
feature" and not a "misfeature", because it fits relatively easily in
your QED design. :-)

The three features you don't like because they don't fit are
compression, internal snapshots and not having to fsck (thanks for
proving the latter above)

> So even with qcow3, what's the expectation of snapshots?  Are we going 
> to scale to images with over 1000 snapshots?  I believe snapshot support 
> in qcow2 is not a feature that has been designed with any serious 
> thought.  If we truly want to support internal snapshots, let's design 
> it correctly.

So what would be the key differences between your design and qcow2's? We
can always check if there's room to improve.

>>> As a format feature, a refcount table really only makes sense if the
>>> refcount is required to be greater than a single bit.  There are more
>>> optimal data structures that can be used if the refcount of a block is
>>> fixed to 1-bit (like a free list) which is what the fundamental design
>>> difference between qcow2 and qed is.
>>>      
>> Okay, so even assuming that there's something like misfeatures that we
>> can kick out (with which I strongly disagree), what's the crucial
>> advantage of free lists that would make you switch the image format?
> 
> Performance.  One thing we haven't tested with qcow2 is O_SYNC 
> performance in the guest but my suspicion is that an O_SYNC workload is 
> going to perform poorly even with cache=none.

But wasn't it you who wants to use the dirty flag in any case? The
refcounts aren't even written then.

> Starting with a simple format that we don't have to jump through 
> tremendous hoops to get reasonable performance out of has a lot of virtues.

I know that you don't mean it like I read this, but it's entirely true:
You're _starting_ with a simple format, but once you add features you're
going to get something much more complex than qcow2 because you just
don't have proper cluster allocation infrastructure and need to invent
new hacks every time.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 14:23                 ` Anthony Liguori
  2011-02-23 14:38                   ` Kevin Wolf
@ 2011-02-23 15:23                   ` Avi Kivity
  2011-02-23 15:31                     ` Anthony Liguori
  2011-02-23 15:33                     ` Daniel P. Berrange
  1 sibling, 2 replies; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 15:23 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi

On 02/23/2011 04:23 PM, Anthony Liguori wrote:
> On 02/23/2011 07:43 AM, Avi Kivity wrote:
>> On 02/22/2011 10:56 AM, Kevin Wolf wrote:
>>> *sigh*
>>>
>>> It starts to get annoying, but if you really insist, I can repeat it
>>> once more: These features that you don't need (this is the correct
>>> description for what you call "misfeatures") _are_ implemented in a way
>>> that they don't impact the "normal" case. And they are it today.
>>>
>>
>> Plus, encryption and snapshots can be implemented in a way that 
>> doesn't impact performance more than is reasonable.
>
> We're still missing the existence proof of this, but even assuming it 
> existed,

dm-crypt isn't any more complicated, and it's used by default in most 
distributions these days.

> what about snapshots?  Are we okay having a feature in a prominent 
> format that isn't going to meet user's expectations?
>
> Is there any hope that an image with 1000, 1000, or 10000 snapshots is 
> going to have even reasonable performance in qcow2?
>

Are thousands of snapshots for a single image a reasonable user 
expectation?  What's the use case?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 14:38                   ` Kevin Wolf
@ 2011-02-23 15:29                     ` Anthony Liguori
  2011-02-23 15:36                       ` Avi Kivity
  2011-02-23 15:54                       ` Kevin Wolf
  0 siblings, 2 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 15:29 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chunqiang Tang, Markus Armbruster, qemu-devel, Stefan Hajnoczi,
	Avi Kivity

On 02/23/2011 08:38 AM, Kevin Wolf wrote:
> Am 23.02.2011 15:23, schrieb Anthony Liguori:
>    
>> On 02/23/2011 07:43 AM, Avi Kivity wrote:
>>      
>>> On 02/22/2011 10:56 AM, Kevin Wolf wrote:
>>>        
>>>> *sigh*
>>>>
>>>> It starts to get annoying, but if you really insist, I can repeat it
>>>> once more: These features that you don't need (this is the correct
>>>> description for what you call "misfeatures") _are_ implemented in a way
>>>> that they don't impact the "normal" case. And they are it today.
>>>>
>>>>          
>>> Plus, encryption and snapshots can be implemented in a way that
>>> doesn't impact performance more than is reasonable.
>>>        
>> We're still missing the existence proof of this, but even assuming it
>>      
> Define "reasonable". I sent you some numbers not too long for
> encryption, and I consider them reasonable (iirc, between 25% and 40%
> slower than without encryption).
>    

I was really referring to snapshots.  I have absolutely no doubt that 
encryption can be implemented with a reasonable performance overhead.

>> existed, what about snapshots?  Are we okay having a feature in a
>> prominent format that isn't going to meet user's expectations?
>>
>> Is there any hope that an image with 1000, 1000, or 10000 snapshots is
>> going to have even reasonable performance in qcow2?
>>      
> Is there any hope for backing file chains of 1000 files or more? I
> haven't tried it out, but in theory I'd expect that internal snapshots
> could cope better with it than external ones because internal snapshots
> don't have to go through the whole chain all the time.
>    

I don't think there's a user expectation of backing file chains of 1000 
files performing well.  However, I've talked to a number of customers 
that have been interested in using internal snapshots for checkpointing 
which would involve a large number of snapshots.

In fact, Fabrice originally added qcow2 because he was interested in 
doing reverse debugging.  The idea of internal snapshots was to store a 
high number of checkpoints to allow reverse debugging to be optimized.

I think the way snapshot metadata is stored makes this not realistic 
since they're stored in more or less a linear array.  I think to really 
support a high number of snapshots, you'd want to store a hash with each 
block that contained a refcount > 1.  I think you quickly end up 
reinventing btrfs though in the process.

Regards,

Anthony Liguori

> What are the points where you think that performance of internal
> snapshots suffers?
>
> The argument that I would understand is that internal snapshots are
> probably not as handy in all scenarios.
>
> Kevin
>
>    

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:23                   ` Avi Kivity
@ 2011-02-23 15:31                     ` Anthony Liguori
  2011-02-23 15:37                       ` Avi Kivity
  2011-02-23 15:33                     ` Daniel P. Berrange
  1 sibling, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 15:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 09:23 AM, Avi Kivity wrote:
> On 02/23/2011 04:23 PM, Anthony Liguori wrote:
>> On 02/23/2011 07:43 AM, Avi Kivity wrote:
>>> On 02/22/2011 10:56 AM, Kevin Wolf wrote:
>>>> *sigh*
>>>>
>>>> It starts to get annoying, but if you really insist, I can repeat it
>>>> once more: These features that you don't need (this is the correct
>>>> description for what you call "misfeatures") _are_ implemented in a 
>>>> way
>>>> that they don't impact the "normal" case. And they are it today.
>>>>
>>>
>>> Plus, encryption and snapshots can be implemented in a way that 
>>> doesn't impact performance more than is reasonable.
>>
>> We're still missing the existence proof of this, but even assuming it 
>> existed,
>
> dm-crypt isn't any more complicated, and it's used by default in most 
> distributions these days.
>
>> what about snapshots?  Are we okay having a feature in a prominent 
>> format that isn't going to meet user's expectations?
>>
>> Is there any hope that an image with 1000, 1000, or 10000 snapshots 
>> is going to have even reasonable performance in qcow2?
>>
>
> Are thousands of snapshots for a single image a reasonable user 
> expectation?  What's the use case?

Checkpointing.  It was the original use-case that led to qcow2 being 
invented.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:23                   ` Avi Kivity
  2011-02-23 15:31                     ` Anthony Liguori
@ 2011-02-23 15:33                     ` Daniel P. Berrange
  2011-02-23 15:38                       ` Avi Kivity
  1 sibling, 1 reply; 87+ messages in thread
From: Daniel P. Berrange @ 2011-02-23 15:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Markus Armbruster,
	Chunqiang Tang

On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote:
> On 02/23/2011 04:23 PM, Anthony Liguori wrote:
> >On 02/23/2011 07:43 AM, Avi Kivity wrote:
> >>On 02/22/2011 10:56 AM, Kevin Wolf wrote:
> >>>*sigh*
> >>>
> >>>It starts to get annoying, but if you really insist, I can repeat it
> >>>once more: These features that you don't need (this is the correct
> >>>description for what you call "misfeatures") _are_ implemented in a way
> >>>that they don't impact the "normal" case. And they are it today.
> >>>
> >>
> >>Plus, encryption and snapshots can be implemented in a way that
> >>doesn't impact performance more than is reasonable.
> >
> >We're still missing the existence proof of this, but even assuming
> >it existed,
> 
> dm-crypt isn't any more complicated, and it's used by default in
> most distributions these days.

IMHO dm-crypt isn't a generally usable alternative to native built
in encryption in qcow2. It isn't usable at all by non-root. If you
want to use with plain files, then you need to turn the file into
a loopback device and then layer in dm-crypt. It is generally
just a PITA to manage.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:29                     ` Anthony Liguori
@ 2011-02-23 15:36                       ` Avi Kivity
  2011-02-23 15:47                         ` Anthony Liguori
  2011-02-23 15:54                       ` Kevin Wolf
  1 sibling, 1 reply; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 15:36 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi

On 02/23/2011 05:29 PM, Anthony Liguori wrote:
>
>>> existed, what about snapshots?  Are we okay having a feature in a
>>> prominent format that isn't going to meet user's expectations?
>>>
>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots is
>>> going to have even reasonable performance in qcow2?
>> Is there any hope for backing file chains of 1000 files or more? I
>> haven't tried it out, but in theory I'd expect that internal snapshots
>> could cope better with it than external ones because internal snapshots
>> don't have to go through the whole chain all the time.
>
> I don't think there's a user expectation of backing file chains of 
> 1000 files performing well.  However, I've talked to a number of 
> customers that have been interested in using internal snapshots for 
> checkpointing which would involve a large number of snapshots.
>
> In fact, Fabrice originally added qcow2 because he was interested in 
> doing reverse debugging.  The idea of internal snapshots was to store 
> a high number of checkpoints to allow reverse debugging to be optimized.

I don't see how that works, since the memory image is duplicated for 
each snapshot.  So thousands of snapshots = terabytes of storage, and 
hours of creating the snapshots.

Migrate-to-file with block live migration, or even better, something 
based on Kemari would be a lot faster.

>
> I think the way snapshot metadata is stored makes this not realistic 
> since they're stored in more or less a linear array.  I think to 
> really support a high number of snapshots, you'd want to store a hash 
> with each block that contained a refcount > 1.  I think you quickly 
> end up reinventing btrfs though in the process.

Can you elaborate?  What's the problem with a linear array of snapshots 
(say up to 10,000 snapshots)?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:31                     ` Anthony Liguori
@ 2011-02-23 15:37                       ` Avi Kivity
  2011-02-23 15:50                         ` Anthony Liguori
  2011-02-23 15:52                         ` Anthony Liguori
  0 siblings, 2 replies; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 15:37 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 05:31 PM, Anthony Liguori wrote:
>>
>>> what about snapshots?  Are we okay having a feature in a prominent 
>>> format that isn't going to meet user's expectations?
>>>
>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots 
>>> is going to have even reasonable performance in qcow2?
>>>
>>
>> Are thousands of snapshots for a single image a reasonable user 
>> expectation?  What's the use case?
>
>
> Checkpointing.  It was the original use-case that led to qcow2 being 
> invented.

I still don't see.  What would you do with thousands of checkpoints?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:33                     ` Daniel P. Berrange
@ 2011-02-23 15:38                       ` Avi Kivity
  0 siblings, 0 replies; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 15:38 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Markus Armbruster,
	Chunqiang Tang

On 02/23/2011 05:33 PM, Daniel P. Berrange wrote:
> On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote:
> >  On 02/23/2011 04:23 PM, Anthony Liguori wrote:
> >  >On 02/23/2011 07:43 AM, Avi Kivity wrote:
> >  >>On 02/22/2011 10:56 AM, Kevin Wolf wrote:
> >  >>>*sigh*
> >  >>>
> >  >>>It starts to get annoying, but if you really insist, I can repeat it
> >  >>>once more: These features that you don't need (this is the correct
> >  >>>description for what you call "misfeatures") _are_ implemented in a way
> >  >>>that they don't impact the "normal" case. And they are it today.
> >  >>>
> >  >>
> >  >>Plus, encryption and snapshots can be implemented in a way that
> >  >>doesn't impact performance more than is reasonable.
> >  >
> >  >We're still missing the existence proof of this, but even assuming
> >  >it existed,
> >
> >  dm-crypt isn't any more complicated, and it's used by default in
> >  most distributions these days.
>
> IMHO dm-crypt isn't a generally usable alternative to native built
> in encryption in qcow2. It isn't usable at all by non-root. If you
> want to use with plain files, then you need to turn the file into
> a loopback device and then layer in dm-crypt. It is generally
> just a PITA to manage.

I wasn't suggesting dm-crypt is a replacement for qcow2 encyption, just 
that it shows that block-level encryption can be done with reasonable 
overhead.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:36                       ` Avi Kivity
@ 2011-02-23 15:47                         ` Anthony Liguori
  2011-02-23 15:59                           ` Avi Kivity
  0 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 15:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi

On 02/23/2011 09:36 AM, Avi Kivity wrote:
> On 02/23/2011 05:29 PM, Anthony Liguori wrote:
>>
>>>> existed, what about snapshots?  Are we okay having a feature in a
>>>> prominent format that isn't going to meet user's expectations?
>>>>
>>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots is
>>>> going to have even reasonable performance in qcow2?
>>> Is there any hope for backing file chains of 1000 files or more? I
>>> haven't tried it out, but in theory I'd expect that internal snapshots
>>> could cope better with it than external ones because internal snapshots
>>> don't have to go through the whole chain all the time.
>>
>> I don't think there's a user expectation of backing file chains of 
>> 1000 files performing well.  However, I've talked to a number of 
>> customers that have been interested in using internal snapshots for 
>> checkpointing which would involve a large number of snapshots.
>>
>> In fact, Fabrice originally added qcow2 because he was interested in 
>> doing reverse debugging.  The idea of internal snapshots was to store 
>> a high number of checkpoints to allow reverse debugging to be optimized.
>
> I don't see how that works, since the memory image is duplicated for 
> each snapshot.  So thousands of snapshots = terabytes of storage, and 
> hours of creating the snapshots.

Fabrice wanted to use CoW to as a mechanism to deduplicate the memory 
contents with the on-disk state specifically to address this problem.  
For the longest time, there was a comment in the savevm code along these 
lines.  It might still be there.

I think the lack of on-disk hashes was a critical missing bit to make 
this feature really work well.

> Migrate-to-file with block live migration, or even better, something 
> based on Kemari would be a lot faster.
>
>>
>> I think the way snapshot metadata is stored makes this not realistic 
>> since they're stored in more or less a linear array.  I think to 
>> really support a high number of snapshots, you'd want to store a hash 
>> with each block that contained a refcount > 1.  I think you quickly 
>> end up reinventing btrfs though in the process.
>
> Can you elaborate?  What's the problem with a linear array of 
> snapshots (say up to 10,000 snapshots)?

Lots of things.  The array will start to consume quite a bit of 
contiguous space as it gets larger which means it needs to be 
relocated.  Deleting a snapshot is a far more expensive operation than 
it needs to be.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:37                       ` Avi Kivity
@ 2011-02-23 15:50                         ` Anthony Liguori
  2011-02-23 16:03                           ` Avi Kivity
  2011-02-23 15:52                         ` Anthony Liguori
  1 sibling, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 15:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 09:37 AM, Avi Kivity wrote:
> On 02/23/2011 05:31 PM, Anthony Liguori wrote:
>>>
>>>> what about snapshots?  Are we okay having a feature in a prominent 
>>>> format that isn't going to meet user's expectations?
>>>>
>>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots 
>>>> is going to have even reasonable performance in qcow2?
>>>>
>>>
>>> Are thousands of snapshots for a single image a reasonable user 
>>> expectation?  What's the use case?
>>
>>
>> Checkpointing.  It was the original use-case that led to qcow2 being 
>> invented.
>
> I still don't see.  What would you do with thousands of checkpoints?

For reverse debugging, if you store checkpoints at a rate of save, every 
10ms, and then degrade to storing every 100ms after 1 second, etc. 
you'll have quite a large number of snapshots pretty quickly.  The idea 
of snapshotting with reverse debugging is that instead of undoing every 
instruction, you can revert to the snapshot before, and then replay the 
instruction stream until you get to the desired point in time.

For disaster recovery, there are some workloads that you can meaningful 
revert to a snapshot provided that the snapshot is stored at a rate of 
something frequency (like once a second).  Think of something like a 
webserver where the only accumulated data is logs.  Losing some of the 
logs is better than losing all of the logs.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:37                       ` Avi Kivity
  2011-02-23 15:50                         ` Anthony Liguori
@ 2011-02-23 15:52                         ` Anthony Liguori
  2011-02-23 15:59                           ` Gleb Natapov
  2011-02-23 16:00                           ` Avi Kivity
  1 sibling, 2 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 15:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 09:37 AM, Avi Kivity wrote:
> On 02/23/2011 05:31 PM, Anthony Liguori wrote:
>>>
>>>> what about snapshots?  Are we okay having a feature in a prominent 
>>>> format that isn't going to meet user's expectations?
>>>>
>>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots 
>>>> is going to have even reasonable performance in qcow2?
>>>>
>>>
>>> Are thousands of snapshots for a single image a reasonable user 
>>> expectation?  What's the use case?
>>
>>
>> Checkpointing.  It was the original use-case that led to qcow2 being 
>> invented.
>
> I still don't see.  What would you do with thousands of checkpoints?

Er, hit send to quickly.

HPC is a big space where checkpointing is actually useful.  An HPC 
workload may take weeks to run to completion.  If something fails during 
the run, it's a huge waste of time.  However, if you do regularl 
checkpointing, a failure may only lose a few minutes of work instead of 
the entire weeks worth of work.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:29                     ` Anthony Liguori
  2011-02-23 15:36                       ` Avi Kivity
@ 2011-02-23 15:54                       ` Kevin Wolf
  1 sibling, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-23 15:54 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Markus Armbruster, qemu-devel, Stefan Hajnoczi,
	Avi Kivity

Am 23.02.2011 16:29, schrieb Anthony Liguori:
> On 02/23/2011 08:38 AM, Kevin Wolf wrote:
>> Am 23.02.2011 15:23, schrieb Anthony Liguori:
>>    
>>> On 02/23/2011 07:43 AM, Avi Kivity wrote:
>>>      
>>>> On 02/22/2011 10:56 AM, Kevin Wolf wrote:
>>>>        
>>>>> *sigh*
>>>>>
>>>>> It starts to get annoying, but if you really insist, I can repeat it
>>>>> once more: These features that you don't need (this is the correct
>>>>> description for what you call "misfeatures") _are_ implemented in a way
>>>>> that they don't impact the "normal" case. And they are it today.
>>>>>
>>>>>          
>>>> Plus, encryption and snapshots can be implemented in a way that
>>>> doesn't impact performance more than is reasonable.
>>>>        
>>> We're still missing the existence proof of this, but even assuming it
>>>      
>> Define "reasonable". I sent you some numbers not too long for
>> encryption, and I consider them reasonable (iirc, between 25% and 40%
>> slower than without encryption).
>>    
> 
> I was really referring to snapshots.  I have absolutely no doubt that 
> encryption can be implemented with a reasonable performance overhead.

Alright. Last time you complained about things being too slow you were
explicitly referring to encryption, so sometimes it's hard for me to
follow you jumping from once topic to another.

>>> existed, what about snapshots?  Are we okay having a feature in a
>>> prominent format that isn't going to meet user's expectations?
>>>
>>> Is there any hope that an image with 1000, 1000, or 10000 snapshots is
>>> going to have even reasonable performance in qcow2?
>>>      
>> Is there any hope for backing file chains of 1000 files or more? I
>> haven't tried it out, but in theory I'd expect that internal snapshots
>> could cope better with it than external ones because internal snapshots
>> don't have to go through the whole chain all the time.
> 
> I don't think there's a user expectation of backing file chains of 1000 
> files performing well.  However, I've talked to a number of customers 
> that have been interested in using internal snapshots for checkpointing 
> which would involve a large number of snapshots.

So if there's no expectation that a chain of 1000 external snapshots
works fine, why is it a requirement for internal snapshots?

You might have a point if the external snapshots were actually not a
chain, but a snapshot tree with lots of branches, but checkpointing
means exactly creating a single chain.

That said, while I haven't tried it out, I don't see any theoretical
problems with using 1000 internal snapshots.

> In fact, Fabrice originally added qcow2 because he was interested in 
> doing reverse debugging.  The idea of internal snapshots was to store a 
> high number of checkpoints to allow reverse debugging to be optimized.
> 
> I think the way snapshot metadata is stored makes this not realistic 
> since they're stored in more or less a linear array.  I think to really 
> support a high number of snapshots, you'd want to store a hash with each 
> block that contained a refcount > 1.  I think you quickly end up 
> reinventing btrfs though in the process.

I share Avi's problem here, I don't really understand what the problem
with a linear list of snapshots is.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:47                         ` Anthony Liguori
@ 2011-02-23 15:59                           ` Avi Kivity
  0 siblings, 0 replies; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 15:59 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi

On 02/23/2011 05:47 PM, Anthony Liguori wrote:
>> I don't see how that works, since the memory image is duplicated for 
>> each snapshot.  So thousands of snapshots = terabytes of storage, and 
>> hours of creating the snapshots.
>
>
> Fabrice wanted to use CoW to as a mechanism to deduplicate the memory 
> contents with the on-disk state specifically to address this problem.  
> For the longest time, there was a comment in the savevm code along 
> these lines.  It might still be there.
>
> I think the lack of on-disk hashes was a critical missing bit to make 
> this feature really work well.

So you have to use dirty logging to see which pages changed, otherwise 
you have to dedup all of them.  Still I think migration/kemari is a 
better fit for this.

>> Can you elaborate?  What's the problem with a linear array of 
>> snapshots (say up to 10,000 snapshots)?
>
> Lots of things.  The array will start to consume quite a bit of 
> contiguous space as it gets larger which means it needs to be relocated.

If you double the space each time, it amortizes out.

A snapshot seems to be around 40 bytes.  So 10K snapshots = 400KB, 
hardly a huge amount (sans pointed-to data which doesn't need to move).

> Deleting a snapshot is a far more expensive operation than it needs to 
> be.
>

Move the last snapshot into the deleted entry?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:52                         ` Anthony Liguori
@ 2011-02-23 15:59                           ` Gleb Natapov
  2011-02-23 16:00                           ` Avi Kivity
  1 sibling, 0 replies; 87+ messages in thread
From: Gleb Natapov @ 2011-02-23 15:59 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, qemu-devel,
	Avi Kivity, Chunqiang Tang

On Wed, Feb 23, 2011 at 09:52:02AM -0600, Anthony Liguori wrote:
> On 02/23/2011 09:37 AM, Avi Kivity wrote:
> >On 02/23/2011 05:31 PM, Anthony Liguori wrote:
> >>>
> >>>>what about snapshots?  Are we okay having a feature in a
> >>>>prominent format that isn't going to meet user's
> >>>>expectations?
> >>>>
> >>>>Is there any hope that an image with 1000, 1000, or 10000
> >>>>snapshots is going to have even reasonable performance in
> >>>>qcow2?
> >>>>
> >>>
> >>>Are thousands of snapshots for a single image a reasonable
> >>>user expectation?  What's the use case?
> >>
> >>
> >>Checkpointing.  It was the original use-case that led to qcow2
> >>being invented.
> >
> >I still don't see.  What would you do with thousands of checkpoints?
> 
> Er, hit send to quickly.
> 
> HPC is a big space where checkpointing is actually useful.  An HPC
> workload may take weeks to run to completion.  If something fails
> during the run, it's a huge waste of time.  However, if you do
> regularl checkpointing, a failure may only lose a few minutes of
> work instead of the entire weeks worth of work.
> 
HPC workload mostly run on cluster nowadays. Getting consistent
distributed snapshot without messages in flight is not as simple as
snapshotting bunch of VMs at random time. Anyway in HPC scenario you
need only one (last) snapshot, not thousands of them.

--
			Gleb.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:52                         ` Anthony Liguori
  2011-02-23 15:59                           ` Gleb Natapov
@ 2011-02-23 16:00                           ` Avi Kivity
  1 sibling, 0 replies; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 16:00 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 05:52 PM, Anthony Liguori wrote:
>> I still don't see.  What would you do with thousands of checkpoints?
>
>
> Er, hit send to quickly.
>
> HPC is a big space where checkpointing is actually useful.  An HPC 
> workload may take weeks to run to completion.  If something fails 
> during the run, it's a huge waste of time.  However, if you do 
> regularl checkpointing, a failure may only lose a few minutes of work 
> instead of the entire weeks worth of work.

The trick is to delete snapshot N-M after taking snapshot N (for a small 
constant M).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 15:50                         ` Anthony Liguori
@ 2011-02-23 16:03                           ` Avi Kivity
  2011-02-23 16:04                             ` Anthony Liguori
                                               ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Avi Kivity @ 2011-02-23 16:03 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 05:50 PM, Anthony Liguori wrote:
>> I still don't see.  What would you do with thousands of checkpoints?
>
>
> For reverse debugging, if you store checkpoints at a rate of save, 
> every 10ms, and then degrade to storing every 100ms after 1 second, 
> etc. you'll have quite a large number of snapshots pretty quickly.  
> The idea of snapshotting with reverse debugging is that instead of 
> undoing every instruction, you can revert to the snapshot before, and 
> then replay the instruction stream until you get to the desired point 
> in time.

You cannot replay the instruction stream since inputs (interrupts, rdtsc 
or other timers, I/O) will be different.  You need Kemari for this.

>
> For disaster recovery, there are some workloads that you can 
> meaningful revert to a snapshot provided that the snapshot is stored 
> at a rate of something frequency (like once a second).  Think of 
> something like a webserver where the only accumulated data is logs.  
> Losing some of the logs is better than losing all of the logs.

Are static webservers that interesting?  For disaster recovery? Anything 
else will need Kemari.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 16:03                           ` Avi Kivity
@ 2011-02-23 16:04                             ` Anthony Liguori
  2011-02-23 16:15                               ` Kevin Wolf
  2011-02-25 11:20                             ` Pavel Dovgaluk
       [not found]                             ` <-1737654525499315352@unknownmsgid>
  2 siblings, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-02-23 16:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Chunqiang Tang, qemu-devel, Stefan Hajnoczi,
	Markus Armbruster

On 02/23/2011 10:03 AM, Avi Kivity wrote:
> On 02/23/2011 05:50 PM, Anthony Liguori wrote:
>>> I still don't see.  What would you do with thousands of checkpoints?
>>
>>
>> For reverse debugging, if you store checkpoints at a rate of save, 
>> every 10ms, and then degrade to storing every 100ms after 1 second, 
>> etc. you'll have quite a large number of snapshots pretty quickly.  
>> The idea of snapshotting with reverse debugging is that instead of 
>> undoing every instruction, you can revert to the snapshot before, and 
>> then replay the instruction stream until you get to the desired point 
>> in time.
>
> You cannot replay the instruction stream since inputs (interrupts, 
> rdtsc or other timers, I/O) will be different.  You need Kemari for this.

Yes, I'm well aware of this.  I don't think all the pieces where ever 
really there to do this.

Regards,

Anthony Liguori

>>
>> For disaster recovery, there are some workloads that you can 
>> meaningful revert to a snapshot provided that the snapshot is stored 
>> at a rate of something frequency (like once a second).  Think of 
>> something like a webserver where the only accumulated data is logs.  
>> Losing some of the logs is better than losing all of the logs.
>
> Are static webservers that interesting?  For disaster recovery? 
> Anything else will need Kemari.
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 16:04                             ` Anthony Liguori
@ 2011-02-23 16:15                               ` Kevin Wolf
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-02-23 16:15 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Markus Armbruster, Avi Kivity, Stefan Hajnoczi,
	qemu-devel

Am 23.02.2011 17:04, schrieb Anthony Liguori:
> On 02/23/2011 10:03 AM, Avi Kivity wrote:
>> On 02/23/2011 05:50 PM, Anthony Liguori wrote:
>>>> I still don't see.  What would you do with thousands of checkpoints?
>>>
>>>
>>> For reverse debugging, if you store checkpoints at a rate of save, 
>>> every 10ms, and then degrade to storing every 100ms after 1 second, 
>>> etc. you'll have quite a large number of snapshots pretty quickly.  
>>> The idea of snapshotting with reverse debugging is that instead of 
>>> undoing every instruction, you can revert to the snapshot before, and 
>>> then replay the instruction stream until you get to the desired point 
>>> in time.
>>
>> You cannot replay the instruction stream since inputs (interrupts, 
>> rdtsc or other timers, I/O) will be different.  You need Kemari for this.
> 
> Yes, I'm well aware of this.  I don't think all the pieces where ever 
> really there to do this.

So why exactly was this a requirement for internal snapshots to be
consider usable in a reasonable way? ;-)

Anyway, I actually think with internal snapshots you're better suited to
implement something like this than with external snapshots.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [Qemu-devel] Re: Strategic decision: COW format
  2011-02-23 16:03                           ` Avi Kivity
  2011-02-23 16:04                             ` Anthony Liguori
@ 2011-02-25 11:20                             ` Pavel Dovgaluk
       [not found]                             ` <-1737654525499315352@unknownmsgid>
  2 siblings, 0 replies; 87+ messages in thread
From: Pavel Dovgaluk @ 2011-02-25 11:20 UTC (permalink / raw)
  To: 'Avi Kivity', 'Anthony Liguori'
  Cc: 'Kevin Wolf', 'Chunqiang Tang',
	qemu-devel, 'Markus Armbruster',
	'Stefan Hajnoczi'


> On 02/23/2011 05:50 PM, Anthony Liguori wrote:
> >> I still don't see.  What would you do with thousands of checkpoints?
> >
> >
> > For reverse debugging, if you store checkpoints at a rate of save,
> > every 10ms, and then degrade to storing every 100ms after 1 second,
> > etc. you'll have quite a large number of snapshots pretty quickly.
> > The idea of snapshotting with reverse debugging is that instead of
> > undoing every instruction, you can revert to the snapshot before, and
> > then replay the instruction stream until you get to the desired point
> > in time.
> 
> You cannot replay the instruction stream since inputs (interrupts, rdtsc
> or other timers, I/O) will be different.  You need Kemari for this.

  I've created the technology for replaying instruction stream and all of the 
inputs. This technology is similar to deterministic replay in VMWare.
  Now I need something to save machine state in many checkpoints to
implement reverse debugging.
  I think COW2 may be useful for it (or I should create something like this).


Pavel Dovgaluk

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
       [not found]                             ` <-1737654525499315352@unknownmsgid>
@ 2011-02-25 13:22                               ` Stefan Hajnoczi
  0 siblings, 0 replies; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-02-25 13:22 UTC (permalink / raw)
  To: Pavel Dovgaluk
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, qemu-devel,
	Avi Kivity, Chunqiang Tang

On Fri, Feb 25, 2011 at 11:20 AM, Pavel Dovgaluk
<Pavel.Dovgaluk@ispras.ru> wrote:
>
>> On 02/23/2011 05:50 PM, Anthony Liguori wrote:
>> >> I still don't see.  What would you do with thousands of checkpoints?
>> >
>> >
>> > For reverse debugging, if you store checkpoints at a rate of save,
>> > every 10ms, and then degrade to storing every 100ms after 1 second,
>> > etc. you'll have quite a large number of snapshots pretty quickly.
>> > The idea of snapshotting with reverse debugging is that instead of
>> > undoing every instruction, you can revert to the snapshot before, and
>> > then replay the instruction stream until you get to the desired point
>> > in time.
>>
>> You cannot replay the instruction stream since inputs (interrupts, rdtsc
>> or other timers, I/O) will be different.  You need Kemari for this.
>
>  I've created the technology for replaying instruction stream and all of the
> inputs. This technology is similar to deterministic replay in VMWare.
>  Now I need something to save machine state in many checkpoints to
> implement reverse debugging.
>  I think COW2 may be useful for it (or I should create something like this).

Or the BTRFS_IOC_CLONE ioctl on the btrfs filesystem.  You can
copy-on-write clone a file using it.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
       [not found]               ` <OFAEB4CD91.BE989F29-ON8525783F.007366B8-85257840.00130B47@LocalDomain>
@ 2011-03-13  5:51                 ` Chunqiang Tang
  2011-03-13 17:48                   ` Anthony Liguori
  2011-03-14 10:12                   ` Kevin Wolf
  0 siblings, 2 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-13  5:51 UTC (permalink / raw)
  To: Anthony Liguori, Markus Armbruster, Aurelien Jarno, Kevin Wolf,
	Stefan Hajnoczi, Stefan Weil
  Cc: qemu-devel

> It seems that there is great interest in QCOW2's
> internal snapshot feature. If we really want to do that, the right 
solution is
> to follow VMDK's approach of storing each snapshot as a separate COW 
file (see 
> http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the 
reference 
> count table. VMDK’s approach can be easily implemented for any COW 
format, or 
> even as a function of the generic block layer, without complicating any 
COW 
> format or hurting its performance. 

After the heated debate, I thought more about the right approach of 
implementing snapshot, and it becomes clear to me that there are major 
limitations with both VMDK's external snapshot approach (which stores each 
snapshot as a separate CoW file) and QCOW2's internal snapshot approach 
(which stores all snapshots in one file and uses a reference count table 
to keep track of them). I just posted to the mailing list a patch that 
implements internal snapshot in FVD but does it in a way without the 
limitations of VMDK and QCOW2. 

Let's first list the properties of an ideal virtual disk snapshot 
solution, and then discuss how to achieve them.

G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot 
code should not slow down the runtime performance of an image that has no 
snapshots.  This implies that an image without snapshot should not cache 
the reference count table in memory and should not update the on-disk 
reference count table.

G2: Even better, an image with 1 snapshot runs as fast as an image without 
snapshot.

G3: Even even better, an image with 1,000 snapshots runs as fast as an 
image without snapshot. This basically means getting the snapshot feature 
for free.

G4: An image with 1,000 snapshots consumes no more memory than an image 
without snapshot. This again means getting the snapshot feature for free.

G5: Regardless of the number of existing snapshots, creating a new 
snapshot is fast, e.g., taking no more than 1 second.

G6: Regardless of the number of existing snapshots, deleting a snapshot is 
fast, e.g., taking no more than 1 second.

Now let's evaluate VMDK and QCOW2 against these ideal properties. 

G1: VMDK good; QCOW2 poor
G2: VMDK ok; QCOW2 poor
G3: VMDK very poor; QCOW2 poor
G4: VMDK very poor; QCOW2 poor
G5: VMDK good; QCOW2 good
G6: VMDK poor; QCOW2 good

The evaluation above assumes a straightforward VMDK implementation that, 
when handling a long chain of snapshots, s0<-s1<-s2<- … <-s1000, it uses a 
chain of 1,000 VMDK driver instances to represent the chain of backing 
files. This is slow and consumes a lot of memory, but it is the behavior 
of QEMU's block device architecture today.

Even if the QEMU architecture can be revised and the VMDK implementation 
is optimized to extreme, a fundamental limitation of VMDK (by design 
instead of by implementation) is G6, i.e., deleting a snapshot X in the 
middle of a snapshot chain is slow (this is also what I observed with the 
VMware software). Because each snapshot is stored as a separate file, when 
a snapshot X is deleted, part of X's data blocks that are still needed by 
its children Y must be physically copied from file X to file Y, which is 
slow and the VM is halted during the copy operation. QCOW2's internal 
snapshot approach avoids this problem. Since all snapshots are stored in 
one file, when a snapshot is deleted, QCOW2 only needs to update its 
reference count table without physically moving data blocks.

On the other hand, QCOW'2 internal snapshot has two major limitations that 
hurt runtime performance: caching the reference count table in memory and 
updating the on-disk reference count table. If we can eliminate both, then 
it is an ideal solution. This is exactly what FVD's internal snapshot 
solution does. Below is the key observation on why FVD can do it so 
efficiently.

In an internal snapshot implementation, the reference count table is used 
to track used blocks and free blocks. It serves no other purposes. In FVD, 
its "static" reference count table only tracks blocks used by (static) 
snapshots, and it does not track blocks (dynamically) allocated (on a 
write) or freed (on a trim) for the running VM. This is a simple but 
fundamental difference w.r.t. to QCOW2, whose reference count table tracks 
both the static content and the dynamic content. Because data blocks used 
by snapshots are static and do not change unless a snapshot is created or 
deleted, there is no need to update FVD's "static" reference count table 
when a VM runs, and actually there is even no need to cache it in memory. 
Data blocks that are dynamically allocated or freed for a running VM are 
already tracked by FVD's one-level lookup table (which is similar to 
QCOW2's two-level table, but in FVD it is much smaller and faster) even 
before introducing the snapshot feature, and hence it comes for free. 
Updating FVD's one-level lookup table is efficient because of FVD's 
journal.

When the VM boots, FVD scans the reference count table once to build a 
so-called free-block-bitmap in memory, which identifies blocks not used by 
static snapshots. The reference count table is then thrown away and never 
updated when the VM runs. For an image with 1TB snapshot data, the 
free-block-bitmap is only 125KB, i.e., the memory overhead is negligible. 
For an image with 1TB snapshot data, FVD's reference count table is 2MB, 
and scanning it once at VM boot time takes no more than 20 milliseconds. 

In short, FVD's internal snapshot achieves the ideal properties of G1-G6, 
by 1) using the reference count table to only track "static" snapshots, 2) 
not keeping the reference count table in memory, 3) not updating the 
on-disk "static" reference count table when the VM runs, and 4) 
efficiently tracking dynamically allocated blocks by piggybacking on FVD's 
other features, i.e., its journal and small one-level lookup table.

Regards,

ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-13  5:51                 ` Chunqiang Tang
@ 2011-03-13 17:48                   ` Anthony Liguori
  2011-03-14  2:28                     ` Chunqiang Tang
  2011-03-14 10:12                   ` Kevin Wolf
  1 sibling, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-03-13 17:48 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Markus Armbruster,
	Aurelien Jarno

On 03/12/2011 11:51 PM, Chunqiang Tang wrote:
>
> In short, FVD's internal snapshot achieves the ideal properties of G1-G6,
> by 1) using the reference count table to only track "static" snapshots, 2)
> not keeping the reference count table in memory, 3) not updating the
> on-disk "static" reference count table when the VM runs, and 4)
> efficiently tracking dynamically allocated blocks by piggybacking on FVD's
> other features, i.e., its journal and small one-level lookup table.

Are you assuming snapshots are read-only?

It's not clear to me how this would work with writeable snapshots.  It's 
not clear to me that writeable snapshots are really that important, but 
this is an advantage of having a refcount table.

External snapshots are essentially read-only snapshots so I can 
understand the argument for it.

Regards,

Anthony Liguori

> Regards,
>
> ChunQiang (CQ) Tang
> Homepage: http://www.research.ibm.com/people/c/ctang
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-13 17:48                   ` Anthony Liguori
@ 2011-03-14  2:28                     ` Chunqiang Tang
  2011-03-14 13:22                       ` Anthony Liguori
  0 siblings, 1 reply; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14  2:28 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, qemu-devel,
	Aurelien Jarno

> > In short, FVD's internal snapshot achieves the ideal properties of 
G1-G6,
> > by 1) using the reference count table to only track "static" 
snapshots, 2)
> > not keeping the reference count table in memory, 3) not updating the
> > on-disk "static" reference count table when the VM runs, and 4)
> > efficiently tracking dynamically allocated blocks by piggybacking on 
FVD's
> > other features, i.e., its journal and small one-level lookup table.
> 
> Are you assuming snapshots are read-only?
> 
> It's not clear to me how this would work with writeable snapshots.  It's 

> not clear to me that writeable snapshots are really that important, but 
> this is an advantage of having a refcount table.
> 
> External snapshots are essentially read-only snapshots so I can 
> understand the argument for it.

By definition, a snapshot itself must be immutable (read-only), but a 
writeable
image state can be derived from an immutable snapshot by using 
copy-on-write, 
which I guess is what you meant by "writeable snapshot." 
Perhaps the following concrete use cases will make things clear. 
These use cases are supported by QCOW2, VMware, and
FVD, regardless of the difference in their internal implementation. 

Suppose an image's initial state is: 

Image: (current-disk-state-observed-by-the-running-VM)

Below, I simply refer to "current-disk-state-observed-by-the-running-VM" 
as
"current-state."  The VM issues writes and continuously modifies the
"current-state". At one point in time, a snapshot s1 is taken, and the 
image
becomes:

Image: s1->(current-state)

The VM issues more writes and subsequently takes three snapshots, s2, s3, 
and
s4. Now the image becomes:

Image: s1->s2->s3->s4->(current-state)

Suppose the action "goto snapshot s2" is taken, which does not affect the
immutable snapshots s1-s4, but the "current-state" is abandoned and lost. 
Now
the image becomes:

Image: s1->s2->s3->s4
           |->(curren-state)

(Note: depending on your email client, the two lines in the diagram may 
not be 
properly aligned). 
The new "current-state" is writeable and is derived from the immutable 
snapshot s2.
When the VM issues a write, it does copy-on-write and stores
dirty data in the "current-state" without modifying the original snapshot 
s2.
Perhaps this is what you meant by "writeable snapshot"? 
The diagram above is at the conceptual level. In implementation,
both QCOW2 and FVD store all snapshots s1-s4 and the current-state in one 
image
file, and the snapshots and curren-state may share data chunks. 

Suppose the VM issues some writes and subsequently takes two snapshots, s5 
and
s6. Now the image becomes: 

Image: s1->s2->s3->s4
           |->s5->s6->(curren-state)

Suppose the action "goto snapshot s2" is taken again. Now the image 
becomes:

Image: s1->s2->s3->s4
           |->s5->s6
           |->(current-state)

The new "current-state" is writeable and is derived from the immutable 
snapshot s2. Right after the "goto" action, the running VM sees the 
state of s2, instead of the state of s5 created after the first "goto 
snapshot s2" action. 
Again, this is because a snapshot itself is immutable. 

Again, all the use cases are supported by QCOW2, VMware, and
FVD, regardless of the difference in their internal implementation. 

Now let's come back to the discussion of FVD. Perhaps my description 
in the previous email is not clear. In the diagrams above, FVD's 
reference count table only tracks the snapshots (s1, s2, ...), 
but does not track the "current-state". Instead,
FVD's default mechanism (one-level lookup table, journal, etc.), which 
exists
even before introducing snapshot, already tracks the "current-state". 
Working
together, FVD's reference count table and its default mechanism tracks all 
the
states. In QCOW2, when a new cluster is allocated during handling a 
running VM's
write request, it updates both the lookup table and the reference count 
table,
which is unnecessary because their information is redundant. By contrast, 
in
FVD, when a new chunk is allocated during handling a running VM's write
request, it only updates the lookup table without updating the reference 
count
table, because by design the reference count table does not track the 
"current-state" and this chunk allocation operation belongs to the 
"current-state."
This is the key why FVD can get all the functions of QCOW2's internal 
snapshot
but without its memory overhead to cache the reference count table and
its disk I/O overhead to read or write the reference count table during 
normal
execution of VM.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-13  5:51                 ` Chunqiang Tang
  2011-03-13 17:48                   ` Anthony Liguori
@ 2011-03-14 10:12                   ` Kevin Wolf
  1 sibling, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 10:12 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Stefan Hajnoczi, qemu-devel, Markus Armbruster, Aurelien Jarno

Am 13.03.2011 06:51, schrieb Chunqiang Tang:
> After the heated debate, I thought more about the right approach of 
> implementing snapshot, and it becomes clear to me that there are major 
> limitations with both VMDK's external snapshot approach (which stores each 
> snapshot as a separate CoW file) and QCOW2's internal snapshot approach 
> (which stores all snapshots in one file and uses a reference count table 
> to keep track of them). I just posted to the mailing list a patch that 
> implements internal snapshot in FVD but does it in a way without the 
> limitations of VMDK and QCOW2. 
> 
> Let's first list the properties of an ideal virtual disk snapshot 
> solution, and then discuss how to achieve them.
> 
> G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot 
> code should not slow down the runtime performance of an image that has no 
> snapshots.  This implies that an image without snapshot should not cache 
> the reference count table in memory and should not update the on-disk 
> reference count table.
> 
> G2: Even better, an image with 1 snapshot runs as fast as an image without 
> snapshot.
> 
> G3: Even even better, an image with 1,000 snapshots runs as fast as an 
> image without snapshot. This basically means getting the snapshot feature 
> for free.
> 
> G4: An image with 1,000 snapshots consumes no more memory than an image 
> without snapshot. This again means getting the snapshot feature for free.
> 
> G5: Regardless of the number of existing snapshots, creating a new 
> snapshot is fast, e.g., taking no more than 1 second.
> 
> G6: Regardless of the number of existing snapshots, deleting a snapshot is 
> fast, e.g., taking no more than 1 second.
> 
> Now let's evaluate VMDK and QCOW2 against these ideal properties. 
> 
> G1: VMDK good; QCOW2 poor
> G2: VMDK ok; QCOW2 poor
> G3: VMDK very poor; QCOW2 poor
> G4: VMDK very poor; QCOW2 poor
> G5: VMDK good; QCOW2 good
> G6: VMDK poor; QCOW2 good

Okay. I think I don't agree with all of these. I'm not entirely sure how
VMDK works, so I take this as "random image format that uses backing
files" (so it also applies to qcow2 with backing files, which I hope
isn't too confusing).

G1: VMDK good; QCOW2 poor for cache=writethrough, ok otherwise; QCOW3 good
G2: VMDK ok; QCOW2 good
G3: VMDK poor; QCOW2 good
G4: VMDK very poor; QCOW2 ok
G5: VMDK good; QCOW2 good
G6: VMDK very poor; QCOW2 good

Also, let me add another feature which I believe is an important factor
in the decision between internal and external snapshots:

G7: Loading/Reverting to a snapshot is fast
G7: VMDK good; QCOW2 ok

> On the other hand, QCOW'2 internal snapshot has two major limitations that 
> hurt runtime performance: caching the reference count table in memory and 
> updating the on-disk reference count table. If we can eliminate both, then 
> it is an ideal solution.

It's not even necessary to get completely rid of it. What hurts is
writing the additional metadata. So if you can delay writing the
metadata and only write out a refcount block once you need to load the
next one into memory, the overhead is lost in the noise (remember, even
with 64k clusters, a refcount block covers 2 GB of virtual disk space).

We already do that for qcow2 in all writeback cache modes. We can't do
it yet for cache=writethrough, but we were planning to allow using QED's
dirty flag approach which would get rid of the writes also in
writethrough modes.

I think this explains my estimation for G1.

For G2 and G3, I'm not sure why you think that having internal snapshots
slows down operation. It's basically just data that sits in the image
file and is unused. After startup or after deleting a snapshot you
probably have to look at all of the refcount table again for cluster
allocations, is this what you mean?

For G4, the size of snapshots in memory, the only overhead of internal
snapshots that I could think of is the snapshot table. I would hardly
rate this as "poor".

For G5 and G6 I basically agree with your estimation, except that I
think that the overhead of deleting a snapshot is _really_ bad. This is
one of the major problems we have with external snapshots today.

> In an internal snapshot implementation, the reference count table is used 
> to track used blocks and free blocks. It serves no other purposes. In FVD, 
> its "static" reference count table only tracks blocks used by (static) 
> snapshots, and it does not track blocks (dynamically) allocated (on a 
> write) or freed (on a trim) for the running VM. This is a simple but 
> fundamental difference w.r.t. to QCOW2, whose reference count table tracks 
> both the static content and the dynamic content. Because data blocks used 
> by snapshots are static and do not change unless a snapshot is created or 
> deleted, there is no need to update FVD's "static" reference count table 
> when a VM runs, and actually there is even no need to cache it in memory. 
> Data blocks that are dynamically allocated or freed for a running VM are 
> already tracked by FVD's one-level lookup table (which is similar to 
> QCOW2's two-level table, but in FVD it is much smaller and faster) even 
> before introducing the snapshot feature, and hence it comes for free. 
> Updating FVD's one-level lookup table is efficient because of FVD's 
> journal.

So when is a cluster considered free? Only if both its refcount is 0 and
it's not referenced by a used lookup table entry?

How do you check the latter condition without scanning the whole lookup
table?

> When the VM boots, FVD scans the reference count table once to build a 
> so-called free-block-bitmap in memory, which identifies blocks not used by 
> static snapshots. The reference count table is then thrown away and never 
> updated when the VM runs.

This is an implementation detail and not related to the format.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14  2:28                     ` Chunqiang Tang
@ 2011-03-14 13:22                       ` Anthony Liguori
  2011-03-14 13:53                         ` Chunqiang Tang
  2011-03-14 14:15                         ` Kevin Wolf
  0 siblings, 2 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-03-14 13:22 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, Aurelien Jarno,
	qemu-devel

On 03/13/2011 09:28 PM, Chunqiang Tang wrote:
>>> In short, FVD's internal snapshot achieves the ideal properties of
> G1-G6,
>>> by 1) using the reference count table to only track "static"
> snapshots, 2)
>>> not keeping the reference count table in memory, 3) not updating the
>>> on-disk "static" reference count table when the VM runs, and 4)
>>> efficiently tracking dynamically allocated blocks by piggybacking on
> FVD's
>>> other features, i.e., its journal and small one-level lookup table.
>> Are you assuming snapshots are read-only?
>>
>> It's not clear to me how this would work with writeable snapshots.  It's
>> not clear to me that writeable snapshots are really that important, but
>> this is an advantage of having a refcount table.
>>
>> External snapshots are essentially read-only snapshots so I can
>> understand the argument for it.
> By definition, a snapshot itself must be immutable (read-only), but a
> writeable
> image state can be derived from an immutable snapshot by using
> copy-on-write,
> which I guess is what you meant by "writeable snapshot."

No, because the copy-on-write is another layer on top of the snapshot 
and AFAICT, they don't persist when moving between snapshots.

The equivalent for external snapshots would be:

base0 <- base1 <- base2 <- image

And then if I wanted to move to base1 without destroying base2 and 
image, I could do:

qemu-img create -f qcow2 -b base1 base1-overlay.img

The file system can keep a lot of these things around pretty easily but 
with your proposal, it seems like there can only be one.  If you support 
many of them, I think you'll degenerate to something as complex as a 
reference count table.

On the other hand, I think it's reasonable to just avoid the CoW overlay 
entirely and say that moving to a previous snapshot destroys any of it's 
children.  I think this ends up being a simplifying assumption that is 
worth investigating further.

 From the use-cases that I'm aware of (backup and RAS), I think these 
semantics are okay.

I'm curious what other people think (Kevin/Stefan?).

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 13:22                       ` Anthony Liguori
@ 2011-03-14 13:53                         ` Chunqiang Tang
  2011-03-14 14:02                           ` Anthony Liguori
  2011-03-14 14:26                           ` Stefan Hajnoczi
  2011-03-14 14:15                         ` Kevin Wolf
  1 sibling, 2 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 13:53 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, Aurelien Jarno,
	qemu-devel

> No, because the copy-on-write is another layer on top of the snapshot 
> and AFAICT, they don't persist when moving between snapshots.
> 
> The equivalent for external snapshots would be:
> 
> base0 <- base1 <- base2 <- image
> 
> And then if I wanted to move to base1 without destroying base2 and 
> image, I could do:
> 
> qemu-img create -f qcow2 -b base1 base1-overlay.img
> 
> The file system can keep a lot of these things around pretty easily but 
> with your proposal, it seems like there can only be one.  If you support 

> many of them, I think you'll degenerate to something as complex as a 
> reference count table.
> 
> On the other hand, I think it's reasonable to just avoid the CoW overlay 

> entirely and say that moving to a previous snapshot destroys any of it's 

> children.  I think this ends up being a simplifying assumption that is 
> worth investigating further.

No, both VMware and FVD have the same semantics as QCOW2. Moving to a 
previous snapshot does not destroy any of its children. In the example I 
gave (copied below), 
it goes from 

Image: s1->s2->s3->s4->(current-state)

back to snapshot s2, and now the state is

Image: s1->s2->s3->s4
           |->(curren-state)

where all snapshots s1-s4 are kept. From there, it can take another 
snapshot s5, and then further go back to snapshot s4, ending up with 

Image: s1->s2->s3->s4
           |->s5   |
                   |-> (current-state)

FVD does have a reference count table like that in QCOW2, but it avoids 
the need for updating the reference count table during normal execution of 
the VM. The reference count table is only updated at the time of creating 
a snapshot or deleting a snapshot. Therefore, during normal execution of a 
VM, images with snapshots are as fast as images without snapshot. 

FVD can do this because of the following:

"FVD's reference count table only tracks the snapshots (s1, s2, ...), 
but does not track the "current-state". Instead,
FVD's default mechanism (one-level lookup table, journal, etc.), which 
exists
even before introducing snapshot, already tracks the "current-state". 
Working
together, FVD's reference count table and its default mechanism tracks all 
the
states. In QCOW2, when a new cluster is allocated during handling a 
running VM's
write request, it updates both the lookup table and the reference count 
table,
which is unnecessary because their information is redundant. By contrast, 
in
FVD, when a new chunk is allocated during handling a running VM's write
request, it only updates the lookup table without updating the reference 
count
table, because by design the reference count table does not track the 
"current-state" and this chunk allocation operation belongs to the 
"current-state."
This is the key why FVD can get all the functions of QCOW2's internal 
snapshot
but without its memory overhead to cache the reference count table and
its disk I/O overhead to read or write the reference count table during 
normal
execution of VM."

Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 13:53                         ` Chunqiang Tang
@ 2011-03-14 14:02                           ` Anthony Liguori
  2011-03-14 14:21                             ` Kevin Wolf
  2011-03-14 14:26                           ` Stefan Hajnoczi
  1 sibling, 1 reply; 87+ messages in thread
From: Anthony Liguori @ 2011-03-14 14:02 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, Aurelien Jarno,
	qemu-devel

On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
>> No, because the copy-on-write is another layer on top of the snapshot
>> and AFAICT, they don't persist when moving between snapshots.
>>
>> The equivalent for external snapshots would be:
>>
>> base0<- base1<- base2<- image
>>
>> And then if I wanted to move to base1 without destroying base2 and
>> image, I could do:
>>
>> qemu-img create -f qcow2 -b base1 base1-overlay.img
>>
>> The file system can keep a lot of these things around pretty easily but
>> with your proposal, it seems like there can only be one.  If you support
>> many of them, I think you'll degenerate to something as complex as a
>> reference count table.
>>
>> On the other hand, I think it's reasonable to just avoid the CoW overlay
>> entirely and say that moving to a previous snapshot destroys any of it's
>> children.  I think this ends up being a simplifying assumption that is
>> worth investigating further.
> No, both VMware and FVD have the same semantics as QCOW2. Moving to a
> previous snapshot does not destroy any of its children. In the example I
> gave (copied below),
> it goes from
>
> Image: s1->s2->s3->s4->(current-state)
>
> back to snapshot s2, and now the state is
>
> Image: s1->s2->s3->s4
>             |->(curren-state)
>
> where all snapshots s1-s4 are kept. From there, it can take another
> snapshot s5, and then further go back to snapshot s4, ending up with
>
> Image: s1->s2->s3->s4
>             |->s5   |
>                     |->  (current-state)

Your use of "current-state" is confusing me because AFAICT, 
current-state is just semantically another snapshot.

It's writable because it has no children.  You only keep around one 
writable snapshot and to make another snapshot writable, you have to 
discard the former.

This is not the semantics of qcow2.  Every time you create a snapshot, 
it's essentially a new image.  You can write directly to it.

While we don't do this today and I don't think we ever should, it's 
entirely possible to have two disks served simultaneously out of the 
same qcow2 file using snapshots.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 13:22                       ` Anthony Liguori
  2011-03-14 13:53                         ` Chunqiang Tang
@ 2011-03-14 14:15                         ` Kevin Wolf
  2011-03-14 14:25                           ` Chunqiang Tang
  2011-03-14 14:47                           ` Anthony Liguori
  1 sibling, 2 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 14:15 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

Am 14.03.2011 14:22, schrieb Anthony Liguori:
> On 03/13/2011 09:28 PM, Chunqiang Tang wrote:
>>>> In short, FVD's internal snapshot achieves the ideal properties of
>> G1-G6,
>>>> by 1) using the reference count table to only track "static"
>> snapshots, 2)
>>>> not keeping the reference count table in memory, 3) not updating the
>>>> on-disk "static" reference count table when the VM runs, and 4)
>>>> efficiently tracking dynamically allocated blocks by piggybacking on
>> FVD's
>>>> other features, i.e., its journal and small one-level lookup table.
>>> Are you assuming snapshots are read-only?
>>>
>>> It's not clear to me how this would work with writeable snapshots.  It's
>>> not clear to me that writeable snapshots are really that important, but
>>> this is an advantage of having a refcount table.
>>>
>>> External snapshots are essentially read-only snapshots so I can
>>> understand the argument for it.
>> By definition, a snapshot itself must be immutable (read-only), but a
>> writeable
>> image state can be derived from an immutable snapshot by using
>> copy-on-write,
>> which I guess is what you meant by "writeable snapshot."
> 
> No, because the copy-on-write is another layer on top of the snapshot 
> and AFAICT, they don't persist when moving between snapshots.
> 
> The equivalent for external snapshots would be:
> 
> base0 <- base1 <- base2 <- image
> 
> And then if I wanted to move to base1 without destroying base2 and 
> image, I could do:
> 
> qemu-img create -f qcow2 -b base1 base1-overlay.img
> 
> The file system can keep a lot of these things around pretty easily but 
> with your proposal, it seems like there can only be one.  If you support 
> many of them, I think you'll degenerate to something as complex as a 
> reference count table.

IIUC, he already uses a refcount table. Actually, I think that a
refcount table is a requirement to provide the interesting properties
that internal snapshots have (see my other mail).

Refcount tables aren't a very complex thing either. In fact, it makes a
format much simpler to have one concept like refcount tables instead of
adding another different mechanism for each new feature that would be
natural with refcount tables.

The only problem with them is that they are metadata that must be
updated. However, I think we have discussed enough how to avoid the
greatest part of that cost.

> On the other hand, I think it's reasonable to just avoid the CoW overlay 
> entirely and say that moving to a previous snapshot destroys any of it's 
> children.  I think this ends up being a simplifying assumption that is 
> worth investigating further.
> 
>  From the use-cases that I'm aware of (backup and RAS), I think these 
> semantics are okay.

I don't think this semantics would be expected. Any anyway, would this
really allow simplification of the format? I'm afraid that you would go
for complicated solutions with odd semantics just because of an
arbitrary dislike of refcounts.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:02                           ` Anthony Liguori
@ 2011-03-14 14:21                             ` Kevin Wolf
  2011-03-14 14:35                               ` Chunqiang Tang
  2011-03-14 14:49                               ` Anthony Liguori
  0 siblings, 2 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 14:21 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

Am 14.03.2011 15:02, schrieb Anthony Liguori:
> On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
>>> No, because the copy-on-write is another layer on top of the snapshot
>>> and AFAICT, they don't persist when moving between snapshots.
>>>
>>> The equivalent for external snapshots would be:
>>>
>>> base0<- base1<- base2<- image
>>>
>>> And then if I wanted to move to base1 without destroying base2 and
>>> image, I could do:
>>>
>>> qemu-img create -f qcow2 -b base1 base1-overlay.img
>>>
>>> The file system can keep a lot of these things around pretty easily but
>>> with your proposal, it seems like there can only be one.  If you support
>>> many of them, I think you'll degenerate to something as complex as a
>>> reference count table.
>>>
>>> On the other hand, I think it's reasonable to just avoid the CoW overlay
>>> entirely and say that moving to a previous snapshot destroys any of it's
>>> children.  I think this ends up being a simplifying assumption that is
>>> worth investigating further.
>> No, both VMware and FVD have the same semantics as QCOW2. Moving to a
>> previous snapshot does not destroy any of its children. In the example I
>> gave (copied below),
>> it goes from
>>
>> Image: s1->s2->s3->s4->(current-state)
>>
>> back to snapshot s2, and now the state is
>>
>> Image: s1->s2->s3->s4
>>             |->(curren-state)
>>
>> where all snapshots s1-s4 are kept. From there, it can take another
>> snapshot s5, and then further go back to snapshot s4, ending up with
>>
>> Image: s1->s2->s3->s4
>>             |->s5   |
>>                     |->  (current-state)
> 
> Your use of "current-state" is confusing me because AFAICT, 
> current-state is just semantically another snapshot.
> 
> It's writable because it has no children.  You only keep around one 
> writable snapshot and to make another snapshot writable, you have to 
> discard the former.
> 
> This is not the semantics of qcow2.  Every time you create a snapshot, 
> it's essentially a new image.  You can write directly to it.
> 
> While we don't do this today and I don't think we ever should, it's 
> entirely possible to have two disks served simultaneously out of the 
> same qcow2 file using snapshots.

No, CQ is describing the semantics of internal snapshots in qcow2
correctly. You have all the snapshots that are stored in the snapshot
table (all read-only) plus one current state described by the image
header (read-write).

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:15                         ` Kevin Wolf
@ 2011-03-14 14:25                           ` Chunqiang Tang
  2011-03-14 14:31                             ` Stefan Hajnoczi
  2011-03-14 14:34                             ` Kevin Wolf
  2011-03-14 14:47                           ` Anthony Liguori
  1 sibling, 2 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 14:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, Aurelien Jarno, Markus Armbruster, qemu-devel

> IIUC, he already uses a refcount table. Actually, I think that a
> refcount table is a requirement to provide the interesting properties
> that internal snapshots have (see my other mail).
> 
> Refcount tables aren't a very complex thing either. In fact, it makes a
> format much simpler to have one concept like refcount tables instead of
> adding another different mechanism for each new feature that would be
> natural with refcount tables.
> 
> The only problem with them is that they are metadata that must be
> updated. However, I think we have discussed enough how to avoid the
> greatest part of that cost.

FVD's novel uses of the reference count table reduces the metadata update 
overhead down to literally zero during normal execution of a VM. This gets 
the bests of QCOW2's reference count table but without its oeverhead. In 
FVD, the reference count table is only updated when creating a new 
snapshot or deleting an existing snapshot. The reference count table is 
never updated during normal execution of a VM.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 13:53                         ` Chunqiang Tang
  2011-03-14 14:02                           ` Anthony Liguori
@ 2011-03-14 14:26                           ` Stefan Hajnoczi
  2011-03-14 14:30                             ` Chunqiang Tang
  1 sibling, 1 reply; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-03-14 14:26 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, Aurelien Jarno, Markus Armbruster, qemu-devel

On Mon, Mar 14, 2011 at 1:53 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
> Therefore, during normal execution of a
> VM, images with snapshots are as fast as images without snapshot.

Hang on, an image with a snapshot still needs to do copy-on-write,
just like backing files.  The cost of copy-on-write is reading data
from the backing file, whereas a non-CoW write doesn't need to do
that.

So no, snapshots are not free during normal execution.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:26                           ` Stefan Hajnoczi
@ 2011-03-14 14:30                             ` Chunqiang Tang
  0 siblings, 0 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 14:30 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Aurelien Jarno, Markus Armbruster, qemu-devel

> On Mon, Mar 14, 2011 at 1:53 PM, Chunqiang Tang <ctang@us.ibm.com> 
wrote:
> > Therefore, during normal execution of a
> > VM, images with snapshots are as fast as images without snapshot.
> 
> Hang on, an image with a snapshot still needs to do copy-on-write,
> just like backing files.  The cost of copy-on-write is reading data
> from the backing file, whereas a non-CoW write doesn't need to do
> that.
> 
> So no, snapshots are not free during normal execution.

You are right. For any implementation of snapshot (internal or external), 
this CoW overhead is unavoidable. What I meant to say was that, other than 
this mandatory CoW overhead, FVD's internal snapshot does not incur any 
additional metadata update overhead (unlike that in QCOW2). 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:25                           ` Chunqiang Tang
@ 2011-03-14 14:31                             ` Stefan Hajnoczi
  2011-03-14 16:32                               ` Chunqiang Tang
       [not found]                               ` <OF7C2FDD40.E76A4E14-ON85257853.005ADD68-85257853.005AF16E@LocalDomain>
  2011-03-14 14:34                             ` Kevin Wolf
  1 sibling, 2 replies; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-03-14 14:31 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, Aurelien Jarno, Markus Armbruster, qemu-devel

On Mon, Mar 14, 2011 at 2:25 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> IIUC, he already uses a refcount table. Actually, I think that a
>> refcount table is a requirement to provide the interesting properties
>> that internal snapshots have (see my other mail).
>>
>> Refcount tables aren't a very complex thing either. In fact, it makes a
>> format much simpler to have one concept like refcount tables instead of
>> adding another different mechanism for each new feature that would be
>> natural with refcount tables.
>>
>> The only problem with them is that they are metadata that must be
>> updated. However, I think we have discussed enough how to avoid the
>> greatest part of that cost.
>
> FVD's novel uses of the reference count table reduces the metadata update
> overhead down to literally zero during normal execution of a VM. This gets
> the bests of QCOW2's reference count table but without its oeverhead. In
> FVD, the reference count table is only updated when creating a new
> snapshot or deleting an existing snapshot. The reference count table is
> never updated during normal execution of a VM.

Do you want to send out a break-down of the steps (and cost) involved in doing:

1. Snapshot creation.
2. Snapshot deletion.
3. Opening an image with n snapshots.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:25                           ` Chunqiang Tang
  2011-03-14 14:31                             ` Stefan Hajnoczi
@ 2011-03-14 14:34                             ` Kevin Wolf
  1 sibling, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 14:34 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Stefan Hajnoczi, Aurelien Jarno, Markus Armbruster, qemu-devel

Am 14.03.2011 15:25, schrieb Chunqiang Tang:
>> IIUC, he already uses a refcount table. Actually, I think that a
>> refcount table is a requirement to provide the interesting properties
>> that internal snapshots have (see my other mail).
>>
>> Refcount tables aren't a very complex thing either. In fact, it makes a
>> format much simpler to have one concept like refcount tables instead of
>> adding another different mechanism for each new feature that would be
>> natural with refcount tables.
>>
>> The only problem with them is that they are metadata that must be
>> updated. However, I think we have discussed enough how to avoid the
>> greatest part of that cost.
> 
> FVD's novel uses of the reference count table reduces the metadata update 
> overhead down to literally zero during normal execution of a VM. This gets 
> the bests of QCOW2's reference count table but without its oeverhead. In 
> FVD, the reference count table is only updated when creating a new 
> snapshot or deleting an existing snapshot. The reference count table is 
> never updated during normal execution of a VM.

Yeah, I think that's basically an interesting property. However, I don't
think that it makes a big difference compared to qcow2's refcount table
when you use a writeback metadata cache.

What about the question that I had in my other mail? (How do you
determine if a cluster is free without scanning the whole lookup table?)
I think this might be the missing piece for me to understand how your
approach works.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:21                             ` Kevin Wolf
@ 2011-03-14 14:35                               ` Chunqiang Tang
  2011-03-14 14:49                               ` Anthony Liguori
  1 sibling, 0 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 14:35 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, Aurelien Jarno, Markus Armbruster, qemu-devel

> > Your use of "current-state" is confusing me because AFAICT, 
> > current-state is just semantically another snapshot.
> > 
> > It's writable because it has no children.  You only keep around one 
> > writable snapshot and to make another snapshot writable, you have to 
> > discard the former.
> > 
> > This is not the semantics of qcow2.  Every time you create a snapshot, 

> > it's essentially a new image.  You can write directly to it.
> > 
> > While we don't do this today and I don't think we ever should, it's 
> > entirely possible to have two disks served simultaneously out of the 
> > same qcow2 file using snapshots.
> 
> No, CQ is describing the semantics of internal snapshots in qcow2
> correctly. You have all the snapshots that are stored in the snapshot
> table (all read-only) plus one current state described by the image
> header (read-write).

That's also the semantics of VMware's external snapshot. So there is no 
difference in semantics. It is just a difference in implementation and 
performance.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:15                         ` Kevin Wolf
  2011-03-14 14:25                           ` Chunqiang Tang
@ 2011-03-14 14:47                           ` Anthony Liguori
  2011-03-14 15:03                             ` Kevin Wolf
  2011-03-14 15:04                             ` Chunqiang Tang
  1 sibling, 2 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-03-14 14:47 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

On 03/14/2011 09:15 AM, Kevin Wolf wrote:
>> The file system can keep a lot of these things around pretty easily but
>> with your proposal, it seems like there can only be one.  If you support
>> many of them, I think you'll degenerate to something as complex as a
>> reference count table.
> IIUC, he already uses a refcount table.

Well, he needs a separate mechanism to make trim/discard work, but for 
the snapshot discussion, a reference count table is avoided.

The bitmap only covers whether the guest has accessed a block or not.  
Then there is a separate table that maps guest offsets to offsets within 
the file.

I haven't thought hard about it, but my guess is that there is an 
ordering constraint between these two pieces of metadata which is why 
the journal is necessary.  I get worried about the complexity of a 
journal even more than a reference count table.

>   Actually, I think that a
> refcount table is a requirement to provide the interesting properties
> that internal snapshots have (see my other mail).

Well the trick here AFAICT is that you're basically storing external 
snapshots internally.  So it's sort of like a bunch of FVD formats 
embedded into a single image.

> Refcount tables aren't a very complex thing either. In fact, it makes a
> format much simpler to have one concept like refcount tables instead of
> adding another different mechanism for each new feature that would be
> natural with refcount tables.

I think it's a reasonable design goal to minimize any metadata updates 
in the fast path.  If we can write 1 piece of metadata verses writing 2, 
then it's worth exploring IMHO.

> The only problem with them is that they are metadata that must be
> updated. However, I think we have discussed enough how to avoid the
> greatest part of that cost.

Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 
the writes for the refcount table?

>> On the other hand, I think it's reasonable to just avoid the CoW overlay
>> entirely and say that moving to a previous snapshot destroys any of it's
>> children.  I think this ends up being a simplifying assumption that is
>> worth investigating further.
>>
>>   From the use-cases that I'm aware of (backup and RAS), I think these
>> semantics are okay.
> I don't think this semantics would be expected. Any anyway, would this
> really allow simplification of the format?

I don't know, I'm really just trying to separate out the implementation 
of the format to the use-cases we're trying to address.

Even if we're talking about qcow3, then if we only really care about 
read-only snapshots, perhaps we can add a feature bit for this and take 
advantage of this to make the WCE=0 case much faster.

But the fundamental question is, does this satisfy the use-cases we care 
about?

Regards,

Anthony Liguori

>   I'm afraid that you would go
> for complicated solutions with odd semantics just because of an
> arbitrary dislike of refcounts.
>
> Kevin
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:21                             ` Kevin Wolf
  2011-03-14 14:35                               ` Chunqiang Tang
@ 2011-03-14 14:49                               ` Anthony Liguori
  2011-03-14 15:05                                 ` Stefan Hajnoczi
  2011-03-14 15:08                                 ` Kevin Wolf
  1 sibling, 2 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-03-14 14:49 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

On 03/14/2011 09:21 AM, Kevin Wolf wrote:
> Am 14.03.2011 15:02, schrieb Anthony Liguori:
>> On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
>>>> No, because the copy-on-write is another layer on top of the snapshot
>>>> and AFAICT, they don't persist when moving between snapshots.
>>>>
>>>> The equivalent for external snapshots would be:
>>>>
>>>> base0<- base1<- base2<- image
>>>>
>>>> And then if I wanted to move to base1 without destroying base2 and
>>>> image, I could do:
>>>>
>>>> qemu-img create -f qcow2 -b base1 base1-overlay.img
>>>>
>>>> The file system can keep a lot of these things around pretty easily but
>>>> with your proposal, it seems like there can only be one.  If you support
>>>> many of them, I think you'll degenerate to something as complex as a
>>>> reference count table.
>>>>
>>>> On the other hand, I think it's reasonable to just avoid the CoW overlay
>>>> entirely and say that moving to a previous snapshot destroys any of it's
>>>> children.  I think this ends up being a simplifying assumption that is
>>>> worth investigating further.
>>> No, both VMware and FVD have the same semantics as QCOW2. Moving to a
>>> previous snapshot does not destroy any of its children. In the example I
>>> gave (copied below),
>>> it goes from
>>>
>>> Image: s1->s2->s3->s4->(current-state)
>>>
>>> back to snapshot s2, and now the state is
>>>
>>> Image: s1->s2->s3->s4
>>>              |->(curren-state)
>>>
>>> where all snapshots s1-s4 are kept. From there, it can take another
>>> snapshot s5, and then further go back to snapshot s4, ending up with
>>>
>>> Image: s1->s2->s3->s4
>>>              |->s5   |
>>>                      |->   (current-state)
>> Your use of "current-state" is confusing me because AFAICT,
>> current-state is just semantically another snapshot.
>>
>> It's writable because it has no children.  You only keep around one
>> writable snapshot and to make another snapshot writable, you have to
>> discard the former.
>>
>> This is not the semantics of qcow2.  Every time you create a snapshot,
>> it's essentially a new image.  You can write directly to it.
>>
>> While we don't do this today and I don't think we ever should, it's
>> entirely possible to have two disks served simultaneously out of the
>> same qcow2 file using snapshots.
> No, CQ is describing the semantics of internal snapshots in qcow2
> correctly. You have all the snapshots that are stored in the snapshot
> table (all read-only) plus one current state described by the image
> header (read-write).

But is there any problem (in the format) with writing to the non-current 
state?  I can't think of one.

Regards,

Anthony Liguori

> Kevin
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:47                           ` Anthony Liguori
@ 2011-03-14 15:03                             ` Kevin Wolf
  2011-03-14 15:13                               ` Anthony Liguori
  2011-03-14 15:04                             ` Chunqiang Tang
  1 sibling, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 15:03 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

Am 14.03.2011 15:47, schrieb Anthony Liguori:
> On 03/14/2011 09:15 AM, Kevin Wolf wrote:
>>> The file system can keep a lot of these things around pretty easily but
>>> with your proposal, it seems like there can only be one.  If you support
>>> many of them, I think you'll degenerate to something as complex as a
>>> reference count table.
>> IIUC, he already uses a refcount table.
> 
> Well, he needs a separate mechanism to make trim/discard work, but for 
> the snapshot discussion, a reference count table is avoided.
> 
> The bitmap only covers whether the guest has accessed a block or not.  
> Then there is a separate table that maps guest offsets to offsets within 
> the file.
> 
> I haven't thought hard about it, but my guess is that there is an 
> ordering constraint between these two pieces of metadata which is why 
> the journal is necessary.  I get worried about the complexity of a 
> journal even more than a reference count table.

Honestly I think that a journal is a good idea that we'll want to
implement in the long run.

There are people who aren't really happy about the dirty flag + fsck
approach, and there are people who are concerned about cluster leaks
without fsck. Both problems should be solved with a journal.

Compared to other questions in the discussio, I think it's only a
nice-to-have addition, though.

>>   Actually, I think that a
>> refcount table is a requirement to provide the interesting properties
>> that internal snapshots have (see my other mail).
> 
> Well the trick here AFAICT is that you're basically storing external 
> snapshots internally.  So it's sort of like a bunch of FVD formats 
> embedded into a single image.

CQ, can you please clarify? From your description, Anthony seems to
understand something completely different than I do.

Are its characteristics more like qcow2's internal snapshots (which is
what I understand) or more like external snapshots (which is what
Anthony seems to understand).

>> Refcount tables aren't a very complex thing either. In fact, it makes a
>> format much simpler to have one concept like refcount tables instead of
>> adding another different mechanism for each new feature that would be
>> natural with refcount tables.
> 
> I think it's a reasonable design goal to minimize any metadata updates 
> in the fast path.  If we can write 1 piece of metadata verses writing 2, 
> then it's worth exploring IMHO.
> 
>> The only problem with them is that they are metadata that must be
>> updated. However, I think we have discussed enough how to avoid the
>> greatest part of that cost.
> 
> Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 
> the writes for the refcount table?

Protected by a dirty flag (and/or a journal), sure. I mean, wasn't that
the whole point of starting the qcow3 discussion?

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:47                           ` Anthony Liguori
  2011-03-14 15:03                             ` Kevin Wolf
@ 2011-03-14 15:04                             ` Chunqiang Tang
  2011-03-14 15:07                               ` Stefan Hajnoczi
  1 sibling, 1 reply; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 15:04 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, Markus Armbruster, Aurelien Jarno,
	qemu-devel

> >> The file system can keep a lot of these things around pretty easily 
but
> >> with your proposal, it seems like there can only be one.  If you 
support
> >> many of them, I think you'll degenerate to something as complex as a
> >> reference count table.
> > IIUC, he already uses a refcount table.
> 
> Well, he needs a separate mechanism to make trim/discard work, but for 
> the snapshot discussion, a reference count table is avoided.

Kevin is right. FVD does have a refcount table. Sorry for causing 
confusion. I am going to send out a very detailed email which describes 
the operation steps in FVD, as Stefan requested.

> The bitmap only covers whether the guest has accessed a block or not. 
> Then there is a separate table that maps guest offsets to offsets within 

> the file.
> 
> I haven't thought hard about it, but my guess is that there is an 
> ordering constraint between these two pieces of metadata which is why 
> the journal is necessary.  I get worried about the complexity of a 
> journal even more than a reference count table.

No, the journal is not necessary. Actually, a very old version of FVD 
worked without journal. Journal was later introduced as a performance 
enhancement. 

> Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 

> the writes for the refcount table?

Yes, this is indeed achieved in FVD, with zero writes to the refcount 
table on the fast path. See details in the other email I am going to send 
out soon.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:49                               ` Anthony Liguori
@ 2011-03-14 15:05                                 ` Stefan Hajnoczi
  2011-03-14 15:08                                 ` Kevin Wolf
  1 sibling, 0 replies; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-03-14 15:05 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Chunqiang Tang, Markus Armbruster, Aurelien Jarno,
	qemu-devel

On Mon, Mar 14, 2011 at 2:49 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 03/14/2011 09:21 AM, Kevin Wolf wrote:
>>
>> Am 14.03.2011 15:02, schrieb Anthony Liguori:
>>>
>>> On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
>>>>>
>>>>> No, because the copy-on-write is another layer on top of the snapshot
>>>>> and AFAICT, they don't persist when moving between snapshots.
>>>>>
>>>>> The equivalent for external snapshots would be:
>>>>>
>>>>> base0<- base1<- base2<- image
>>>>>
>>>>> And then if I wanted to move to base1 without destroying base2 and
>>>>> image, I could do:
>>>>>
>>>>> qemu-img create -f qcow2 -b base1 base1-overlay.img
>>>>>
>>>>> The file system can keep a lot of these things around pretty easily but
>>>>> with your proposal, it seems like there can only be one.  If you
>>>>> support
>>>>> many of them, I think you'll degenerate to something as complex as a
>>>>> reference count table.
>>>>>
>>>>> On the other hand, I think it's reasonable to just avoid the CoW
>>>>> overlay
>>>>> entirely and say that moving to a previous snapshot destroys any of
>>>>> it's
>>>>> children.  I think this ends up being a simplifying assumption that is
>>>>> worth investigating further.
>>>>
>>>> No, both VMware and FVD have the same semantics as QCOW2. Moving to a
>>>> previous snapshot does not destroy any of its children. In the example I
>>>> gave (copied below),
>>>> it goes from
>>>>
>>>> Image: s1->s2->s3->s4->(current-state)
>>>>
>>>> back to snapshot s2, and now the state is
>>>>
>>>> Image: s1->s2->s3->s4
>>>>             |->(curren-state)
>>>>
>>>> where all snapshots s1-s4 are kept. From there, it can take another
>>>> snapshot s5, and then further go back to snapshot s4, ending up with
>>>>
>>>> Image: s1->s2->s3->s4
>>>>             |->s5   |
>>>>                     |->   (current-state)
>>>
>>> Your use of "current-state" is confusing me because AFAICT,
>>> current-state is just semantically another snapshot.
>>>
>>> It's writable because it has no children.  You only keep around one
>>> writable snapshot and to make another snapshot writable, you have to
>>> discard the former.
>>>
>>> This is not the semantics of qcow2.  Every time you create a snapshot,
>>> it's essentially a new image.  You can write directly to it.
>>>
>>> While we don't do this today and I don't think we ever should, it's
>>> entirely possible to have two disks served simultaneously out of the
>>> same qcow2 file using snapshots.
>>
>> No, CQ is describing the semantics of internal snapshots in qcow2
>> correctly. You have all the snapshots that are stored in the snapshot
>> table (all read-only) plus one current state described by the image
>> header (read-write).
>
> But is there any problem (in the format) with writing to the non-current
> state?  I can't think of one.

Here is a problem: there is a single global refcount table in QCOW2.
You need to synchronize updates of the refcounts between multiple
writers to avoid introducing incorrect refcounts.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 15:04                             ` Chunqiang Tang
@ 2011-03-14 15:07                               ` Stefan Hajnoczi
  0 siblings, 0 replies; 87+ messages in thread
From: Stefan Hajnoczi @ 2011-03-14 15:07 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, Aurelien Jarno, Markus Armbruster, qemu-devel

On Mon, Mar 14, 2011 at 3:04 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> >> The file system can keep a lot of these things around pretty easily
> but
>> >> with your proposal, it seems like there can only be one.  If you
> support
>> >> many of them, I think you'll degenerate to something as complex as a
>> >> reference count table.
>> > IIUC, he already uses a refcount table.
>>
>> Well, he needs a separate mechanism to make trim/discard work, but for
>> the snapshot discussion, a reference count table is avoided.
>
> Kevin is right. FVD does have a refcount table. Sorry for causing
> confusion. I am going to send out a very detailed email which describes
> the operation steps in FVD, as Stefan requested.
>
>> The bitmap only covers whether the guest has accessed a block or not.
>> Then there is a separate table that maps guest offsets to offsets within
>
>> the file.
>>
>> I haven't thought hard about it, but my guess is that there is an
>> ordering constraint between these two pieces of metadata which is why
>> the journal is necessary.  I get worried about the complexity of a
>> journal even more than a reference count table.
>
> No, the journal is not necessary. Actually, a very old version of FVD
> worked without journal. Journal was later introduced as a performance
> enhancement.

I like the journal because it allows us to isolate metadata updates
into one specific area that can be scanned on image recovery.  If we
take the QED approach with the dirty bit then we have to scan all
L1/L2 tables.  The journal makes recovery more efficient than a full
consistency check.

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:49                               ` Anthony Liguori
  2011-03-14 15:05                                 ` Stefan Hajnoczi
@ 2011-03-14 15:08                                 ` Kevin Wolf
  1 sibling, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 15:08 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

Am 14.03.2011 15:49, schrieb Anthony Liguori:
> On 03/14/2011 09:21 AM, Kevin Wolf wrote:
>> Am 14.03.2011 15:02, schrieb Anthony Liguori:
>>> On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
>>>>> No, because the copy-on-write is another layer on top of the snapshot
>>>>> and AFAICT, they don't persist when moving between snapshots.
>>>>>
>>>>> The equivalent for external snapshots would be:
>>>>>
>>>>> base0<- base1<- base2<- image
>>>>>
>>>>> And then if I wanted to move to base1 without destroying base2 and
>>>>> image, I could do:
>>>>>
>>>>> qemu-img create -f qcow2 -b base1 base1-overlay.img
>>>>>
>>>>> The file system can keep a lot of these things around pretty easily but
>>>>> with your proposal, it seems like there can only be one.  If you support
>>>>> many of them, I think you'll degenerate to something as complex as a
>>>>> reference count table.
>>>>>
>>>>> On the other hand, I think it's reasonable to just avoid the CoW overlay
>>>>> entirely and say that moving to a previous snapshot destroys any of it's
>>>>> children.  I think this ends up being a simplifying assumption that is
>>>>> worth investigating further.
>>>> No, both VMware and FVD have the same semantics as QCOW2. Moving to a
>>>> previous snapshot does not destroy any of its children. In the example I
>>>> gave (copied below),
>>>> it goes from
>>>>
>>>> Image: s1->s2->s3->s4->(current-state)
>>>>
>>>> back to snapshot s2, and now the state is
>>>>
>>>> Image: s1->s2->s3->s4
>>>>              |->(curren-state)
>>>>
>>>> where all snapshots s1-s4 are kept. From there, it can take another
>>>> snapshot s5, and then further go back to snapshot s4, ending up with
>>>>
>>>> Image: s1->s2->s3->s4
>>>>              |->s5   |
>>>>                      |->   (current-state)
>>> Your use of "current-state" is confusing me because AFAICT,
>>> current-state is just semantically another snapshot.
>>>
>>> It's writable because it has no children.  You only keep around one
>>> writable snapshot and to make another snapshot writable, you have to
>>> discard the former.
>>>
>>> This is not the semantics of qcow2.  Every time you create a snapshot,
>>> it's essentially a new image.  You can write directly to it.
>>>
>>> While we don't do this today and I don't think we ever should, it's
>>> entirely possible to have two disks served simultaneously out of the
>>> same qcow2 file using snapshots.
>> No, CQ is describing the semantics of internal snapshots in qcow2
>> correctly. You have all the snapshots that are stored in the snapshot
>> table (all read-only) plus one current state described by the image
>> header (read-write).
> 
> But is there any problem (in the format) with writing to the non-current 
> state?  I can't think of one.

You would run into problems with the COW flag in the L2 tables. They are
only an optimization, though, so you could probably avoid using them and
directly look up the refcount table for each write, at the cost of
performance.

Anyway, I don't think there's a real use case for something like this.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 15:03                             ` Kevin Wolf
@ 2011-03-14 15:13                               ` Anthony Liguori
  0 siblings, 0 replies; 87+ messages in thread
From: Anthony Liguori @ 2011-03-14 15:13 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chunqiang Tang, Stefan Hajnoczi, Markus Armbruster,
	Aurelien Jarno, qemu-devel

On 03/14/2011 10:03 AM, Kevin Wolf wrote:
>>> The only problem with them is that they are metadata that must be
>>> updated. However, I think we have discussed enough how to avoid the
>>> greatest part of that cost.
>> Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid
>> the writes for the refcount table?
> Protected by a dirty flag (and/or a journal), sure. I mean, wasn't that
> the whole point of starting the qcow3 discussion?

Okay, I thought you had something else in mind.

Regards,

Anthony Liguori

> Kevin
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 14:31                             ` Stefan Hajnoczi
@ 2011-03-14 16:32                               ` Chunqiang Tang
  2011-03-14 17:57                                 ` Kevin Wolf
       [not found]                               ` <OF7C2FDD40.E76A4E14-ON85257853.005ADD68-85257853.005AF16E@LocalDomain>
  1 sibling, 1 reply; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 16:32 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Aurelien Jarno, Markus Armbruster, qemu-devel

> > FVD's novel uses of the reference count table reduces the metadata 
update
> > overhead down to literally zero during normal execution of a VM. This 
gets
> > the bests of QCOW2's reference count table but without its oeverhead. 
In
> > FVD, the reference count table is only updated when creating a new
> > snapshot or deleting an existing snapshot. The reference count table 
is
> > never updated during normal execution of a VM.
> 
> Do you want to send out a break-down of the steps (and cost) involved in 
doing:
> 
> 1. Snapshot creation.
> 2. Snapshot deletion.
> 3. Opening an image with n snapshots.

Here is a detailed description. Relevant to the discussion of snapshot, 
FVD uses a one-level lookup table and a refcount table. FVD’s one-level 
lookup table is very similar to QCOW2’s two-level lookup table, except 
that it is much smaller in FVD, and is preallocated and hence contiguous 
in image. 

FVD’s refcount table is almost identical to that of QCOW2, but with a key 
difference. An image consists of an arbitrary number of read-only 
snapshots, and a single writeable image front, which is the current image 
state perceived by the VM. Below, I will simply refer to the read-only 
snapshots as snapshots, and refer to the “writeable image front” as 
“writeable-front.” QCOW2’s refcount table counts clusters that are used by 
either read-only snapshots or writeable-front. Because writeable-front 
changes as the VM runs, QCOW2 needs to update the refcount table on the 
fast path of normal VM execution. 

By contrast, FVD’s refcount table only counts chunks that are used by 
read-only snapshots, and does not count chunks used by write-front. This 
is the key that allows FVD to entirely avoid updating the refcount table 
on the fast path of normal VM execution. 

Below are the detailed steps for different operations.

O1: Open an image with n snapshots.

Let me introduce some basic concepts first. The storage allocation unit in 
FVD is called a chunk (like cluster in QCOW2). The default chunk size is 
1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD 
image file is conceptually divided into chunks, where chunk 0 is the first 
1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so 
forth. The size of an image file grow as needed, just like that of QCOW2. 
The refcount table is a linear array “uint16_t refcount[]”. If a chunk j 
is referenced by s different snapshots, then refcount[j] = s. If a new 
snapshot is created and this new snapshot also uses chunk j, then 
refcount[j] is incremented to refcount[j] = s+1.

If all snapshots together use 1TB storage spaces, there are 
1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB. 
Loading the entire 2MB refcount table from disk into memory takes about 15 
milliseconds. If the virtual disk size perceived by the VM is also 1TB, 
FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves 
the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level 
table is much smaller and is preallocated and hence continuous in the 
image. Loading the entire 4MB lookup table from disk into memory takes 
about 20 milliseconds. These numbers mean that it is quite affordable to 
scan the entire tables at VM boot time, although the scan can also be 
avoided in FVD. The optimizations will be described later.

When opening an image with n snapshots, an unoptimized version of FVD 
performs the following steps:

O1: Load the entire 2MB reference count table from disk into memory. This 
step takes about 15ms.

O2: Load the entire 4MB lookup table from disk into memory. This step 
takes about 20ms.

O3: Use the two tables to build an in-memory data structure called 
“free-chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap 
identifies free chunks that are not used by either snapshots or writeable 
front, and hence can be allocated for future writes. The size of the 
free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory 
overhead is negligible. The free-chunk-bitmap also supports trim 
operations. The free-chunk-bitmap does not have to be persisted on disk as 
it can always be rebuilt easily, although as an optimization it can be 
persisted on disk on VM shutdown.

O4: Compare the refcount table and the lookup table to identify chunks 
that are in both tables (i.e., shared) and hence the running VM’s write to 
those chunks in writeable-front triggers copy-on-write. This step takes 
about 2ms. One bit in the lookup table’s entry is stolen to mark whether a 
chunk in writeable-front is shared with snapshots and hence needs 
copy-on-write upon a write.

The whole process above, i.e., opening an image with n (e.g., n=1000) 
snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I 
will describe optimizations that can further reduce this 39ms by saving 
the 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization 
is more than likely to an over-engineering effort, given that 39ms on VM 
boot perhaps is not a big deal.

Once the image is opened, the refcount table is discarded and never 
updated during normal execution of the VM. This is how FVD gets the bests 
of QCOW2’s refcount table but without its runtime overhead. When the VM 
issues a write to a chunk in writeable-front and this chunk is shared with 
snapshots (this is known by looking at the special marker bit in lookup 
table entries), FVD uses the free-chunk-bitmap to find a free chunk, 
performs copy-on-write, and updates the lookup table to point to that new 
chunk. This metadata change to the lookup table is persisted in FVD’s 
journal and is not updated directly to the writeable-front’s on-disk 
lookup table. When the VM shuts down gracefully, the entire lookup table 
is written back to the writeable-front’s on-disk lookup table (this step 
takes about 20ms), and FVD’s image header is updated to indicate that 
there is no need to recover from journal on the next VM boot.

Now let me describe an optimization that can further reduce the 39ms time 
needed for opening an image with n snapshots. When the VM shuts down 
gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk. 
When the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk 
and skips steps O1, O2, and O3. Since the lookup table is saved to disk on 
a clean shutdown and one bit in the table entries marks whether a chunk is 
shared with snapshots, the scanning step in O4 can also be avoided. In 
other words, all the steps above O1-O4 can all be avoided on a clean 
shutdown. They are needed only after a host crash. During the recovery 
process after a host crash, the journal re-builds the lookup table; the 
refcount table and the lookup table rebuild the free-chunk-bitmap.

This is a description of FVD's image open operation.

(…to be continued… I am running out of time now, and will write about 
FVD’s other snapshot operations in separate emails.) 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang





^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 16:32                               ` Chunqiang Tang
@ 2011-03-14 17:57                                 ` Kevin Wolf
  2011-03-14 19:23                                   ` Chunqiang Tang
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 17:57 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Stefan Hajnoczi, Aurelien Jarno, Markus Armbruster, qemu-devel

Am 14.03.2011 17:32, schrieb Chunqiang Tang:
>>> FVD's novel uses of the reference count table reduces the metadata 
> update
>>> overhead down to literally zero during normal execution of a VM. This 
> gets
>>> the bests of QCOW2's reference count table but without its oeverhead. 
> In
>>> FVD, the reference count table is only updated when creating a new
>>> snapshot or deleting an existing snapshot. The reference count table 
> is
>>> never updated during normal execution of a VM.
>>
>> Do you want to send out a break-down of the steps (and cost) involved in 
> doing:
>>
>> 1. Snapshot creation.
>> 2. Snapshot deletion.
>> 3. Opening an image with n snapshots.
> 
> Here is a detailed description. Relevant to the discussion of snapshot, 
> FVD uses a one-level lookup table and a refcount table. FVD’s one-level 
> lookup table is very similar to QCOW2’s two-level lookup table, except 
> that it is much smaller in FVD, and is preallocated and hence contiguous 
> in image.

Does this mean that FVD can't hold VM state of arbitrary size?

> FVD’s refcount table is almost identical to that of QCOW2, but with a key 
> difference. An image consists of an arbitrary number of read-only 
> snapshots, and a single writeable image front, which is the current image 
> state perceived by the VM. Below, I will simply refer to the read-only 
> snapshots as snapshots, and refer to the “writeable image front” as 
> “writeable-front.” QCOW2’s refcount table counts clusters that are used by 
> either read-only snapshots or writeable-front. Because writeable-front 
> changes as the VM runs, QCOW2 needs to update the refcount table on the 
> fast path of normal VM execution. 

Needs to update, but not necessarily on the fast path. Updates can be
delayed and batched.

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 17:57                                 ` Kevin Wolf
@ 2011-03-14 19:23                                   ` Chunqiang Tang
  2011-03-14 20:16                                     ` Kevin Wolf
  0 siblings, 1 reply; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 19:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, Aurelien Jarno, Markus Armbruster, qemu-devel

> > Here is a detailed description. Relevant to the discussion of 
snapshot, 
> > FVD uses a one-level lookup table and a refcount table. FVD’s 
one-level 
> > lookup table is very similar to QCOW2’s two-level lookup table, except 

> > that it is much smaller in FVD, and is preallocated and hence 
contiguous 
> > in image.
> 
> Does this mean that FVD can't hold VM state of arbitrary size?

No, FVD can hold VM state of an arbitrary size. Unlike QCOW2, FVD does not 
store the index of the vm state as part of the one-level lookup table. FVD 
could have done so, and then relocates the one-level lookup table in order 
grow it in size (growing FVD's lookup table through relocation is 
supported, e.g., in order to resize an image to a larger size), but that's 
not an ideal solution. Instead, in FVD, each snapshot has two fields, 
vm_state_offset and vm_state_space_size, which directly point to where the 
VM state is stored, and vm_state_space_size can be arbitrary. BTW, I 
observe "uint32_t QEMUSnapshotInfo.vm_state_size". Does this mean that a 
VM state cannot be larger than 4GB? This seems to be a limitation of QEMU. 
FVD instead uses "uint64_t vm_state_space_size" in the image format, in 
case that the size of QEMUSnapshotInfo.vm_state_size is increased in the 
future.
 
> > FVD’s refcount table is almost identical to that of QCOW2, but with a 
key 
> > difference. An image consists of an arbitrary number of read-only 
> > snapshots, and a single writeable image front, which is the current 
image 
> > state perceived by the VM. Below, I will simply refer to the read-only 

> > snapshots as snapshots, and refer to the “writeable image front” as 
> > “writeable-front.” QCOW2’s refcount table counts clusters that are 
used by 
> > either read-only snapshots or writeable-front. Because writeable-front 

> > changes as the VM runs, QCOW2 needs to update the refcount table on 
the 
> > fast path of normal VM execution. 
> 
> Needs to update, but not necessarily on the fast path. Updates can be
> delayed and batched.

Probably this has been discussed extensively before (as you mentioned in 
some previous emails), but I missed the discussion and still have a naive 
question. Is delaying and batching possible for "wce=0", i.e., 
cache=writethrough? 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
  2011-03-14 19:23                                   ` Chunqiang Tang
@ 2011-03-14 20:16                                     ` Kevin Wolf
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Wolf @ 2011-03-14 20:16 UTC (permalink / raw)
  To: Chunqiang Tang
  Cc: Stefan Hajnoczi, Aurelien Jarno, Markus Armbruster, qemu-devel

Am 14.03.2011 20:23, schrieb Chunqiang Tang:
>>> Here is a detailed description. Relevant to the discussion of 
> snapshot, 
>>> FVD uses a one-level lookup table and a refcount table. FVD’s 
> one-level 
>>> lookup table is very similar to QCOW2’s two-level lookup table, except 
> 
>>> that it is much smaller in FVD, and is preallocated and hence 
> contiguous 
>>> in image.
>>
>> Does this mean that FVD can't hold VM state of arbitrary size?
> 
> No, FVD can hold VM state of an arbitrary size. Unlike QCOW2, FVD does not 
> store the index of the vm state as part of the one-level lookup table. FVD 
> could have done so, and then relocates the one-level lookup table in order 
> grow it in size (growing FVD's lookup table through relocation is 
> supported, e.g., in order to resize an image to a larger size), but that's 
> not an ideal solution. Instead, in FVD, each snapshot has two fields, 
> vm_state_offset and vm_state_space_size, which directly point to where the 
> VM state is stored, and vm_state_space_size can be arbitrary. 

Okay, makes sense.

> BTW, I 
> observe "uint32_t QEMUSnapshotInfo.vm_state_size". Does this mean that a 
> VM state cannot be larger than 4GB? This seems to be a limitation of QEMU. 
> FVD instead uses "uint64_t vm_state_space_size" in the image format, in 
> case that the size of QEMUSnapshotInfo.vm_state_size is increased in the 
> future.

Yeah, that was a stupid decision, it definitely should be 64 bit.

>> Needs to update, but not necessarily on the fast path. Updates can be
>> delayed and batched.
> 
> Probably this has been discussed extensively before (as you mentioned in 
> some previous emails), but I missed the discussion and still have a naive 
> question. Is delaying and batching possible for "wce=0", i.e., 
> cache=writethrough? 

It's possible with QED's approach: You set a dirty flag in the image
header, and while this flag is set you don't have to care about
consistent refcount tables. Only when you clear the flag, you must flush
the refcount cache to the image file.

If qemu crashes, you see the dirty flag and you know that you have an
image with stale refcounts. In this case you must do a metadata scan to
rebuild the refcount table from the L2 tables (or just replay the
journal if you have one).

Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] Re: Strategic decision: COW format
       [not found]                               ` <OF7C2FDD40.E76A4E14-ON85257853.005ADD68-85257853.005AF16E@LocalDomain>
@ 2011-03-14 21:32                                 ` Chunqiang Tang
  0 siblings, 0 replies; 87+ messages in thread
From: Chunqiang Tang @ 2011-03-14 21:32 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Aurelien Jarno, Markus Armbruster, qemu-devel

> > > FVD's novel uses of the reference count table reduces the metadata 
update
> > > overhead down to literally zero during normal execution of a VM. 
This gets
> > > the bests of QCOW2's reference count table but without its 
oeverhead. In
> > > FVD, the reference count table is only updated when creating a new
> > > snapshot or deleting an existing snapshot. The reference count table 
is
> > > never updated during normal execution of a VM.
> > 
> > Do you want to send out a break-down of the steps (and cost) involved 
in doing:
> > 
> > 1. Snapshot creation.
> > 2. Snapshot deletion.
> > 3. Opening an image with n snapshots.

> Here is a detailed description. Relevant to the discussion of snapshot, 
FVD 
> uses a one-level lookup table and a refcount table. FVD’s one-level 
lookup 
> table is very similar to QCOW2’s two-level lookup table, except that it 
is 
> much smaller in FVD, and is preallocated and hence contiguous in image. 
> 
> FVD’s refcount table is almost identical to that of QCOW2, but with a 
key 
> difference. An image consists of an arbitrary number of read-only 
snapshots, 
> and a single writeable image front, which is the current image state 
perceived
> by the VM. Below, I will simply refer to the read-only snapshots as 
snapshots,
> and refer to the “writeable image front” as “writeable-front.” QCOW2’s 
> refcount table counts clusters that are used by either read-only 
snapshots or 
> writeable-front. Because writeable-front changes as the VM runs, QCOW2 
needs 
> to update the refcount table on the fast path of normal VM execution. 
> 
> By contrast, FVD’s refcount table only counts chunks that are used by 
read-
> only snapshots, and does not count chunks used by write-front. This is 
the key
> that allows FVD to entirely avoid updating the refcount table on the 
fast path
> of normal VM execution. 
> 
> Below are the detailed steps for different operations.
> 
> O1: Open an image with n snapshots.
> 
> Let me introduce some basic concepts first. The storage allocation unit 
in FVD
> is called a chunk (like cluster in QCOW2). The default chunk size is 
1MB, like
> that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file 
is 
> conceptually divided into chunks, where chunk 0 is the first 1MB of the 
image 
> file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of 
an 
> image file grow as needed, just like that of QCOW2. The refcount table 
is a 
> linear array “uint16_t refcount[]”. If a chunk j is referenced by s 
different 
> snapshots, then refcount[j] = s. If a new snapshot is created and this 
new 
> snapshot also uses chunk j, then refcount[j] is incremented to 
refcount[j] = s+1.
> 
> If all snapshots together use 1TB storage spaces, there are 
1TB/1MB=1,000,000 
> chunks, and the size of the refcount table is 2MB. Loading the entire 
2MB 
> refcount table from disk into memory takes about 15 milliseconds. If the 

> virtual disk size perceived by the VM is also 1TB, FVD’s one-level 
lookup 
> table is 4MB. FVD’s one-level lookup table serves the same purpose as 
QCOW2’s 
> two-level lookup table, but FVD’s one-level table is much smaller and is 

> preallocated and hence continuous in the image. Loading the entire 4MB 
lookup 
> table from disk into memory takes about 20 milliseconds. These numbers 
mean 
> that it is quite affordable to scan the entire tables at VM boot time, 
> although the scan can also be avoided in FVD. The optimizations will be 
described later.
> 
> When opening an image with n snapshots, an unoptimized version of FVD 
performs
> the following steps:
> 
> O1: Load the entire 2MB reference count table from disk into memory. 
This step
> takes about 15ms.
> 
> O2: Load the entire 4MB lookup table from disk into memory. This step 
takes about 20ms.
> 
> O3: Use the two tables to build an in-memory data structure called 
“free-
> chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap 
identifies 
> free chunks that are not used by either snapshots or writeable front, 
and 
> hence can be allocated for future writes. The size of the 
free-chunk-bitmap is
> only 125KB for a 1TB disk, and hence the memory overhead is negligible. 
The 
> free-chunk-bitmap also supports trim operations. The free-chunk-bitmap 
does 
> not have to be persisted on disk as it can always be rebuilt easily, 
although 
> as an optimization it can be persisted on disk on VM shutdown.
> 
> O4: Compare the refcount table and the lookup table to identify chunks 
that 
> are in both tables (i.e., shared) and hence the running VM’s write to 
those 
> chunks in writeable-front triggers copy-on-write. This step takes about 
2ms. 
> One bit in the lookup table’s entry is stolen to mark whether a chunk in 

> writeable-front is shared with snapshots and hence needs copy-on-write 
upon a write.
> 
> The whole process above, i.e., opening an image with n (e.g., n=1000) 
> snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, 
I 
> will describe optimizations that can further reduce this 39ms by saving 
the 
> 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization is 
more 
> than likely to an over-engineering effort, given that 39ms on VM boot 
perhaps 
> is not a big deal.
> 
> Once the image is opened, the refcount table is discarded and never 
updated 
> during normal execution of the VM. This is how FVD gets the bests of 
QCOW2’s 
> refcount table but without its runtime overhead. When the VM issues a 
write to
> a chunk in writeable-front and this chunk is shared with snapshots (this 
is 
> known by looking at the special marker bit in lookup table entries), FVD 
uses 
> the free-chunk-bitmap to find a free chunk, performs copy-on-write, and 
> updates the lookup table to point to that new chunk. This metadata 
change to 
> the lookup table is persisted in FVD’s journal and is not updated 
directly to 
> the writeable-front’s on-disk lookup table. When the VM shuts down 
gracefully,
> the entire lookup table is written back to the writeable-front’s on-disk 

> lookup table (this step takes about 20ms), and FVD’s image header is 
updated 
> to indicate that there is no need to recover from journal on the next VM 
boot.
> 
> Now let me describe an optimization that can further reduce the 39ms 
time 
> needed for opening an image with n snapshots. When the VM shuts down 
> gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk. 
When 
> the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk and 
skips 
> steps O1, O2, and O3. Since the lookup table is saved to disk on a clean 

> shutdown and one bit in the table entries marks whether a chunk is 
shared with
> snapshots, the scanning step in O4 can also be avoided. In other words, 
all 
> the steps above O1-O4 can all be avoided on a clean shutdown. They are 
needed 
> only after a host crash. During the recovery process after a host crash, 
the 
> journal re-builds the lookup table; the refcount table and the lookup 
table 
> rebuild the free-chunk-bitmap.
> 
> This is a description of FVD's image open operation.
> 
> (…to be continued… I am running out of time now, and will write about 
FVD’s 
> other snapshot operations in separate emails.) 

Let me continue my writeup to describe other snapshot related operations 
in FVD.

Operation 2: Snapshot creation.

FVD performs the following operations when creating a new snapshot.

C1: Create a new snapshot by saving a copy of FVD's bitmap and lookup 
table to a new location on disk. FVD's lookup table for a 1TB disk is 4MB 
and the bitmap is even smaller. This step takes about 30ms. 

C2: Load the 2MB refcount table from disk to memory. This step takes about 
15ms.

C3: Use the writeable-front's 4MB lookup table to update the in-memory 2MB 
refcount table. For a chunk j that exists in the lookup table, increment 
the refcount table by doing refcount[j]++. This step is an in-memory 
operation and takes about 2ms.

C4: Write the new 2MB refcount table to disk. This step takes about 15ms.

C5: Write the new list of snapshots to disk. This step takes about 10ms.

The snapshot creation process above takes about 72ms. Recall that on the 
fast path during the VM's normal execution, FVD never updates the on-disk 
refcount table and does not even keep the refcount table in memory. 
Conceptually, updating the refcount table is shifted from the fast path of 
VM execution to steps C2, C3, and C4 of the snapshot creation process. In 
another perspective, this also allows FVD to batch all updates to the 
refcount table and do it in a single, efficient, sequential write at the 
time of snapshot creation. This is the trick why internal snapshot using 
the refcount table is so efficient in FVD. 

Operation 3: Snapshot deletion.

FVD performs the following operations when deleting a snapshot. Suppose 
snapshot X is to be deleted.

D1: Load the 2MB refcount table from disk to memory. This step takes about 
15ms.

D2: Load snapshot X's 4MB lookup table from disk. This step takes about 
20ms.

D3: Use snapshot X's 4MB lookup table to update the in-memory 2MB refcount 
table. For a chunk j that exists in snapshot X's lookup table, decrement 
the refcount table by doing refcount[j]--. This step is an in-memory 
operation and takes about 2ms.

C4: Write the new 2MB refcount table to disk. This step takes about 15ms.

C5: Write the new list of snapshots to disk. This step takes about 10ms.

The snapshot deletion process above takes about 62ms. 

Operation 4: Go to a snapshot, i.e., derive a new writeable-front based on 
a snapshot.

FVD performs the following operations when going to a snapshot X.

C1: Load snapshot X's bitmap and lookup table from disk into memory, and 
save a new copy on disk, which will be the on-disk metadata for the new 
writeable-front. FVD's lookup table for a 1TB disk is 4MB and the bitmap 
is even smaller. This step takes about 60ms. 

C2: Update the FVD image's header to point to the new bitmap and the new 
lookup table. This step takes 7ms.

C3: Follow the image open operation, which take about 39ms in total.

The goto snapshot process above takes about 106ms. Note that when going to 
a snapshot, FVD need not update the refcount table, because the refcount 
table only counts read-only snapshots, which does not change during a goto 
snapshot operation. 

In summary, snapshot creation, snapshot deletion, and snapshot goto are 
all fast operations in FVD, taking 106ms or less. Most importantly, on the 
fast path during the normal execution of the VM, FVD never updates the 
on-disk refcount table and does not even keep the refcount table in 
memory. The only overhead in FVD is the 125KB free-chunk-bitmap that must 
be kept in memory, but the free-chunk-bitmap is already needed for 
supporting trim and recovering leaked storage space on a host crash, even 
if we do not introduce the snapshot feature into FVD. In other words, FVD 
gets all the benefits of QCOW2's internal snapshot function but without 
paying any of its overhead. I truly believe this is the ideal solution for 
snapshot, going beyond the current state-of-the-art in QCOW2 and VMware. 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang


^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2011-03-14 21:32 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <OF3C9DAE9F.EC6B5878-ON85257826.00715C10-85257826.007A14FB@LocalDomain>
2011-02-15 19:45 ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Chunqiang Tang
2011-02-16 12:34   ` Kevin Wolf
2011-02-17 16:04     ` Chunqiang Tang
2011-02-18  9:12     ` Strategic decision: COW format (was: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED) Markus Armbruster
2011-02-18  9:57       ` [Qemu-devel] Re: Strategic decision: COW format Kevin Wolf
2011-02-18 14:20         ` Anthony Liguori
2011-02-22  8:37           ` Markus Armbruster
2011-02-22  8:56             ` Kevin Wolf
2011-02-22 10:21               ` Markus Armbruster
2011-02-22 15:57               ` Anthony Liguori
2011-02-22 16:15                 ` Kevin Wolf
2011-02-22 18:18                   ` Anthony Liguori
2011-02-23  9:13                     ` Kevin Wolf
2011-02-23 14:21                       ` Anthony Liguori
2011-02-23 14:55                         ` Kevin Wolf
2011-02-23 13:43               ` Avi Kivity
2011-02-23 14:23                 ` Anthony Liguori
2011-02-23 14:38                   ` Kevin Wolf
2011-02-23 15:29                     ` Anthony Liguori
2011-02-23 15:36                       ` Avi Kivity
2011-02-23 15:47                         ` Anthony Liguori
2011-02-23 15:59                           ` Avi Kivity
2011-02-23 15:54                       ` Kevin Wolf
2011-02-23 15:23                   ` Avi Kivity
2011-02-23 15:31                     ` Anthony Liguori
2011-02-23 15:37                       ` Avi Kivity
2011-02-23 15:50                         ` Anthony Liguori
2011-02-23 16:03                           ` Avi Kivity
2011-02-23 16:04                             ` Anthony Liguori
2011-02-23 16:15                               ` Kevin Wolf
2011-02-25 11:20                             ` Pavel Dovgaluk
     [not found]                             ` <-1737654525499315352@unknownmsgid>
2011-02-25 13:22                               ` Stefan Hajnoczi
2011-02-23 15:52                         ` Anthony Liguori
2011-02-23 15:59                           ` Gleb Natapov
2011-02-23 16:00                           ` Avi Kivity
2011-02-23 15:33                     ` Daniel P. Berrange
2011-02-23 15:38                       ` Avi Kivity
2011-02-18 17:43         ` Stefan Weil
2011-02-18 19:11           ` Kevin Wolf
2011-02-18 19:47             ` Anthony Liguori
2011-02-18 20:49               ` Kevin Wolf
2011-02-18 20:50                 ` Anthony Liguori
2011-02-18 21:27                   ` Kevin Wolf
2011-02-19 17:19             ` Stefan Hajnoczi
2011-02-18 20:31           ` Anthony Liguori
2011-02-19 12:27           ` [Qemu-devel] Bugs in the VDI Block Device Driver Chunqiang Tang
2011-02-19 16:21             ` Stefan Hajnoczi
2011-02-19 18:49               ` Stefan Weil
2011-02-20 22:13         ` [Qemu-devel] Re: Strategic decision: COW format Aurelien Jarno
2011-02-21  8:59           ` Kevin Wolf
2011-02-21 13:44             ` Stefan Hajnoczi
2011-02-21 14:10               ` Kevin Wolf
2011-02-21 15:16                 ` Anthony Liguori
2011-02-21 15:26                   ` Kevin Wolf
2011-02-23  3:32               ` Chunqiang Tang
2011-02-23 13:20                 ` Markus Armbruster
     [not found]               ` <OFAEB4CD91.BE989F29-ON8525783F.007366B8-85257840.00130B47@LocalDomain>
2011-03-13  5:51                 ` Chunqiang Tang
2011-03-13 17:48                   ` Anthony Liguori
2011-03-14  2:28                     ` Chunqiang Tang
2011-03-14 13:22                       ` Anthony Liguori
2011-03-14 13:53                         ` Chunqiang Tang
2011-03-14 14:02                           ` Anthony Liguori
2011-03-14 14:21                             ` Kevin Wolf
2011-03-14 14:35                               ` Chunqiang Tang
2011-03-14 14:49                               ` Anthony Liguori
2011-03-14 15:05                                 ` Stefan Hajnoczi
2011-03-14 15:08                                 ` Kevin Wolf
2011-03-14 14:26                           ` Stefan Hajnoczi
2011-03-14 14:30                             ` Chunqiang Tang
2011-03-14 14:15                         ` Kevin Wolf
2011-03-14 14:25                           ` Chunqiang Tang
2011-03-14 14:31                             ` Stefan Hajnoczi
2011-03-14 16:32                               ` Chunqiang Tang
2011-03-14 17:57                                 ` Kevin Wolf
2011-03-14 19:23                                   ` Chunqiang Tang
2011-03-14 20:16                                     ` Kevin Wolf
     [not found]                               ` <OF7C2FDD40.E76A4E14-ON85257853.005ADD68-85257853.005AF16E@LocalDomain>
2011-03-14 21:32                                 ` Chunqiang Tang
2011-03-14 14:34                             ` Kevin Wolf
2011-03-14 14:47                           ` Anthony Liguori
2011-03-14 15:03                             ` Kevin Wolf
2011-03-14 15:13                               ` Anthony Liguori
2011-03-14 15:04                             ` Chunqiang Tang
2011-03-14 15:07                               ` Stefan Hajnoczi
2011-03-14 10:12                   ` Kevin Wolf
2011-02-22  8:40           ` Markus Armbruster
2011-02-16 13:21   ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Stefan Hajnoczi
2011-02-17 16:04     ` Chunqiang Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.