All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] disk image: self-organized format or raw file
@ 2014-08-11 23:38 吴兴博
  2014-08-12  0:52 ` Fam Zheng
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: 吴兴博 @ 2014-08-11 23:38 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2264 bytes --]

Hello,

  The introduction in the wiki page present several advantages of qcow2
[1]. But I'm a little confused. I really appreciate if any one can give me
some help on this :).

 (1) Currently the raw format doesn't support COW. In other words, a raw
image cannot have a backing file. COW depends on the mapping table on which
we it knows whether each block/cluster is present (has been modified) in
the current image file. Modern file-systems like xfs/ext4/etc. provide
extent/block allocation information to user-level. Like what 'filefrag'
does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
block/raw-posix.c) may obtain correct 'present information about blocks.
However this information may be limited to be aligned with file allocation
unit size. Maybe it's just because a raw file has no space to store the
"backing file name"? I don't think this could hinder the useful feature.

 (2) As most popular filesystems support delay-allocation/on-demand
allocation/holes, whatever, a raw image is also thin provisioned as other
formats. It doesn't consume much disk space by storing useless zeros.
However, I don't know if there is any concern on whether fragmented extents
would become a burden of the host filesystem.

 (3) For compression and encryption, I'm not an export on these topics at
all but I think these features may not be vital to a image format as both
guest/host's filesystem can also provide similar functionality.

 (4) I don't have too much understanding on how snapshot works but I think
theoretically it would be using the techniques no more than that used in
COW and backing file.

After all these thoughts, I still found no reason to not using a 'raw' file
image (engineering efforts in Qemu should not count as we don't ask  for
more features from outside world).
I would be very sorry if my ignorance wasted your time.



references:
[1] http://en.wikibooks.org/wiki/QEMU/Images#Image_types
"QEMU supports several image types. The "native" and most flexible type is
*qcow2*, which supports copy on write
<http://en.wikibooks.org/wiki/QEMU/Images#Copy_on_write>, encryption,
compression, and VM snapshots.



Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

[-- Attachment #2: Type: text/html, Size: 3538 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-11 23:38 [Qemu-devel] disk image: self-organized format or raw file 吴兴博
@ 2014-08-12  0:52 ` Fam Zheng
  2014-08-12 10:46   ` 吴兴博
  2014-08-12 13:08   ` Kirill Batuzov
  2014-08-12 13:23 ` Eric Blake
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Fam Zheng @ 2014-08-12  0:52 UTC (permalink / raw)
  To: 吴兴博; +Cc: qemu-devel

On Mon, 08/11 19:38, 吴兴博 wrote:
> Hello,
> 
>   The introduction in the wiki page present several advantages of qcow2
> [1]. But I'm a little confused. I really appreciate if any one can give me
> some help on this :).
> 
>  (1) Currently the raw format doesn't support COW. In other words, a raw
> image cannot have a backing file. COW depends on the mapping table on which
> we it knows whether each block/cluster is present (has been modified) in
> the current image file. Modern file-systems like xfs/ext4/etc. provide
> extent/block allocation information to user-level. Like what 'filefrag'
> does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
> block/raw-posix.c) may obtain correct 'present information about blocks.
> However this information may be limited to be aligned with file allocation
> unit size. Maybe it's just because a raw file has no space to store the
> "backing file name"? I don't think this could hinder the useful feature.
> 
>  (2) As most popular filesystems support delay-allocation/on-demand
> allocation/holes, whatever, a raw image is also thin provisioned as other
> formats. It doesn't consume much disk space by storing useless zeros.
> However, I don't know if there is any concern on whether fragmented extents
> would become a burden of the host filesystem.
> 
>  (3) For compression and encryption, I'm not an export on these topics at
> all but I think these features may not be vital to a image format as both
> guest/host's filesystem can also provide similar functionality.
> 
>  (4) I don't have too much understanding on how snapshot works but I think
> theoretically it would be using the techniques no more than that used in
> COW and backing file.
> 
> After all these thoughts, I still found no reason to not using a 'raw' file
> image (engineering efforts in Qemu should not count as we don't ask  for
> more features from outside world).
> I would be very sorry if my ignorance wasted your time.

Hi! I think what you described is theoretically possible, but I'm not so
positive about this feature. What would be the advantages, compared to qcow2?

My major concern is that the file system hole's transparency, meaning that the
users normally can't tell if a "hole" is really zeroes or unallocated, would
cause data loss more easily: the user may expect scp (1) or cp (1) to work on
an image file, just as always, but these tools can legitimately fill the whole
with actual zeroes, if the target is filesystem does not supporting hole.
That's too dangerous but totally out of control of QEMU.

Fam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12  0:52 ` Fam Zheng
@ 2014-08-12 10:46   ` 吴兴博
  2014-08-12 11:19     ` Fam Zheng
  2014-08-12 13:08   ` Kirill Batuzov
  1 sibling, 1 reply; 29+ messages in thread
From: 吴兴博 @ 2014-08-12 10:46 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4957 bytes --]

Hi Fam,
  It's glad to hear you,
It is said in this post that "All files systems that support inodes
(ext2/3/4, xfs, btfs, etc) support files with holes while creating the
files..."
[
http://serverfault.com/questions/558761/best-linux-filesystem-for-sparse-files
]

I also heard this claim from other sources, and the only "popular"
filesystems who don't support holes in real world are just the old FAT32
and other FAT*.
Note that holes appear in filesystems when creating a sparse file in
inode-filesystems. While "punching holes" does remove the existent contents
from the file, and it was  newly added to only xfs/ext4 in newer linux
kernel.

In qemu's disk image, a hole delivers clear message---the corresponding
sectors/blocks/clusters are never written. So it's up to the guest whether
to initialize the sectors to zero or just ignore them (filesystems never
confuse with a uninitialized sector right?). Filesystems should ignore
uninitialized data just because it's meaningless. Once written, the data
would be ever meaningful to the guest.

"punching holes" would add support for "DISCARD" for a image which could
behave like a SSD. Otherwise the image behaves like a magnetic disk.

The message in below would not be accurate:
* cp has --sparse option to support read and create sparse files.
* Sadly scp doesn't support sparse files.
* rsync also has a -S --sparse option to properly handle sparse files.

Not until recently did I realize that the hole is just widely supported in
*almost* all filesystems. That's why I have come up this idea.
I understand your concern about the support of hole. If this just because
the "hole" is never standardized as POSIX or something else?

So now I get one clear reason: hole is not guaranteed by standardized
filesystems (I guess a POSIX would be enough).
Is their something else? If it's the only reason of not using a sparse raw
file as image, and the only impediment is no-one-should-ever-use FAT32 or
say the POSIX, we may be very close to  move one step forward.





Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>


On Mon, Aug 11, 2014 at 8:52 PM, Fam Zheng <famz@redhat.com> wrote:

> On Mon, 08/11 19:38, 吴兴博 wrote:
> > Hello,
> >
> >   The introduction in the wiki page present several advantages of qcow2
> > [1]. But I'm a little confused. I really appreciate if any one can give
> me
> > some help on this :).
> >
> >  (1) Currently the raw format doesn't support COW. In other words, a raw
> > image cannot have a backing file. COW depends on the mapping table on
> which
> > we it knows whether each block/cluster is present (has been modified) in
> > the current image file. Modern file-systems like xfs/ext4/etc. provide
> > extent/block allocation information to user-level. Like what 'filefrag'
> > does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
> > block/raw-posix.c) may obtain correct 'present information about blocks.
> > However this information may be limited to be aligned with file
> allocation
> > unit size. Maybe it's just because a raw file has no space to store the
> > "backing file name"? I don't think this could hinder the useful feature.
> >
> >  (2) As most popular filesystems support delay-allocation/on-demand
> > allocation/holes, whatever, a raw image is also thin provisioned as other
> > formats. It doesn't consume much disk space by storing useless zeros.
> > However, I don't know if there is any concern on whether fragmented
> extents
> > would become a burden of the host filesystem.
> >
> >  (3) For compression and encryption, I'm not an export on these topics at
> > all but I think these features may not be vital to a image format as both
> > guest/host's filesystem can also provide similar functionality.
> >
> >  (4) I don't have too much understanding on how snapshot works but I
> think
> > theoretically it would be using the techniques no more than that used in
> > COW and backing file.
> >
> > After all these thoughts, I still found no reason to not using a 'raw'
> file
> > image (engineering efforts in Qemu should not count as we don't ask  for
> > more features from outside world).
> > I would be very sorry if my ignorance wasted your time.
>
> Hi! I think what you described is theoretically possible, but I'm not so
> positive about this feature. What would be the advantages, compared to
> qcow2?
>
> My major concern is that the file system hole's transparency, meaning that
> the
> users normally can't tell if a "hole" is really zeroes or unallocated,
> would
> cause data loss more easily: the user may expect scp (1) or cp (1) to work
> on
> an image file, just as always, but these tools can legitimately fill the
> whole
> with actual zeroes, if the target is filesystem does not supporting hole.
> That's too dangerous but totally out of control of QEMU.
>
> Fam
>

[-- Attachment #2: Type: text/html, Size: 6227 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 10:46   ` 吴兴博
@ 2014-08-12 11:19     ` Fam Zheng
       [not found]       ` <CABPa+v1a7meoEtjLkwygjuZEABTqd8q3efGWJvAsAr-mLTQb-A@mail.gmail.com>
  0 siblings, 1 reply; 29+ messages in thread
From: Fam Zheng @ 2014-08-12 11:19 UTC (permalink / raw)
  To: 吴兴博; +Cc: qemu-devel

On Tue, 08/12 06:46, 吴兴博 wrote:
> Hi Fam,
>   It's glad to hear you,
> It is said in this post that "All files systems that support inodes
> (ext2/3/4, xfs, btfs, etc) support files with holes while creating the
> files..."
> [
> http://serverfault.com/questions/558761/best-linux-filesystem-for-sparse-files
> ]
> 
> I also heard this claim from other sources, and the only "popular"
> filesystems who don't support holes in real world are just the old FAT32
> and other FAT*.
> Note that holes appear in filesystems when creating a sparse file in
> inode-filesystems. While "punching holes" does remove the existent contents
> from the file, and it was  newly added to only xfs/ext4 in newer linux
> kernel.
> 
> In qemu's disk image, a hole delivers clear message---the corresponding
> sectors/blocks/clusters are never written. So it's up to the guest whether
> to initialize the sectors to zero or just ignore them (filesystems never
> confuse with a uninitialized sector right?). Filesystems should ignore
> uninitialized data just because it's meaningless. Once written, the data
> would be ever meaningful to the guest.
> 
> "punching holes" would add support for "DISCARD" for a image which could
> behave like a SSD. Otherwise the image behaves like a magnetic disk.
> 
> The message in below would not be accurate:
> * cp has --sparse option to support read and create sparse files.
> * Sadly scp doesn't support sparse files.
> * rsync also has a -S --sparse option to properly handle sparse files.
> 
> Not until recently did I realize that the hole is just widely supported in
> *almost* all filesystems. That's why I have come up this idea.
> I understand your concern about the support of hole. If this just because
> the "hole" is never standardized as POSIX or something else?
> 
> So now I get one clear reason: hole is not guaranteed by standardized
> filesystems (I guess a POSIX would be enough).
> Is their something else? If it's the only reason of not using a sparse raw
> file as image, and the only impediment is no-one-should-ever-use FAT32 or
> say the POSIX, we may be very close to  move one step forward.
> 

The problem is cp wouldn't maintain the correctness of a copied raw-with-hole
image, whereas cp does maintain the correctness of any other thin image types,
that has cluster explicit allocation info.

We can't overcome that, unless we tell users "never use `cp' to copy the image,
it will break your data, you have to use `qemu-img convert'". That's
counterintuitive and a step back.

Fam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
       [not found]         ` <20140812113916.GB2803@T430.redhat.com>
@ 2014-08-12 12:03           ` 吴兴博
  2014-08-12 12:21             ` Fam Zheng
  0 siblings, 1 reply; 29+ messages in thread
From: 吴兴博 @ 2014-08-12 12:03 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

I carefully read your reply and thought of it carefully. I'm sorry that
when I said "I get it" I actually meant "I believe you" but not "I
understand it".
The problem would not come from cp or rsync -- It's not their fault. They
just have no way to make it right.
The real reason of it would be that filesystems have different allocation
unit size.

For example, a file is of 16KB in appearance, and the 4KB-12KB of it is a
hole (0KB-4KB and 12KB-16KB has valid data).
The FS held it has 4KB block size, so it *could* be allocated like this.
Copying this file to a filesystem of 16KB block size would cause the entire
16KB filled with data, to be specific, the hole is filled with zero and
cp/rsync have NO way to make difference.

That's not a engineering issue of cp/rsync. It's a real issue cause by the
fact that (most) filesystems have configurable block size.

Is that correct?
I really appreciate.


Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>


On Tue, Aug 12, 2014 at 7:39 AM, Fam Zheng <famz@redhat.com> wrote:

> On Tue, 08/12 07:22, 吴兴博 wrote:
> > Thanks, I get it.
> > Does rsync have exactly the same problem?
>
> Yes.
>
> Fam
>

[-- Attachment #2: Type: text/html, Size: 1873 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 12:03           ` 吴兴博
@ 2014-08-12 12:21             ` Fam Zheng
  0 siblings, 0 replies; 29+ messages in thread
From: Fam Zheng @ 2014-08-12 12:21 UTC (permalink / raw)
  To: 吴兴博; +Cc: qemu-devel

On Tue, 08/12 08:03, 吴兴博 wrote:
> I carefully read your reply and thought of it carefully. I'm sorry that
> when I said "I get it" I actually meant "I believe you" but not "I
> understand it".
> The problem would not come from cp or rsync -- It's not their fault. They
> just have no way to make it right.
> The real reason of it would be that filesystems have different allocation
> unit size.
> 
> For example, a file is of 16KB in appearance, and the 4KB-12KB of it is a
> hole (0KB-4KB and 12KB-16KB has valid data).
> The FS held it has 4KB block size, so it *could* be allocated like this.
> Copying this file to a filesystem of 16KB block size would cause the entire
> 16KB filled with data, to be specific, the hole is filled with zero and
> cp/rsync have NO way to make difference.
> 
> That's not a engineering issue of cp/rsync. It's a real issue cause by the
> fact that (most) filesystems have configurable block size.
> 

Correct.

It's not an fault of any party, because there is no contract on this part at
all. What you suggested is not a good use case of the file system hole.

Fam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12  0:52 ` Fam Zheng
  2014-08-12 10:46   ` 吴兴博
@ 2014-08-12 13:08   ` Kirill Batuzov
  1 sibling, 0 replies; 29+ messages in thread
From: Kirill Batuzov @ 2014-08-12 13:08 UTC (permalink / raw)
  To: Fam Zheng; +Cc: 吴兴博, qemu-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3015 bytes --]

On Tue, 12 Aug 2014, Fam Zheng wrote:

> On Mon, 08/11 19:38, 吴兴博 wrote:
> > Hello,
> > 
> >   The introduction in the wiki page present several advantages of qcow2
> > [1]. But I'm a little confused. I really appreciate if any one can give me
> > some help on this :).
> > 
> >  (1) Currently the raw format doesn't support COW. In other words, a raw
> > image cannot have a backing file. COW depends on the mapping table on which
> > we it knows whether each block/cluster is present (has been modified) in
> > the current image file. Modern file-systems like xfs/ext4/etc. provide
> > extent/block allocation information to user-level. Like what 'filefrag'
> > does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
> > block/raw-posix.c) may obtain correct 'present information about blocks.
> > However this information may be limited to be aligned with file allocation
> > unit size. Maybe it's just because a raw file has no space to store the
> > "backing file name"? I don't think this could hinder the useful feature.
> > 
> >  (2) As most popular filesystems support delay-allocation/on-demand
> > allocation/holes, whatever, a raw image is also thin provisioned as other
> > formats. It doesn't consume much disk space by storing useless zeros.
> > However, I don't know if there is any concern on whether fragmented extents
> > would become a burden of the host filesystem.
> > 
> >  (3) For compression and encryption, I'm not an export on these topics at
> > all but I think these features may not be vital to a image format as both
> > guest/host's filesystem can also provide similar functionality.
> > 
> >  (4) I don't have too much understanding on how snapshot works but I think
> > theoretically it would be using the techniques no more than that used in
> > COW and backing file.
> > 
> > After all these thoughts, I still found no reason to not using a 'raw' file
> > image (engineering efforts in Qemu should not count as we don't ask  for
> > more features from outside world).
> > I would be very sorry if my ignorance wasted your time.
> 
> Hi! I think what you described is theoretically possible, but I'm not so
> positive about this feature. What would be the advantages, compared to qcow2?
> 

I think this idea was exploited in FVD format. The research paper
reported a large performance gain compared to qcow2. The patches can be
found in the mailing list archives (feb. 2011).

http://wiki.qemu.org/Features/FVD

> My major concern is that the file system hole's transparency, meaning that the
> users normally can't tell if a "hole" is really zeroes or unallocated, would
> cause data loss more easily: the user may expect scp (1) or cp (1) to work on
> an image file, just as always, but these tools can legitimately fill the whole
> with actual zeroes, if the target is filesystem does not supporting hole.
> That's too dangerous but totally out of control of QEMU.
> 
> Fam
> 
>

-- 
Kirill

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-11 23:38 [Qemu-devel] disk image: self-organized format or raw file 吴兴博
  2014-08-12  0:52 ` Fam Zheng
@ 2014-08-12 13:23 ` Eric Blake
  2014-08-12 13:45   ` 吴兴博
  2014-08-12 18:46 ` Daniel P. Berrange
  2014-08-13 15:54 ` Kevin Wolf
  3 siblings, 1 reply; 29+ messages in thread
From: Eric Blake @ 2014-08-12 13:23 UTC (permalink / raw)
  To: 吴兴博, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2907 bytes --]

On 08/11/2014 05:38 PM, 吴兴博 wrote:
> Hello,
> 
>   The introduction in the wiki page present several advantages of qcow2
> [1]. But I'm a little confused. I really appreciate if any one can give me
> some help on this :).
> 
>  (1) Currently the raw format doesn't support COW. In other words, a raw
> image cannot have a backing file. COW depends on the mapping table on which
> we it knows whether each block/cluster is present (has been modified) in
> the current image file. Modern file-systems like xfs/ext4/etc. provide
> extent/block allocation information to user-level. Like what 'filefrag'
> does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
> block/raw-posix.c) may obtain correct 'present information about blocks.
> However this information may be limited to be aligned with file allocation
> unit size. Maybe it's just because a raw file has no space to store the
> "backing file name"? I don't think this could hinder the useful feature.

Search the list archives; at one point in the past, an 'addcow' format
was proposed, which is an additional file alongside a raw which provides
enough information to (temporarily) add cow to raw (or any other file
without a native backing file).  I don't know why that format was not
pursued further.

You could use xattr to store a user attribute of a backing file or
addcow file to associate with a raw file.  But file system holes are NOT
a good metadata tool for distinguishing between data not present (refer
to the backing file) vs. data explicitly all zero.  Your proposal of
using holes in raw files as metadata is NOT going to reliably work.

Also, using SEEK_HOLE/SEEK_DATA is a much nicer interface for iterating
raw file holes than FIEMAP.  It conveys less information, but that
information is more portable (POSIX will be adding requirements for
SEEK_HOLE/SEEK_DATA, and even NFSv4.2 is considering[1] adding this
support because of POSIX).  GNU cp is capable of using both FIEMAP and
SEEK_HOLE to optimize copies where the destination tries to preserve the
same hole layout as the source (not always possible, given that not all
systems have the same granularities of holes, and also given that not
all consecutive blocks of all-zero bytes have to be reported as holes).
 The SEEK_HOLE implementation has ALWAYS worked, but the FIEMAP
implementation uncovered various bugs in file systems, and at one point
would corrupt the copy unless cp did a sync() first, which slowed down
the operation and defeated the point of attempting to use it for
optimizations.  While holes are a cool thing, they are best only for
optimizations, and not for reliable metadata information.

[1]
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-26#section-15.12

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 13:23 ` Eric Blake
@ 2014-08-12 13:45   ` 吴兴博
  2014-08-12 14:07     ` Eric Blake
  0 siblings, 1 reply; 29+ messages in thread
From: 吴兴博 @ 2014-08-12 13:45 UTC (permalink / raw)
  To: Eric Blake, Kirill Batuzov; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3493 bytes --]

Thanks for your information. It's really helpful.
I think adding a bitmap alongside the raw file ( or just within that file)
would be suffice to distinguish between present or in backing file.
The idea in FVD looks similar to 'addcow'---use bitmap but delegating
allocation to FS. However FVD seems to have been ignored by community.

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>


On Tue, Aug 12, 2014 at 9:23 AM, Eric Blake <eblake@redhat.com> wrote:

> On 08/11/2014 05:38 PM, 吴兴博 wrote:
> > Hello,
> >
> >   The introduction in the wiki page present several advantages of qcow2
> > [1]. But I'm a little confused. I really appreciate if any one can give
> me
> > some help on this :).
> >
> >  (1) Currently the raw format doesn't support COW. In other words, a raw
> > image cannot have a backing file. COW depends on the mapping table on
> which
> > we it knows whether each block/cluster is present (has been modified) in
> > the current image file. Modern file-systems like xfs/ext4/etc. provide
> > extent/block allocation information to user-level. Like what 'filefrag'
> > does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
> > block/raw-posix.c) may obtain correct 'present information about blocks.
> > However this information may be limited to be aligned with file
> allocation
> > unit size. Maybe it's just because a raw file has no space to store the
> > "backing file name"? I don't think this could hinder the useful feature.
>
> Search the list archives; at one point in the past, an 'addcow' format
> was proposed, which is an additional file alongside a raw which provides
> enough information to (temporarily) add cow to raw (or any other file
> without a native backing file).  I don't know why that format was not
> pursued further.
>
> You could use xattr to store a user attribute of a backing file or
> addcow file to associate with a raw file.  But file system holes are NOT
> a good metadata tool for distinguishing between data not present (refer
> to the backing file) vs. data explicitly all zero.  Your proposal of
> using holes in raw files as metadata is NOT going to reliably work.
>
> Also, using SEEK_HOLE/SEEK_DATA is a much nicer interface for iterating
> raw file holes than FIEMAP.  It conveys less information, but that
> information is more portable (POSIX will be adding requirements for
> SEEK_HOLE/SEEK_DATA, and even NFSv4.2 is considering[1] adding this
> support because of POSIX).  GNU cp is capable of using both FIEMAP and
> SEEK_HOLE to optimize copies where the destination tries to preserve the
> same hole layout as the source (not always possible, given that not all
> systems have the same granularities of holes, and also given that not
> all consecutive blocks of all-zero bytes have to be reported as holes).
>  The SEEK_HOLE implementation has ALWAYS worked, but the FIEMAP
> implementation uncovered various bugs in file systems, and at one point
> would corrupt the copy unless cp did a sync() first, which slowed down
> the operation and defeated the point of attempting to use it for
> optimizations.  While holes are a cool thing, they are best only for
> optimizations, and not for reliable metadata information.
>
> [1]
> http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-26#section-15.12
>
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>

[-- Attachment #2: Type: text/html, Size: 4393 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 13:45   ` 吴兴博
@ 2014-08-12 14:07     ` Eric Blake
  2014-08-12 14:14       ` 吴兴博
  2014-08-12 18:39       ` Richard W.M. Jones
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Blake @ 2014-08-12 14:07 UTC (permalink / raw)
  To: 吴兴博, Kirill Batuzov; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 900 bytes --]

On 08/12/2014 07:45 AM, 吴兴博 wrote:

[please don't top-post on technical lists]

> Thanks for your information. It's really helpful.
> I think adding a bitmap alongside the raw file ( or just within that file)

Umm, how do you propose to add a bitmap within a raw file?  The moment
the file contains metadata, it is no longer raw, but some other format.
 You'd need a way to reliably delineate the portion of the file that
contains the bitmap and therefore must not be exposed to the guest.

> would be suffice to distinguish between present or in backing file.
> The idea in FVD looks similar to 'addcow'---use bitmap but delegating
> allocation to FS. However FVD seems to have been ignored by community.

Care to give a pointer to a URL describing the FVD format?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 14:07     ` Eric Blake
@ 2014-08-12 14:14       ` 吴兴博
  2014-08-12 15:30         ` Eric Blake
  2014-08-12 18:39       ` Richard W.M. Jones
  1 sibling, 1 reply; 29+ messages in thread
From: 吴兴博 @ 2014-08-12 14:14 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Kirill Batuzov

[-- Attachment #1: Type: text/plain, Size: 1364 bytes --]

On Tue, Aug 12, 2014 at 10:07 AM, Eric Blake <eblake@redhat.com> wrote:

> On 08/12/2014 07:45 AM, 吴兴博 wrote:
>
> [please don't top-post on technical lists]
>
> Sorry about that..

>  > Thanks for your information. It's really helpful.
> > I think adding a bitmap alongside the raw file ( or just within that
> file)
>
> Umm, how do you propose to add a bitmap within a raw file?  The moment
> the file contains metadata, it is no longer raw, but some other format.
>  You'd need a way to reliably delineate the portion of the file that
> contains the bitmap and therefore must not be exposed to the guest.
>
> Yes a agree. It's not raw anymore. It should be some 'lightweight' format.

 > would be suffice to distinguish between present or in backing file.
> > The idea in FVD looks similar to 'addcow'---use bitmap but delegating
> > allocation to FS. However FVD seems to have been ignored by community.
>
> Care to give a pointer to a URL describing the FVD format?
>
> http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg00398.html

This thread could be the clearest message on FVD.
It also has a paper published on USENIX conference.
https://www.usenix.org/event/atc11/tech/final_files/Tang.pdf

> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>

[-- Attachment #2: Type: text/html, Size: 2943 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 14:14       ` 吴兴博
@ 2014-08-12 15:30         ` Eric Blake
  2014-08-12 16:22           ` Xingbo Wu
  2014-08-13 15:42           ` Kevin Wolf
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Blake @ 2014-08-12 15:30 UTC (permalink / raw)
  To: 吴兴博; +Cc: qemu-devel, Kirill Batuzov

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

On 08/12/2014 08:14 AM, 吴兴博 wrote:
>>> However FVD seems to have been ignored by community.
>>
>> Care to give a pointer to a URL describing the FVD format?
>>
>> http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg00398.html
> 
> This thread could be the clearest message on FVD.

That very message also points out WHY the community has appeared to
ignore FVD:

"For any feature to be seriously considered for inclusion in QEMU,
patches need to be posted to the mailing list against the latest git
tree. That's a pre-requisite for any real discussion."

> It also has a paper published on USENIX conference.
> https://www.usenix.org/event/atc11/tech/final_files/Tang.pdf

Thanks for the references.  Are you interested in posting patches to
revive the work on that format?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 15:30         ` Eric Blake
@ 2014-08-12 16:22           ` Xingbo Wu
  2014-08-13  1:29             ` Fam Zheng
  2014-08-13 15:42           ` Kevin Wolf
  1 sibling, 1 reply; 29+ messages in thread
From: Xingbo Wu @ 2014-08-12 16:22 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Kirill Batuzov

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

On Tue, Aug 12, 2014 at 11:30 AM, Eric Blake <eblake@redhat.com> wrote:

> On 08/12/2014 08:14 AM, 吴兴博 wrote:
> >>> However FVD seems to have been ignored by community.
> >>
> >> Care to give a pointer to a URL describing the FVD format?
> >>
> >> http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg00398.html
> >
> > This thread could be the clearest message on FVD.
>
> That very message also points out WHY the community has appeared to
> ignore FVD:
>
> "For any feature to be seriously considered for inclusion in QEMU,
> patches need to be posted to the mailing list against the latest git
> tree. That's a pre-requisite for any real discussion."
>
> > It also has a paper published on USENIX conference.
> > https://www.usenix.org/event/atc11/tech/final_files/Tang.pdf
>
> Thanks for the references.  Are you interested in posting patches to
> revive the work on that format?
>
> I'm going to study it first. It would take some time :)

>  --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>


-- 

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

[-- Attachment #2: Type: text/html, Size: 2150 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 14:07     ` Eric Blake
  2014-08-12 14:14       ` 吴兴博
@ 2014-08-12 18:39       ` Richard W.M. Jones
  1 sibling, 0 replies; 29+ messages in thread
From: Richard W.M. Jones @ 2014-08-12 18:39 UTC (permalink / raw)
  To: Eric Blake; +Cc: 吴兴博, qemu-devel, Kirill Batuzov

On Tue, Aug 12, 2014 at 08:07:55AM -0600, Eric Blake wrote:
> On 08/12/2014 07:45 AM, 吴兴博 wrote:
> 
> [please don't top-post on technical lists]
> 
> > Thanks for your information. It's really helpful.
> > I think adding a bitmap alongside the raw file ( or just within that file)
> 
> Umm, how do you propose to add a bitmap within a raw file?  The moment
> the file contains metadata, it is no longer raw, but some other format.
>  You'd need a way to reliably delineate the portion of the file that
> contains the bitmap and therefore must not be exposed to the guest.

There was an MSFT format where they used raw but added metadata after
the end of the raw file data.

https://en.wikipedia.org/wiki/VHD_%28file_format%29

This is crazy BTW - I'm not advocating we do it :-)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-11 23:38 [Qemu-devel] disk image: self-organized format or raw file 吴兴博
  2014-08-12  0:52 ` Fam Zheng
  2014-08-12 13:23 ` Eric Blake
@ 2014-08-12 18:46 ` Daniel P. Berrange
  2014-08-12 18:52   ` Richard W.M. Jones
  2014-08-13 15:54 ` Kevin Wolf
  3 siblings, 1 reply; 29+ messages in thread
From: Daniel P. Berrange @ 2014-08-12 18:46 UTC (permalink / raw)
  To: 吴兴博; +Cc: qemu-devel

On Mon, Aug 11, 2014 at 07:38:50PM -0400, 吴兴博 wrote:
> Hello,
> 
>   The introduction in the wiki page present several advantages of qcow2
> [1]. But I'm a little confused. I really appreciate if any one can give me
> some help on this :).
> 
>  (1) Currently the raw format doesn't support COW. In other words, a raw
> image cannot have a backing file. COW depends on the mapping table on which
> we it knows whether each block/cluster is present (has been modified) in
> the current image file. Modern file-systems like xfs/ext4/etc. provide
> extent/block allocation information to user-level. Like what 'filefrag'
> does with ioctl 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe
> block/raw-posix.c) may obtain correct 'present information about blocks.
> However this information may be limited to be aligned with file allocation
> unit size. Maybe it's just because a raw file has no space to store the
> "backing file name"? I don't think this could hinder the useful feature.
> 
>  (2) As most popular filesystems support delay-allocation/on-demand
> allocation/holes, whatever, a raw image is also thin provisioned as other
> formats. It doesn't consume much disk space by storing useless zeros.
> However, I don't know if there is any concern on whether fragmented extents
> would become a burden of the host filesystem.
> 
>  (3) For compression and encryption, I'm not an export on these topics at
> all but I think these features may not be vital to a image format as both
> guest/host's filesystem can also provide similar functionality.
> 
>  (4) I don't have too much understanding on how snapshot works but I think
> theoretically it would be using the techniques no more than that used in
> COW and backing file.
> 
> After all these thoughts, I still found no reason to not using a 'raw' file
> image (engineering efforts in Qemu should not count as we don't ask  for
> more features from outside world).
> I would be very sorry if my ignorance wasted your time.

FWIW, much of what you say about features supported in filesystems is
correct, however, that is only considering the needs of deployment on
your specific platform. One value of QCow2 is that it is a portable
format you can use on any platform where QEMU builds, whether it be
Linux, Windows, *BSD or Solaris. If you were to rely on the host
filesystem then obviously you'd have to figure out the different
solution for the particular OS you deploy on.

Taking the compression feature - arguably the biggest benefit of that
is when you distribute disk images. eg if someone provides a root disk
image on a web server, using compression in qcow2 can dramatically
lower the download size, while still allowing QEMU to directly run
from that qcow2 file. Sure you could wrap your disk images in gzip
and then convert to your local filesystem at time of use but this
introduces multiple extra steps.

There's similar arguments for other features in qcow2. That's not to
say you are wrong in your analysis of your own needs. It is simply a
case that different scenarios imply different solutions, so for some
qcow2 may be optimal, while for others using native filesystem features
might be better

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 18:46 ` Daniel P. Berrange
@ 2014-08-12 18:52   ` Richard W.M. Jones
  2014-08-12 19:23     ` Xingbo Wu
  0 siblings, 1 reply; 29+ messages in thread
From: Richard W.M. Jones @ 2014-08-12 18:52 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: 吴兴博, qemu-devel

On Tue, Aug 12, 2014 at 07:46:30PM +0100, Daniel P. Berrange wrote:
> Taking the compression feature - arguably the biggest benefit of that
> is when you distribute disk images. eg if someone provides a root disk
> image on a web server, using compression in qcow2 can dramatically
> lower the download size, while still allowing QEMU to directly run
> from that qcow2 file. Sure you could wrap your disk images in gzip
> and then convert to your local filesystem at time of use but this
> introduces multiple extra steps.

It would be nice if qemu could handle xz-compressed files
transparently, since (when prepared correctly) these files are
seekable.

I have written code to do this here:

  https://github.com/libguestfs/nbdkit/tree/master/plugins/xz

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 18:52   ` Richard W.M. Jones
@ 2014-08-12 19:23     ` Xingbo Wu
  2014-08-12 20:14       ` Richard W.M. Jones
  0 siblings, 1 reply; 29+ messages in thread
From: Xingbo Wu @ 2014-08-12 19:23 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]

On Tue, Aug 12, 2014 at 2:52 PM, Richard W.M. Jones <rjones@redhat.com>
wrote:

> On Tue, Aug 12, 2014 at 07:46:30PM +0100, Daniel P. Berrange wrote:
> > Taking the compression feature - arguably the biggest benefit of that
> > is when you distribute disk images. eg if someone provides a root disk
> > image on a web server, using compression in qcow2 can dramatically
> > lower the download size, while still allowing QEMU to directly run
> > from that qcow2 file. Sure you could wrap your disk images in gzip
> > and then convert to your local filesystem at time of use but this
> > introduces multiple extra steps.
>
> It would be nice if qemu could handle xz-compressed files
> transparently, since (when prepared correctly) these files are
> seekable.
>
> I have written code to do this here:
>
>   https://github.com/libguestfs/nbdkit/tree/master/plugins/xz
>
> I believe it's ideal for read-only backing file, the xz-compressed image
would be very space efficient for distribution :).
Would you consider replace xz with lz4? it has faster decompression speed
(~500MB/s)[1] and client-side decompression would be made painless.

[1]
http://linuxaria.com/article/linux-compressors-comparison-on-centos-6-5-x86-64-lzo-vs-lz4-vs-gzip-vs-bzip2-vs-lzma

> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat
> http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines.  Supports shell scripting,
> bindings from many languages.  http://libguestfs.org
>



-- 

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

[-- Attachment #2: Type: text/html, Size: 2898 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 19:23     ` Xingbo Wu
@ 2014-08-12 20:14       ` Richard W.M. Jones
  0 siblings, 0 replies; 29+ messages in thread
From: Richard W.M. Jones @ 2014-08-12 20:14 UTC (permalink / raw)
  To: Xingbo Wu; +Cc: qemu-devel

On Tue, Aug 12, 2014 at 03:23:38PM -0400, Xingbo Wu wrote:
> would be very space efficient for distribution :).
> Would you consider replace xz with lz4? it has faster decompression speed
> (~500MB/s)[1] and client-side decompression would be made painless.

No.  The main benefit of xz is it has a well defined stable API and a
file format that supports seeking.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 16:22           ` Xingbo Wu
@ 2014-08-13  1:29             ` Fam Zheng
  0 siblings, 0 replies; 29+ messages in thread
From: Fam Zheng @ 2014-08-13  1:29 UTC (permalink / raw)
  To: Xingbo Wu; +Cc: qemu-devel, Kirill Batuzov

On Tue, 08/12 12:22, Xingbo Wu wrote:
> On Tue, Aug 12, 2014 at 11:30 AM, Eric Blake <eblake@redhat.com> wrote:
> 
> > On 08/12/2014 08:14 AM, 吴兴博 wrote:
> > >>> However FVD seems to have been ignored by community.
> > >>
> > >> Care to give a pointer to a URL describing the FVD format?
> > >>
> > >> http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg00398.html
> > >
> > > This thread could be the clearest message on FVD.
> >
> > That very message also points out WHY the community has appeared to
> > ignore FVD:
> >
> > "For any feature to be seriously considered for inclusion in QEMU,
> > patches need to be posted to the mailing list against the latest git
> > tree. That's a pre-requisite for any real discussion."
> >
> > > It also has a paper published on USENIX conference.
> > > https://www.usenix.org/event/atc11/tech/final_files/Tang.pdf
> >
> > Thanks for the references.  Are you interested in posting patches to
> > revive the work on that format?
> >
> > I'm going to study it first. It would take some time :)
> 

Please don't add your text after quote leadings "> >". You should start a new
line.

Fam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-12 15:30         ` Eric Blake
  2014-08-12 16:22           ` Xingbo Wu
@ 2014-08-13 15:42           ` Kevin Wolf
  1 sibling, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2014-08-13 15:42 UTC (permalink / raw)
  To: Eric Blake; +Cc: 吴兴博, qemu-devel, Kirill Batuzov

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

Am 12.08.2014 um 17:30 hat Eric Blake geschrieben:
> On 08/12/2014 08:14 AM, 吴兴博 wrote:
> >>> However FVD seems to have been ignored by community.
> >>
> >> Care to give a pointer to a URL describing the FVD format?
> >>
> >> http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg00398.html
> > 
> > This thread could be the clearest message on FVD.
> 
> That very message also points out WHY the community has appeared to
> ignore FVD:
> 
> "For any feature to be seriously considered for inclusion in QEMU,
> patches need to be posted to the mailing list against the latest git
> tree. That's a pre-requisite for any real discussion."
> 
> > It also has a paper published on USENIX conference.
> > https://www.usenix.org/event/atc11/tech/final_files/Tang.pdf
> 
> Thanks for the references.  Are you interested in posting patches to
> revive the work on that format?

Just to be clear upfront so that you don't waste your time: A new native
image format is not going to be merged. You would have to prove that
your format is capable of replacing qcow2 with all its features, that
it's better in some respect and that qcow2 cannot be extended to provide
the same. Other proposals, including FVD, have failed to provide that
and I'd consider it unlikely to happen this time. (QED fell short of it
and was merged anyway for political reasons; it's clear today that this
was a mistake.)

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-11 23:38 [Qemu-devel] disk image: self-organized format or raw file 吴兴博
                   ` (2 preceding siblings ...)
  2014-08-12 18:46 ` Daniel P. Berrange
@ 2014-08-13 15:54 ` Kevin Wolf
  2014-08-13 16:38   ` Xingbo Wu
  3 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2014-08-13 15:54 UTC (permalink / raw)
  To: 吴兴博; +Cc: qemu-devel

Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
> Hello,
> 
>   The introduction in the wiki page present several advantages of qcow2 [1].
> But I'm a little confused. I really appreciate if any one can give me some help
> on this :).
> 
>  (1) Currently the raw format doesn't support COW. In other words, a raw image
> cannot have a backing file. COW depends on the mapping table on which we it
> knows whether each block/cluster is present (has been modified) in the current
> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
> allocation information to user-level. Like what 'filefrag' does with ioctl
> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
> may obtain correct 'present information about blocks. However this information
> may be limited to be aligned with file allocation unit size. Maybe it's just
> because a raw file has no space to store the "backing file name"? I don't think
> this could hinder the useful feature.
> 
>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
> holes, whatever, a raw image is also thin provisioned as other formats. It
> doesn't consume much disk space by storing useless zeros. However, I don't know
> if there is any concern on whether fragmented extents would become a burden of
> the host filesystem.
> 
>  (3) For compression and encryption, I'm not an export on these topics at all
> but I think these features may not be vital to a image format as both guest/
> host's filesystem can also provide similar functionality.
> 
>  (4) I don't have too much understanding on how snapshot works but I think
> theoretically it would be using the techniques no more than that used in COW
> and backing file.
> 
> After all these thoughts, I still found no reason to not using a 'raw' file
> image (engineering efforts in Qemu should not count as we don't ask  for more
> features from outside world).
> I would be very sorry if my ignorance wasted your time.

Even if it did work (that it's problematic is already discussed in other
subthreads) what advantage would you get from using an extended raw
driver compared to simply using qcow2, which supports all of this today?

Kevin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-13 15:54 ` Kevin Wolf
@ 2014-08-13 16:38   ` Xingbo Wu
  2014-08-13 18:32     ` Kevin Wolf
  0 siblings, 1 reply; 29+ messages in thread
From: Xingbo Wu @ 2014-08-13 16:38 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

On Wed, Aug 13, 2014 at 11:54 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
>> Hello,
>>
>>   The introduction in the wiki page present several advantages of qcow2 [1].
>> But I'm a little confused. I really appreciate if any one can give me some help
>> on this :).
>>
>>  (1) Currently the raw format doesn't support COW. In other words, a raw image
>> cannot have a backing file. COW depends on the mapping table on which we it
>> knows whether each block/cluster is present (has been modified) in the current
>> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
>> allocation information to user-level. Like what 'filefrag' does with ioctl
>> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
>> may obtain correct 'present information about blocks. However this information
>> may be limited to be aligned with file allocation unit size. Maybe it's just
>> because a raw file has no space to store the "backing file name"? I don't think
>> this could hinder the useful feature.
>>
>>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
>> holes, whatever, a raw image is also thin provisioned as other formats. It
>> doesn't consume much disk space by storing useless zeros. However, I don't know
>> if there is any concern on whether fragmented extents would become a burden of
>> the host filesystem.
>>
>>  (3) For compression and encryption, I'm not an export on these topics at all
>> but I think these features may not be vital to a image format as both guest/
>> host's filesystem can also provide similar functionality.
>>
>>  (4) I don't have too much understanding on how snapshot works but I think
>> theoretically it would be using the techniques no more than that used in COW
>> and backing file.
>>
>> After all these thoughts, I still found no reason to not using a 'raw' file
>> image (engineering efforts in Qemu should not count as we don't ask  for more
>> features from outside world).
>> I would be very sorry if my ignorance wasted your time.
>
> Even if it did work (that it's problematic is already discussed in other
> subthreads) what advantage would you get from using an extended raw
> driver compared to simply using qcow2, which supports all of this today?
>
> Kevin


I read several messages from this thread: "[RFC] qed: Add QEMU
Enhanced Disk format". To my understanding, if the new format can be
acceptable to the community:
  It needs to retain all the key features provided by qcow2,
especially for compression, encryption, and internal snapshot, as
mentioned in that thread.
  And, needless to say, it must run faster.

Yes I agree it's at least a subset of the homework one need to do
before selling the new format to the community.

Thanks and another question:
What's the magic that makes QED runs faster than QCOW2? In some simple
parallel IO tests QED can run a magnitude faster than QCOW2.  I saw
differences on simple/complex metadata organization, and coroutine/aio
(however "bdrv_co_"s finally call "bdrv_aio_"s via "_em". If you can
provide some insight on this I would be really appreciate.


-- 

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-13 16:38   ` Xingbo Wu
@ 2014-08-13 18:32     ` Kevin Wolf
  2014-08-13 21:04       ` Xingbo Wu
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2014-08-13 18:32 UTC (permalink / raw)
  To: Xingbo Wu; +Cc: qemu-devel

Am 13.08.2014 um 18:38 hat Xingbo Wu geschrieben:
> On Wed, Aug 13, 2014 at 11:54 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
> >> Hello,
> >>
> >>   The introduction in the wiki page present several advantages of qcow2 [1].
> >> But I'm a little confused. I really appreciate if any one can give me some help
> >> on this :).
> >>
> >>  (1) Currently the raw format doesn't support COW. In other words, a raw image
> >> cannot have a backing file. COW depends on the mapping table on which we it
> >> knows whether each block/cluster is present (has been modified) in the current
> >> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
> >> allocation information to user-level. Like what 'filefrag' does with ioctl
> >> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
> >> may obtain correct 'present information about blocks. However this information
> >> may be limited to be aligned with file allocation unit size. Maybe it's just
> >> because a raw file has no space to store the "backing file name"? I don't think
> >> this could hinder the useful feature.
> >>
> >>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
> >> holes, whatever, a raw image is also thin provisioned as other formats. It
> >> doesn't consume much disk space by storing useless zeros. However, I don't know
> >> if there is any concern on whether fragmented extents would become a burden of
> >> the host filesystem.
> >>
> >>  (3) For compression and encryption, I'm not an export on these topics at all
> >> but I think these features may not be vital to a image format as both guest/
> >> host's filesystem can also provide similar functionality.
> >>
> >>  (4) I don't have too much understanding on how snapshot works but I think
> >> theoretically it would be using the techniques no more than that used in COW
> >> and backing file.
> >>
> >> After all these thoughts, I still found no reason to not using a 'raw' file
> >> image (engineering efforts in Qemu should not count as we don't ask  for more
> >> features from outside world).
> >> I would be very sorry if my ignorance wasted your time.
> >
> > Even if it did work (that it's problematic is already discussed in other
> > subthreads) what advantage would you get from using an extended raw
> > driver compared to simply using qcow2, which supports all of this today?
> >
> > Kevin
> 
> 
> I read several messages from this thread: "[RFC] qed: Add QEMU
> Enhanced Disk format". To my understanding, if the new format can be
> acceptable to the community:
>   It needs to retain all the key features provided by qcow2,
> especially for compression, encryption, and internal snapshot, as
> mentioned in that thread.
>   And, needless to say, it must run faster.
> 
> Yes I agree it's at least a subset of the homework one need to do
> before selling the new format to the community.

So your goal is improved performance?

Why do you think that a raw driver with backing file support would run
much faster than qcow2? It would have to solve the same problems, like
doing efficient COW.

> Thanks and another question:
> What's the magic that makes QED runs faster than QCOW2?

During cluster allocation (which is the real critical part), QED is a
lot slower than today's qcow2. And by that I mean not just a few
percent, but like half the performance. After that, when accessing
already allocated data, both perform similar. Mailing list discussions
of four years ago don't reflect accurately how qemu works today.

The main trick of QED was to introduce a dirty flag, which allowed to
call fdatasync() less often because it was okay for image metadata to
become inconsistent. After a crash, you have to repair the image then.

qcow2 supports the same with lazy_refcounts=on, but it's really only
useful in rare cases, mostly with cache=writethrough.

> In some simple
> parallel IO tests QED can run a magnitude faster than QCOW2.  I saw
> differences on simple/complex metadata organization, and coroutine/aio
> (however "bdrv_co_"s finally call "bdrv_aio_"s via "_em". If you can
> provide some insight on this I would be really appreciate.

Today, everything is internally coroutine operations, so every request
goes through bdrv_co_do_preadv/pwritev. The aio_* versions are just
wrappers around it for callers and block drivers that prefer a callback
based interface.

Kevin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-13 18:32     ` Kevin Wolf
@ 2014-08-13 21:04       ` Xingbo Wu
  2014-08-13 21:35         ` Eric Blake
  2014-08-14  2:42         ` Xingbo Wu
  0 siblings, 2 replies; 29+ messages in thread
From: Xingbo Wu @ 2014-08-13 21:04 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

On Wed, Aug 13, 2014 at 2:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 13.08.2014 um 18:38 hat Xingbo Wu geschrieben:
>> On Wed, Aug 13, 2014 at 11:54 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
>> >> Hello,
>> >>
>> >>   The introduction in the wiki page present several advantages of qcow2 [1].
>> >> But I'm a little confused. I really appreciate if any one can give me some help
>> >> on this :).
>> >>
>> >>  (1) Currently the raw format doesn't support COW. In other words, a raw image
>> >> cannot have a backing file. COW depends on the mapping table on which we it
>> >> knows whether each block/cluster is present (has been modified) in the current
>> >> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
>> >> allocation information to user-level. Like what 'filefrag' does with ioctl
>> >> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
>> >> may obtain correct 'present information about blocks. However this information
>> >> may be limited to be aligned with file allocation unit size. Maybe it's just
>> >> because a raw file has no space to store the "backing file name"? I don't think
>> >> this could hinder the useful feature.
>> >>
>> >>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
>> >> holes, whatever, a raw image is also thin provisioned as other formats. It
>> >> doesn't consume much disk space by storing useless zeros. However, I don't know
>> >> if there is any concern on whether fragmented extents would become a burden of
>> >> the host filesystem.
>> >>
>> >>  (3) For compression and encryption, I'm not an export on these topics at all
>> >> but I think these features may not be vital to a image format as both guest/
>> >> host's filesystem can also provide similar functionality.
>> >>
>> >>  (4) I don't have too much understanding on how snapshot works but I think
>> >> theoretically it would be using the techniques no more than that used in COW
>> >> and backing file.
>> >>
>> >> After all these thoughts, I still found no reason to not using a 'raw' file
>> >> image (engineering efforts in Qemu should not count as we don't ask  for more
>> >> features from outside world).
>> >> I would be very sorry if my ignorance wasted your time.
>> >
>> > Even if it did work (that it's problematic is already discussed in other
>> > subthreads) what advantage would you get from using an extended raw
>> > driver compared to simply using qcow2, which supports all of this today?
>> >
>> > Kevin
>>
>>
>> I read several messages from this thread: "[RFC] qed: Add QEMU
>> Enhanced Disk format". To my understanding, if the new format can be
>> acceptable to the community:
>>   It needs to retain all the key features provided by qcow2,
>> especially for compression, encryption, and internal snapshot, as
>> mentioned in that thread.
>>   And, needless to say, it must run faster.
>>
>> Yes I agree it's at least a subset of the homework one need to do
>> before selling the new format to the community.
>
> So your goal is improved performance?
>

Yes if performance is not improved I won't spend more time on it :).
I believe it's gonna be very difficult.

> Why do you think that a raw driver with backing file support would run
> much faster than qcow2? It would have to solve the same problems, like
> doing efficient COW.
>
>> Thanks and another question:
>> What's the magic that makes QED runs faster than QCOW2?
>
> During cluster allocation (which is the real critical part), QED is a
> lot slower than today's qcow2. And by that I mean not just a few
> percent, but like half the performance. After that, when accessing
> already allocated data, both perform similar. Mailing list discussions
> of four years ago don't reflect accurately how qemu works today.
>
> The main trick of QED was to introduce a dirty flag, which allowed to
> call fdatasync() less often because it was okay for image metadata to
> become inconsistent. After a crash, you have to repair the image then.
>

I'm very curious about this dirty flag trick. I was surprised when I
observed very fast 'sync write' performance on QED.
If it skips the fdatasync when processing the device 'flush' command from
guest, it literally cheats the guest as the data can be lost. Am I that correct?
Does the repairing make sure all the data written before the last
successful 'flush'
can be recovered?
To my understanding, the 'flush' command in guest asks for persistence.
Data has to be persistent on host storage after flush except for the
image opened with 'cache=unsafe' mode.

> qcow2 supports the same with lazy_refcounts=on, but it's really only
> useful in rare cases, mostly with cache=writethrough.
>
>> In some simple
>> parallel IO tests QED can run a magnitude faster than QCOW2.  I saw
>> differences on simple/complex metadata organization, and coroutine/aio
>> (however "bdrv_co_"s finally call "bdrv_aio_"s via "_em". If you can
>> provide some insight on this I would be really appreciate.
>
> Today, everything is internally coroutine operations, so every request
> goes through bdrv_co_do_preadv/pwritev. The aio_* versions are just
> wrappers around it for callers and block drivers that prefer a callback
> based interface.
>
> Kevin


-- 

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-13 21:04       ` Xingbo Wu
@ 2014-08-13 21:35         ` Eric Blake
  2014-08-14  2:42         ` Xingbo Wu
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Blake @ 2014-08-13 21:35 UTC (permalink / raw)
  To: Xingbo Wu, Kevin Wolf; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

On 08/13/2014 03:04 PM, Xingbo Wu wrote:

>>> I read several messages from this thread: "[RFC] qed: Add QEMU
>>> Enhanced Disk format". To my understanding, if the new format can be
>>> acceptable to the community:
>>>   It needs to retain all the key features provided by qcow2,
>>> especially for compression, encryption, and internal snapshot, as
>>> mentioned in that thread.

Encryption in qcow2 is currently a joke, that no one in their right mind
should be relying on.  If your new format approaches encryption in a
cryptographically sound manner, then your format might be considered
better even without beating qcow2 in benchmarks.

But from the sound of this thread, you aren't out to improve encrypted
images.  And even if you ARE hoping to improve encrypted images, it
might STILL be better to investigate how to enhance qcow2 to do a
cryptographically sound encryption (the idea floated on the list is to
let qcow2 do LUKS encryption of the guest-visible payload, while still
leaving the metadata unencrypted), rather than trying to do a completely
new format.

>>>   And, needless to say, it must run faster.
>>>
>>> Yes I agree it's at least a subset of the homework one need to do
>>> before selling the new format to the community.
>>
>> So your goal is improved performance?
>>
> 
> Yes if performance is not improved I won't spend more time on it :).
> I believe it's gonna be very difficult.

Good luck if you are willing to try it.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-13 21:04       ` Xingbo Wu
  2014-08-13 21:35         ` Eric Blake
@ 2014-08-14  2:42         ` Xingbo Wu
  2014-08-14  9:06           ` Kevin Wolf
  1 sibling, 1 reply; 29+ messages in thread
From: Xingbo Wu @ 2014-08-14  2:42 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

On Wed, Aug 13, 2014 at 5:04 PM, Xingbo Wu <wuxb45@gmail.com> wrote:
> On Wed, Aug 13, 2014 at 2:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 13.08.2014 um 18:38 hat Xingbo Wu geschrieben:
>>> On Wed, Aug 13, 2014 at 11:54 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> > Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
>>> >> Hello,
>>> >>
>>> >>   The introduction in the wiki page present several advantages of qcow2 [1].
>>> >> But I'm a little confused. I really appreciate if any one can give me some help
>>> >> on this :).
>>> >>
>>> >>  (1) Currently the raw format doesn't support COW. In other words, a raw image
>>> >> cannot have a backing file. COW depends on the mapping table on which we it
>>> >> knows whether each block/cluster is present (has been modified) in the current
>>> >> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
>>> >> allocation information to user-level. Like what 'filefrag' does with ioctl
>>> >> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
>>> >> may obtain correct 'present information about blocks. However this information
>>> >> may be limited to be aligned with file allocation unit size. Maybe it's just
>>> >> because a raw file has no space to store the "backing file name"? I don't think
>>> >> this could hinder the useful feature.
>>> >>
>>> >>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
>>> >> holes, whatever, a raw image is also thin provisioned as other formats. It
>>> >> doesn't consume much disk space by storing useless zeros. However, I don't know
>>> >> if there is any concern on whether fragmented extents would become a burden of
>>> >> the host filesystem.
>>> >>
>>> >>  (3) For compression and encryption, I'm not an export on these topics at all
>>> >> but I think these features may not be vital to a image format as both guest/
>>> >> host's filesystem can also provide similar functionality.
>>> >>
>>> >>  (4) I don't have too much understanding on how snapshot works but I think
>>> >> theoretically it would be using the techniques no more than that used in COW
>>> >> and backing file.
>>> >>
>>> >> After all these thoughts, I still found no reason to not using a 'raw' file
>>> >> image (engineering efforts in Qemu should not count as we don't ask  for more
>>> >> features from outside world).
>>> >> I would be very sorry if my ignorance wasted your time.
>>> >
>>> > Even if it did work (that it's problematic is already discussed in other
>>> > subthreads) what advantage would you get from using an extended raw
>>> > driver compared to simply using qcow2, which supports all of this today?
>>> >
>>> > Kevin
>>>
>>>
>>> I read several messages from this thread: "[RFC] qed: Add QEMU
>>> Enhanced Disk format". To my understanding, if the new format can be
>>> acceptable to the community:
>>>   It needs to retain all the key features provided by qcow2,
>>> especially for compression, encryption, and internal snapshot, as
>>> mentioned in that thread.
>>>   And, needless to say, it must run faster.
>>>
>>> Yes I agree it's at least a subset of the homework one need to do
>>> before selling the new format to the community.
>>
>> So your goal is improved performance?
>>
>
> Yes if performance is not improved I won't spend more time on it :).
> I believe it's gonna be very difficult.
>
>> Why do you think that a raw driver with backing file support would run
>> much faster than qcow2? It would have to solve the same problems, like
>> doing efficient COW.
>>
>>> Thanks and another question:
>>> What's the magic that makes QED runs faster than QCOW2?
>>
>> During cluster allocation (which is the real critical part), QED is a
>> lot slower than today's qcow2. And by that I mean not just a few
>> percent, but like half the performance. After that, when accessing
>> already allocated data, both perform similar. Mailing list discussions
>> of four years ago don't reflect accurately how qemu works today.
>>
>> The main trick of QED was to introduce a dirty flag, which allowed to
>> call fdatasync() less often because it was okay for image metadata to
>> become inconsistent. After a crash, you have to repair the image then.
>>
>
> I'm very curious about this dirty flag trick. I was surprised when I
> observed very fast 'sync write' performance on QED.
> If it skips the fdatasync when processing the device 'flush' command from
> guest, it literally cheats the guest as the data can be lost. Am I that correct?
> Does the repairing make sure all the data written before the last
> successful 'flush'
> can be recovered?
> To my understanding, the 'flush' command in guest asks for persistence.
> Data has to be persistent on host storage after flush except for the
> image opened with 'cache=unsafe' mode.
>

I have some different ideas. Please correct me if I make any mistake.
The trick may not cause true consistency issues. The relaxed write
ordering (less fdatasync) seems to be safe.
The analysis on this is described in this
[http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg00515.html].

In my opinion the reason why the ordering is irreverent is that any
uninitialized block could exist in a block device.
Unordered update l1 and alloc-write l2 are also safe because
uninitialized blocks in a file is always zero or beyond the EOF.
Any unsuccessful write of the l1/l2/data would cause the loss of the
data. However, at that point the guest must not have returned from its
last 'flush' so the guest won't have consistency issue on its data.
The repair process (qed-check.c) doesn't recover data, it only does
some scanning for processing new requests. the 'check' can be
considered as a normal operation of bdrv_open().

BTW, filesystems heavily use this kind of 'tricks' to improve performance.
The sync write could return as a indication of data being persistently
written, while the data may have only been committed to the journal.
Scanning and recovering from journal is considered as the normal job
of filesystems.


>> qcow2 supports the same with lazy_refcounts=on, but it's really only
>> useful in rare cases, mostly with cache=writethrough.
>>
>>> In some simple
>>> parallel IO tests QED can run a magnitude faster than QCOW2.  I saw
>>> differences on simple/complex metadata organization, and coroutine/aio
>>> (however "bdrv_co_"s finally call "bdrv_aio_"s via "_em". If you can
>>> provide some insight on this I would be really appreciate.
>>
>> Today, everything is internally coroutine operations, so every request
>> goes through bdrv_co_do_preadv/pwritev. The aio_* versions are just
>> wrappers around it for callers and block drivers that prefer a callback
>> based interface.
>>
>> Kevin
>
>
> --
>
> Cheers!
>        吴兴博  Wu, Xingbo <wuxb45@gmail.com>



-- 

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-14  2:42         ` Xingbo Wu
@ 2014-08-14  9:06           ` Kevin Wolf
  2014-08-14 20:53             ` Xingbo Wu
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2014-08-14  9:06 UTC (permalink / raw)
  To: Xingbo Wu; +Cc: qemu-devel

Am 14.08.2014 um 04:42 hat Xingbo Wu geschrieben:
> On Wed, Aug 13, 2014 at 5:04 PM, Xingbo Wu <wuxb45@gmail.com> wrote:
> > On Wed, Aug 13, 2014 at 2:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> Am 13.08.2014 um 18:38 hat Xingbo Wu geschrieben:
> >>> On Wed, Aug 13, 2014 at 11:54 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> >>> > Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
> >>> >> Hello,
> >>> >>
> >>> >>   The introduction in the wiki page present several advantages of qcow2 [1].
> >>> >> But I'm a little confused. I really appreciate if any one can give me some help
> >>> >> on this :).
> >>> >>
> >>> >>  (1) Currently the raw format doesn't support COW. In other words, a raw image
> >>> >> cannot have a backing file. COW depends on the mapping table on which we it
> >>> >> knows whether each block/cluster is present (has been modified) in the current
> >>> >> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
> >>> >> allocation information to user-level. Like what 'filefrag' does with ioctl
> >>> >> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
> >>> >> may obtain correct 'present information about blocks. However this information
> >>> >> may be limited to be aligned with file allocation unit size. Maybe it's just
> >>> >> because a raw file has no space to store the "backing file name"? I don't think
> >>> >> this could hinder the useful feature.
> >>> >>
> >>> >>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
> >>> >> holes, whatever, a raw image is also thin provisioned as other formats. It
> >>> >> doesn't consume much disk space by storing useless zeros. However, I don't know
> >>> >> if there is any concern on whether fragmented extents would become a burden of
> >>> >> the host filesystem.
> >>> >>
> >>> >>  (3) For compression and encryption, I'm not an export on these topics at all
> >>> >> but I think these features may not be vital to a image format as both guest/
> >>> >> host's filesystem can also provide similar functionality.
> >>> >>
> >>> >>  (4) I don't have too much understanding on how snapshot works but I think
> >>> >> theoretically it would be using the techniques no more than that used in COW
> >>> >> and backing file.
> >>> >>
> >>> >> After all these thoughts, I still found no reason to not using a 'raw' file
> >>> >> image (engineering efforts in Qemu should not count as we don't ask  for more
> >>> >> features from outside world).
> >>> >> I would be very sorry if my ignorance wasted your time.
> >>> >
> >>> > Even if it did work (that it's problematic is already discussed in other
> >>> > subthreads) what advantage would you get from using an extended raw
> >>> > driver compared to simply using qcow2, which supports all of this today?
> >>> >
> >>> > Kevin
> >>>
> >>>
> >>> I read several messages from this thread: "[RFC] qed: Add QEMU
> >>> Enhanced Disk format". To my understanding, if the new format can be
> >>> acceptable to the community:
> >>>   It needs to retain all the key features provided by qcow2,
> >>> especially for compression, encryption, and internal snapshot, as
> >>> mentioned in that thread.
> >>>   And, needless to say, it must run faster.
> >>>
> >>> Yes I agree it's at least a subset of the homework one need to do
> >>> before selling the new format to the community.
> >>
> >> So your goal is improved performance?
> >>
> >
> > Yes if performance is not improved I won't spend more time on it :).
> > I believe it's gonna be very difficult.
> >
> >> Why do you think that a raw driver with backing file support would run
> >> much faster than qcow2? It would have to solve the same problems, like
> >> doing efficient COW.
> >>
> >>> Thanks and another question:
> >>> What's the magic that makes QED runs faster than QCOW2?
> >>
> >> During cluster allocation (which is the real critical part), QED is a
> >> lot slower than today's qcow2. And by that I mean not just a few
> >> percent, but like half the performance. After that, when accessing
> >> already allocated data, both perform similar. Mailing list discussions
> >> of four years ago don't reflect accurately how qemu works today.
> >>
> >> The main trick of QED was to introduce a dirty flag, which allowed to
> >> call fdatasync() less often because it was okay for image metadata to
> >> become inconsistent. After a crash, you have to repair the image then.
> >>
> >
> > I'm very curious about this dirty flag trick. I was surprised when I
> > observed very fast 'sync write' performance on QED.
> > If it skips the fdatasync when processing the device 'flush' command from
> > guest, it literally cheats the guest as the data can be lost. Am I that correct?
> > Does the repairing make sure all the data written before the last
> > successful 'flush'
> > can be recovered?
> > To my understanding, the 'flush' command in guest asks for persistence.
> > Data has to be persistent on host storage after flush except for the
> > image opened with 'cache=unsafe' mode.
> >
> 
> I have some different ideas. Please correct me if I make any mistake.
> The trick may not cause true consistency issues. The relaxed write
> ordering (less fdatasync) seems to be safe.
> The analysis on this is described in this
> [http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg00515.html].

Yes, specifically point 3. Without the dirty flag, you would have to
ensure that the file size is updated first and then the L2 table entry
is written. (This would still allow cluster leaks that cannot be
reclaimed, but at least no data corruption.)

> In my opinion the reason why the ordering is irreverent is that any
> uninitialized block could exist in a block device.
> Unordered update l1 and alloc-write l2 are also safe because
> uninitialized blocks in a file is always zero or beyond the EOF.

Yes. This holds true because QED (unlike qcow2) cannot be used directly
on block devices. This is a real limitation.

> Any unsuccessful write of the l1/l2/data would cause the loss of the
> data. However, at that point the guest must not have returned from its
> last 'flush' so the guest won't have consistency issue on its data.
> The repair process (qed-check.c) doesn't recover data, it only does
> some scanning for processing new requests. the 'check' can be
> considered as a normal operation of bdrv_open().
> 
> BTW, filesystems heavily use this kind of 'tricks' to improve performance.
> The sync write could return as a indication of data being persistently
> written, while the data may have only been committed to the journal.
> Scanning and recovering from journal is considered as the normal job
> of filesystems.

But this is not a journal. It is something like fsck in ext2 times.

I believe qcow2 could be optimised a bit more if we added a journal to
it, but currently qcow2 performance isn't a problem urgent enough that I
could easily find the time to implement it. (We've discussed it several
times in the past.)

Kevin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-14  9:06           ` Kevin Wolf
@ 2014-08-14 20:53             ` Xingbo Wu
  2014-08-15 10:46               ` Kevin Wolf
  0 siblings, 1 reply; 29+ messages in thread
From: Xingbo Wu @ 2014-08-14 20:53 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

>> >> The main trick of QED was to introduce a dirty flag, which allowed to
>> >> call fdatasync() less often because it was okay for image metadata to
>> >> become inconsistent. After a crash, you have to repair the image then.
>> >>
>> >
>> > I'm very curious about this dirty flag trick. I was surprised when I
>> > observed very fast 'sync write' performance on QED.
>> > If it skips the fdatasync when processing the device 'flush' command from
>> > guest, it literally cheats the guest as the data can be lost. Am I that correct?
>> > Does the repairing make sure all the data written before the last
>> > successful 'flush'
>> > can be recovered?
>> > To my understanding, the 'flush' command in guest asks for persistence.
>> > Data has to be persistent on host storage after flush except for the
>> > image opened with 'cache=unsafe' mode.
>> >
>>
>> I have some different ideas. Please correct me if I make any mistake.
>> The trick may not cause true consistency issues. The relaxed write
>> ordering (less fdatasync) seems to be safe.
>> The analysis on this is described in this
>> [http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg00515.html].
>
> Yes, specifically point 3. Without the dirty flag, you would have to
> ensure that the file size is updated first and then the L2 table entry
> is written. (This would still allow cluster leaks that cannot be
> reclaimed, but at least no data corruption.)
>
>> In my opinion the reason why the ordering is irreverent is that any
>> uninitialized block could exist in a block device.
>> Unordered update l1 and alloc-write l2 are also safe because
>> uninitialized blocks in a file is always zero or beyond the EOF.
>
> Yes. This holds true because QED (unlike qcow2) cannot be used directly
> on block devices. This is a real limitation.
>

I don't know much about the best practices in virtualization. Could
you give me some examples? Thanks.
Do some products provide resizeable (automatically?) Logical Volumes
and put one qcow2 on each LV?
Anyway, does someone use a physical disk to hold only one qcow2 image
for some special usage?

>> Any unsuccessful write of the l1/l2/data would cause the loss of the
>> data. However, at that point the guest must not have returned from its
>> last 'flush' so the guest won't have consistency issue on its data.
>> The repair process (qed-check.c) doesn't recover data, it only does
>> some scanning for processing new requests. the 'check' can be
>> considered as a normal operation of bdrv_open().
>>
>> BTW, filesystems heavily use this kind of 'tricks' to improve performance.
>> The sync write could return as a indication of data being persistently
>> written, while the data may have only been committed to the journal.
>> Scanning and recovering from journal is considered as the normal job
>> of filesystems.
>
> But this is not a journal. It is something like fsck in ext2 times.
>
> I believe qcow2 could be optimised a bit more if we added a journal to
> it, but currently qcow2 performance isn't a problem urgent enough that I
> could easily find the time to implement it. (We've discussed it several
> times in the past.)
>
> Kevin



-- 

Cheers!
       吴兴博  Wu, Xingbo <wuxb45@gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] disk image: self-organized format or raw file
  2014-08-14 20:53             ` Xingbo Wu
@ 2014-08-15 10:46               ` Kevin Wolf
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2014-08-15 10:46 UTC (permalink / raw)
  To: Xingbo Wu; +Cc: qemu-devel

Am 14.08.2014 um 22:53 hat Xingbo Wu geschrieben:
> >> >> The main trick of QED was to introduce a dirty flag, which allowed to
> >> >> call fdatasync() less often because it was okay for image metadata to
> >> >> become inconsistent. After a crash, you have to repair the image then.
> >> >>
> >> >
> >> > I'm very curious about this dirty flag trick. I was surprised when I
> >> > observed very fast 'sync write' performance on QED.
> >> > If it skips the fdatasync when processing the device 'flush' command from
> >> > guest, it literally cheats the guest as the data can be lost. Am I that correct?
> >> > Does the repairing make sure all the data written before the last
> >> > successful 'flush'
> >> > can be recovered?
> >> > To my understanding, the 'flush' command in guest asks for persistence.
> >> > Data has to be persistent on host storage after flush except for the
> >> > image opened with 'cache=unsafe' mode.
> >> >
> >>
> >> I have some different ideas. Please correct me if I make any mistake.
> >> The trick may not cause true consistency issues. The relaxed write
> >> ordering (less fdatasync) seems to be safe.
> >> The analysis on this is described in this
> >> [http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg00515.html].
> >
> > Yes, specifically point 3. Without the dirty flag, you would have to
> > ensure that the file size is updated first and then the L2 table entry
> > is written. (This would still allow cluster leaks that cannot be
> > reclaimed, but at least no data corruption.)
> >
> >> In my opinion the reason why the ordering is irreverent is that any
> >> uninitialized block could exist in a block device.
> >> Unordered update l1 and alloc-write l2 are also safe because
> >> uninitialized blocks in a file is always zero or beyond the EOF.
> >
> > Yes. This holds true because QED (unlike qcow2) cannot be used directly
> > on block devices. This is a real limitation.
> >
> 
> I don't know much about the best practices in virtualization. Could
> you give me some examples? Thanks.
> Do some products provide resizeable (automatically?) Logical Volumes
> and put one qcow2 on each LV?
> Anyway, does someone use a physical disk to hold only one qcow2 image
> for some special usage?

I would be surprised if someone used a whole physical disk for a single
qcow2 image, but some people always do crazier things than you can
imagine...

Anyway, oVirt uses LVs to store qcow2 images on. It resizes the LVs on
the fly as they fill up. This seems to be working quite well.

Kevin

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-08-15 10:47 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-11 23:38 [Qemu-devel] disk image: self-organized format or raw file 吴兴博
2014-08-12  0:52 ` Fam Zheng
2014-08-12 10:46   ` 吴兴博
2014-08-12 11:19     ` Fam Zheng
     [not found]       ` <CABPa+v1a7meoEtjLkwygjuZEABTqd8q3efGWJvAsAr-mLTQb-A@mail.gmail.com>
     [not found]         ` <20140812113916.GB2803@T430.redhat.com>
2014-08-12 12:03           ` 吴兴博
2014-08-12 12:21             ` Fam Zheng
2014-08-12 13:08   ` Kirill Batuzov
2014-08-12 13:23 ` Eric Blake
2014-08-12 13:45   ` 吴兴博
2014-08-12 14:07     ` Eric Blake
2014-08-12 14:14       ` 吴兴博
2014-08-12 15:30         ` Eric Blake
2014-08-12 16:22           ` Xingbo Wu
2014-08-13  1:29             ` Fam Zheng
2014-08-13 15:42           ` Kevin Wolf
2014-08-12 18:39       ` Richard W.M. Jones
2014-08-12 18:46 ` Daniel P. Berrange
2014-08-12 18:52   ` Richard W.M. Jones
2014-08-12 19:23     ` Xingbo Wu
2014-08-12 20:14       ` Richard W.M. Jones
2014-08-13 15:54 ` Kevin Wolf
2014-08-13 16:38   ` Xingbo Wu
2014-08-13 18:32     ` Kevin Wolf
2014-08-13 21:04       ` Xingbo Wu
2014-08-13 21:35         ` Eric Blake
2014-08-14  2:42         ` Xingbo Wu
2014-08-14  9:06           ` Kevin Wolf
2014-08-14 20:53             ` Xingbo Wu
2014-08-15 10:46               ` Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.