Re: [Qemu-devel] disk image: self-organized format or raw file

From: Kevin Wolf <kwolf@redhat.com>
To: Xingbo Wu <wuxb45@gmail.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] disk image: self-organized format or raw file
Date: Thu, 14 Aug 2014 11:06:33 +0200	[thread overview]
Message-ID: <20140814090633.GA3820@noname.redhat.com> (raw)
In-Reply-To: <CABPa+v3+oh1MB8icpN1cSk5Jr6N51et62EcAGGqyWbbxTJeoGA@mail.gmail.com>

Am 14.08.2014 um 04:42 hat Xingbo Wu geschrieben:
> On Wed, Aug 13, 2014 at 5:04 PM, Xingbo Wu <wuxb45@gmail.com> wrote:
> > On Wed, Aug 13, 2014 at 2:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> Am 13.08.2014 um 18:38 hat Xingbo Wu geschrieben:
> >>> On Wed, Aug 13, 2014 at 11:54 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> >>> > Am 12.08.2014 um 01:38 hat 吴兴博 geschrieben:
> >>> >> Hello,
> >>> >>
> >>> >>   The introduction in the wiki page present several advantages of qcow2 [1].
> >>> >> But I'm a little confused. I really appreciate if any one can give me some help
> >>> >> on this :).
> >>> >>
> >>> >>  (1) Currently the raw format doesn't support COW. In other words, a raw image
> >>> >> cannot have a backing file. COW depends on the mapping table on which we it
> >>> >> knows whether each block/cluster is present (has been modified) in the current
> >>> >> image file. Modern file-systems like xfs/ext4/etc. provide extent/block
> >>> >> allocation information to user-level. Like what 'filefrag' does with ioctl
> >>> >> 'FIBMAP' and 'FIEMAP'. I guess the raw file driver (maybe block/raw-posix.c)
> >>> >> may obtain correct 'present information about blocks. However this information
> >>> >> may be limited to be aligned with file allocation unit size. Maybe it's just
> >>> >> because a raw file has no space to store the "backing file name"? I don't think
> >>> >> this could hinder the useful feature.
> >>> >>
> >>> >>  (2) As most popular filesystems support delay-allocation/on-demand allocation/
> >>> >> holes, whatever, a raw image is also thin provisioned as other formats. It
> >>> >> doesn't consume much disk space by storing useless zeros. However, I don't know
> >>> >> if there is any concern on whether fragmented extents would become a burden of
> >>> >> the host filesystem.
> >>> >>
> >>> >>  (3) For compression and encryption, I'm not an export on these topics at all
> >>> >> but I think these features may not be vital to a image format as both guest/
> >>> >> host's filesystem can also provide similar functionality.
> >>> >>
> >>> >>  (4) I don't have too much understanding on how snapshot works but I think
> >>> >> theoretically it would be using the techniques no more than that used in COW
> >>> >> and backing file.
> >>> >>
> >>> >> After all these thoughts, I still found no reason to not using a 'raw' file
> >>> >> image (engineering efforts in Qemu should not count as we don't ask  for more
> >>> >> features from outside world).
> >>> >> I would be very sorry if my ignorance wasted your time.
> >>> >
> >>> > Even if it did work (that it's problematic is already discussed in other
> >>> > subthreads) what advantage would you get from using an extended raw
> >>> > driver compared to simply using qcow2, which supports all of this today?
> >>> >
> >>> > Kevin
> >>>
> >>>
> >>> I read several messages from this thread: "[RFC] qed: Add QEMU
> >>> Enhanced Disk format". To my understanding, if the new format can be
> >>> acceptable to the community:
> >>>   It needs to retain all the key features provided by qcow2,
> >>> especially for compression, encryption, and internal snapshot, as
> >>> mentioned in that thread.
> >>>   And, needless to say, it must run faster.
> >>>
> >>> Yes I agree it's at least a subset of the homework one need to do
> >>> before selling the new format to the community.
> >>
> >> So your goal is improved performance?
> >>
> >
> > Yes if performance is not improved I won't spend more time on it :).
> > I believe it's gonna be very difficult.
> >
> >> Why do you think that a raw driver with backing file support would run
> >> much faster than qcow2? It would have to solve the same problems, like
> >> doing efficient COW.
> >>
> >>> Thanks and another question:
> >>> What's the magic that makes QED runs faster than QCOW2?
> >>
> >> During cluster allocation (which is the real critical part), QED is a
> >> lot slower than today's qcow2. And by that I mean not just a few
> >> percent, but like half the performance. After that, when accessing
> >> already allocated data, both perform similar. Mailing list discussions
> >> of four years ago don't reflect accurately how qemu works today.
> >>
> >> The main trick of QED was to introduce a dirty flag, which allowed to
> >> call fdatasync() less often because it was okay for image metadata to
> >> become inconsistent. After a crash, you have to repair the image then.
> >>
> >
> > I'm very curious about this dirty flag trick. I was surprised when I
> > observed very fast 'sync write' performance on QED.
> > If it skips the fdatasync when processing the device 'flush' command from
> > guest, it literally cheats the guest as the data can be lost. Am I that correct?
> > Does the repairing make sure all the data written before the last
> > successful 'flush'
> > can be recovered?
> > To my understanding, the 'flush' command in guest asks for persistence.
> > Data has to be persistent on host storage after flush except for the
> > image opened with 'cache=unsafe' mode.
> >
> 
> I have some different ideas. Please correct me if I make any mistake.
> The trick may not cause true consistency issues. The relaxed write
> ordering (less fdatasync) seems to be safe.
> The analysis on this is described in this
> [http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg00515.html].

Yes, specifically point 3. Without the dirty flag, you would have to
ensure that the file size is updated first and then the L2 table entry
is written. (This would still allow cluster leaks that cannot be
reclaimed, but at least no data corruption.)

> In my opinion the reason why the ordering is irreverent is that any
> uninitialized block could exist in a block device.
> Unordered update l1 and alloc-write l2 are also safe because
> uninitialized blocks in a file is always zero or beyond the EOF.

Yes. This holds true because QED (unlike qcow2) cannot be used directly
on block devices. This is a real limitation.

> Any unsuccessful write of the l1/l2/data would cause the loss of the
> data. However, at that point the guest must not have returned from its
> last 'flush' so the guest won't have consistency issue on its data.
> The repair process (qed-check.c) doesn't recover data, it only does
> some scanning for processing new requests. the 'check' can be
> considered as a normal operation of bdrv_open().
> 
> BTW, filesystems heavily use this kind of 'tricks' to improve performance.
> The sync write could return as a indication of data being persistently
> written, while the data may have only been committed to the journal.
> Scanning and recovering from journal is considered as the normal job
> of filesystems.

But this is not a journal. It is something like fsck in ext2 times.

I believe qcow2 could be optimised a bit more if we added a journal to
it, but currently qcow2 performance isn't a problem urgent enough that I
could easily find the time to implement it. (We've discussed it several
times in the past.)

Kevin