[Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
@ 2011-01-04 21:44 Chunqiang Tang
  2011-01-05 17:29 ` Anthony Liguori
  2011-01-06  9:17 ` Stefan Hajnoczi
  0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-04 21:44 UTC (permalink / raw)
  To: qemu-devel

Dear QEMU Community Members,

Happy new year! We would like to contribute a new year gift to the 
community. 

As the community considers the next-generation image formats for QEMU, 
hopefully we really challenge ourselves hard enough to find the right 
solution for the long term, rather than just a convenient solution for the 
short term, because an image format has long-term impacts and is hard to 
change once released.  In this spirit, we would like to argue that QCOW2 
and QED’s use of a two-level lookup table as the basis for implementing 
all features is a fundamental obstacle for achieving high performance. 
Accordingly, we advocate the newly developed Fast Virtual Disk (FVD) image 
format for adoption in the QEMU mainline. FVD achieves the performance of 
a RAW image running on a raw partition, while providing the rich features 
of compact image, copy-on-write, copy-on-read, and adaptive prefetching. 
FVD is extensible and can accommodate additional features. Experiments 
show that the throughput of FVD is 249% higher than that of QCOW2 when 
using the PostMark benchmark to create files.

FVD came out of the work done at IBM T.J. Watson Research Center, when 
studying virtual disk related issues during the development of the IBM 
Cloud (http://www.ibm.com/services/us/igs/cloud-development/). At IBM 
internally, FVD (a.k.a. ODS) has been widely demonstrated since June 2010. 
Recently, the FVD technical papers were completed and the source code was 
cleared for external release. Now we finally can share FVD with the 
community, and seek your valuable feedback and contributions. All related 
information is available at 
https://researcher.ibm.com/researcher/view_project.php?id=1852 , including 
a high-level overview of FVD, the source code, and the technical papers. 

The FVD patch also includes a fully automated testing framework that 
exercises QEMU block device drivers under stress load and extreme race 
conditions. Currently (as of January 2011), QCOW2 cannot pass the 
automated test. The symptom is that QCOW2 attempts to read beyond the end 
of the base image. QCOW2 experts please take a look at this "potential" 
bug.

Best Regards,
Chunqiang Tang 

Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-04 21:44 [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% Chunqiang Tang
@ 2011-01-05 17:29 ` Anthony Liguori
  2011-01-14 20:56   ` Chunqiang Tang
  2011-01-06  9:17 ` Stefan Hajnoczi
  1 sibling, 1 reply; 21+ messages in thread
From: Anthony Liguori @ 2011-01-05 17:29 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

Hi Chunqiang,

On 01/04/2011 03:44 PM, Chunqiang Tang wrote:
> Dear QEMU Community Members,
>
> Happy new year! We would like to contribute a new year gift to the
> community.
>
> As the community considers the next-generation image formats for QEMU,
> hopefully we really challenge ourselves hard enough to find the right
> solution for the long term, rather than just a convenient solution for the
> short term, because an image format has long-term impacts and is hard to
> change once released.  In this spirit, we would like to argue that QCOW2
> and QED’s use of a two-level lookup table as the basis for implementing
> all features is a fundamental obstacle for achieving high performance.
> Accordingly, we advocate the newly developed Fast Virtual Disk (FVD) image
> format for adoption in the QEMU mainline. FVD achieves the performance of
> a RAW image running on a raw partition, while providing the rich features
> of compact image, copy-on-write, copy-on-read, and adaptive prefetching.
> FVD is extensible and can accommodate additional features. Experiments
> show that the throughput of FVD is 249% higher than that of QCOW2 when
> using the PostMark benchmark to create files.
>
> FVD came out of the work done at IBM T.J. Watson Research Center, when
> studying virtual disk related issues during the development of the IBM
> Cloud (http://www.ibm.com/services/us/igs/cloud-development/). At IBM
> internally, FVD (a.k.a. ODS) has been widely demonstrated since June 2010.
> Recently, the FVD technical papers were completed and the source code was
> cleared for external release. Now we finally can share FVD with the
> community, and seek your valuable feedback and contributions. All related
> information is available at
> https://researcher.ibm.com/researcher/view_project.php?id=1852 , including
> a high-level overview of FVD, the source code, and the technical papers.
>
> The FVD patch also includes a fully automated testing framework that
> exercises QEMU block device drivers under stress load and extreme race
> conditions. Currently (as of January 2011), QCOW2 cannot pass the
> automated test. The symptom is that QCOW2 attempts to read beyond the end
> of the base image. QCOW2 experts please take a look at this "potential"
> bug.
>    

For any feature to be seriously considered for inclusion in QEMU, 
patches need to be posted to the mailing list against the latest git 
tree.  That's a pre-requisite for any real discussion.

There's a tremendous amount of desire to avoid further fragmentation of 
image formats.  Based on my limited understanding, I think FVD shares a 
lot in common with the COW format (block/cow.c).

But I think most of the advantages you mention could be considered as 
additions to either qcow2 or qed.  At any rate, the right way to have 
that discussion is in the form of patches on the ML.

Regards,

Anthony Liguori

> Best Regards,
> Chunqiang Tang
>
> Homepage: http://www.research.ibm.com/people/c/ctang
>
>    

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-04 21:44 [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% Chunqiang Tang
  2011-01-05 17:29 ` Anthony Liguori
@ 2011-01-06  9:17 ` Stefan Hajnoczi
  2011-01-15  3:28   ` Chunqiang Tang
  1 sibling, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-01-06  9:17 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Tue, Jan 4, 2011 at 9:44 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
> Happy new year! We would like to contribute a new year gift to the
> community.

We == IBM Research?

> The FVD patch also includes a fully automated testing framework that
> exercises QEMU block device drivers under stress load and extreme race
> conditions. Currently (as of January 2011), QCOW2 cannot pass the
> automated test. The symptom is that QCOW2 attempts to read beyond the end
> of the base image. QCOW2 experts please take a look at this "potential"
> bug.

The community block I/O test suite is qemu-iotests:
http://git.kernel.org/?p=linux/kernel/git/hch/qemu-iotests.git;a=summary

If you have tests that you'd like to contribute, please put them into
that framework so other developers can run them as part of their
regular testing.

Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-05 17:29 ` Anthony Liguori
@ 2011-01-14 20:56   ` Chunqiang Tang
  2011-01-19  1:12     ` Jamie Lokier
  2011-01-19 15:51     ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-14 20:56 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

> Based on my limited understanding, I think FVD shares a 
> lot in common with the COW format (block/cow.c).
> 
> But I think most of the advantages you mention could be considered as 
> additions to either qcow2 or qed.  At any rate, the right way to have 
> that discussion is in the form of patches on the ML.

FVD is much more advanced than block/cow.c. I would be happy to discuss 
possible leverage, but setting aside the details of QCOW2, QED, and FVD, 
let’s start with a discussion of what is needed for the next generation 
image format. 

First of all, of course, we need high performance. Through extensive 
benchmarking, I identified three major performance overheads in image 
formats. The numbers cited below are based on the PostMark benchmark. See 
the paper for more details,  
http://researcher.watson.ibm.com/researcher/files/us-ctang/FVD-cow.pdf .

P1) Increased disk seek distance caused by a compact image’s distorted 
data layout. Specifically, the average disk seek distance in QCOW2 is 460% 
longer than that in a RAW image.

P2) Overhead of storing an image on a host file system. Specifically, a 
RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw 
partition.

P3) Overhead in reading or updating an image format’s on-disk metadata. 
Due to this overhead, QCOW2 causes 45% more total disk I/Os (including 
I/Os for accessing both data and metadata) than FVD does.

For P1), I uses the term compact image instead of sparse image, because a 
RAW image stored as a sparse file in ext3 is a sparse image, but is not a 
compact image. A compact image stores data in such a way that the file 
size of the image file is smaller than the size of the virtual disk 
perceived by the VM. QCOW2 is a compact image. The disadvantage of a 
compact image is that the data layout perceived by the guest OS differs 
from the actual layout on the physical disk, which defeats many 
optimizations in guest file systems. Consider one concrete example. When 
the guest VM issues a disk I/O request to the hypervisor using a virtual 
block address (VBA), QEMU’s block device driver translates the VBA into an 
image block address (IBA), which specifies where the requested data are 
stored in the image file, i.e., IBA is an offset in the image file. When a 
guest OS creates or resizes a file system, it writes out the file system 
metadata, which are all grouped together and assigned consecutive image 
block addresses (IBAs) by QCOW2, despite the fact that the metadata’s 
virtual block addresses (VBAs) are deliberately scattered for better 
reliability and locality, e.g., co-locating inodes and file content blocks 
in block groups. As a result, it may cause a long disk seek distance 
between accessing a file’s metadata and accessing the file’s content 
blocks. 

For P2), using a host file system is inefficient, because 1) historically 
file systems are optimized for small files rather than large images, and 
2) certain functions of a host file system are simply redundant with 
respect to the function of a compact image, e.g., performing storage 
allocation. Moreover, using a host file system not only adds overhead, but 
also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC, 
it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data 
integrity in the event of a host crash. See 
http://lwn.net/Articles/348739/ . 

For P3), it includes the overhead in reading on-disk metadata and the 
overhead in updating on-disk metadata. The former can be reduced by 
minimizing the size of metadata so that they can be easily cached in 
memory. Reducing the latter requires optimizations to avoid updating the 
on-disk metadata whenever possible, while not compromising data integrity 
in the event of a host crash. 

In addition to addressing the performance overheads caused by P1-P3, 
ideally the next-generation image format should meet the following 
functional requirements and perhaps beyond.

R1) Support storage over-commit.
R2) Support compact image, copy-on-write, copy-on-read, and adaptive 
prefetching.
R3) Allow eliminating the host file system to achieve high performance.
R4) Make all these features orthogonal, i.e., each feature can be enabled 
or disabled individually without affecting other features. The purpose is 
to support diverse use cases. For example, a copy-on-write image can use a 
RAW image like data layout to avoid the overhead associated with a compact 
image. 

Storage over-commit means that, e.g., a 100GB physical disk can be used to 
host 10 VMs, each with a 20GB virtual disk. This is possible because not 
every VM completely fills up its 20GB virtual disk. It is not mandatory to 
use compact image in order to support storage over-commit. For example, 
RAW images stored as sparse files on ext3 support storage over-commit. 
Copy-on-read and adaptive prefetching compliment copy-on-write in certain 
use cases, e.g., in a Cloud where the backing image is stored on 
network-attached storage (NAS) while the copy-on-write image is stored on 
direct-attached storage (DAS). When the VM reads a block from the backing 
image, a copy of the data is saved in the copy-on-write image for later 
reuse. Adaptive prefetching finds resource idle times to copy from NAS to 
DAS parts of the image that have not been accessed by the VM before. 
Prefetching should be conservative in that if the driver detects a 
contention on any resource (including DAS, NAS, or network), it pauses 
prefetching temporarily and resumes prefetching later when congestion 
disappears.

Next, let me briefly describe how FVD is designed to address the 
performance issues P1-P3 and the functional requirements R1-R4. FVD has 
the following features.

F1) Use a bitmap to implement copy-on-write.
F2) Use a one-level lookup table to implement compact image.
F3) Use a journal to commit changes to the bitmap and the lookup table.
F4) Store a compact image on a logical volume to support storage 
over-commit, and to avoid the overhead and data integrity issues of a host 
file system.

For F1), a bit in the bitmap tracks the state of a block. The bit is 0 if 
the block is in the base image, and the bit is 1 if the block is in the 
FVD image. The default size of a block is 64KB, as that in QCOW2. To 
represent the state of a 1TB base image, FVD only needs a 2MB bitmap, 
which can be easily cached in memory. This bitmap also implements 
copy-on-read and adaptive prefetching.

For F2), one entry in the table maps the virtual disk address of a chunk 
to an offset in the FVD image where the chunk is stored. The default size 
of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft 
VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the 
lookup table is only 4MB. Because of the small size, there is no need to 
use a two-level lookup table as that in QCOW2.

F1) and F2) are essential. They meet the requirement R4), i.e., the 
features of copy-on-write and compact image can be enabled individually. 
F1) and F2) are closest to the Microsoft Virtual Hard Disk (VHD) format, 
which also uses a bitmap and a one-level table. There are some key 
differences though. VHD partitions the bitmap and stores a fragment of the 
bitmap with every 2MB chunk. As a result, VHD does not meet the 
requirement R4, because it cannot have a copy-on-write image using a RAW 
image like data layout. Also because of that, a bit in VHD can only 
represents the state of a 512-byte sector (if a bit represents a 64KB 
block, the chunk size then has to be 2GB, which is way too large and makes 
storage over-commit ineffective). For a 1TB image, the size of the bitmap 
in VHD is 256MB, vs. 2MB in FVD, which makes caching more difficult. 

F3) uses a journal to commit metadata updates, which is not essential and 
there are alternative implementations. F3) however does provide benefits 
in addressing P3) (i.e., reducing metadata update overhead) and 
simplifying implementation. By default, the size of the journal is 16MB. 
When the bitmap and/or the lookup table are updated by a write, the 
changes are saved in the journal. When the journal is full, the entire 
bitmap and the entire lookup table are flushed to the disk, and the 
journal can recycled for reuse. Because the bitmap and the lookup table 
are small, the flush is quick. The journal provides several benefits. 
First, updating both the bitmap and the lookup table requires only a 
single write to journal. Second, K concurrent updates to any potions of 
the bitmap or the lookup table are converted to K sequential writes in the 
journal, and they can be merged into a single write by the host Linux 
kernel. Third, it increases concurrency by avoiding locking the bitmap or 
the lookup table. For example, updating one bit in the bitmap requires 
writing a 512-byte sector to the on-disk bitmap. This bitmap sector covers 
a total of 512*8*64K=256MB data. That is, any two writes that target that 
256MB data and require updating the bitmap cannot be processed 
concurrently. The journal solves this problem and eliminates locking.

For F4), it is actually quite straightforward to eliminate the host file 
system. The main thing that an image format needs from the host file 
system is to perform storage allocation. This function, however, is 
already performed by a compact image. Using a host file system simply ends 
up doing storage allocation twice, which requires updating on-disk 
metadata twice and introduces distorted data layout twice. Therefore, if 
we migrate the necessary function of a host file system into an image 
format, in other words, implementing a mini file system in an image 
format, then we can get rid of the host file system. This is exactly what 
FVD does, by slightly enhancing the compact image function that is already 
there. FVD can manage incrementally added storage space, like ZFS and 
unlike ext2/3/4. For example, when FVD manages a 100GB virtual disk, it 
initially gets 5GB storage space from the logical volume manager and uses 
it to host many 1MB chunks. When the first 5GB is used up, FVD gets 
another 5GB to host more 1MB chunks, and so forth. Unlike QCOW2 and more 
like a file system, FVD does not have to allocate a new chunk always right 
after where the previous chunk was allocated. Instead, it may spread out 
used chunks in the storage space in order to mimic a raw image like data 
layout. More details will be explained in follow up emails. 

The description above is long but is still a summary. Please refer to more 
detailed information on the web site, 
http://researcher.watson.ibm.com/researcher/view_project.php?id=1852 . 
Hopefully I have given a summary of the problems, the requirements, and 
the solutions in FVD, which can serve as the basis for a productive 
discussion.

Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-06  9:17 ` Stefan Hajnoczi
@ 2011-01-15  3:28   ` Chunqiang Tang
  2011-01-15 17:27     ` Stefan Weil
       [not found]     ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
  0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-15  3:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

> The community block I/O test suite is qemu-iotests:
> http://git.kernel.org/?p=linux/kernel/git/hch/qemu-iotests.git;a=summary
> If you have tests that you'd like to contribute, please put them into
> that framework so other developers can run them as part of their
> regular testing.

Hi Stefan,

What I described is not a qemu-io test case. I also use qemu-io, which is 
very helpful, but I observed that qemu-io has several limitations in 
discovering elusive bugs:

B1) qemu-io cannot trigger many race condition bugs, because it does not 
fully control the timing of events. For example, qemu-io cannot test this 
scenario: three concurrent writes a, b, and c are processed by 
bdrv_aio_writev() in the order of Pa, Pb, and Pc; their writes are 
actually persisted on disk in another order of Wc, Wa, and Wb; and finally 
their callbacks are invoked in yet another order of Vb, Vc, and Va. Some 
race condition bugs may exist in the code (e.g., inappropriate locking), 
because it does not anticipate these orders of events are possible. This 
is just one example. In theory, there can be 100 concurrent reads or 
writes, and their events can happen in an arbitrary permutation order. It 
is nearly impossible to manually generating test cases for all of them.

B2) Even if a race condition bug is triggered by chance, its behavior 
depends on subtle event timing that is hard to repeat and hence hard to 
debug. 

B3) With qemu-io, it is hard to test code paths that handle I/O failures. 
For example, a disk write may fail due to disk media error. Because these 
errors are rare, the failure handling code paths may never be tested, 
which for example may contain a null pointer bug that can crash the entire 
VM or gradually leaks resources (e.g., memory) due to incomplete cleanup.

B4) qemu-io requires manually creating test cases, which is not only time 
consuming but also leads a low coverage in testing. This is because many 
bugs happen in scenarios that the developers do not anticipate, and hence 
do not know how to create test cases in the first place. 

The FVD patch includes a new testing framework that addresses the above 
issues. This testing framework is orthogonal to FVD and can be used to 
test other block device drivers as well. This testing framework includes 
two components that can be used both separately and in a combination

T1) To address the problems of B1- B3, I implemented an emulated disk in 
block/sim.c, which allows a full control of event timings, either manually 
or automatically. Given the three concurrent writes example above, their 9 
events (Pa, Pb, Pc, Wa, Wb, Wc, Va, Vb, and Vc) can be precisely 
controlled to be executed in any given order. Moreover, the emulated disk 
can inject disk I/O errors in a controlled manner. For example, it can 
fail a specific read or write to test how the code handles that, or it can 
even fail as many as 90% of the reads/writes to test if the code has 
resource leaks. qemu-io is extended with a module qemu-io-sim.c to work 
with the emulated disk block/sim.c, so that the tester can use the qemu-io 
console to manually control the order of events or fail disk reads or 
writes.

T2) The solution in T1 still does not address the problem of B3), i.e., 
manually generating test cases is time consuming and has a low coverage. 
This problem is solved by a new testing tool called qemu-test. qemu-test 
can 1) automatically generate an unlimited number of randomized test cases 
that, e.g., execute 1,000 concurrent disk reads or writes on overlapping 
disk regions; 2) automatically generate the corresponding anticipated 
correct results, automatically run the tests, and automatically compare 
the actual test results with the anticipated correct results. Once it 
discovers a difference, which indicates a bug, it halts testing and waits 
for the developer to debug.  The randomized test cases created by 
qemu-test are controlled by a pseudo random number generator, and hence 
the behavior is completely repeatable. Therefore, once a bug is triggered, 
it can be precisely repeated for an unlimited number of times to 
facilitate debugging, even if this bug happens extremely rare in real runs 
of a VM. qemu-test is fully automated. Once started, it can continuously 
run, e.g., for months to test an enormous number of test cases.

The implementation of qemu-test is actually not that complicated. It opens 
two virtual disks, the so-called truth image and test image, respectively. 
The truth image is served by a trivial synchronous block device driver so 
that its behavior is guaranteed to be correct. The test image is served a 
real block device driver (e.g., FVD or QCOW2) that we want to test. 
qemu-test submits the same sequence of disk I/O requests (which is 
randomly generated) to the truth image and the test image, and expect that 
the two images’ contents never diverge. Otherwise, it indicates a bug in 
the test image’s block device driver. qemu-test works with the emulated 
disk block/sim.c so that it can randomize event timings in a controlled 
manner and can inject disk I/O errors randomly. 

I found qemu-test extremely powerful in discovering elusive bugs that I 
never anticipated, and using qemu-test is effortless. Whenever I completed 
some major code upgrade, I simply started qemu-test in the evening and 
came back in the morning to collect bugs, if any. Debugging them is also 
easy because the bugs are precisely repeatable even if they are hard to 
trigger.

As for the QCOW2 bug I mentioned previously, it can be triggered by 
test-qcow2.sh. A faster way to trigger it is to bypass those correct test 
runs by executing the commands below:

dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1155683840
dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=609064448
./qemu-img create -f qcow2 -b /var/ramdisk/zero-500M.raw 
/var/ramdisk/test.qcow2 1155683840
./qemu-test --seed=116579177 --truth=/var/ramdisk/truth.raw 
--test=/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false 
--compare_after=true --round=100000 --parallel=100 --io_size=10485760 
--fail_prob=0 --cancel_prob=0 --instant_qemubh=true

As for the FVD patch that includes the new testing framework, I tried to 
post it on the mailing list twice but it always got bounced back, either 
because the message is too big or because of a Notes client configuration 
issue. Until I figure it out, please down the FVD patch from 
https://researcher.ibm.com/researcher/files/us-ctang/FVD-01-14-2011.patch 
.

Best regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-15  3:28   ` Chunqiang Tang
@ 2011-01-15 17:27     ` Stefan Weil
  2011-01-20  2:59       ` Chunqiang Tang
       [not found]     ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
  1 sibling, 1 reply; 21+ messages in thread
From: Stefan Weil @ 2011-01-15 17:27 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Stefan Hajnoczi, qemu-devel

Am 15.01.2011 04:28, schrieb Chunqiang Tang:
>> The community block I/O test suite is qemu-iotests:
>> http://git.kernel.org/?p=linux/kernel/git/hch/qemu-iotests.git;a=summary
>> If you have tests that you'd like to contribute, please put them into
>> that framework so other developers can run them as part of their
>> regular testing.
>
> Hi Stefan,
>
> What I described is not a qemu-io test case. I also use qemu-io, which is
> very helpful, but I observed that qemu-io has several limitations in
> discovering elusive bugs:
>
> B1) qemu-io cannot trigger many race condition bugs, because it does not
> fully control the timing of events. For example, qemu-io cannot test this
> scenario: three concurrent writes a, b, and c are processed by
> bdrv_aio_writev() in the order of Pa, Pb, and Pc; their writes are
> actually persisted on disk in another order of Wc, Wa, and Wb; and 
> finally
> their callbacks are invoked in yet another order of Vb, Vc, and Va. Some
> race condition bugs may exist in the code (e.g., inappropriate locking),
> because it does not anticipate these orders of events are possible. This
> is just one example. In theory, there can be 100 concurrent reads or
> writes, and their events can happen in an arbitrary permutation order. It
> is nearly impossible to manually generating test cases for all of them.
>
> B2) Even if a race condition bug is triggered by chance, its behavior
> depends on subtle event timing that is hard to repeat and hence hard to
> debug.
>
> B3) With qemu-io, it is hard to test code paths that handle I/O failures.
> For example, a disk write may fail due to disk media error. Because these
> errors are rare, the failure handling code paths may never be tested,
> which for example may contain a null pointer bug that can crash the 
> entire
> VM or gradually leaks resources (e.g., memory) due to incomplete cleanup.
>
> B4) qemu-io requires manually creating test cases, which is not only time
> consuming but also leads a low coverage in testing. This is because many
> bugs happen in scenarios that the developers do not anticipate, and hence
> do not know how to create test cases in the first place.
>
> The FVD patch includes a new testing framework that addresses the above
> issues. This testing framework is orthogonal to FVD and can be used to
> test other block device drivers as well. This testing framework includes
> two components that can be used both separately and in a combination
>
> T1) To address the problems of B1- B3, I implemented an emulated disk in
> block/sim.c, which allows a full control of event timings, either 
> manually
> or automatically. Given the three concurrent writes example above, 
> their 9
> events (Pa, Pb, Pc, Wa, Wb, Wc, Va, Vb, and Vc) can be precisely
> controlled to be executed in any given order. Moreover, the emulated disk
> can inject disk I/O errors in a controlled manner. For example, it can
> fail a specific read or write to test how the code handles that, or it 
> can
> even fail as many as 90% of the reads/writes to test if the code has
> resource leaks. qemu-io is extended with a module qemu-io-sim.c to work
> with the emulated disk block/sim.c, so that the tester can use the 
> qemu-io
> console to manually control the order of events or fail disk reads or
> writes.
>
> T2) The solution in T1 still does not address the problem of B3), i.e.,
> manually generating test cases is time consuming and has a low coverage.
> This problem is solved by a new testing tool called qemu-test. qemu-test
> can 1) automatically generate an unlimited number of randomized test 
> cases
> that, e.g., execute 1,000 concurrent disk reads or writes on overlapping
> disk regions; 2) automatically generate the corresponding anticipated
> correct results, automatically run the tests, and automatically compare
> the actual test results with the anticipated correct results. Once it
> discovers a difference, which indicates a bug, it halts testing and waits
> for the developer to debug. The randomized test cases created by
> qemu-test are controlled by a pseudo random number generator, and hence
> the behavior is completely repeatable. Therefore, once a bug is 
> triggered,
> it can be precisely repeated for an unlimited number of times to
> facilitate debugging, even if this bug happens extremely rare in real 
> runs
> of a VM. qemu-test is fully automated. Once started, it can continuously
> run, e.g., for months to test an enormous number of test cases.
>
> The implementation of qemu-test is actually not that complicated. It 
> opens
> two virtual disks, the so-called truth image and test image, 
> respectively.
> The truth image is served by a trivial synchronous block device driver so
> that its behavior is guaranteed to be correct. The test image is served a
> real block device driver (e.g., FVD or QCOW2) that we want to test.
> qemu-test submits the same sequence of disk I/O requests (which is
> randomly generated) to the truth image and the test image, and expect 
> that
> the two images’ contents never diverge. Otherwise, it indicates a bug in
> the test image’s block device driver. qemu-test works with the emulated
> disk block/sim.c so that it can randomize event timings in a controlled
> manner and can inject disk I/O errors randomly.
>
> I found qemu-test extremely powerful in discovering elusive bugs that I
> never anticipated, and using qemu-test is effortless. Whenever I 
> completed
> some major code upgrade, I simply started qemu-test in the evening and
> came back in the morning to collect bugs, if any. Debugging them is also
> easy because the bugs are precisely repeatable even if they are hard to
> trigger.
>
> As for the QCOW2 bug I mentioned previously, it can be triggered by
> test-qcow2.sh. A faster way to trigger it is to bypass those correct test
> runs by executing the commands below:
>
> dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1155683840
> dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=609064448
> ./qemu-img create -f qcow2 -b /var/ramdisk/zero-500M.raw
> /var/ramdisk/test.qcow2 1155683840
> ./qemu-test --seed=116579177 --truth=/var/ramdisk/truth.raw
> --test=/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false
> --compare_after=true --round=100000 --parallel=100 --io_size=10485760
> --fail_prob=0 --cancel_prob=0 --instant_qemubh=true
>
> As for the FVD patch that includes the new testing framework, I tried to
> post it on the mailing list twice but it always got bounced back, either
> because the message is too big or because of a Notes client configuration
> issue. Until I figure it out, please down the FVD patch from
> https://researcher.ibm.com/researcher/files/us-ctang/FVD-01-14-2011.patch
> .
>
> Best regards,
> ChunQiang (CQ) Tang, Ph.D.
> Homepage: http://www.research.ibm.com/people/c/ctang

Hi,

when I tried to use your patch, I found several problems:

* The patch does apply cleanly to latest QEMU.
   This is caused by recent changes in QEMU git master.

* The new code uses tabs instead of spaces (QEMU coding rules).

* Some lines of the new code end with blank characters.

* The patch adds empty lines at the end of some files.

The last two points are reported by newer versions of git
(which refuse to take such patches with the default setting).

Could you please update your patch to fix those topics?
I'd like to apply it to my QEMU code and try the new FVD.

If needed, I could also send your patch to qemu-devel.

Kind regards,
Stefan Weil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
       [not found]     ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
@ 2011-01-17 10:37       ` Stefan Hajnoczi
  2011-01-18 20:35         ` Chunqiang Tang
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-01-17 10:37 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, qemu-devel

Resend because qemu-devel was dropped from CC.  Thanks for pointing it
out Kevin.

On Sat, Jan 15, 2011 at 12:25 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sat, Jan 15, 2011 at 3:28 AM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> T1) To address the problems of B1- B3, I implemented an emulated disk in
>> block/sim.c, which allows a full control of event timings, either manually
>> or automatically. Given the three concurrent writes example above, their 9
>> events (Pa, Pb, Pc, Wa, Wb, Wc, Va, Vb, and Vc) can be precisely
>> controlled to be executed in any given order. Moreover, the emulated disk
>> can inject disk I/O errors in a controlled manner. For example, it can
>> fail a specific read or write to test how the code handles that, or it can
>> even fail as many as 90% of the reads/writes to test if the code has
>> resource leaks. qemu-io is extended with a module qemu-io-sim.c to work
>> with the emulated disk block/sim.c, so that the tester can use the qemu-io
>> console to manually control the order of events or fail disk reads or
>> writes.
>
> block/blkdebug.c already provides fault injection and is used in
> qemu-iotests test 026.  Using blkdebug it is possible to test specific
> error paths in image formats.  We should look at merging random
> failures ("fail as many as 90% of the reads/writes") into blkdebug.
>
>> T2) The solution in T1 still does not address the problem of B3), i.e.,
>> manually generating test cases is time consuming and has a low coverage.
>> This problem is solved by a new testing tool called qemu-test. qemu-test
>> can 1) automatically generate an unlimited number of randomized test cases
>> that, e.g., execute 1,000 concurrent disk reads or writes on overlapping
>> disk regions; 2) automatically generate the corresponding anticipated
>> correct results, automatically run the tests, and automatically compare
>> the actual test results with the anticipated correct results. Once it
>> discovers a difference, which indicates a bug, it halts testing and waits
>> for the developer to debug.  The randomized test cases created by
>> qemu-test are controlled by a pseudo random number generator, and hence
>> the behavior is completely repeatable. Therefore, once a bug is triggered,
>> it can be precisely repeated for an unlimited number of times to
>> facilitate debugging, even if this bug happens extremely rare in real runs
>> of a VM. qemu-test is fully automated. Once started, it can continuously
>> run, e.g., for months to test an enormous number of test cases.
>>
>> The implementation of qemu-test is actually not that complicated. It opens
>> two virtual disks, the so-called truth image and test image, respectively.
>> The truth image is served by a trivial synchronous block device driver so
>> that its behavior is guaranteed to be correct. The test image is served a
>> real block device driver (e.g., FVD or QCOW2) that we want to test.
>> qemu-test submits the same sequence of disk I/O requests (which is
>> randomly generated) to the truth image and the test image, and expect that
>> the two images’ contents never diverge. Otherwise, it indicates a bug in
>> the test image’s block device driver. qemu-test works with the emulated
>> disk block/sim.c so that it can randomize event timings in a controlled
>> manner and can inject disk I/O errors randomly.
>
> block/blkverify.c already provides I/O verification.  It mirrors
> writes to a raw file and compares the contents of read blocks to
> detect data integrity issues.  That's the same approach you have
> described.
>
>> I found qemu-test extremely powerful in discovering elusive bugs that I
>> never anticipated, and using qemu-test is effortless. Whenever I completed
>> some major code upgrade, I simply started qemu-test in the evening and
>> came back in the morning to collect bugs, if any. Debugging them is also
>> easy because the bugs are precisely repeatable even if they are hard to
>> trigger.
>
> Here are the unique features you've described beyond what qemu-io,
> blkdebug, and blkverify do:
>
> 1. New functionality
>  * Control over ordering of I/O request submission and completion.
>  * Random I/O generator (probably as new qemu-io command).
>
> 2. Enhancements to existing code:
>  * Random chance of failing I/O in blkdebug.
>
> Do you agree with this or are there other unique features which are
> beyond small enhancements to existing code?
>
> I think the best strategy is to consolidate these as incremental
> patches that can be reviewed and merged independently.
>
>> As for the FVD patch that includes the new testing framework, I tried to
>> post it on the mailing list twice but it always got bounced back, either
>> because the message is too big or because of a Notes client configuration
>> issue. Until I figure it out, please down the FVD patch from
>> https://researcher.ibm.com/researcher/files/us-ctang/FVD-01-14-2011.patch
>
> I'll send you my git-send-email config off-list.
>
> Stefan
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-17 10:37       ` Stefan Hajnoczi
@ 2011-01-18 20:35         ` Chunqiang Tang
  2011-01-19  0:59           ` Jamie Lokier
  0 siblings, 1 reply; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-18 20:35 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> > Here are the unique features you've described beyond what qemu-io,
> > blkdebug, and blkverify do:
> >
> > 1. New functionality
> >  * Control over ordering of I/O request submission and completion.
> >  * Random I/O generator (probably as new qemu-io command).
> >
> > 2. Enhancements to existing code:
> >  * Random chance of failing I/O in blkdebug.
> >
> > Do you agree with this or are there other unique features which are
> > beyond small enhancements to existing code?
> >
> > I think the best strategy is to consolidate these as incremental
> > patches that can be reviewed and merged independently.

Hi Stefan,

I agree with the strategy you described. Among the things you summarized, 
'random chance of failing I/O in blkdebug' is probably trivial to add. 
'random I/O generator' (i.e., the currently stand-alone program qemu-test) 
may be able to be folded in as a qemu-io command. Controlling I/O order 
and callback order is the most significant change, which has already 
integrated as several qemu-io commands in the FVD patch.

 The purpose of controlling I/O order and callback order is to test race 
conditions under concurrent requests. It is implemented as the “sim” 
driver in block/sim.c, by following an event-driven simulation approach 
and maintaining an outstanding event list. The “sim” driver can either 
remain standalone or be folded into blkdebug.c. The latter case would 
require significant changes to blkdebug.c. 

Doing both fault injection and verification together introduces some 
subtlety. For example, even under the random failure mode, two disk writes 
triggered by one VM-issued write must either fail together or succeed 
together. Otherwise, the truth image and the test image will diverge and 
verification won't succeed. Currently, qemu-test carefully works with the 
'sim' driver to guarantee those conditions. Those conditions need be 
retained after code restructure. 

Best regards,
Chunqiang Tang

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-18 20:35         ` Chunqiang Tang
@ 2011-01-19  0:59           ` Jamie Lokier
  2011-01-19 14:59             ` Chunqiang Tang
  0 siblings, 1 reply; 21+ messages in thread
From: Jamie Lokier @ 2011-01-19  0:59 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

Chunqiang Tang wrote:
> Doing both fault injection and verification together introduces some 
> subtlety. For example, even under the random failure mode, two disk writes 
> triggered by one VM-issued write must either fail together or succeed 
> together. Otherwise, the truth image and the test image will diverge and 
> verification won't succeed. Currently, qemu-test carefully works with the 
> 'sim' driver to guarantee those conditions. Those conditions need be 
> retained after code restructure. 

If the real backend is a host system file or device, and AIO or
multi-threaded writes are used, you can't depend on two parallel disk
writes (triggered by one VM-issued write) failing together or
succeeding together.  All you can do is look at the error code after
each operation completes, and use it to prevent issuing later
operations.  You can't stop the other parallel operations that are
already in progress.

Is that an issue in your design assumptions?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-14 20:56   ` Chunqiang Tang
@ 2011-01-19  1:12     ` Jamie Lokier
  2011-01-19  8:10       ` Stefan Hajnoczi
  2011-01-19 15:51     ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Jamie Lokier @ 2011-01-19  1:12 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

Chunqiang Tang wrote:
> > Based on my limited understanding, I think FVD shares a 
> > lot in common with the COW format (block/cow.c).
> > 
> > But I think most of the advantages you mention could be considered as 
> > additions to either qcow2 or qed.  At any rate, the right way to have 
> > that discussion is in the form of patches on the ML.
> 
> FVD is much more advanced than block/cow.c. I would be happy to discuss 
> possible leverage, but setting aside the details of QCOW2, QED, and FVD, 
> let’s start with a discussion of what is needed for the next generation 
> image format. 

Thank you for the detailed description.

FVD looks quite good to me; it seems very simple yet performant at the
same time, due to its smart yet simple design.

> Moreover, using a host file system not only adds overhead, but 
> also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC, 
> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data 
> integrity in the event of a host crash. See 
> http://lwn.net/Articles/348739/ . 

You have the same issue with O_DIRECT when using a raw disk device
too.  That is, O_DIRECT on a raw device does not guarantee integrity
in the event of a host crash either, for mostly the same reasons.

-- Jamie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19  1:12     ` Jamie Lokier
@ 2011-01-19  8:10       ` Stefan Hajnoczi
  2011-01-19 15:17         ` Chunqiang Tang
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-01-19  8:10 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Wed, Jan 19, 2011 at 1:12 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Chunqiang Tang wrote:
>> Moreover, using a host file system not only adds overhead, but
>> also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC,
>> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
>> integrity in the event of a host crash. See
>> http://lwn.net/Articles/348739/ .
>
> You have the same issue with O_DIRECT when using a raw disk device
> too.  That is, O_DIRECT on a raw device does not guarantee integrity
> in the event of a host crash either, for mostly the same reasons.

QEMU has semantics that use O_DIRECT safely; there is no issue here.
When a drive is added with cache=none, QEMU not only uses O_DIRECT but
also advertises an enabled write cache to the guest.

The guest *must* flush the cache when it wants to ensure data is
stable.  In the event of a host crash, all, some, or none of the I/O
since the last flush may have made it to disk.  Each of these
possibilities is fair game since the guest may only depend on writes
being on disk if they completed and a successful flush was issued
afterwards.

Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19  0:59           ` Jamie Lokier
@ 2011-01-19 14:59             ` Chunqiang Tang
  0 siblings, 0 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 14:59 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

> > Doing both fault injection and verification together introduces some 
> > subtlety. For example, even under the random failure mode, two disk 
writes 
> > triggered by one VM-issued write must either fail together or succeed 
> > together. Otherwise, the truth image and the test image will diverge 
and 
> > verification won't succeed. Currently, qemu-test carefully works with 
the 
> > 'sim' driver to guarantee those conditions. Those conditions need be 
> > retained after code restructure. 
> 
> If the real backend is a host system file or device, and AIO or
> multi-threaded writes are used, you can't depend on two parallel disk
> writes (triggered by one VM-issued write) failing together or
> succeeding together.  All you can do is look at the error code after
> each operation completes, and use it to prevent issuing later
> operations.  You can't stop the other parallel operations that are
> already in progress.
> 
> Is that an issue in your design assumptions?

Your description of the problem is accurate, i.e., "if  AIO or 
multi-threaded writes are used, you can't stop the other parallel 
operations that are already in progress." As a result, a naive extension 
of blkverify to test concurrent requests would not work. The simulated 
block driver (block/sim.c) in the FVD patch uses neither AIO or nor 
multi-threaded I/O as the backend. It instead uses a 'simulated' backend, 
which allows a full control of I/O order and callback order, and can 
enforce that two parallel disk writes (triggered by one VM-issued write) 
either failing together or succeeding together, and some other properties 
as well, which makes testing more comprehensive and debugging easier. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19  8:10       ` Stefan Hajnoczi
@ 2011-01-19 15:17         ` Chunqiang Tang
  2011-01-19 15:25           ` Christoph Hellwig
  2011-01-19 23:56           ` Jamie Lokier
  0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 15:17 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

> >> Moreover, using a host file system not only adds overhead, but
> >> also introduces data integrity issues. Specifically, if I/Os uses 
O_DSYNC,
> >> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> >> integrity in the event of a host crash. See
> >> http://lwn.net/Articles/348739/ .
> >
> > You have the same issue with O_DIRECT when using a raw disk device
> > too.  That is, O_DIRECT on a raw device does not guarantee integrity
> > in the event of a host crash either, for mostly the same reasons.
> 
> QEMU has semantics that use O_DIRECT safely; there is no issue here.
> When a drive is added with cache=none, QEMU not only uses O_DIRECT but
> also advertises an enabled write cache to the guest.
> 
> The guest *must* flush the cache when it wants to ensure data is
> stable.  In the event of a host crash, all, some, or none of the I/O
> since the last flush may have made it to disk.  Each of these
> possibilities is fair game since the guest may only depend on writes
> being on disk if they completed and a successful flush was issued
> afterwards.

Thank both of you for the explanation, which is very helpful to me. With 
FVD's capability of eliminating the host file system and storing the image 
on a logical volume, then perhaps we can always use O_DSYNC, because there 
is little (or no?) LVM metadata that needs a flush on every write and 
hence O_DSYNC  does not add overhead? I am not certain on this, and need 
help for confirmation. If this is true, the guest does not need to flush 
the cache. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19 15:17         ` Chunqiang Tang
@ 2011-01-19 15:25           ` Christoph Hellwig
  2011-01-19 23:56           ` Jamie Lokier
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 15:25 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Stefan Hajnoczi, qemu-devel

On Wed, Jan 19, 2011 at 10:17:47AM -0500, Chunqiang Tang wrote:
> Thank both of you for the explanation, which is very helpful to me. With 
> FVD's capability of eliminating the host file system and storing the image 
> on a logical volume, then perhaps we can always use O_DSYNC, because there 
> is little (or no?) LVM metadata that needs a flush on every write and 
> hence O_DSYNC  does not add overhead? I am not certain on this, and need 
> help for confirmation. If this is true, the guest does not need to flush 
> the cache. 

O_DSYNC flushes the volatile write cache of the disk on every write,
which can be very ineffienct.  In addition to that image formats really
should obey the configurable caching settings qemu has, they exist for
a reason and should be handled uniformly over different image formats
and protocol drivers.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-14 20:56   ` Chunqiang Tang
  2011-01-19  1:12     ` Jamie Lokier
@ 2011-01-19 15:51     ` Christoph Hellwig
  2011-01-19 16:21       ` Chunqiang Tang
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 15:51 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Fri, Jan 14, 2011 at 03:56:00PM -0500, Chunqiang Tang wrote:
> P2) Overhead of storing an image on a host file system. Specifically, a 
> RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw 
> partition.

Sorry, benchmarking this against ext3 really doesn't matter.  Benchmark
it against xfs or ext4 with a preallocated image (fallocate or dd).

> For P1), I uses the term compact image instead of sparse image, because a 
> RAW image stored as a sparse file in ext3 is a sparse image, but is not a 
> compact image. A compact image stores data in such a way that the file 
> size of the image file is smaller than the size of the virtual disk 
> perceived by the VM. QCOW2 is a compact image. The disadvantage of a 
> compact image is that the data layout perceived by the guest OS differs 
> from the actual layout on the physical disk, which defeats many 
> optimizations in guest file systems.

It's something filesystems have to deal with.  Real storage is getting
increasingly virtualized.  While this didn't matter for the real high
end storage which has been doing this for a long time it's getting more
and more exposed to the filesystem.  That includes LVM layouts and
thinly provisioned disk arrays, which are getting increasingly popular.
That doesn't matter the 64k (or until recently 4k) cluster size in qcow2
is a good idea, we'd want at least a magnitude or two larger extents
to perform well, but it means filesystems really do have to cope with
it.

> For P2), using a host file system is inefficient, because 1) historically 
> file systems are optimized for small files rather than large images,

I'm not sure what hole you're pulling off this bullshit, but this is
absolutely not correct.  Since the damn of time you have filesystems
optimized for small files, for larger or really large files, or trying
to deal with a tradeoff inbetween.

> 2) certain functions of a host file system are simply redundant with 
> respect to the function of a compact image, e.g., performing storage 
> allocation. Moreover, using a host file system not only adds overhead, but 
> also introduces data integrity issues.

I/O into fully preallocated files uses exactly the same codepath as
doing I/O to the block device, except for a identify logical to physical
block mapping in the block device and a non-trivial one in the
filesysgem.  Note that the block mapping is cached and does not affect
the performance.  I've published the numbers for qemu in the various
caching modes and all major filesystems a while ago, so I'm not making
this up.

> Specifically, if I/Os uses O_DSYNC, 
> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data 
> integrity in the event of a host crash. See 
> http://lwn.net/Articles/348739/ . 

I/O to the block devices does not guarantee data integrity without
O_DSYNC either.

> Storage over-commit means that, e.g., a 100GB physical disk can be used to 
> host 10 VMs, each with a 20GB virtual disk.

The current storage industry buzz word for that is thin provisioning.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19 15:51     ` Christoph Hellwig
@ 2011-01-19 16:21       ` Chunqiang Tang
  2011-01-19 16:42         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 16:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel

> It's something filesystems have to deal with.  Real storage is getting
> increasingly virtualized.  While this didn't matter for the real high
> end storage which has been doing this for a long time it's getting more
> and more exposed to the filesystem.  That includes LVM layouts and
> thinly provisioned disk arrays, which are getting increasingly popular.
> That doesn't matter the 64k (or until recently 4k) cluster size in qcow2
> is a good idea, we'd want at least a magnitude or two larger extents
> to perform well, but it means filesystems really do have to cope with
> it.

Yes, a fundamental and optimal solution would be changing guest file 
systems, but it would take a much much longer route to introduce 
virtualization awareness into all guest file systems and its also requires 
changing the interface between the guest and the host.

> I/O into fully preallocated files uses exactly the same codepath as
> doing I/O to the block device, except for a identify logical to physical
> block mapping in the block device and a non-trivial one in the
> filesysgem.  Note that the block mapping is cached and does not affect
> the performance.  I've published the numbers for qemu in the various
> caching modes and all major filesystems a while ago, so I'm not making
> this up.

Preallocation is not a universal solution here, because it just defeats 
the other goal: thin provisioning. Moreover, if preallocation is used, it 
works best for RAW images and makes it unnecessary to use a compact image, 
which is exactly one goal of FVD, i.e., allowing optionally disabling a 
compact image data layout without giving up other features, e.g., 
copy-on-write.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19 16:21       ` Chunqiang Tang
@ 2011-01-19 16:42         ` Christoph Hellwig
  2011-01-19 17:08           ` Chunqiang Tang
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 16:42 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Christoph Hellwig, qemu-devel

On Wed, Jan 19, 2011 at 11:21:07AM -0500, Chunqiang Tang wrote:
> Yes, a fundamental and optimal solution would be changing guest file 
> systems, but it would take a much much longer route to introduce 
> virtualization awareness into all guest file systems and its also requires 
> changing the interface between the guest and the host.

Actually current filesystems do pretty well on thinly provisioned
storage, as long as your extent size is not too small.  Starting from
extent size in the 64M to 256M range there's almost no difference to
non-virtualized storage.

> Preallocation is not a universal solution here, because it just defeats 
> the other goal: thin provisioning. Moreover, if preallocation is used, it 
> works best for RAW images and makes it unnecessary to use a compact image, 
> which is exactly one goal of FVD, i.e., allowing optionally disabling a 
> compact image data layout without giving up other features, e.g., 
> copy-on-write.

Again, sparse images with a large enough allocation size give you almost
the same numbers as preallocated images.  I've been doing quite a lot of
work on TP support in QEMU.  Using an XFS filesystem to back the image
with an extent size hint in the above mentioned range gives performance
withing 1% of fully preallocated images.  With the added benefit of
allowing to deallocate the space again through the SCSI WRITE_SAME
or ATA TRIM commands.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19 16:42         ` Christoph Hellwig
@ 2011-01-19 17:08           ` Chunqiang Tang
  2011-01-19 17:25             ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 17:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel

> Actually current filesystems do pretty well on thinly provisioned
> storage, as long as your extent size is not too small.  Starting from
> extent size in the 64M to 256M range there's almost no difference to
> non-virtualized storage.
> 
> Again, sparse images with a large enough allocation size give you almost
> the same numbers as preallocated images.  I've been doing quite a lot of
> work on TP support in QEMU.  Using an XFS filesystem to back the image
> with an extent size hint in the above mentioned range gives performance
> withing 1% of fully preallocated images.  With the added benefit of
> allowing to deallocate the space again through the SCSI WRITE_SAME
> or ATA TRIM commands.

These  numbers are very interesting and I would like to read more. Are 
your detailed results accessible on the Internet? Do you  have numbers on 
the impact of using a large extent size (64-256MB) on thin provisioning 
(since a guest file system's metadata are written first and are scattered 
in the virtual disk)? 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19 17:08           ` Chunqiang Tang
@ 2011-01-19 17:25             ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 17:25 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: qemu-devel

On Wed, Jan 19, 2011 at 12:08:41PM -0500, Chunqiang Tang wrote:
> These  numbers are very interesting and I would like to read more. Are 
> your detailed results accessible on the Internet? Do you  have numbers on 
> the impact of using a large extent size (64-256MB) on thin provisioning 
> (since a guest file system's metadata are written first and are scattered 
> in the virtual disk)? 

It's still work in progress.  I will publish patches and numbers in a
few weeks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-19 15:17         ` Chunqiang Tang
  2011-01-19 15:25           ` Christoph Hellwig
@ 2011-01-19 23:56           ` Jamie Lokier
  1 sibling, 0 replies; 21+ messages in thread
From: Jamie Lokier @ 2011-01-19 23:56 UTC (permalink / raw)
  To: Chunqiang Tang; +Cc: Stefan Hajnoczi, qemu-devel

Chunqiang Tang wrote:
> > >> Moreover, using a host file system not only adds overhead, but
> > >> also introduces data integrity issues. Specifically, if I/Os uses 
> O_DSYNC,
> > >> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> > >> integrity in the event of a host crash. See
> > >> http://lwn.net/Articles/348739/ .
> > >
> > > You have the same issue with O_DIRECT when using a raw disk device
> > > too.  That is, O_DIRECT on a raw device does not guarantee integrity
> > > in the event of a host crash either, for mostly the same reasons.
> > 
> > QEMU has semantics that use O_DIRECT safely; there is no issue here.
> > When a drive is added with cache=none, QEMU not only uses O_DIRECT but
> > also advertises an enabled write cache to the guest.
> > 
> > The guest *must* flush the cache when it wants to ensure data is
> > stable.  In the event of a host crash, all, some, or none of the I/O
> > since the last flush may have made it to disk.  Each of these
> > possibilities is fair game since the guest may only depend on writes
> > being on disk if they completed and a successful flush was issued
> > afterwards.
> 
> Thank both of you for the explanation, which is very helpful to me. With 
> FVD's capability of eliminating the host file system and storing the image 
> on a logical volume, then perhaps we can always use O_DSYNC, because there 
> is little (or no?) LVM metadata that needs a flush on every write and 
> hence O_DSYNC  does not add overhead? I am not certain on this, and need 
> help for confirmation. If this is true, the guest does not need to flush 
> the cache. 

I think O_DSYNC does not work as you might expect on raw disk devices
and logical volumes.

That doesn't mean you don't need something for crash durability!
Instead, you need to issue the disk cache flushes in whatever way works.

It actually has a very *high* overhead.

The overhead isn't from metadata - it is from needing to flush the
disk cache after every write, which prevents the disk from reordering
writes.

If you don't issue the flushes, and the physical device has a volatile
write cache, then you cannot guarantee integrity in the event of a
host crash.

This can make a filesystem faster than a raw disk or logical volume in
some configurations, if the filesystem journals data writes to limit
the seeking needed to commit durably.

-- Jamie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
  2011-01-15 17:27     ` Stefan Weil
@ 2011-01-20  2:59       ` Chunqiang Tang
  0 siblings, 0 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-20  2:59 UTC (permalink / raw)
  To: Stefan Weil; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

> when I tried to use your patch, I found several problems:
> 
> * The patch does apply cleanly to latest QEMU.
>    This is caused by recent changes in QEMU git master.
> 
> * The new code uses tabs instead of spaces (QEMU coding rules).
> 
> * Some lines of the new code end with blank characters.
> 
> * The patch adds empty lines at the end of some files.
> 
> The last two points are reported by newer versions of git
> (which refuse to take such patches with the default setting).
> 
> Could you please update your patch to fix those topics?
> I'd like to apply it to my QEMU code and try the new FVD.

Thank you for the detailed instructions. I fixed all the issues above and 
posted the latest patches to the mailing list (see below). Please use this 
latest version, which also includes some other fixes. BTW, I found out 
that my previous patches were rejected by the mailing list because a 
single patch for FVD was too big.

http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01948.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01947.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01950.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01949.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01951.html

Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-01-20  3:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-04 21:44 [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% Chunqiang Tang
2011-01-05 17:29 ` Anthony Liguori
2011-01-14 20:56   ` Chunqiang Tang
2011-01-19  1:12     ` Jamie Lokier
2011-01-19  8:10       ` Stefan Hajnoczi
2011-01-19 15:17         ` Chunqiang Tang
2011-01-19 15:25           ` Christoph Hellwig
2011-01-19 23:56           ` Jamie Lokier
2011-01-19 15:51     ` Christoph Hellwig
2011-01-19 16:21       ` Chunqiang Tang
2011-01-19 16:42         ` Christoph Hellwig
2011-01-19 17:08           ` Chunqiang Tang
2011-01-19 17:25             ` Christoph Hellwig
2011-01-06  9:17 ` Stefan Hajnoczi
2011-01-15  3:28   ` Chunqiang Tang
2011-01-15 17:27     ` Stefan Weil
2011-01-20  2:59       ` Chunqiang Tang
     [not found]     ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
2011-01-17 10:37       ` Stefan Hajnoczi
2011-01-18 20:35         ` Chunqiang Tang
2011-01-19  0:59           ` Jamie Lokier
2011-01-19 14:59             ` Chunqiang Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.