* [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
@ 2011-01-04 21:44 Chunqiang Tang
2011-01-05 17:29 ` Anthony Liguori
2011-01-06 9:17 ` Stefan Hajnoczi
0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-04 21:44 UTC (permalink / raw)
To: qemu-devel
Dear QEMU Community Members,
Happy new year! We would like to contribute a new year gift to the
community.
As the community considers the next-generation image formats for QEMU,
hopefully we really challenge ourselves hard enough to find the right
solution for the long term, rather than just a convenient solution for the
short term, because an image format has long-term impacts and is hard to
change once released. In this spirit, we would like to argue that QCOW2
and QED’s use of a two-level lookup table as the basis for implementing
all features is a fundamental obstacle for achieving high performance.
Accordingly, we advocate the newly developed Fast Virtual Disk (FVD) image
format for adoption in the QEMU mainline. FVD achieves the performance of
a RAW image running on a raw partition, while providing the rich features
of compact image, copy-on-write, copy-on-read, and adaptive prefetching.
FVD is extensible and can accommodate additional features. Experiments
show that the throughput of FVD is 249% higher than that of QCOW2 when
using the PostMark benchmark to create files.
FVD came out of the work done at IBM T.J. Watson Research Center, when
studying virtual disk related issues during the development of the IBM
Cloud (http://www.ibm.com/services/us/igs/cloud-development/). At IBM
internally, FVD (a.k.a. ODS) has been widely demonstrated since June 2010.
Recently, the FVD technical papers were completed and the source code was
cleared for external release. Now we finally can share FVD with the
community, and seek your valuable feedback and contributions. All related
information is available at
https://researcher.ibm.com/researcher/view_project.php?id=1852 , including
a high-level overview of FVD, the source code, and the technical papers.
The FVD patch also includes a fully automated testing framework that
exercises QEMU block device drivers under stress load and extreme race
conditions. Currently (as of January 2011), QCOW2 cannot pass the
automated test. The symptom is that QCOW2 attempts to read beyond the end
of the base image. QCOW2 experts please take a look at this "potential"
bug.
Best Regards,
Chunqiang Tang
Homepage: http://www.research.ibm.com/people/c/ctang
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-04 21:44 [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% Chunqiang Tang
@ 2011-01-05 17:29 ` Anthony Liguori
2011-01-14 20:56 ` Chunqiang Tang
2011-01-06 9:17 ` Stefan Hajnoczi
1 sibling, 1 reply; 21+ messages in thread
From: Anthony Liguori @ 2011-01-05 17:29 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: qemu-devel
Hi Chunqiang,
On 01/04/2011 03:44 PM, Chunqiang Tang wrote:
> Dear QEMU Community Members,
>
> Happy new year! We would like to contribute a new year gift to the
> community.
>
> As the community considers the next-generation image formats for QEMU,
> hopefully we really challenge ourselves hard enough to find the right
> solution for the long term, rather than just a convenient solution for the
> short term, because an image format has long-term impacts and is hard to
> change once released. In this spirit, we would like to argue that QCOW2
> and QED’s use of a two-level lookup table as the basis for implementing
> all features is a fundamental obstacle for achieving high performance.
> Accordingly, we advocate the newly developed Fast Virtual Disk (FVD) image
> format for adoption in the QEMU mainline. FVD achieves the performance of
> a RAW image running on a raw partition, while providing the rich features
> of compact image, copy-on-write, copy-on-read, and adaptive prefetching.
> FVD is extensible and can accommodate additional features. Experiments
> show that the throughput of FVD is 249% higher than that of QCOW2 when
> using the PostMark benchmark to create files.
>
> FVD came out of the work done at IBM T.J. Watson Research Center, when
> studying virtual disk related issues during the development of the IBM
> Cloud (http://www.ibm.com/services/us/igs/cloud-development/). At IBM
> internally, FVD (a.k.a. ODS) has been widely demonstrated since June 2010.
> Recently, the FVD technical papers were completed and the source code was
> cleared for external release. Now we finally can share FVD with the
> community, and seek your valuable feedback and contributions. All related
> information is available at
> https://researcher.ibm.com/researcher/view_project.php?id=1852 , including
> a high-level overview of FVD, the source code, and the technical papers.
>
> The FVD patch also includes a fully automated testing framework that
> exercises QEMU block device drivers under stress load and extreme race
> conditions. Currently (as of January 2011), QCOW2 cannot pass the
> automated test. The symptom is that QCOW2 attempts to read beyond the end
> of the base image. QCOW2 experts please take a look at this "potential"
> bug.
>
For any feature to be seriously considered for inclusion in QEMU,
patches need to be posted to the mailing list against the latest git
tree. That's a pre-requisite for any real discussion.
There's a tremendous amount of desire to avoid further fragmentation of
image formats. Based on my limited understanding, I think FVD shares a
lot in common with the COW format (block/cow.c).
But I think most of the advantages you mention could be considered as
additions to either qcow2 or qed. At any rate, the right way to have
that discussion is in the form of patches on the ML.
Regards,
Anthony Liguori
> Best Regards,
> Chunqiang Tang
>
> Homepage: http://www.research.ibm.com/people/c/ctang
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-04 21:44 [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% Chunqiang Tang
2011-01-05 17:29 ` Anthony Liguori
@ 2011-01-06 9:17 ` Stefan Hajnoczi
2011-01-15 3:28 ` Chunqiang Tang
1 sibling, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-01-06 9:17 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: qemu-devel
On Tue, Jan 4, 2011 at 9:44 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
> Happy new year! We would like to contribute a new year gift to the
> community.
We == IBM Research?
> The FVD patch also includes a fully automated testing framework that
> exercises QEMU block device drivers under stress load and extreme race
> conditions. Currently (as of January 2011), QCOW2 cannot pass the
> automated test. The symptom is that QCOW2 attempts to read beyond the end
> of the base image. QCOW2 experts please take a look at this "potential"
> bug.
The community block I/O test suite is qemu-iotests:
http://git.kernel.org/?p=linux/kernel/git/hch/qemu-iotests.git;a=summary
If you have tests that you'd like to contribute, please put them into
that framework so other developers can run them as part of their
regular testing.
Stefan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-05 17:29 ` Anthony Liguori
@ 2011-01-14 20:56 ` Chunqiang Tang
2011-01-19 1:12 ` Jamie Lokier
2011-01-19 15:51 ` Christoph Hellwig
0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-14 20:56 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel
> Based on my limited understanding, I think FVD shares a
> lot in common with the COW format (block/cow.c).
>
> But I think most of the advantages you mention could be considered as
> additions to either qcow2 or qed. At any rate, the right way to have
> that discussion is in the form of patches on the ML.
FVD is much more advanced than block/cow.c. I would be happy to discuss
possible leverage, but setting aside the details of QCOW2, QED, and FVD,
let’s start with a discussion of what is needed for the next generation
image format.
First of all, of course, we need high performance. Through extensive
benchmarking, I identified three major performance overheads in image
formats. The numbers cited below are based on the PostMark benchmark. See
the paper for more details,
http://researcher.watson.ibm.com/researcher/files/us-ctang/FVD-cow.pdf .
P1) Increased disk seek distance caused by a compact image’s distorted
data layout. Specifically, the average disk seek distance in QCOW2 is 460%
longer than that in a RAW image.
P2) Overhead of storing an image on a host file system. Specifically, a
RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw
partition.
P3) Overhead in reading or updating an image format’s on-disk metadata.
Due to this overhead, QCOW2 causes 45% more total disk I/Os (including
I/Os for accessing both data and metadata) than FVD does.
For P1), I uses the term compact image instead of sparse image, because a
RAW image stored as a sparse file in ext3 is a sparse image, but is not a
compact image. A compact image stores data in such a way that the file
size of the image file is smaller than the size of the virtual disk
perceived by the VM. QCOW2 is a compact image. The disadvantage of a
compact image is that the data layout perceived by the guest OS differs
from the actual layout on the physical disk, which defeats many
optimizations in guest file systems. Consider one concrete example. When
the guest VM issues a disk I/O request to the hypervisor using a virtual
block address (VBA), QEMU’s block device driver translates the VBA into an
image block address (IBA), which specifies where the requested data are
stored in the image file, i.e., IBA is an offset in the image file. When a
guest OS creates or resizes a file system, it writes out the file system
metadata, which are all grouped together and assigned consecutive image
block addresses (IBAs) by QCOW2, despite the fact that the metadata’s
virtual block addresses (VBAs) are deliberately scattered for better
reliability and locality, e.g., co-locating inodes and file content blocks
in block groups. As a result, it may cause a long disk seek distance
between accessing a file’s metadata and accessing the file’s content
blocks.
For P2), using a host file system is inefficient, because 1) historically
file systems are optimized for small files rather than large images, and
2) certain functions of a host file system are simply redundant with
respect to the function of a compact image, e.g., performing storage
allocation. Moreover, using a host file system not only adds overhead, but
also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC,
it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
integrity in the event of a host crash. See
http://lwn.net/Articles/348739/ .
For P3), it includes the overhead in reading on-disk metadata and the
overhead in updating on-disk metadata. The former can be reduced by
minimizing the size of metadata so that they can be easily cached in
memory. Reducing the latter requires optimizations to avoid updating the
on-disk metadata whenever possible, while not compromising data integrity
in the event of a host crash.
In addition to addressing the performance overheads caused by P1-P3,
ideally the next-generation image format should meet the following
functional requirements and perhaps beyond.
R1) Support storage over-commit.
R2) Support compact image, copy-on-write, copy-on-read, and adaptive
prefetching.
R3) Allow eliminating the host file system to achieve high performance.
R4) Make all these features orthogonal, i.e., each feature can be enabled
or disabled individually without affecting other features. The purpose is
to support diverse use cases. For example, a copy-on-write image can use a
RAW image like data layout to avoid the overhead associated with a compact
image.
Storage over-commit means that, e.g., a 100GB physical disk can be used to
host 10 VMs, each with a 20GB virtual disk. This is possible because not
every VM completely fills up its 20GB virtual disk. It is not mandatory to
use compact image in order to support storage over-commit. For example,
RAW images stored as sparse files on ext3 support storage over-commit.
Copy-on-read and adaptive prefetching compliment copy-on-write in certain
use cases, e.g., in a Cloud where the backing image is stored on
network-attached storage (NAS) while the copy-on-write image is stored on
direct-attached storage (DAS). When the VM reads a block from the backing
image, a copy of the data is saved in the copy-on-write image for later
reuse. Adaptive prefetching finds resource idle times to copy from NAS to
DAS parts of the image that have not been accessed by the VM before.
Prefetching should be conservative in that if the driver detects a
contention on any resource (including DAS, NAS, or network), it pauses
prefetching temporarily and resumes prefetching later when congestion
disappears.
Next, let me briefly describe how FVD is designed to address the
performance issues P1-P3 and the functional requirements R1-R4. FVD has
the following features.
F1) Use a bitmap to implement copy-on-write.
F2) Use a one-level lookup table to implement compact image.
F3) Use a journal to commit changes to the bitmap and the lookup table.
F4) Store a compact image on a logical volume to support storage
over-commit, and to avoid the overhead and data integrity issues of a host
file system.
For F1), a bit in the bitmap tracks the state of a block. The bit is 0 if
the block is in the base image, and the bit is 1 if the block is in the
FVD image. The default size of a block is 64KB, as that in QCOW2. To
represent the state of a 1TB base image, FVD only needs a 2MB bitmap,
which can be easily cached in memory. This bitmap also implements
copy-on-read and adaptive prefetching.
For F2), one entry in the table maps the virtual disk address of a chunk
to an offset in the FVD image where the chunk is stored. The default size
of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft
VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the
lookup table is only 4MB. Because of the small size, there is no need to
use a two-level lookup table as that in QCOW2.
F1) and F2) are essential. They meet the requirement R4), i.e., the
features of copy-on-write and compact image can be enabled individually.
F1) and F2) are closest to the Microsoft Virtual Hard Disk (VHD) format,
which also uses a bitmap and a one-level table. There are some key
differences though. VHD partitions the bitmap and stores a fragment of the
bitmap with every 2MB chunk. As a result, VHD does not meet the
requirement R4, because it cannot have a copy-on-write image using a RAW
image like data layout. Also because of that, a bit in VHD can only
represents the state of a 512-byte sector (if a bit represents a 64KB
block, the chunk size then has to be 2GB, which is way too large and makes
storage over-commit ineffective). For a 1TB image, the size of the bitmap
in VHD is 256MB, vs. 2MB in FVD, which makes caching more difficult.
F3) uses a journal to commit metadata updates, which is not essential and
there are alternative implementations. F3) however does provide benefits
in addressing P3) (i.e., reducing metadata update overhead) and
simplifying implementation. By default, the size of the journal is 16MB.
When the bitmap and/or the lookup table are updated by a write, the
changes are saved in the journal. When the journal is full, the entire
bitmap and the entire lookup table are flushed to the disk, and the
journal can recycled for reuse. Because the bitmap and the lookup table
are small, the flush is quick. The journal provides several benefits.
First, updating both the bitmap and the lookup table requires only a
single write to journal. Second, K concurrent updates to any potions of
the bitmap or the lookup table are converted to K sequential writes in the
journal, and they can be merged into a single write by the host Linux
kernel. Third, it increases concurrency by avoiding locking the bitmap or
the lookup table. For example, updating one bit in the bitmap requires
writing a 512-byte sector to the on-disk bitmap. This bitmap sector covers
a total of 512*8*64K=256MB data. That is, any two writes that target that
256MB data and require updating the bitmap cannot be processed
concurrently. The journal solves this problem and eliminates locking.
For F4), it is actually quite straightforward to eliminate the host file
system. The main thing that an image format needs from the host file
system is to perform storage allocation. This function, however, is
already performed by a compact image. Using a host file system simply ends
up doing storage allocation twice, which requires updating on-disk
metadata twice and introduces distorted data layout twice. Therefore, if
we migrate the necessary function of a host file system into an image
format, in other words, implementing a mini file system in an image
format, then we can get rid of the host file system. This is exactly what
FVD does, by slightly enhancing the compact image function that is already
there. FVD can manage incrementally added storage space, like ZFS and
unlike ext2/3/4. For example, when FVD manages a 100GB virtual disk, it
initially gets 5GB storage space from the logical volume manager and uses
it to host many 1MB chunks. When the first 5GB is used up, FVD gets
another 5GB to host more 1MB chunks, and so forth. Unlike QCOW2 and more
like a file system, FVD does not have to allocate a new chunk always right
after where the previous chunk was allocated. Instead, it may spread out
used chunks in the storage space in order to mimic a raw image like data
layout. More details will be explained in follow up emails.
The description above is long but is still a summary. Please refer to more
detailed information on the web site,
http://researcher.watson.ibm.com/researcher/view_project.php?id=1852 .
Hopefully I have given a summary of the problems, the requirements, and
the solutions in FVD, which can serve as the basis for a productive
discussion.
Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-06 9:17 ` Stefan Hajnoczi
@ 2011-01-15 3:28 ` Chunqiang Tang
2011-01-15 17:27 ` Stefan Weil
[not found] ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-15 3:28 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: qemu-devel
> The community block I/O test suite is qemu-iotests:
> http://git.kernel.org/?p=linux/kernel/git/hch/qemu-iotests.git;a=summary
> If you have tests that you'd like to contribute, please put them into
> that framework so other developers can run them as part of their
> regular testing.
Hi Stefan,
What I described is not a qemu-io test case. I also use qemu-io, which is
very helpful, but I observed that qemu-io has several limitations in
discovering elusive bugs:
B1) qemu-io cannot trigger many race condition bugs, because it does not
fully control the timing of events. For example, qemu-io cannot test this
scenario: three concurrent writes a, b, and c are processed by
bdrv_aio_writev() in the order of Pa, Pb, and Pc; their writes are
actually persisted on disk in another order of Wc, Wa, and Wb; and finally
their callbacks are invoked in yet another order of Vb, Vc, and Va. Some
race condition bugs may exist in the code (e.g., inappropriate locking),
because it does not anticipate these orders of events are possible. This
is just one example. In theory, there can be 100 concurrent reads or
writes, and their events can happen in an arbitrary permutation order. It
is nearly impossible to manually generating test cases for all of them.
B2) Even if a race condition bug is triggered by chance, its behavior
depends on subtle event timing that is hard to repeat and hence hard to
debug.
B3) With qemu-io, it is hard to test code paths that handle I/O failures.
For example, a disk write may fail due to disk media error. Because these
errors are rare, the failure handling code paths may never be tested,
which for example may contain a null pointer bug that can crash the entire
VM or gradually leaks resources (e.g., memory) due to incomplete cleanup.
B4) qemu-io requires manually creating test cases, which is not only time
consuming but also leads a low coverage in testing. This is because many
bugs happen in scenarios that the developers do not anticipate, and hence
do not know how to create test cases in the first place.
The FVD patch includes a new testing framework that addresses the above
issues. This testing framework is orthogonal to FVD and can be used to
test other block device drivers as well. This testing framework includes
two components that can be used both separately and in a combination
T1) To address the problems of B1- B3, I implemented an emulated disk in
block/sim.c, which allows a full control of event timings, either manually
or automatically. Given the three concurrent writes example above, their 9
events (Pa, Pb, Pc, Wa, Wb, Wc, Va, Vb, and Vc) can be precisely
controlled to be executed in any given order. Moreover, the emulated disk
can inject disk I/O errors in a controlled manner. For example, it can
fail a specific read or write to test how the code handles that, or it can
even fail as many as 90% of the reads/writes to test if the code has
resource leaks. qemu-io is extended with a module qemu-io-sim.c to work
with the emulated disk block/sim.c, so that the tester can use the qemu-io
console to manually control the order of events or fail disk reads or
writes.
T2) The solution in T1 still does not address the problem of B3), i.e.,
manually generating test cases is time consuming and has a low coverage.
This problem is solved by a new testing tool called qemu-test. qemu-test
can 1) automatically generate an unlimited number of randomized test cases
that, e.g., execute 1,000 concurrent disk reads or writes on overlapping
disk regions; 2) automatically generate the corresponding anticipated
correct results, automatically run the tests, and automatically compare
the actual test results with the anticipated correct results. Once it
discovers a difference, which indicates a bug, it halts testing and waits
for the developer to debug. The randomized test cases created by
qemu-test are controlled by a pseudo random number generator, and hence
the behavior is completely repeatable. Therefore, once a bug is triggered,
it can be precisely repeated for an unlimited number of times to
facilitate debugging, even if this bug happens extremely rare in real runs
of a VM. qemu-test is fully automated. Once started, it can continuously
run, e.g., for months to test an enormous number of test cases.
The implementation of qemu-test is actually not that complicated. It opens
two virtual disks, the so-called truth image and test image, respectively.
The truth image is served by a trivial synchronous block device driver so
that its behavior is guaranteed to be correct. The test image is served a
real block device driver (e.g., FVD or QCOW2) that we want to test.
qemu-test submits the same sequence of disk I/O requests (which is
randomly generated) to the truth image and the test image, and expect that
the two images’ contents never diverge. Otherwise, it indicates a bug in
the test image’s block device driver. qemu-test works with the emulated
disk block/sim.c so that it can randomize event timings in a controlled
manner and can inject disk I/O errors randomly.
I found qemu-test extremely powerful in discovering elusive bugs that I
never anticipated, and using qemu-test is effortless. Whenever I completed
some major code upgrade, I simply started qemu-test in the evening and
came back in the morning to collect bugs, if any. Debugging them is also
easy because the bugs are precisely repeatable even if they are hard to
trigger.
As for the QCOW2 bug I mentioned previously, it can be triggered by
test-qcow2.sh. A faster way to trigger it is to bypass those correct test
runs by executing the commands below:
dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1155683840
dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=609064448
./qemu-img create -f qcow2 -b /var/ramdisk/zero-500M.raw
/var/ramdisk/test.qcow2 1155683840
./qemu-test --seed=116579177 --truth=/var/ramdisk/truth.raw
--test=/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false
--compare_after=true --round=100000 --parallel=100 --io_size=10485760
--fail_prob=0 --cancel_prob=0 --instant_qemubh=true
As for the FVD patch that includes the new testing framework, I tried to
post it on the mailing list twice but it always got bounced back, either
because the message is too big or because of a Notes client configuration
issue. Until I figure it out, please down the FVD patch from
https://researcher.ibm.com/researcher/files/us-ctang/FVD-01-14-2011.patch
.
Best regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-15 3:28 ` Chunqiang Tang
@ 2011-01-15 17:27 ` Stefan Weil
2011-01-20 2:59 ` Chunqiang Tang
[not found] ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
1 sibling, 1 reply; 21+ messages in thread
From: Stefan Weil @ 2011-01-15 17:27 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: Stefan Hajnoczi, qemu-devel
Am 15.01.2011 04:28, schrieb Chunqiang Tang:
>> The community block I/O test suite is qemu-iotests:
>> http://git.kernel.org/?p=linux/kernel/git/hch/qemu-iotests.git;a=summary
>> If you have tests that you'd like to contribute, please put them into
>> that framework so other developers can run them as part of their
>> regular testing.
>
> Hi Stefan,
>
> What I described is not a qemu-io test case. I also use qemu-io, which is
> very helpful, but I observed that qemu-io has several limitations in
> discovering elusive bugs:
>
> B1) qemu-io cannot trigger many race condition bugs, because it does not
> fully control the timing of events. For example, qemu-io cannot test this
> scenario: three concurrent writes a, b, and c are processed by
> bdrv_aio_writev() in the order of Pa, Pb, and Pc; their writes are
> actually persisted on disk in another order of Wc, Wa, and Wb; and
> finally
> their callbacks are invoked in yet another order of Vb, Vc, and Va. Some
> race condition bugs may exist in the code (e.g., inappropriate locking),
> because it does not anticipate these orders of events are possible. This
> is just one example. In theory, there can be 100 concurrent reads or
> writes, and their events can happen in an arbitrary permutation order. It
> is nearly impossible to manually generating test cases for all of them.
>
> B2) Even if a race condition bug is triggered by chance, its behavior
> depends on subtle event timing that is hard to repeat and hence hard to
> debug.
>
> B3) With qemu-io, it is hard to test code paths that handle I/O failures.
> For example, a disk write may fail due to disk media error. Because these
> errors are rare, the failure handling code paths may never be tested,
> which for example may contain a null pointer bug that can crash the
> entire
> VM or gradually leaks resources (e.g., memory) due to incomplete cleanup.
>
> B4) qemu-io requires manually creating test cases, which is not only time
> consuming but also leads a low coverage in testing. This is because many
> bugs happen in scenarios that the developers do not anticipate, and hence
> do not know how to create test cases in the first place.
>
> The FVD patch includes a new testing framework that addresses the above
> issues. This testing framework is orthogonal to FVD and can be used to
> test other block device drivers as well. This testing framework includes
> two components that can be used both separately and in a combination
>
> T1) To address the problems of B1- B3, I implemented an emulated disk in
> block/sim.c, which allows a full control of event timings, either
> manually
> or automatically. Given the three concurrent writes example above,
> their 9
> events (Pa, Pb, Pc, Wa, Wb, Wc, Va, Vb, and Vc) can be precisely
> controlled to be executed in any given order. Moreover, the emulated disk
> can inject disk I/O errors in a controlled manner. For example, it can
> fail a specific read or write to test how the code handles that, or it
> can
> even fail as many as 90% of the reads/writes to test if the code has
> resource leaks. qemu-io is extended with a module qemu-io-sim.c to work
> with the emulated disk block/sim.c, so that the tester can use the
> qemu-io
> console to manually control the order of events or fail disk reads or
> writes.
>
> T2) The solution in T1 still does not address the problem of B3), i.e.,
> manually generating test cases is time consuming and has a low coverage.
> This problem is solved by a new testing tool called qemu-test. qemu-test
> can 1) automatically generate an unlimited number of randomized test
> cases
> that, e.g., execute 1,000 concurrent disk reads or writes on overlapping
> disk regions; 2) automatically generate the corresponding anticipated
> correct results, automatically run the tests, and automatically compare
> the actual test results with the anticipated correct results. Once it
> discovers a difference, which indicates a bug, it halts testing and waits
> for the developer to debug. The randomized test cases created by
> qemu-test are controlled by a pseudo random number generator, and hence
> the behavior is completely repeatable. Therefore, once a bug is
> triggered,
> it can be precisely repeated for an unlimited number of times to
> facilitate debugging, even if this bug happens extremely rare in real
> runs
> of a VM. qemu-test is fully automated. Once started, it can continuously
> run, e.g., for months to test an enormous number of test cases.
>
> The implementation of qemu-test is actually not that complicated. It
> opens
> two virtual disks, the so-called truth image and test image,
> respectively.
> The truth image is served by a trivial synchronous block device driver so
> that its behavior is guaranteed to be correct. The test image is served a
> real block device driver (e.g., FVD or QCOW2) that we want to test.
> qemu-test submits the same sequence of disk I/O requests (which is
> randomly generated) to the truth image and the test image, and expect
> that
> the two images’ contents never diverge. Otherwise, it indicates a bug in
> the test image’s block device driver. qemu-test works with the emulated
> disk block/sim.c so that it can randomize event timings in a controlled
> manner and can inject disk I/O errors randomly.
>
> I found qemu-test extremely powerful in discovering elusive bugs that I
> never anticipated, and using qemu-test is effortless. Whenever I
> completed
> some major code upgrade, I simply started qemu-test in the evening and
> came back in the morning to collect bugs, if any. Debugging them is also
> easy because the bugs are precisely repeatable even if they are hard to
> trigger.
>
> As for the QCOW2 bug I mentioned previously, it can be triggered by
> test-qcow2.sh. A faster way to trigger it is to bypass those correct test
> runs by executing the commands below:
>
> dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1155683840
> dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=609064448
> ./qemu-img create -f qcow2 -b /var/ramdisk/zero-500M.raw
> /var/ramdisk/test.qcow2 1155683840
> ./qemu-test --seed=116579177 --truth=/var/ramdisk/truth.raw
> --test=/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false
> --compare_after=true --round=100000 --parallel=100 --io_size=10485760
> --fail_prob=0 --cancel_prob=0 --instant_qemubh=true
>
> As for the FVD patch that includes the new testing framework, I tried to
> post it on the mailing list twice but it always got bounced back, either
> because the message is too big or because of a Notes client configuration
> issue. Until I figure it out, please down the FVD patch from
> https://researcher.ibm.com/researcher/files/us-ctang/FVD-01-14-2011.patch
> .
>
> Best regards,
> ChunQiang (CQ) Tang, Ph.D.
> Homepage: http://www.research.ibm.com/people/c/ctang
Hi,
when I tried to use your patch, I found several problems:
* The patch does apply cleanly to latest QEMU.
This is caused by recent changes in QEMU git master.
* The new code uses tabs instead of spaces (QEMU coding rules).
* Some lines of the new code end with blank characters.
* The patch adds empty lines at the end of some files.
The last two points are reported by newer versions of git
(which refuse to take such patches with the default setting).
Could you please update your patch to fix those topics?
I'd like to apply it to my QEMU code and try the new FVD.
If needed, I could also send your patch to qemu-devel.
Kind regards,
Stefan Weil
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
[not found] ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
@ 2011-01-17 10:37 ` Stefan Hajnoczi
2011-01-18 20:35 ` Chunqiang Tang
0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-01-17 10:37 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: Kevin Wolf, qemu-devel
Resend because qemu-devel was dropped from CC. Thanks for pointing it
out Kevin.
On Sat, Jan 15, 2011 at 12:25 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sat, Jan 15, 2011 at 3:28 AM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> T1) To address the problems of B1- B3, I implemented an emulated disk in
>> block/sim.c, which allows a full control of event timings, either manually
>> or automatically. Given the three concurrent writes example above, their 9
>> events (Pa, Pb, Pc, Wa, Wb, Wc, Va, Vb, and Vc) can be precisely
>> controlled to be executed in any given order. Moreover, the emulated disk
>> can inject disk I/O errors in a controlled manner. For example, it can
>> fail a specific read or write to test how the code handles that, or it can
>> even fail as many as 90% of the reads/writes to test if the code has
>> resource leaks. qemu-io is extended with a module qemu-io-sim.c to work
>> with the emulated disk block/sim.c, so that the tester can use the qemu-io
>> console to manually control the order of events or fail disk reads or
>> writes.
>
> block/blkdebug.c already provides fault injection and is used in
> qemu-iotests test 026. Using blkdebug it is possible to test specific
> error paths in image formats. We should look at merging random
> failures ("fail as many as 90% of the reads/writes") into blkdebug.
>
>> T2) The solution in T1 still does not address the problem of B3), i.e.,
>> manually generating test cases is time consuming and has a low coverage.
>> This problem is solved by a new testing tool called qemu-test. qemu-test
>> can 1) automatically generate an unlimited number of randomized test cases
>> that, e.g., execute 1,000 concurrent disk reads or writes on overlapping
>> disk regions; 2) automatically generate the corresponding anticipated
>> correct results, automatically run the tests, and automatically compare
>> the actual test results with the anticipated correct results. Once it
>> discovers a difference, which indicates a bug, it halts testing and waits
>> for the developer to debug. The randomized test cases created by
>> qemu-test are controlled by a pseudo random number generator, and hence
>> the behavior is completely repeatable. Therefore, once a bug is triggered,
>> it can be precisely repeated for an unlimited number of times to
>> facilitate debugging, even if this bug happens extremely rare in real runs
>> of a VM. qemu-test is fully automated. Once started, it can continuously
>> run, e.g., for months to test an enormous number of test cases.
>>
>> The implementation of qemu-test is actually not that complicated. It opens
>> two virtual disks, the so-called truth image and test image, respectively.
>> The truth image is served by a trivial synchronous block device driver so
>> that its behavior is guaranteed to be correct. The test image is served a
>> real block device driver (e.g., FVD or QCOW2) that we want to test.
>> qemu-test submits the same sequence of disk I/O requests (which is
>> randomly generated) to the truth image and the test image, and expect that
>> the two images’ contents never diverge. Otherwise, it indicates a bug in
>> the test image’s block device driver. qemu-test works with the emulated
>> disk block/sim.c so that it can randomize event timings in a controlled
>> manner and can inject disk I/O errors randomly.
>
> block/blkverify.c already provides I/O verification. It mirrors
> writes to a raw file and compares the contents of read blocks to
> detect data integrity issues. That's the same approach you have
> described.
>
>> I found qemu-test extremely powerful in discovering elusive bugs that I
>> never anticipated, and using qemu-test is effortless. Whenever I completed
>> some major code upgrade, I simply started qemu-test in the evening and
>> came back in the morning to collect bugs, if any. Debugging them is also
>> easy because the bugs are precisely repeatable even if they are hard to
>> trigger.
>
> Here are the unique features you've described beyond what qemu-io,
> blkdebug, and blkverify do:
>
> 1. New functionality
> * Control over ordering of I/O request submission and completion.
> * Random I/O generator (probably as new qemu-io command).
>
> 2. Enhancements to existing code:
> * Random chance of failing I/O in blkdebug.
>
> Do you agree with this or are there other unique features which are
> beyond small enhancements to existing code?
>
> I think the best strategy is to consolidate these as incremental
> patches that can be reviewed and merged independently.
>
>> As for the FVD patch that includes the new testing framework, I tried to
>> post it on the mailing list twice but it always got bounced back, either
>> because the message is too big or because of a Notes client configuration
>> issue. Until I figure it out, please down the FVD patch from
>> https://researcher.ibm.com/researcher/files/us-ctang/FVD-01-14-2011.patch
>
> I'll send you my git-send-email config off-list.
>
> Stefan
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-17 10:37 ` Stefan Hajnoczi
@ 2011-01-18 20:35 ` Chunqiang Tang
2011-01-19 0:59 ` Jamie Lokier
0 siblings, 1 reply; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-18 20:35 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel
> > Here are the unique features you've described beyond what qemu-io,
> > blkdebug, and blkverify do:
> >
> > 1. New functionality
> > * Control over ordering of I/O request submission and completion.
> > * Random I/O generator (probably as new qemu-io command).
> >
> > 2. Enhancements to existing code:
> > * Random chance of failing I/O in blkdebug.
> >
> > Do you agree with this or are there other unique features which are
> > beyond small enhancements to existing code?
> >
> > I think the best strategy is to consolidate these as incremental
> > patches that can be reviewed and merged independently.
Hi Stefan,
I agree with the strategy you described. Among the things you summarized,
'random chance of failing I/O in blkdebug' is probably trivial to add.
'random I/O generator' (i.e., the currently stand-alone program qemu-test)
may be able to be folded in as a qemu-io command. Controlling I/O order
and callback order is the most significant change, which has already
integrated as several qemu-io commands in the FVD patch.
The purpose of controlling I/O order and callback order is to test race
conditions under concurrent requests. It is implemented as the “sim”
driver in block/sim.c, by following an event-driven simulation approach
and maintaining an outstanding event list. The “sim” driver can either
remain standalone or be folded into blkdebug.c. The latter case would
require significant changes to blkdebug.c.
Doing both fault injection and verification together introduces some
subtlety. For example, even under the random failure mode, two disk writes
triggered by one VM-issued write must either fail together or succeed
together. Otherwise, the truth image and the test image will diverge and
verification won't succeed. Currently, qemu-test carefully works with the
'sim' driver to guarantee those conditions. Those conditions need be
retained after code restructure.
Best regards,
Chunqiang Tang
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-18 20:35 ` Chunqiang Tang
@ 2011-01-19 0:59 ` Jamie Lokier
2011-01-19 14:59 ` Chunqiang Tang
0 siblings, 1 reply; 21+ messages in thread
From: Jamie Lokier @ 2011-01-19 0:59 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel
Chunqiang Tang wrote:
> Doing both fault injection and verification together introduces some
> subtlety. For example, even under the random failure mode, two disk writes
> triggered by one VM-issued write must either fail together or succeed
> together. Otherwise, the truth image and the test image will diverge and
> verification won't succeed. Currently, qemu-test carefully works with the
> 'sim' driver to guarantee those conditions. Those conditions need be
> retained after code restructure.
If the real backend is a host system file or device, and AIO or
multi-threaded writes are used, you can't depend on two parallel disk
writes (triggered by one VM-issued write) failing together or
succeeding together. All you can do is look at the error code after
each operation completes, and use it to prevent issuing later
operations. You can't stop the other parallel operations that are
already in progress.
Is that an issue in your design assumptions?
Thanks,
-- Jamie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-14 20:56 ` Chunqiang Tang
@ 2011-01-19 1:12 ` Jamie Lokier
2011-01-19 8:10 ` Stefan Hajnoczi
2011-01-19 15:51 ` Christoph Hellwig
1 sibling, 1 reply; 21+ messages in thread
From: Jamie Lokier @ 2011-01-19 1:12 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: qemu-devel
Chunqiang Tang wrote:
> > Based on my limited understanding, I think FVD shares a
> > lot in common with the COW format (block/cow.c).
> >
> > But I think most of the advantages you mention could be considered as
> > additions to either qcow2 or qed. At any rate, the right way to have
> > that discussion is in the form of patches on the ML.
>
> FVD is much more advanced than block/cow.c. I would be happy to discuss
> possible leverage, but setting aside the details of QCOW2, QED, and FVD,
> let’s start with a discussion of what is needed for the next generation
> image format.
Thank you for the detailed description.
FVD looks quite good to me; it seems very simple yet performant at the
same time, due to its smart yet simple design.
> Moreover, using a host file system not only adds overhead, but
> also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC,
> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> integrity in the event of a host crash. See
> http://lwn.net/Articles/348739/ .
You have the same issue with O_DIRECT when using a raw disk device
too. That is, O_DIRECT on a raw device does not guarantee integrity
in the event of a host crash either, for mostly the same reasons.
-- Jamie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 1:12 ` Jamie Lokier
@ 2011-01-19 8:10 ` Stefan Hajnoczi
2011-01-19 15:17 ` Chunqiang Tang
0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2011-01-19 8:10 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: qemu-devel
On Wed, Jan 19, 2011 at 1:12 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Chunqiang Tang wrote:
>> Moreover, using a host file system not only adds overhead, but
>> also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC,
>> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
>> integrity in the event of a host crash. See
>> http://lwn.net/Articles/348739/ .
>
> You have the same issue with O_DIRECT when using a raw disk device
> too. That is, O_DIRECT on a raw device does not guarantee integrity
> in the event of a host crash either, for mostly the same reasons.
QEMU has semantics that use O_DIRECT safely; there is no issue here.
When a drive is added with cache=none, QEMU not only uses O_DIRECT but
also advertises an enabled write cache to the guest.
The guest *must* flush the cache when it wants to ensure data is
stable. In the event of a host crash, all, some, or none of the I/O
since the last flush may have made it to disk. Each of these
possibilities is fair game since the guest may only depend on writes
being on disk if they completed and a successful flush was issued
afterwards.
Stefan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 0:59 ` Jamie Lokier
@ 2011-01-19 14:59 ` Chunqiang Tang
0 siblings, 0 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 14:59 UTC (permalink / raw)
To: Jamie Lokier; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel
> > Doing both fault injection and verification together introduces some
> > subtlety. For example, even under the random failure mode, two disk
writes
> > triggered by one VM-issued write must either fail together or succeed
> > together. Otherwise, the truth image and the test image will diverge
and
> > verification won't succeed. Currently, qemu-test carefully works with
the
> > 'sim' driver to guarantee those conditions. Those conditions need be
> > retained after code restructure.
>
> If the real backend is a host system file or device, and AIO or
> multi-threaded writes are used, you can't depend on two parallel disk
> writes (triggered by one VM-issued write) failing together or
> succeeding together. All you can do is look at the error code after
> each operation completes, and use it to prevent issuing later
> operations. You can't stop the other parallel operations that are
> already in progress.
>
> Is that an issue in your design assumptions?
Your description of the problem is accurate, i.e., "if AIO or
multi-threaded writes are used, you can't stop the other parallel
operations that are already in progress." As a result, a naive extension
of blkverify to test concurrent requests would not work. The simulated
block driver (block/sim.c) in the FVD patch uses neither AIO or nor
multi-threaded I/O as the backend. It instead uses a 'simulated' backend,
which allows a full control of I/O order and callback order, and can
enforce that two parallel disk writes (triggered by one VM-issued write)
either failing together or succeeding together, and some other properties
as well, which makes testing more comprehensive and debugging easier.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 8:10 ` Stefan Hajnoczi
@ 2011-01-19 15:17 ` Chunqiang Tang
2011-01-19 15:25 ` Christoph Hellwig
2011-01-19 23:56 ` Jamie Lokier
0 siblings, 2 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 15:17 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: qemu-devel
> >> Moreover, using a host file system not only adds overhead, but
> >> also introduces data integrity issues. Specifically, if I/Os uses
O_DSYNC,
> >> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> >> integrity in the event of a host crash. See
> >> http://lwn.net/Articles/348739/ .
> >
> > You have the same issue with O_DIRECT when using a raw disk device
> > too. That is, O_DIRECT on a raw device does not guarantee integrity
> > in the event of a host crash either, for mostly the same reasons.
>
> QEMU has semantics that use O_DIRECT safely; there is no issue here.
> When a drive is added with cache=none, QEMU not only uses O_DIRECT but
> also advertises an enabled write cache to the guest.
>
> The guest *must* flush the cache when it wants to ensure data is
> stable. In the event of a host crash, all, some, or none of the I/O
> since the last flush may have made it to disk. Each of these
> possibilities is fair game since the guest may only depend on writes
> being on disk if they completed and a successful flush was issued
> afterwards.
Thank both of you for the explanation, which is very helpful to me. With
FVD's capability of eliminating the host file system and storing the image
on a logical volume, then perhaps we can always use O_DSYNC, because there
is little (or no?) LVM metadata that needs a flush on every write and
hence O_DSYNC does not add overhead? I am not certain on this, and need
help for confirmation. If this is true, the guest does not need to flush
the cache.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 15:17 ` Chunqiang Tang
@ 2011-01-19 15:25 ` Christoph Hellwig
2011-01-19 23:56 ` Jamie Lokier
1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 15:25 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: Stefan Hajnoczi, qemu-devel
On Wed, Jan 19, 2011 at 10:17:47AM -0500, Chunqiang Tang wrote:
> Thank both of you for the explanation, which is very helpful to me. With
> FVD's capability of eliminating the host file system and storing the image
> on a logical volume, then perhaps we can always use O_DSYNC, because there
> is little (or no?) LVM metadata that needs a flush on every write and
> hence O_DSYNC does not add overhead? I am not certain on this, and need
> help for confirmation. If this is true, the guest does not need to flush
> the cache.
O_DSYNC flushes the volatile write cache of the disk on every write,
which can be very ineffienct. In addition to that image formats really
should obey the configurable caching settings qemu has, they exist for
a reason and should be handled uniformly over different image formats
and protocol drivers.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-14 20:56 ` Chunqiang Tang
2011-01-19 1:12 ` Jamie Lokier
@ 2011-01-19 15:51 ` Christoph Hellwig
2011-01-19 16:21 ` Chunqiang Tang
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 15:51 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: qemu-devel
On Fri, Jan 14, 2011 at 03:56:00PM -0500, Chunqiang Tang wrote:
> P2) Overhead of storing an image on a host file system. Specifically, a
> RAW image stored on ext3 is 50-63% slower than a RAW image stored on a raw
> partition.
Sorry, benchmarking this against ext3 really doesn't matter. Benchmark
it against xfs or ext4 with a preallocated image (fallocate or dd).
> For P1), I uses the term compact image instead of sparse image, because a
> RAW image stored as a sparse file in ext3 is a sparse image, but is not a
> compact image. A compact image stores data in such a way that the file
> size of the image file is smaller than the size of the virtual disk
> perceived by the VM. QCOW2 is a compact image. The disadvantage of a
> compact image is that the data layout perceived by the guest OS differs
> from the actual layout on the physical disk, which defeats many
> optimizations in guest file systems.
It's something filesystems have to deal with. Real storage is getting
increasingly virtualized. While this didn't matter for the real high
end storage which has been doing this for a long time it's getting more
and more exposed to the filesystem. That includes LVM layouts and
thinly provisioned disk arrays, which are getting increasingly popular.
That doesn't matter the 64k (or until recently 4k) cluster size in qcow2
is a good idea, we'd want at least a magnitude or two larger extents
to perform well, but it means filesystems really do have to cope with
it.
> For P2), using a host file system is inefficient, because 1) historically
> file systems are optimized for small files rather than large images,
I'm not sure what hole you're pulling off this bullshit, but this is
absolutely not correct. Since the damn of time you have filesystems
optimized for small files, for larger or really large files, or trying
to deal with a tradeoff inbetween.
> 2) certain functions of a host file system are simply redundant with
> respect to the function of a compact image, e.g., performing storage
> allocation. Moreover, using a host file system not only adds overhead, but
> also introduces data integrity issues.
I/O into fully preallocated files uses exactly the same codepath as
doing I/O to the block device, except for a identify logical to physical
block mapping in the block device and a non-trivial one in the
filesysgem. Note that the block mapping is cached and does not affect
the performance. I've published the numbers for qemu in the various
caching modes and all major filesystems a while ago, so I'm not making
this up.
> Specifically, if I/Os uses O_DSYNC,
> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> integrity in the event of a host crash. See
> http://lwn.net/Articles/348739/ .
I/O to the block devices does not guarantee data integrity without
O_DSYNC either.
> Storage over-commit means that, e.g., a 100GB physical disk can be used to
> host 10 VMs, each with a 20GB virtual disk.
The current storage industry buzz word for that is thin provisioning.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 15:51 ` Christoph Hellwig
@ 2011-01-19 16:21 ` Chunqiang Tang
2011-01-19 16:42 ` Christoph Hellwig
0 siblings, 1 reply; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 16:21 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: qemu-devel
> It's something filesystems have to deal with. Real storage is getting
> increasingly virtualized. While this didn't matter for the real high
> end storage which has been doing this for a long time it's getting more
> and more exposed to the filesystem. That includes LVM layouts and
> thinly provisioned disk arrays, which are getting increasingly popular.
> That doesn't matter the 64k (or until recently 4k) cluster size in qcow2
> is a good idea, we'd want at least a magnitude or two larger extents
> to perform well, but it means filesystems really do have to cope with
> it.
Yes, a fundamental and optimal solution would be changing guest file
systems, but it would take a much much longer route to introduce
virtualization awareness into all guest file systems and its also requires
changing the interface between the guest and the host.
> I/O into fully preallocated files uses exactly the same codepath as
> doing I/O to the block device, except for a identify logical to physical
> block mapping in the block device and a non-trivial one in the
> filesysgem. Note that the block mapping is cached and does not affect
> the performance. I've published the numbers for qemu in the various
> caching modes and all major filesystems a while ago, so I'm not making
> this up.
Preallocation is not a universal solution here, because it just defeats
the other goal: thin provisioning. Moreover, if preallocation is used, it
works best for RAW images and makes it unnecessary to use a compact image,
which is exactly one goal of FVD, i.e., allowing optionally disabling a
compact image data layout without giving up other features, e.g.,
copy-on-write.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 16:21 ` Chunqiang Tang
@ 2011-01-19 16:42 ` Christoph Hellwig
2011-01-19 17:08 ` Chunqiang Tang
0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 16:42 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: Christoph Hellwig, qemu-devel
On Wed, Jan 19, 2011 at 11:21:07AM -0500, Chunqiang Tang wrote:
> Yes, a fundamental and optimal solution would be changing guest file
> systems, but it would take a much much longer route to introduce
> virtualization awareness into all guest file systems and its also requires
> changing the interface between the guest and the host.
Actually current filesystems do pretty well on thinly provisioned
storage, as long as your extent size is not too small. Starting from
extent size in the 64M to 256M range there's almost no difference to
non-virtualized storage.
> Preallocation is not a universal solution here, because it just defeats
> the other goal: thin provisioning. Moreover, if preallocation is used, it
> works best for RAW images and makes it unnecessary to use a compact image,
> which is exactly one goal of FVD, i.e., allowing optionally disabling a
> compact image data layout without giving up other features, e.g.,
> copy-on-write.
Again, sparse images with a large enough allocation size give you almost
the same numbers as preallocated images. I've been doing quite a lot of
work on TP support in QEMU. Using an XFS filesystem to back the image
with an extent size hint in the above mentioned range gives performance
withing 1% of fully preallocated images. With the added benefit of
allowing to deallocate the space again through the SCSI WRITE_SAME
or ATA TRIM commands.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 16:42 ` Christoph Hellwig
@ 2011-01-19 17:08 ` Chunqiang Tang
2011-01-19 17:25 ` Christoph Hellwig
0 siblings, 1 reply; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-19 17:08 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: qemu-devel
> Actually current filesystems do pretty well on thinly provisioned
> storage, as long as your extent size is not too small. Starting from
> extent size in the 64M to 256M range there's almost no difference to
> non-virtualized storage.
>
> Again, sparse images with a large enough allocation size give you almost
> the same numbers as preallocated images. I've been doing quite a lot of
> work on TP support in QEMU. Using an XFS filesystem to back the image
> with an extent size hint in the above mentioned range gives performance
> withing 1% of fully preallocated images. With the added benefit of
> allowing to deallocate the space again through the SCSI WRITE_SAME
> or ATA TRIM commands.
These numbers are very interesting and I would like to read more. Are
your detailed results accessible on the Internet? Do you have numbers on
the impact of using a large extent size (64-256MB) on thin provisioning
(since a guest file system's metadata are written first and are scattered
in the virtual disk)?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 17:08 ` Chunqiang Tang
@ 2011-01-19 17:25 ` Christoph Hellwig
0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2011-01-19 17:25 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: qemu-devel
On Wed, Jan 19, 2011 at 12:08:41PM -0500, Chunqiang Tang wrote:
> These numbers are very interesting and I would like to read more. Are
> your detailed results accessible on the Internet? Do you have numbers on
> the impact of using a large extent size (64-256MB) on thin provisioning
> (since a guest file system's metadata are written first and are scattered
> in the virtual disk)?
It's still work in progress. I will publish patches and numbers in a
few weeks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-19 15:17 ` Chunqiang Tang
2011-01-19 15:25 ` Christoph Hellwig
@ 2011-01-19 23:56 ` Jamie Lokier
1 sibling, 0 replies; 21+ messages in thread
From: Jamie Lokier @ 2011-01-19 23:56 UTC (permalink / raw)
To: Chunqiang Tang; +Cc: Stefan Hajnoczi, qemu-devel
Chunqiang Tang wrote:
> > >> Moreover, using a host file system not only adds overhead, but
> > >> also introduces data integrity issues. Specifically, if I/Os uses
> O_DSYNC,
> > >> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> > >> integrity in the event of a host crash. See
> > >> http://lwn.net/Articles/348739/ .
> > >
> > > You have the same issue with O_DIRECT when using a raw disk device
> > > too. That is, O_DIRECT on a raw device does not guarantee integrity
> > > in the event of a host crash either, for mostly the same reasons.
> >
> > QEMU has semantics that use O_DIRECT safely; there is no issue here.
> > When a drive is added with cache=none, QEMU not only uses O_DIRECT but
> > also advertises an enabled write cache to the guest.
> >
> > The guest *must* flush the cache when it wants to ensure data is
> > stable. In the event of a host crash, all, some, or none of the I/O
> > since the last flush may have made it to disk. Each of these
> > possibilities is fair game since the guest may only depend on writes
> > being on disk if they completed and a successful flush was issued
> > afterwards.
>
> Thank both of you for the explanation, which is very helpful to me. With
> FVD's capability of eliminating the host file system and storing the image
> on a logical volume, then perhaps we can always use O_DSYNC, because there
> is little (or no?) LVM metadata that needs a flush on every write and
> hence O_DSYNC does not add overhead? I am not certain on this, and need
> help for confirmation. If this is true, the guest does not need to flush
> the cache.
I think O_DSYNC does not work as you might expect on raw disk devices
and logical volumes.
That doesn't mean you don't need something for crash durability!
Instead, you need to issue the disk cache flushes in whatever way works.
It actually has a very *high* overhead.
The overhead isn't from metadata - it is from needing to flush the
disk cache after every write, which prevents the disk from reordering
writes.
If you don't issue the flushes, and the physical device has a volatile
write cache, then you cannot guarantee integrity in the event of a
host crash.
This can make a filesystem faster than a raw disk or logical volume in
some configurations, if the filesystem journals data writes to limit
the seeking needed to commit durably.
-- Jamie
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
2011-01-15 17:27 ` Stefan Weil
@ 2011-01-20 2:59 ` Chunqiang Tang
0 siblings, 0 replies; 21+ messages in thread
From: Chunqiang Tang @ 2011-01-20 2:59 UTC (permalink / raw)
To: Stefan Weil; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel
> when I tried to use your patch, I found several problems:
>
> * The patch does apply cleanly to latest QEMU.
> This is caused by recent changes in QEMU git master.
>
> * The new code uses tabs instead of spaces (QEMU coding rules).
>
> * Some lines of the new code end with blank characters.
>
> * The patch adds empty lines at the end of some files.
>
> The last two points are reported by newer versions of git
> (which refuse to take such patches with the default setting).
>
> Could you please update your patch to fix those topics?
> I'd like to apply it to my QEMU code and try the new FVD.
Thank you for the detailed instructions. I fixed all the issues above and
posted the latest patches to the mailing list (see below). Please use this
latest version, which also includes some other fixes. BTW, I found out
that my previous patches were rejected by the mailing list because a
single patch for FVD was too big.
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01948.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01947.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01950.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01949.html
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg01951.html
Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2011-01-20 3:00 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-04 21:44 [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249% Chunqiang Tang
2011-01-05 17:29 ` Anthony Liguori
2011-01-14 20:56 ` Chunqiang Tang
2011-01-19 1:12 ` Jamie Lokier
2011-01-19 8:10 ` Stefan Hajnoczi
2011-01-19 15:17 ` Chunqiang Tang
2011-01-19 15:25 ` Christoph Hellwig
2011-01-19 23:56 ` Jamie Lokier
2011-01-19 15:51 ` Christoph Hellwig
2011-01-19 16:21 ` Chunqiang Tang
2011-01-19 16:42 ` Christoph Hellwig
2011-01-19 17:08 ` Chunqiang Tang
2011-01-19 17:25 ` Christoph Hellwig
2011-01-06 9:17 ` Stefan Hajnoczi
2011-01-15 3:28 ` Chunqiang Tang
2011-01-15 17:27 ` Stefan Weil
2011-01-20 2:59 ` Chunqiang Tang
[not found] ` <AANLkTinw2S2dzKoeFK-dBP6b36J+VNLjb3f-vbkKm3Fz@mail.gmail.com>
2011-01-17 10:37 ` Stefan Hajnoczi
2011-01-18 20:35 ` Chunqiang Tang
2011-01-19 0:59 ` Jamie Lokier
2011-01-19 14:59 ` Chunqiang Tang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.