Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

From: Alberto Garcia <berto@igalia.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org, Brian Foster <bfoster@redhat.com>,
	qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
Date: Fri, 21 Aug 2020 18:09:17 +0200	[thread overview]
Message-ID: <w51pn7khtg2.fsf@maestria.local.igalia.com> (raw)
In-Reply-To: <20200820215811.GC7941@dread.disaster.area>

On Thu 20 Aug 2020 11:58:11 PM CEST, Dave Chinner wrote:
>> The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
>> the host (on an xfs or ext4 filesystem as the table above shows), and
>> it is attached to QEMU using a virtio-blk-pci device:
>> 
>>    -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
>
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?

I sent the results on a reply to Brian.

>> cache=none means that the image is opened with O_DIRECT and
>> l2-cache-size is large enough so QEMU is able to cache all the
>> relevant qcow2 metadata in memory.
>
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
>
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of the
> raw image file... (assuming you made the xfs filesystem with reflink
> support (which is the TOT default now)).

To be clear, I'm not trying to advocate for or against qcow2 on xfs, we
were just analyzing different allocation strategies for qcow2 and we
came across these results which we don't quite understand.

>> 1) off: for every write request QEMU initializes the cluster (64KB)
>>         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>> 
>> 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>>         of the cluster with zeroes.
>> 
>> 3) metadata: all clusters were allocated when the image was created
>>         but they are sparse, QEMU only writes the 4KB of data.
>> 
>> 4) falloc: all clusters were allocated with fallocate() when the image
>>         was created, QEMU only writes 4KB of data.
>> 
>> 5) full: all clusters were allocated by writing zeroes to all of them
>>         when the image was created, QEMU only writes 4KB of data.
>> 
>> As I said in a previous message I'm not familiar with xfs, but the
>> parts that I don't understand are
>> 
>>    - Why is (4) slower than (1)?
>
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
>
> fallocate(whole file)
> <IO>
> <IO>
> <IO>
> .....
>
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO completion
> (see answer to next question!)
>
> However, if you just use (4) you get:
>
> falloc(64k)
>   <wait for inflight IO to complete>
>   <allocates 64k as unwritten>
> <4k io>
>   ....
> falloc(64k)
>   <wait for inflight IO to complete>
>   ....
>   <4k IO completes, converts 4k to written>
>   <allocates 64k as unwritten>
> <4k io>

I think Brian pointed it out already, but scenario (4) is rather
falloc(25GB), then QEMU is launched and the actual 4k IO requests start
to happen.

So I would expect that after falloc(25GB) all clusters are initialized
and the end result would be closer to a full preallocation (i.e. writing
25GB worth of zeroes to disk).

> IOWs, typical "write once" benchmark testing indicates the *worst*
> performance you are going to see. As the guest filesytsem ages and
> initialises more of the underlying image file, it will get faster, not
> slower.

Yes, that's clear, once everything is allocation then it is fast (and
really much faster in the case of xfs vs ext4), what we try to optimize
in qcow2 is precisely the allocation of new clusters.

Berto