Re: [Qemu-devel] [PATCH][RFC][0/2] REF+/REF- optimization

From: Kevin Wolf <kwolf@redhat.com>
To: Frediano Ziglio <freddy77@gmail.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH][RFC][0/2] REF+/REF- optimization
Date: Wed, 14 Sep 2011 11:10:38 +0200	[thread overview]
Message-ID: <4E706F8E.4060101@redhat.com> (raw)
In-Reply-To: <CAHt6W4cEGRmweASGkOJBiZNmj-Ddnagx9EH6kPTFKi8JBwswmQ@mail.gmail.com>

Am 13.09.2011 15:36, schrieb Frediano Ziglio:
> 2011/9/13 Kevin Wolf <kwolf@redhat.com>:
>> Am 13.09.2011 09:53, schrieb Frediano Ziglio:
>>> These patches try to trade-off between leaks and speed for clusters
>>> refcounts.
>>>
>>> Refcount increments (REF+ or refp) are handled in a different way from
>>> decrements (REF- or refm). The reason it that posting or not flushing
>>> a REF- cause "just" a leak while posting a REF+ cause a corruption.
>>>
>>> To optimize REF- I just used an array to store offsets then when a
>>> flush is requested or array reach a limit (currently 1022) the array
>>> is sorted and written to disk. I use an array with offset instead of
>>> ranges to support compression (an offset could appear multiple times
>>> in the array).
>>> I consider this patch quite ready.
>>
>> Ok, first of all let's clarify what this optimises. I don't think it
>> changes anything at all for the writeback cache modes, because these
>> already do most operations in memory only. So this must be about
>> optimising some operations with cache=writethrough. REF- isn't about
>> normal cluster allocation, it is about COW with internal snapshots or
>> bdrv_discard. Do you have benchmarks for any of them?
>>
>> I strongly disagree with your approach for REF-. We already have a
>> cache, and introducing a second one sounds like a bad idea. I think we
>> could get a very similar effect if we introduced a
>> qcow2_cache_entry_mark_dirty_wb() that marks a given refcount block as
>> dirty, but at the same time tells the cache that even in write-through
>> mode it can still treat this block as write-back. This should require
>> much less code changes.
>>
> 
> Yes, mainly optimize for writethrough. I did not test with writeback
> but should improve even this (I think here you have some flush to keep
> consistency).
> I'll try to write a qcow2_cache_entry_mark_dirty_wb patch and test it.

Great, thanks!

>> But let's measure the effects first, I suspect that for cluster
>> allocation it doesn't help much because every REF- comes with a REF+.
>>
> 
> That's 50% of effort if REF- clusters are far from REF+ :)

I would expect that the next REF+ allocates exactly the REF- cluster.
But you still have a point, we save the write on REF- and combine it
with the REF+ write.

>>> To optimize REF+ I mark a range as allocated and use this range to
>>> get new ones (avoiding writing refcount to disk). When a flush is
>>> requested or in some situations (like snapshot) this cache is disabled
>>> and flushed (written as REF-).
>>> I do not consider this patch ready, it works and pass all io-tests
>>> but for instance I would avoid allocating new clusters for refcount
>>> during preallocation.
>>
>> The only question here is if improving cache=writethrough cluster
>> allocation performance is worth the additional complexity in the already
>> complex refcounting code.
>>
> 
> I didn't see this optimization as a second level cache, but yes, for
> REF- is a second cache.
> 
>> The alternative that was discussed before is the dirty bit approach that
>> is used in QED and would allow us to use writeback for all refcount
>> blocks, regardless of REF- or REF+. It would be an easier approach
>> requiring less code changes, but it comes with the cost of requiring an
>> fsck after a qemu crash.
>>
> 
> I was thinking about changing the header magic first time we change
> refcount in order to mark image as dirty so newer Qemu recognize the
> flag while former one does not recognize image. Obviously reverting
> magic on image close.

We've discussed this idea before and I think it wasn't considered a
great idea to automagically change the header in an incompatible way.
But we can always say that for improved performance you need to upgrade
your image to qcow2 v3.

>>> End speed up is quite visible allocating clusters (more then 20%).
>>
>> What benchmark do you use for testing this?
>>
>> Kevin
>>
> 
> Currently I'm using bonnie++ but I noted similar improves with iozone.
> The test script format an image then launch a Linux machine which run
> a script and save result to a file.
> The test image is seems by this virtual machine as a separate disk.
> The file on hist reside in a separate LV.
> I got quite consistent results (of course not working on the machine
> while testing, is not actually dedicated to this job).
> 
> Actually I'm running the test (added a test working in a snapshot image).

Okay. Let me guess the remaining variables: The image is on an ext4 host
filesystem, you use cache=writethrough and virtio-blk. You don't use
backing files, compression and encryption. For your tests with internal
snapshots you have exactly one internal snapshot that is taken
immediately before the benchmark. Oh, and not to forget, KVM is enabled.

Are these assumptions correct?

Kevin