Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes

From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org
Cc: kwolf@redhat.com, jsnow@redhat.com, qemu-devel@nongnu.org,
	ehabkost@redhat.com, crosa@redhat.com
Subject: Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
Date: Mon, 15 Mar 2021 10:58:54 +0100	[thread overview]
Message-ID: <6056196d-a0cc-7de2-5d6f-b223fdee98ff@redhat.com> (raw)
In-Reply-To: <d5acfe9d-2095-a601-20b7-bd0b677df68a@virtuozzo.com>

On 12.03.21 19:43, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 21:15, Max Reitz wrote:
>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>> Compressed writes are unaligned to 512, which works very slow in
>>> O_DIRECT mode. Let's use the cache.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/coroutines.h     |   3 +
>>>   block/qcow2.h          |   4 ++
>>>   block/qcow2-refcount.c |  10 +++
>>>   block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
>>>   4 files changed, 164 insertions(+), 11 deletions(-)

[...]

>>> @@ -2699,6 +2796,12 @@ static void qcow2_close(BlockDriverState *bs)
>>>           qcow2_inactivate(bs);
>>>       }
>>> +    /*
>>> +     * Cache should be flushed in qcow2_inactivate() and should be 
>>> empty in
>>> +     * inactive mode. So we are safe to free it.
>>> +     */
>>> +    seqcache_free(s->compressed_cache);
>>> +
>>>       cache_clean_timer_del(bs);
>>>       qcow2_cache_destroy(s->l2_table_cache);
>>>       qcow2_cache_destroy(s->refcount_block_cache);
>>> @@ -4558,18 +4661,42 @@ 
>>> qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>>           goto fail;
>>>       }
>>> -    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>>> +    if (s->compressed_cache) {
>>
>> Why is this conditional?
> 
> We don't have compressed_cache for non o_direct.

Oh right.

>>> +        /*
>>> +         * It's important to do seqcache_write() in the same 
>>> critical section
>>> +         * (by s->lock) as qcow2_alloc_compressed_cluster_offset(), 
>>> so that the
>>> +         * cache is filled sequentially.
>>> +         */
>>
>> Yes.
>>
>>> +        seqcache_write(s->compressed_cache, cluster_offset, out_len, 
>>> out_buf);
>>> -    qemu_co_mutex_unlock(&s->lock);
>>> +        qemu_co_mutex_unlock(&s->lock);
>>> -    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>>> -    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, 
>>> out_buf, 0);
>>> +        ret = qcow2_co_compressed_flush_one(bs, false);
>>
>> The qcow2 doc says a compressed cluster can span multiple host 
>> clusters.  I don’t know whether that can happen with this driver, but 
>> if it does, wouldn’t that mean we’d need to flush two clusters here?  
>> Oh, no, never mind.  Only the first one would be finished and thus 
>> flushed, not the second one.
>>
>> I could have now removed the above paragraph, but it made me think, so 
>> I kept it:
>>
>> Hm.  Actually, if we unconditionally flush here, doesn’t that mean 
>> that we’ll never have a finished cluster in the cache for longer than 
>> the span between the seqcache_write() and this 
>> qcow2_co_compressed_flush_one()?  I.e., the 
>> qcow2_co_flush_compressed_cache() is supposed to never flush any 
>> finished cluster, but only the currently active unfinished cluster (if 
>> there is one), right?
> 
> Hmm. Maybe if we have parallel write and flush requests, it's a kind of 
> race condition: may be flush will flush both finished and unfinished 
> cluster, maybe write will flush the finished cluster and flush will 
> flush only unfinished one.. Moreover we may have several parallel 
> requests, so they make several finished clusters, and sudden flush will 
> flush them all.

OK.  I was mostly asking because I was wondering how much you expect the 
cache to be filled, i.e., how much you expect the read cache to help.

[...]

>>> @@ -4681,10 +4808,19 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
>>>       out_buf = qemu_blockalign(bs, s->cluster_size);
>>> -    BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
>>> -    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
>>> -    if (ret < 0) {
>>> -        goto fail;
>>> +    /*
>>> +     * seqcache_read may return less bytes than csize, as csize may 
>>> exceed
>>> +     * actual compressed data size. So we are OK if seqcache_read 
>>> returns
>>> +     * something > 0.
>>
>> I was about to ask what happens when a compressed cluster spans two 
>> host clusters (I could have imagined that in theory the second one 
>> could have been discarded, but not the first one, so reading from the 
>> cache would really be short -- we would have needed to check that we 
>> only fell short in the range of 512 bytes, not more).
>>
>> But then I realized that in this version of the series, all finished 
>> clusters are immediately discarded and only the current unfinished one 
>> is kept.  Does it even make sense to try seqcache_read() here, then?
> 
> Hmm. Not immediately, but after flush. An flush is not under mutex. So 
> in theory at some moment we may have several finished clusters 
> "in-flight". And your question make sense. The cache supports reading 
> from consequitive clusters. But we also should support here reading one 
> part of data from disk and another from the cache to be safe.

The question is whether it really makes sense to even have a 
seqcache_read() path when in reality it’s probably never accessed.  I 
mean, besides the fact that it seems based purely on chance whether a 
read might fetch something from the cache even while we’re writing, in 
practice I don’t know any case where we’d write to and read from a 
compressed qcow2 image at the same time.  (I don’t know what you’re 
doing with the 'compress' filter, though.)

Max