Re: [PATCH v2 1/4] qemu-img: implement compare --stat

From: Hanna Reitz <hreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org
Cc: kwolf@redhat.com, eblake@redhat.com, qemu-devel@nongnu.org,
	nsoffer@redhat.com, nikita.lapshin@virtuozzo.com, den@openvz.org,
	jsnow@redhat.com
Subject: Re: [PATCH v2 1/4] qemu-img: implement compare --stat
Date: Wed, 27 Oct 2021 10:35:44 +0200	[thread overview]
Message-ID: <6b016523-87d5-f17e-8f1c-2e6ccf0dce0b@redhat.com> (raw)
In-Reply-To: <f63b4410-2c3d-d078-26e5-e891bfbb61e5@virtuozzo.com>

On 26.10.21 11:18, Vladimir Sementsov-Ogievskiy wrote:
> 26.10.2021 11:47, Hanna Reitz wrote:
>> On 26.10.21 09:53, Vladimir Sementsov-Ogievskiy wrote:
>>> 25.10.2021 19:40, Hanna Reitz wrote:
>>>> On 21.10.21 12:12, Vladimir Sementsov-Ogievskiy wrote:
>>>>> With new option qemu-img compare will not stop at first mismatch, but
>>>>> instead calculate statistics: how many clusters with different data,
>>>>> how many clusters with equal data, how many clusters were unallocated
>>>>> but become data and so on.
>>>>>
>>>>> We compare images chunk by chunk. Chunk size depends on what
>>>>> block_status returns for both images. It may return less than cluster
>>>>> (remember about qcow2 subclusters), it may return more than 
>>>>> cluster (if
>>>>> several consecutive clusters share same status). Finally images may
>>>>> have different cluster sizes. This all leads to ambiguity in how to
>>>>> finally compare the data.
>>>>>
>>>>> What we can say for sure is that, when we compare two qcow2 images 
>>>>> with
>>>>> same cluster size, we should compare clusters with data separately.
>>>>> Otherwise, if we for example compare 10 consecutive clusters of data
>>>>> where only one byte differs we'll report 10 different clusters.
>>>>> Expected result in this case is 1 different cluster and 9 equal ones.
>>>>>
>>>>> So, to serve this case and just to have some defined rule let's do 
>>>>> the
>>>>> following:
>>>>>
>>>>> 1. Select some block-size for compare procedure. In this commit it 
>>>>> must
>>>>>     be specified by user, next commit will add some automatic 
>>>>> logic and
>>>>>     make --block-size optional.
>>>>>
>>>>> 2. Go chunk-by-chunk using block_status as we do now with only one
>>>>>     differency:
>>>>>     If block_status() returns DATA region that intersects block-size
>>>>>     aligned boundary, crop this region at this boundary.
>>>>>
>>>>> This way it's still possible to compare less than cluster and report
>>>>> subcluster-level accuracy, but we newer compare more than one cluster
>>>>> of data.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>>>> <vsementsov@virtuozzo.com>
>>>>> ---
>>>>>   docs/tools/qemu-img.rst |  18 +++-
>>>>>   qemu-img.c              | 206 
>>>>> +++++++++++++++++++++++++++++++++++++---
>>>>>   qemu-img-cmds.hx        |   4 +-
>>>>>   3 files changed, 212 insertions(+), 16 deletions(-)
>>>>
>>>> Looks good to me overall!  Just some technical comments below.
>>>>
>>>>> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
>>>>> index d58980aef8..21164253d4 100644
>>>>> --- a/docs/tools/qemu-img.rst
>>>>> +++ b/docs/tools/qemu-img.rst
>>>>> @@ -159,6 +159,18 @@ Parameters to compare subcommand:
>>>>>     Strict mode - fail on different image size or sector allocation
>>>>> +.. option:: --stat
>>>>> +
>>>>> +  Instead of exit on first mismatch compare the whole images and 
>>>>> print
>>>>> +  statistics on amount of different pairs of clusters, based on 
>>>>> their
>>>>> +  block-status and are they equal or not.
>>>>
>>>> I’d phrase this as:
>>>>
>>>> Instead of exiting on the first mismatch, compare the whole images 
>>>> and print statistics on how much they differ in terms of block 
>>>> status (i.e. are blocks allocated or not, do they contain data, are 
>>>> they marked as containing only zeroes) and block content (a block 
>>>> of data that contains only zero still has the same content as a 
>>>> marked-zero block).
>>>
>>> For me the rest starting from "and block content" sounds unclear, 
>>> seems doesn't add any information to previous (i.e. are blocks 
>>> allocated ...)
>>
>> By “block content” I meant what you said by “equal or not”, i.e. what 
>> is returned when reading from the block.
>>
>> Now that I think about it again, I believe we should go with your 
>> original “equal or not”, though, because that reflects what qemu-img 
>> --stat prints, like so perhaps:
>>
>> Instead of exiting on the first mismatch, compare the whole images 
>> and print statistics on the amount of different pairs of blocks, 
>> based on their block status and whether they are equal or not.
>
> OK
>
>>
>> I’d still like to add something like what I had in parentheses, 
>> though, because as a user, I’d find the “block status” and “equal or 
>> not” terms to be a bit handwave-y.  I don’t think “block status” is a 
>> common term in our documentation, so I wanted to add some examples; 
>> and I wanted to show by example that “equal” blocks don’t need to 
>> have the same block status.
>
> Actually, I think, that the resulting tool is anyway 
> developer-oriented, to use it you should know what this block status 
> really means.. May be just s/block status/block type/, will it be more 
> human friendly?

Oh, OK.  I think I’d prefer “block status” still, because that’s what we 
use in the code.

Hanna