qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* backing chain & block status & filters
@ 2020-04-28  8:55 Vladimir Sementsov-Ogievskiy
  2020-04-28 11:08 ` Max Reitz
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-28  8:55 UTC (permalink / raw)
  To: qemu block; +Cc: Kevin Wolf, Andrey Shinkevich, qemu-devel, Max Reitz

Hi!

I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and is_allocated_above", and returned to all the inconsistencies about block-status. I keep in mind Max's series about child-access functions, and Andrey's work about using COR filter in block-stream, which depends on Max's series (because, without them COR fitler with file child breaks backing chains).. And, it seems that it's better to discuss some questions before resending.

First, problems about block-status:

1. We consider ALLOCATED = ZERO | DATA, and documented as follows:

    * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
    * BDRV_BLOCK_ZERO: offset reads as zero
    * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing raw data
    * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
    *                       layer rather than any backing, set by block layer

This actually means, that we should always have BDRV_BLOCK_ALLOCATED for formats which doesn't support backing. So, all such format drivers must return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for example, iscsi - doesn't.

2. ZERO. The meaning differs a bit for generic block_status and for drivers.. I think, we at least should document it like this:

BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at this layer and read as ZERO. If generic block_status returns ZERO, it only mean that it reads as zero, but the region may be allocated on underlying level.

3. bdi.unallocated_blocks_are_zero

I think it's very bad, that we have formats, that supports backing, but doesn't report bdi.unallocated_blocks_are_zero as true. It means that UNALLOCATED region reads as zero if we have short backing file, and not as zero if we remove this short backing file. I can live with it but this is weird logic. These bad drivers are qcow (not qcow2), parallels and vmdk. I hope, they actually just forget to set unallocated_blocks_are_zero to true.

Next. But what about drivers which doesn't support backing? As we considered above, they should always return ZERO or DATA, as everything is allocated in this backing-chain level (last level, of course).. So again unallocated_blocks_are_zero is meaningless. So, I think, that driver which doesn't support backings, should be fixed to return always ZERO or DATA, than we don't need this unallocated_blocks_are_zero at all.

3. Short backing files in allocated_above: we must consider space after EOF as ALLOCATED, if short backing file is inside requested backing-chain part, as it produced exactly because of this short file (and we never go to backing). (current realization of allocated_above is buggy, see my outdated series "[PATCH 0/4] fix & merge block_status_above and is_allocated_above")

4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have a backing chain of non-backing child.. I just remember that we didn't reach the consensus.

5. Filters.. OK we have two functions for them: bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I think both are wrong:

bdrv_co_block_status_from_file leads to problem [4], when we can report UNALLOCATED, which refers not to the current backing chain, but to sub backing chain of file child, which is inconsistent with block_status_above and is_allocated_above iteration.

bdrv_co_block_status_from_backing is also is not consistent with block_status_above iteration.. At least at leads to querying the same node twice.

=================

So, about filters and backing chains. Keeping (OK, just, trying to keep) all these things in mind, I think that it's better to keep backing chains exactly *backing* chains, so that "backing" child is the only "not own" child of the node. So, its close to current behavior and we don't need child-access functions. Then how filters should work:

Filter with backing child, should always return UNALLOCATED (i.e. no DATA, no ZERO), it is honest: everything is on the other level of backing chain.

Filter with file child should always return BDRV_BLOCK_DATA | BDRV_BLOCK_RECURSE, to show that:
1. everything is allocated in *this* level of backing chain
2. filter is too lazy to dig in it's file child (and, maybe the whole sub-tree of it) and asks generic layer to do it by itself, if it wants zeroes.

Then, of course, if we want some filter to be inside backing chain, it should have not "file" child but "backing". For this, we may support in current public filter both variants: backing or file, as user prefer. I.e., filter is opened either with file option or with backing and operate correspondingly. And newer filters (like backup-top) may support only backing variants.

=================

So, I propose to complicate a bit code of old file-based filters, to support backing child (which may be done on demand, when needed. For example, Virtuozzo now need only COR filter (and only to satisfy iotests, actually)), and keep generic code about block-status and handling backing-chains simpler (and simpler to document).

Another benefit is that block-graph is already visible through QAPI, and it would be good to keep backing link meaning, not adding the thing that for filters "file" link behaves as "backing".

So again, I'm for limiting backing-behavior to bs->backing child only (as it currently behaves). What do you think?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28  8:55 backing chain & block status & filters Vladimir Sementsov-Ogievskiy
@ 2020-04-28 11:08 ` Max Reitz
  2020-04-28 11:28   ` Kevin Wolf
  2020-04-28 14:51   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 16+ messages in thread
From: Max Reitz @ 2020-04-28 11:08 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu block
  Cc: Kevin Wolf, Andrey Shinkevich, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 6699 bytes --]

On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
> Hi!
> 
> I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
> is_allocated_above", and returned to all the inconsistencies about
> block-status. I keep in mind Max's series about child-access functions,
> and Andrey's work about using COR filter in block-stream, which depends
> on Max's series (because, without them COR fitler with file child breaks
> backing chains).. And, it seems that it's better to discuss some
> questions before resending.
> 
> First, problems about block-status:
> 
> 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
> 
>    * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>    * BDRV_BLOCK_ZERO: offset reads as zero
>    * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
> raw data
>    * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>    *                       layer rather than any backing, set by block
> layer
> 
> This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
> formats which doesn't support backing. So, all such format drivers must
> return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
> example, iscsi - doesn't.

Hm.  I could imagine that there are formats that have non-zero holes
(e.g. 0xff or just garbage).  It would be a bit wrong for them to return
ZERO or DATA then.

But OTOH we don’t care about such cases, do we?  We need to know whether
ranges are zero, data, or unallocated.  If they aren’t zero, we only
care about whether reading from it will return data from this layer or not.

So I suppose that anything that doesn’t support backing files (or
filtered children) should always return ZERO and/or DATA.

> 2. ZERO. The meaning differs a bit for generic block_status and for
> drivers.. I think, we at least should document it like this:
> 
> BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
> BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at
> this layer and read as ZERO. If generic block_status returns ZERO, it
> only mean that it reads as zero, but the region may be allocated on
> underlying level.

Hm.  What does that mean?

One of the problems is that “allocated” has two meanings:
(1) reading data returns data defined at this backing layer,
(2) actually allocated, i.e. takes up space on the file represented by
this BDS.

As far as I understand, we actually don’t care about (2) in the context
of block_status, but just about (1).

So if a layer returns ZERO, it is by definition (1)-allocated.  (It
isn’t necessarily (2)-allocated.)

> 3. bdi.unallocated_blocks_are_zero
> 
> I think it's very bad, that we have formats, that supports backing, but
> doesn't report bdi.unallocated_blocks_are_zero as true. It means that
> UNALLOCATED region reads as zero if we have short backing file, and not
> as zero if we remove this short backing file.

What do you mean by “remove this short backing file”?  Because generally
one can’t just drop a backing file.

So maybe a case like block-stream?  Wouldn’t that be a bug in
block-stream them, i.e. shouldn’t it stream zeros after the end of the
backing file?

> I can live with it but
> this is weird logic. These bad drivers are qcow (not qcow2), parallels
> and vmdk. I hope, they actually just forget to set
> unallocated_blocks_are_zero to true.

qcow definitely sounds like it.

> Next. But what about drivers which doesn't support backing? As we
> considered above, they should always return ZERO or DATA, as everything
> is allocated in this backing-chain level (last level, of course).. So
> again unallocated_blocks_are_zero is meaningless. So, I think, that
> driver which doesn't support backings, should be fixed to return always
> ZERO or DATA, than we don't need this unallocated_blocks_are_zero at all.

Agreed.

> 3.

The second 3.? :)

> Short backing files in allocated_above: we must consider space after
> EOF as ALLOCATED, if short backing file is inside requested
> backing-chain part, as it produced exactly because of this short file
> (and we never go to backing).

Sounds correct.

> (current realization of allocated_above is
> buggy, see my outdated series "[PATCH 0/4] fix & merge
> block_status_above and is_allocated_above")
> 
> 4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have
> a backing chain of non-backing child.. I just remember that we didn't
> reach the consensus.

Possible? :)

> 5. Filters.. OK we have two functions for them:
> bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I
> think both are wrong:
> 
> bdrv_co_block_status_from_file leads to problem [4], when we can report
> UNALLOCATED, which refers not to the current backing chain, but to sub
> backing chain of file child, which is inconsistent with
> block_status_above and is_allocated_above iteration.
> 
> bdrv_co_block_status_from_backing is also is not consistent with
> block_status_above iteration.. At least at leads to querying the same
> node twice.

Well, yes.

> So, about filters and backing chains. Keeping (OK, just, trying to keep)
> all these things in mind, I think that it's better to keep backing
> chains exactly *backing* chains, so that "backing" child is the only
> "not own" child of the node. So, its close to current behavior and we
> don't need child-access functions. Then how filters should work:
> 
> Filter with backing child, should always return UNALLOCATED (i.e. no
> DATA, no ZERO), it is honest: everything is on the other level of
> backing chain.

I disagree, because filters with or without backing children should work
exactly the same way and just not appear in the backing chain.

> Filter with file child should always return BDRV_BLOCK_DATA |
> BDRV_BLOCK_RECURSE, to show that:
> 1. everything is allocated in *this* level of backing chain
> 2. filter is too lazy to dig in it's file child (and, maybe the whole
> sub-tree of it) and asks generic layer to do it by itself, if it wants
> zeroes.
> 
> Then, of course, if we want some filter to be inside backing chain, it
> should have not "file" child but "backing". For this, we may support in
> current public filter both variants: backing or file, as user prefer.
> I.e., filter is opened either with file option or with backing and
> operate correspondingly. And newer filters (like backup-top) may support
> only backing variants.

I disagree again for the same reason.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 11:08 ` Max Reitz
@ 2020-04-28 11:28   ` Kevin Wolf
  2020-04-28 15:13     ` Vladimir Sementsov-Ogievskiy
  2020-04-28 14:51   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2020-04-28 11:28 UTC (permalink / raw)
  To: Max Reitz
  Cc: Andrey Shinkevich, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 2434 bytes --]

Am 28.04.2020 um 13:08 hat Max Reitz geschrieben:
> On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
> > Hi!
> > 
> > I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
> > is_allocated_above", and returned to all the inconsistencies about
> > block-status. I keep in mind Max's series about child-access functions,
> > and Andrey's work about using COR filter in block-stream, which depends
> > on Max's series (because, without them COR fitler with file child breaks
> > backing chains).. And, it seems that it's better to discuss some
> > questions before resending.
> > 
> > First, problems about block-status:
> > 
> > 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
> > 
> >    * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
> >    * BDRV_BLOCK_ZERO: offset reads as zero
> >    * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
> > raw data
> >    * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
> >    *                       layer rather than any backing, set by block
> > layer
> > 
> > This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
> > formats which doesn't support backing. So, all such format drivers must
> > return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
> > example, iscsi - doesn't.
> 
> Hm.  I could imagine that there are formats that have non-zero holes
> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
> ZERO or DATA then.
> 
> But OTOH we don’t care about such cases, do we?  We need to know whether
> ranges are zero, data, or unallocated.  If they aren’t zero, we only
> care about whether reading from it will return data from this layer or not.
> 
> So I suppose that anything that doesn’t support backing files (or
> filtered children) should always return ZERO and/or DATA.

I'm not sure I agree with the notion that everything should be
BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
at least. If we want to change this, we will have to check all callers
of bdrv_is_allocated() and friends who might use this to find holes in
the file.

Basically, the way bdrv_is_allocated() works today is that we assume an
implicit zeroed backing layer even for block drivers that don't support
backing files.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 11:08 ` Max Reitz
  2020-04-28 11:28   ` Kevin Wolf
@ 2020-04-28 14:51   ` Vladimir Sementsov-Ogievskiy
  2020-04-30 19:12     ` Vladimir Sementsov-Ogievskiy
  2020-05-07 12:58     ` Max Reitz
  1 sibling, 2 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-28 14:51 UTC (permalink / raw)
  To: Max Reitz, qemu block; +Cc: Kevin Wolf, Andrey Shinkevich, qemu-devel

28.04.2020 14:08, Max Reitz wrote:
> On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
>> Hi!
>>
>> I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
>> is_allocated_above", and returned to all the inconsistencies about
>> block-status. I keep in mind Max's series about child-access functions,
>> and Andrey's work about using COR filter in block-stream, which depends
>> on Max's series (because, without them COR fitler with file child breaks
>> backing chains).. And, it seems that it's better to discuss some
>> questions before resending.
>>
>> First, problems about block-status:
>>
>> 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
>>
>>     * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>     * BDRV_BLOCK_ZERO: offset reads as zero
>>     * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
>> raw data
>>     * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>     *                       layer rather than any backing, set by block
>> layer
>>
>> This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
>> formats which doesn't support backing. So, all such format drivers must
>> return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
>> example, iscsi - doesn't.
> 
> Hm.  I could imagine that there are formats that have non-zero holes
> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
> ZERO or DATA then.
> 
> But OTOH we don’t care about such cases, do we?  We need to know whether
> ranges are zero, data, or unallocated.  If they aren’t zero, we only
> care about whether reading from it will return data from this layer or not.
> 
> So I suppose that anything that doesn’t support backing files (or
> filtered children) should always return ZERO and/or DATA.
> 
>> 2. ZERO. The meaning differs a bit for generic block_status and for
>> drivers.. I think, we at least should document it like this:
>>
>> BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>> BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at
>> this layer and read as ZERO. If generic block_status returns ZERO, it
>> only mean that it reads as zero, but the region may be allocated on
>> underlying level.
> 
> Hm.  What does that mean?
> 
> One of the problems is that “allocated” has two meanings:
> (1) reading data returns data defined at this backing layer,
> (2) actually allocated, i.e. takes up space on the file represented by
> this BDS.
> 
> As far as I understand, we actually don’t care about (2) in the context
> of block_status, but just about (1).
> 
> So if a layer returns ZERO, it is by definition (1)-allocated.  (It
> isn’t necessarily (2)-allocated.)
> 
>> 3. bdi.unallocated_blocks_are_zero
>>
>> I think it's very bad, that we have formats, that supports backing, but
>> doesn't report bdi.unallocated_blocks_are_zero as true. It means that
>> UNALLOCATED region reads as zero if we have short backing file, and not
>> as zero if we remove this short backing file.
> 
> What do you mean by “remove this short backing file”?  Because generally
> one can’t just drop a backing file.
> 
> So maybe a case like block-stream?  Wouldn’t that be a bug in
> block-stream them, i.e. shouldn’t it stream zeros after the end of the
> backing file?
> 
>> I can live with it but
>> this is weird logic. These bad drivers are qcow (not qcow2), parallels
>> and vmdk. I hope, they actually just forget to set
>> unallocated_blocks_are_zero to true.
> 
> qcow definitely sounds like it.
> 
>> Next. But what about drivers which doesn't support backing? As we
>> considered above, they should always return ZERO or DATA, as everything
>> is allocated in this backing-chain level (last level, of course).. So
>> again unallocated_blocks_are_zero is meaningless. So, I think, that
>> driver which doesn't support backings, should be fixed to return always
>> ZERO or DATA, than we don't need this unallocated_blocks_are_zero at all.
> 
> Agreed.
> 
>> 3.
> 
> The second 3.? :)
> 
>> Short backing files in allocated_above: we must consider space after
>> EOF as ALLOCATED, if short backing file is inside requested
>> backing-chain part, as it produced exactly because of this short file
>> (and we never go to backing).
> 
> Sounds correct.
> 
>> (current realization of allocated_above is
>> buggy, see my outdated series "[PATCH 0/4] fix & merge
>> block_status_above and is_allocated_above")
>>
>> 4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have
>> a backing chain of non-backing child.. I just remember that we didn't
>> reach the consensus.
> 
> Possible? :)
> 
>> 5. Filters.. OK we have two functions for them:
>> bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I
>> think both are wrong:
>>
>> bdrv_co_block_status_from_file leads to problem [4], when we can report
>> UNALLOCATED, which refers not to the current backing chain, but to sub
>> backing chain of file child, which is inconsistent with
>> block_status_above and is_allocated_above iteration.
>>
>> bdrv_co_block_status_from_backing is also is not consistent with
>> block_status_above iteration.. At least at leads to querying the same
>> node twice.
> 
> Well, yes.
> 
>> So, about filters and backing chains. Keeping (OK, just, trying to keep)
>> all these things in mind, I think that it's better to keep backing
>> chains exactly *backing* chains, so that "backing" child is the only
>> "not own" child of the node. So, its close to current behavior and we
>> don't need child-access functions. Then how filters should work:
>>
>> Filter with backing child, should always return UNALLOCATED (i.e. no
>> DATA, no ZERO), it is honest: everything is on the other level of
>> backing chain.
> 
> I disagree, because filters with or without backing children should work
> exactly the same way and just not appear in the backing chain.

They work the same way in my variant too. The only difference is that if you use file-child-based filters, you can't do stream/commit around them. I just think, that binding the concept to the "backing" link of the node is simpler and more generic. In blockdev era, when all nodes will be named and libvirt will take care of all nodes including filters, we will not need any filter-skipping magic, libvirt will directly point to exact nodes it means. And we can deprecate "file" children of existing filters, to finally reach simple architecture with simple backing chain of nodes (any nodes). And after deprecating old filename-based interfaces, we'll drop all filter-skipping magic..

> 
>> Filter with file child should always return BDRV_BLOCK_DATA |
>> BDRV_BLOCK_RECURSE, to show that:
>> 1. everything is allocated in *this* level of backing chain
>> 2. filter is too lazy to dig in it's file child (and, maybe the whole
>> sub-tree of it) and asks generic layer to do it by itself, if it wants
>> zeroes.
>>
>> Then, of course, if we want some filter to be inside backing chain, it
>> should have not "file" child but "backing". For this, we may support in
>> current public filter both variants: backing or file, as user prefer.
>> I.e., filter is opened either with file option or with backing and
>> operate correspondingly. And newer filters (like backup-top) may support
>> only backing variants.
> 
> I disagree again for the same reason.
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 11:28   ` Kevin Wolf
@ 2020-04-28 15:13     ` Vladimir Sementsov-Ogievskiy
  2020-04-28 16:18       ` Eric Blake
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-28 15:13 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz; +Cc: Andrey Shinkevich, qemu-devel, qemu block

28.04.2020 14:28, Kevin Wolf wrote:
> Am 28.04.2020 um 13:08 hat Max Reitz geschrieben:
>> On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi!
>>>
>>> I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
>>> is_allocated_above", and returned to all the inconsistencies about
>>> block-status. I keep in mind Max's series about child-access functions,
>>> and Andrey's work about using COR filter in block-stream, which depends
>>> on Max's series (because, without them COR fitler with file child breaks
>>> backing chains).. And, it seems that it's better to discuss some
>>> questions before resending.
>>>
>>> First, problems about block-status:
>>>
>>> 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
>>>
>>>     * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>>     * BDRV_BLOCK_ZERO: offset reads as zero
>>>     * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
>>> raw data
>>>     * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>     *                       layer rather than any backing, set by block
>>> layer
>>>
>>> This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
>>> formats which doesn't support backing. So, all such format drivers must
>>> return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
>>> example, iscsi - doesn't.
>>
>> Hm.  I could imagine that there are formats that have non-zero holes
>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>> ZERO or DATA then.
>>
>> But OTOH we don’t care about such cases, do we?  We need to know whether
>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>> care about whether reading from it will return data from this layer or not.
>>
>> So I suppose that anything that doesn’t support backing files (or
>> filtered children) should always return ZERO and/or DATA.
> 
> I'm not sure I agree with the notion that everything should be
> BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
> at least. If we want to change this, we will have to check all callers
> of bdrv_is_allocated() and friends who might use this to find holes in
> the file.

Yes. Because they are doing incorrect (or at least undocumented and unreliable) thing.

> 
> Basically, the way bdrv_is_allocated() works today is that we assume an
> implicit zeroed backing layer even for block drivers that don't support
> backing files.

But read doesn't work so: it will read data from the bottom layer, not from
this implicit zeroed backing layer. And it is inconsistent. On read data
comes exactly from this layer, not from its implicit backing. So it should
return BDRV_BLOCK_ALLOCATED, accordingly to its definition..

Or, we should at least document current behavior:

   BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
   layer rather than any backing, set by block. Attention: it may not be set
   for drivers without backing support, still data is of course read from
   this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
   allocation on fs level, which occupies real space on disk.. So, for such drivers

   ZERO | ALLOCATED means that, read as zero, data may be allocated on fs, or
   (most probably) not,
   don't look at ALLOCATED flag, as it is added by generic layer for another logic,
   not related to fs-allocation.

   0 means that, most probably, data doesn't occupy space on fs, zero-status is
   unknown (most probably non-zero)

  

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 15:13     ` Vladimir Sementsov-Ogievskiy
@ 2020-04-28 16:18       ` Eric Blake
  2020-04-28 16:46         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Blake @ 2020-04-28 16:18 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Max Reitz
  Cc: Andrey Shinkevich, qemu-devel, qemu block

On 4/28/20 10:13 AM, Vladimir Sementsov-Ogievskiy wrote:

>>> Hm.  I could imagine that there are formats that have non-zero holes
>>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>>> ZERO or DATA then.
>>>
>>> But OTOH we don’t care about such cases, do we?  We need to know whether
>>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>>> care about whether reading from it will return data from this layer 
>>> or not.
>>>
>>> So I suppose that anything that doesn’t support backing files (or
>>> filtered children) should always return ZERO and/or DATA.
>>
>> I'm not sure I agree with the notion that everything should be
>> BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
>> at least. If we want to change this, we will have to check all callers
>> of bdrv_is_allocated() and friends who might use this to find holes in
>> the file.
> 
> Yes. Because they are doing incorrect (or at least undocumented and 
> unreliable) thing.

Here's some previous mails discussing the same question about what 
block_status should actually mean.  At the time, I was so scared of the 
prospect of something breaking if I changed things that I ended up 
keeping status quo, so here we are revisiting the topic several years 
later, still asking the same questions.

https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00069.html
https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03757.html

> 
>>
>> Basically, the way bdrv_is_allocated() works today is that we assume an
>> implicit zeroed backing layer even for block drivers that don't support
>> backing files.
> 
> But read doesn't work so: it will read data from the bottom layer, not from
> this implicit zeroed backing layer. And it is inconsistent. On read data
> comes exactly from this layer, not from its implicit backing. So it should
> return BDRV_BLOCK_ALLOCATED, accordingly to its definition..
> 
> Or, we should at least document current behavior:
> 
>    BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>    layer rather than any backing, set by block. Attention: it may not be 
> set
>    for drivers without backing support, still data is of course read from
>    this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
>    allocation on fs level, which occupies real space on disk.. So, for 
> such drivers
> 
>    ZERO | ALLOCATED means that, read as zero, data may be allocated on 
> fs, or
>    (most probably) not,
>    don't look at ALLOCATED flag, as it is added by generic layer for 
> another logic,
>    not related to fs-allocation.
> 
>    0 means that, most probably, data doesn't occupy space on fs, 
> zero-status is
>    unknown (most probably non-zero)
> 

That may be right in describing the current situation, but again, needs 
a GOOD audit of what we are actually using it for, and whether it is 
what we really WANT to be using it for.  If we're going to 
audit/refactor the code, we might as well get semantics that are 
actually useful, rather than painfully contorted to documentation that 
happens to match our current contorted code.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 16:18       ` Eric Blake
@ 2020-04-28 16:46         ` Vladimir Sementsov-Ogievskiy
  2020-04-28 18:37           ` Kevin Wolf
  2020-04-28 19:44           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-28 16:46 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, Max Reitz
  Cc: Andrey Shinkevich, qemu-devel, qemu block

28.04.2020 19:18, Eric Blake wrote:
> On 4/28/20 10:13 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>> Hm.  I could imagine that there are formats that have non-zero holes
>>>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>>>> ZERO or DATA then.
>>>>
>>>> But OTOH we don’t care about such cases, do we?  We need to know whether
>>>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>>>> care about whether reading from it will return data from this layer or not.
>>>>
>>>> So I suppose that anything that doesn’t support backing files (or
>>>> filtered children) should always return ZERO and/or DATA.
>>>
>>> I'm not sure I agree with the notion that everything should be
>>> BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
>>> at least. If we want to change this, we will have to check all callers
>>> of bdrv_is_allocated() and friends who might use this to find holes in
>>> the file.
>>
>> Yes. Because they are doing incorrect (or at least undocumented and unreliable) thing.
> 
> Here's some previous mails discussing the same question about what block_status should actually mean.  At the time, I was so scared of the prospect of something breaking if I changed things that I ended up keeping status quo, so here we are revisiting the topic several years later, still asking the same questions.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00069.html
> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03757.html
> 
>>
>>>
>>> Basically, the way bdrv_is_allocated() works today is that we assume an
>>> implicit zeroed backing layer even for block drivers that don't support
>>> backing files.
>>
>> But read doesn't work so: it will read data from the bottom layer, not from
>> this implicit zeroed backing layer. And it is inconsistent. On read data
>> comes exactly from this layer, not from its implicit backing. So it should
>> return BDRV_BLOCK_ALLOCATED, accordingly to its definition..
>>
>> Or, we should at least document current behavior:
>>
>>    BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>    layer rather than any backing, set by block. Attention: it may not be set
>>    for drivers without backing support, still data is of course read from
>>    this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
>>    allocation on fs level, which occupies real space on disk.. So, for such drivers
>>
>>    ZERO | ALLOCATED means that, read as zero, data may be allocated on fs, or
>>    (most probably) not,
>>    don't look at ALLOCATED flag, as it is added by generic layer for another logic,
>>    not related to fs-allocation.
>>
>>    0 means that, most probably, data doesn't occupy space on fs, zero-status is
>>    unknown (most probably non-zero)
>>
> 
> That may be right in describing the current situation, but again, needs a GOOD audit of what we are actually using it for, and whether it is what we really WANT to be using it for.  If we're going to audit/refactor the code, we might as well get semantics that are actually useful, rather than painfully contorted to documentation that happens to match our current contorted code.
> 

Honest enough:) I'll try to make a table.

I don't think that reporting fs-allocation status is a bad thing. But I'm sure that it should be separated from backing-chain-allocated concept.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 16:46         ` Vladimir Sementsov-Ogievskiy
@ 2020-04-28 18:37           ` Kevin Wolf
  2020-04-28 19:44           ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 16+ messages in thread
From: Kevin Wolf @ 2020-04-28 18:37 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Andrey Shinkevich, qemu-devel, qemu block, Max Reitz

Am 28.04.2020 um 18:46 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 28.04.2020 19:18, Eric Blake wrote:
> > On 4/28/20 10:13 AM, Vladimir Sementsov-Ogievskiy wrote:
> > 
> > > > > Hm.  I could imagine that there are formats that have non-zero holes
> > > > > (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
> > > > > ZERO or DATA then.
> > > > > 
> > > > > But OTOH we don’t care about such cases, do we?  We need to know whether
> > > > > ranges are zero, data, or unallocated.  If they aren’t zero, we only
> > > > > care about whether reading from it will return data from this layer or not.
> > > > > 
> > > > > So I suppose that anything that doesn’t support backing files (or
> > > > > filtered children) should always return ZERO and/or DATA.
> > > > 
> > > > I'm not sure I agree with the notion that everything should be
> > > > BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
> > > > at least. If we want to change this, we will have to check all callers
> > > > of bdrv_is_allocated() and friends who might use this to find holes in
> > > > the file.
> > > 
> > > Yes. Because they are doing incorrect (or at least undocumented and unreliable) thing.
> > 
> > Here's some previous mails discussing the same question about what block_status should actually mean.  At the time, I was so scared of the prospect of something breaking if I changed things that I ended up keeping status quo, so here we are revisiting the topic several years later, still asking the same questions.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00069.html
> > https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03757.html
> > 
> > > 
> > > > 
> > > > Basically, the way bdrv_is_allocated() works today is that we assume an
> > > > implicit zeroed backing layer even for block drivers that don't support
> > > > backing files.
> > > 
> > > But read doesn't work so: it will read data from the bottom layer, not from
> > > this implicit zeroed backing layer. And it is inconsistent. On read data
> > > comes exactly from this layer, not from its implicit backing. So it should
> > > return BDRV_BLOCK_ALLOCATED, accordingly to its definition..
> > > 
> > > Or, we should at least document current behavior:
> > > 
> > >    BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
> > >    layer rather than any backing, set by block. Attention: it may not be set
> > >    for drivers without backing support, still data is of course read from
> > >    this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
> > >    allocation on fs level, which occupies real space on disk.. So, for such drivers
> > > 
> > >    ZERO | ALLOCATED means that, read as zero, data may be allocated on fs, or
> > >    (most probably) not,
> > >    don't look at ALLOCATED flag, as it is added by generic layer for another logic,
> > >    not related to fs-allocation.
> > > 
> > >    0 means that, most probably, data doesn't occupy space on fs, zero-status is
> > >    unknown (most probably non-zero)
> > > 
> > 
> > That may be right in describing the current situation, but again,
> > needs a GOOD audit of what we are actually using it for, and whether
> > it is what we really WANT to be using it for.  If we're going to
> > audit/refactor the code, we might as well get semantics that are
> > actually useful, rather than painfully contorted to documentation
> > that happens to match our current contorted code.
> > 
> 
> Honest enough:) I'll try to make a table.
> 
> I don't think that reporting fs-allocation status is a bad thing. But
> I'm sure that it should be separated from backing-chain-allocated
> concept.

I think we could easily agree on what would be a good concept.

My concern is just that existing code probably uses existing semantics
and not what we consider more logical now. So if we change it, we must
make sure that we change all places that expect the old semantics.

Kevin



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 16:46         ` Vladimir Sementsov-Ogievskiy
  2020-04-28 18:37           ` Kevin Wolf
@ 2020-04-28 19:44           ` Vladimir Sementsov-Ogievskiy
  2020-04-29  9:15             ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-28 19:44 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, Max Reitz
  Cc: Andrey Shinkevich, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 3661 bytes --]

28.04.2020 19:46, Vladimir Sementsov-Ogievskiy wrote:
> 28.04.2020 19:18, Eric Blake wrote:
>> On 4/28/20 10:13 AM, Vladimir Sementsov-Ogievskiy wrote:
>>
>>>>> Hm.  I could imagine that there are formats that have non-zero holes
>>>>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>>>>> ZERO or DATA then.
>>>>>
>>>>> But OTOH we don’t care about such cases, do we?  We need to know whether
>>>>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>>>>> care about whether reading from it will return data from this layer or not.
>>>>>
>>>>> So I suppose that anything that doesn’t support backing files (or
>>>>> filtered children) should always return ZERO and/or DATA.
>>>>
>>>> I'm not sure I agree with the notion that everything should be
>>>> BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
>>>> at least. If we want to change this, we will have to check all callers
>>>> of bdrv_is_allocated() and friends who might use this to find holes in
>>>> the file.
>>>
>>> Yes. Because they are doing incorrect (or at least undocumented and unreliable) thing.
>>
>> Here's some previous mails discussing the same question about what block_status should actually mean.  At the time, I was so scared of the prospect of something breaking if I changed things that I ended up keeping status quo, so here we are revisiting the topic several years later, still asking the same questions.
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00069.html
>> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03757.html
>>
>>>
>>>>
>>>> Basically, the way bdrv_is_allocated() works today is that we assume an
>>>> implicit zeroed backing layer even for block drivers that don't support
>>>> backing files.
>>>
>>> But read doesn't work so: it will read data from the bottom layer, not from
>>> this implicit zeroed backing layer. And it is inconsistent. On read data
>>> comes exactly from this layer, not from its implicit backing. So it should
>>> return BDRV_BLOCK_ALLOCATED, accordingly to its definition..
>>>
>>> Or, we should at least document current behavior:
>>>
>>>    BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>    layer rather than any backing, set by block. Attention: it may not be set
>>>    for drivers without backing support, still data is of course read from
>>>    this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
>>>    allocation on fs level, which occupies real space on disk.. So, for such drivers
>>>
>>>    ZERO | ALLOCATED means that, read as zero, data may be allocated on fs, or
>>>    (most probably) not,
>>>    don't look at ALLOCATED flag, as it is added by generic layer for another logic,
>>>    not related to fs-allocation.
>>>
>>>    0 means that, most probably, data doesn't occupy space on fs, zero-status is
>>>    unknown (most probably non-zero)
>>>
>>
>> That may be right in describing the current situation, but again, needs a GOOD audit of what we are actually using it for, and whether it is what we really WANT to be using it for.  If we're going to audit/refactor the code, we might as well get semantics that are actually useful, rather than painfully contorted to documentation that happens to match our current contorted code.
>>
> 
> Honest enough:) I'll try to make a table.
> 
> I don't think that reporting fs-allocation status is a bad thing. But I'm sure that it should be separated from backing-chain-allocated concept.
> 

As a first step, I've don brief analysis of .bdrv_co_block_status of drivers (attached)

-- 
Best regards,
Vladimir

[-- Attachment #2: block-status-report --]
[-- Type: text/plain, Size: 7213 bytes --]

= Filter functions =
mirror,commit,backup filters: bdrv_co_block_status_from_backing
blklogwrites,compress,COR,throttle: bdrv_co_block_status_from_file

    return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID
    - semantics of BDRV_BLOCK_RAW is unclean, behavior is broken

blkdebug: blkdebug_co_block_status
    - actually, uses bdrv_co_block_status_from_file,
    after additional blkdebug-related things not influincing the result

= raw =
raw: raw_co_block_status
    return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID
    - semantics of BDRV_BLOCK_RAW is unclean, behavior is broken

Summary: we need to fix BDRV_BLOCK_RAW-recursion semantics to not interfere with block_status_above/is_allocated_above loops.

= Format drivers with supports_backing = true =

qed: bdrv_qed_co_block_status
    bdi->unallocated_blocks_are_zero = true;
    
    0 - go to backing
    ZERO - metadata-zero
    DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data
     
parallels: parallels_co_block_status
    unallocated_blocks_are_zero is unset, but they are actually read as zero if no backing

    0 - go to backing
    DATA | OFFSET_VALID with @map set and @file = file-child : metadat-allocated-data

qcow2: qcow2_co_block_status
    bdi->unallocated_blocks_are_zero = true;

    ZERO | OFFSET_VALID with @map set and @file = something : metadata-allocated-zero
    DATA | OFFSET_VALID with @map set and @file = something : metadata-allocated-data
    RECURSE | DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data (hint, that you'd better dig zeroes in *file)

    ZERO : metadata-zero

    DATA : compressed clusters. So, it's metadata-allocated-data, but no offset we can provide

    0: go to backing

qcow: qcow_co_block_status
    unallocated_blocks_are_zero is unset, but they are actually read as zero if no backing

    0: go to backing
    DATA: compressed
    DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data

vmdk: vmdk_co_block_status
    unallocated_blocks_are_zero is unset, but they are actually read as zero if no backing

    0: go to backing
    ZERO: metadata-zero
    DATA with @file set to something: compressed
    DATA | OFFSET_VALID  with @map and @file set : metadata-allocated-data
    RECURSE | DATA | OFFSET_VALID  with @map and @file set : metadata-allocated-data (with hint)

Summary:
  So, obviously unallocated_blocks_are_zero is not needed, as all they behave so

  common semantics:
  just corresponds to specification:
  DATA | ZERO means backing-chain-allocated-at-this-level
  ZERO means reads as zero
  OFFSET_VALID means that it may be read from @file (which should be child of this) at @map offset
  additional RECURSE is hint to dig zeroes in @file of @map offset (if it is not set, digging is most probably useless)

  of course, nothing here about fs-allocation status or occupying space on disk

= Format drivers with supports_backing = false =

vdi: vdi_co_block_status
    bdi->unallocated_blocks_are_zero = true;

    0: metadata-unallocated.. 
       in this case vid_co_preadv zeroize qiov. So, it actually should be ZERO
    DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data
    RECURSE | DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data (hint, that you'd better dig zeroes in *file)

vpc: vpc_co_block_status
    bdi->unallocated_blocks_are_zero = true;
    
    0: unallocated?
       in this case vpc_co_preadv zeroize qiov. So, it actually should be ZERO
    DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data
    RECURSE | DATA | OFFSET_VALID with @map set and @file = file-child : metadata-allocated-data (hint, that you'd better dig zeroes in *file)

Summary:
  Again, unallocated_blocks_are_zero is always true, so it's not actually needed
  Moreover, I think drivers should return ZERO by themselves
  So common semantics should be the same as for backing-supporting formats, except for they must always set at least one of DATA and ZERO flags

  of course, nothing here about fs-allocation status or occupying space on disk

== Protocol drivers, with no children ==

iscsi, iser: iscsi_co_block_status
    bdi->unallocated_blocks_are_zero = iscsilun->lbprz;
    but block-status never return 0, if iscsilun->lbprz

    always: OFFSET_VALID with @map set and @file = this

    also:
    DATA : normal-data
    0 - fs-unallocated, not zero
    ZERO - fs-unallocated, zero

nbd: nbd_client_co_block_status

    special case:

        ZERO: for offset > nbd EOF (which may be not sector-aligned)

        I think, no reason to not report OFFSET_VALID combination,
        as byte-accurate EOF is actually internal nbd driver information

    otherwise:

        always: OFFSET_VALID with @map=@offset and @file = this

        also:
        DATA: if feature unsupported, so unknown
        any combination of DATA and ZERO, mapping NBD protocol,
        !DATA means: most probably doesn't occupy space on fs
        ZERO means: reads as zero

null-co, null-aio: null_co_block_status

    Hmm, is it a protocol driver? :)

    always: OFFSET_VALID with @map set and @file = this

    also:
    ZERO - if s->read_zeroes (read will zeroize qiov)
    0 - otherwise (read does nothing)

gluster: qemu_gluster_co_block_status

    always: OFFSET_VALID with @map=@offset and @file=this

    also:
    DATA: if not want_zero, or unknown, or known-data - normal-data
    ZERO: fs-unallocated-zero

file-posix: raw_co_block_status
bdi->unallocated_blocks_are_zero = s->discard_zeroes;
but it never report unallocated :)

    always: OFFSET_VALID with @map=@offset and @file=this

    also:
    DATA: if not want_zero, or unknown, or known-data - normal-data
    ZERO: fs-unallocated-zero

sheepdog: sd_co_block_status

    Hmm. sheepdog is a dead project, yes? Should we deprecate it?

    0: fs-unallocated
    DATA | OFFSET_VALID with @map=@offset and @file=this

vvfat: vvfat_co_block_status

    always: DATA

Summary:
   First, unallocated_blocks_are_zero is obviously unused.
  
   Second, I think all should always return at least
   OFFSET_VALID with @map=@offset and @file=this, as
   it's exactly where from we read in terms of block-layer.
   And good to be consistent about it.

   Next, OK, seems, most of drivers wants to report fs-allocation/disk-occupying status,

   0 - most probably doesn't occupy space, not-zero
   ZERO - most probably doesn't occupy space, zero
   DATA - most probably occupy space, but also is a safe default
   DATA | ZERO - most probably occupy space, but for some reason known zeroes. Only nbd can theoretically do it.


= Testing =

test: bdrv_test_co_block_status

    always 0.  test_sync_op_block_status() tests different scenarios
    of bdrv_is_allocated. If we decide to change the behavior, not a
    problem to change test results and bdrv_test_co_block_status()

= Drivers without block_status =
A lot of them. Not a problem. Worth mentioning vhdx:
vhdx
    bdi->unallocated_blocks_are_zero =
        (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) == 0;
    - it is not used, as DATA|ALLOCATED is set for drivers without block_status

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 19:44           ` Vladimir Sementsov-Ogievskiy
@ 2020-04-29  9:15             ` Vladimir Sementsov-Ogievskiy
  2020-04-29 10:50               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-29  9:15 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, Max Reitz
  Cc: Andrey Shinkevich, qemu-devel, qemu block

[-- Attachment #1: Type: text/plain, Size: 3851 bytes --]

28.04.2020 22:44, Vladimir Sementsov-Ogievskiy wrote:
> 28.04.2020 19:46, Vladimir Sementsov-Ogievskiy wrote:
>> 28.04.2020 19:18, Eric Blake wrote:
>>> On 4/28/20 10:13 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>
>>>>>> Hm.  I could imagine that there are formats that have non-zero holes
>>>>>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>>>>>> ZERO or DATA then.
>>>>>>
>>>>>> But OTOH we don’t care about such cases, do we?  We need to know whether
>>>>>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>>>>>> care about whether reading from it will return data from this layer or not.
>>>>>>
>>>>>> So I suppose that anything that doesn’t support backing files (or
>>>>>> filtered children) should always return ZERO and/or DATA.
>>>>>
>>>>> I'm not sure I agree with the notion that everything should be
>>>>> BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
>>>>> at least. If we want to change this, we will have to check all callers
>>>>> of bdrv_is_allocated() and friends who might use this to find holes in
>>>>> the file.
>>>>
>>>> Yes. Because they are doing incorrect (or at least undocumented and unreliable) thing.
>>>
>>> Here's some previous mails discussing the same question about what block_status should actually mean.  At the time, I was so scared of the prospect of something breaking if I changed things that I ended up keeping status quo, so here we are revisiting the topic several years later, still asking the same questions.
>>>
>>> https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00069.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03757.html
>>>
>>>>
>>>>>
>>>>> Basically, the way bdrv_is_allocated() works today is that we assume an
>>>>> implicit zeroed backing layer even for block drivers that don't support
>>>>> backing files.
>>>>
>>>> But read doesn't work so: it will read data from the bottom layer, not from
>>>> this implicit zeroed backing layer. And it is inconsistent. On read data
>>>> comes exactly from this layer, not from its implicit backing. So it should
>>>> return BDRV_BLOCK_ALLOCATED, accordingly to its definition..
>>>>
>>>> Or, we should at least document current behavior:
>>>>
>>>>    BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>>    layer rather than any backing, set by block. Attention: it may not be set
>>>>    for drivers without backing support, still data is of course read from
>>>>    this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
>>>>    allocation on fs level, which occupies real space on disk.. So, for such drivers
>>>>
>>>>    ZERO | ALLOCATED means that, read as zero, data may be allocated on fs, or
>>>>    (most probably) not,
>>>>    don't look at ALLOCATED flag, as it is added by generic layer for another logic,
>>>>    not related to fs-allocation.
>>>>
>>>>    0 means that, most probably, data doesn't occupy space on fs, zero-status is
>>>>    unknown (most probably non-zero)
>>>>
>>>
>>> That may be right in describing the current situation, but again, needs a GOOD audit of what we are actually using it for, and whether it is what we really WANT to be using it for.  If we're going to audit/refactor the code, we might as well get semantics that are actually useful, rather than painfully contorted to documentation that happens to match our current contorted code.
>>>
>>
>> Honest enough:) I'll try to make a table.
>>
>> I don't think that reporting fs-allocation status is a bad thing. But I'm sure that it should be separated from backing-chain-allocated concept.
>>
> 
> As a first step, I've don brief analysis of .bdrv_co_block_status of drivers (attached)
> 

As a second step, here is brief analysis of all block_status usage

-- 
Best regards,
Vladimir

[-- Attachment #2: block-status-usage-report --]
[-- Type: text/plain, Size: 4098 bytes --]

Public interface of block-status is:

    bdrv_block_status
    bdrv_block_status_above
    bdrv_is_allocated
    bdrv_is_allocated_above


= bdrv_block_status =

bdrv_make_zero: works on current level of backing-chain, want's to skip zeroes, not interested in @map and @file

img convert: convert_iteration_sectors: wants to distinguish ZERO, DATA and go-to-backing. It also tries to not write zeroes, if have short backing file, but does it a bit wrong. Treats unallocated as DATA if no backing.

img-map: get_block_status: distinguish ZERO, DATA and go-to-backing. Count depth of the backing. Just reports final ZERO and DATA. So, fs-unallocated thing is reported to user

= bdrv_block_status_above =

block-copy: block_copy_block_status: wants two things:
  
  1. skip go-to-backing holes in top layer for top mode
  2. do write_zero for ZERO areas

mirror: call on the whole backing chain
   - for DATA (and for DATA|ZERO which is bad) do just copy
   - for ZERO do just ZERO
   - for 0 (which means that bottom layer doesn't report that unallocated are zero) does DISCARD (which is most-probably zeroing) - absolutely wrong thing

qcow2: is_zero: call on the whole backing chain, want's just to check is reads-as-zero or not.

qcow2: qcow2_measure: call on the whole backing chain:
   - skip ZERO
   - count clusters with both DATA and ALLOCATED set. Hmm. ALLOCATED is always set for DATA. Seems the function actually tries to calculate disk occupation, assuming that BDRV_BLOCK_ALLOCATED helps in it, but it actually doesn't..

   I think, correct solution is to support offset and bytes in bdrv_measure, and split it from block_status. Then qcow2_measure will just recursively call bdrv_measure on its children. This would be clean.

nbd: nbd_co_send_sparse_read: call on the whole backing chain:
   - wants to distinguish zeroes

nbd: blockstatus_to_extents: call on the whole backing chain:
   !ALLOCATED -> NBD_HOLE
   ZERO -> NBD_ZERO

   So, we report HOLE only if it's not BDRV_BLOCK_ALLOCATED on any layer.. That's wrong. I think, we should report HOLE in a lot more cases. Actually, when not occupy real space on disk.

img-compare: call on the whole backing chain:
  - do not compare zeroes
  - do not compare if both report unallocated.. it's actually not correct for protocols which reports fs-unallocated-non-zeroes. As reads may differ actually. Still, read from fs-unallocated area is not guaranteed to return same thing each time, yes? At least, null-co doesn't guarantee it :).. So, it may be correct to skip these areas. Or may be better to always report them different??
  - consider data-zeroes equal to unallocated.. it's definitely not correct for protocols which reports fs-unallocated-non-zeroes

  I think, img-compare must only consider zero/non-zero, and don't touch other block-status features. Otherwise it's a mess

img-convert: convert_iteration_sectors: call on the whole backing chain: already described in bdrv_block_status section

= bdrv_is_allocated =

Obvious thing for backing-chain related operation (still wrong that some protocol drivers may return fs-unallocated and it is treated as go-to-backing areas):
    block-copy, commit, copy-on-read, stream, img-rebase

Others:
vvfat: o_O it has qcow child.. and operates like self is a backing of this child. But yes, it just uses bdrv_is_allocated to understand is chunk is rewritten in qcow.

migration/block: skip unallocated for top mode (shared_base, as it called here)

io-alloc: just report number of allocated in top layer

io-map: map_is_allocated: same thing as io-alloc, but report chunks

test_sync_op_block_status: just check what it returns

= bdrv_is_allocated_above =

Obvious usage for backing-chain related: commit, mirror, stream, img-rebase. Wrong for fs-unallocated-non-zero reporting drivers

Others:
qcow2: is_unallocated: call for the whole backing chain. Used to check is-zero.. Wrong for fs-unallocated-non-zero reporting drivers, and may be more efficient if consider also ZERO status.. but in some smart-fast way.

replication: allocated or not in backing-chain: common case

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-29  9:15             ` Vladimir Sementsov-Ogievskiy
@ 2020-04-29 10:50               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-29 10:50 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, Max Reitz
  Cc: Andrey Shinkevich, qemu-devel, qemu block

Full summary

What drivers returns?

Filters and raw are just broken together with their BDRV_BLOCK_RAW.


Format drivers behaves as follows (except for backing-not-supporting, which should be fixed):

0 - go to backing (not-backing supporting will never return it, backing-supporting will return it even if there is not backing file, but read is guaranteed to return zeroes, if there is no backing file at the moment)
ZERO - zero at this level
DATA - data at this level
DATA | ZERO actually never returned.

Protocol drivers

0 - fs-unallocated, read may return any garbage (null-co, iscsi)

from SCSI:
Logical Block Provisioning Read Zeros (LBPRZ) bit
1     If the logical block provisioning read zeros (LBPRZ) bit is set to one, then, for an unmapped LBA specified by a read operation, the deviceserver shall send user data with all bits set to zero to the data-in buffer.
0     If the TPRZ bit is set to zero, then, for an unmapped LBA specified by a read operation, the device server may send user data with all bitsset to any value to the data-in buffer.

so, yes, this 0 actually have significant meaning. And null-co matches it too. And nbd should be fixed to match it, I think.

ZERO - fs-unallocated, reads as zero
DATA - fs-allocated data or safe default
ZERO | DATA - only nbd may return it, and it seems wrong. let's fix it.

So, seems, we may can with it as is. The only thing to be documented is meaning of zero status returned by driver:

for backing-supporting it means go-to-backing, with a guarantee to read zeroes if there is no backing file, for backing-not-supporting it means most probably not occupy disk space, read returns garbage. And format drivers without backing support should never return 0.

===================

How it is used? And here we definitely fail. As there are a lot of places where 0 return of drivers is misused: we consider it fs-unallocated, but it may be just go-to-backing, we consider it go-to-backing, but it may be fs-unallocated.

let's go again through our public API

= bdrv_block_status =

img-convert and bdrv_make_zero wants only zero and go-to-backing information
img-map is better to distinguish fs-unallocated and go-to-backing and report them in different way.

= bdrv_block_status_above =

most callers needs only zero and go-to-backing information, others are doing wrong things and should be rewritten anyway

= bdrv_is_allocated =

most callers need only go-to-backing information.

io-alloc io-map - are reporting utilities, and they probably want to show fs-unallocated as well... but we never documented, how actually img-map, io-map and io-alloc should work and what they are trying to say :). Anyway, they may use same interface as img-map

= bdrv_is_allocated_above =

callers only need go-to-backing information

OK, sounds good. So, for most things we only need zero/data/go-to-backing information. fs-unallocated should be treated as DATA, it's garbage, but on read we will directly read this garbage, so we should consider it DATA.

And only for reporting through img-map, io-alloc and io-map we may want fs-unallocated information.
Do we actually need it?

Basing on this, I think that there should be four block-status types:

ZERO: unrelated to backing, reads as zero from this layer
DATA: unrelated to backing, reads from this layer, may be non-zero
BACKING: go to backing for information, reads from backing as well
RAW: I'm a filter or 'raw' driver, I don't know what to do. I give you my child and offset in it, take care of it. Be careful: it may be backing child or not, don't break your backing-chain loop on me!

So we may require that at most one of DATA, ZERO, RAW is set. And if nothing is set it means BACKING. And if we want to report fs-unallocated things, we just need additional flag for it, like BDRV_FS_UNALLOCATED_GARBAGE, which may be combined with DATA type chunks.

Or, may be even better, to split type from flags: block_status will return type, which is one of these four types, and other things (RECURSE, EOF, OFFSET_VALID, FS_UNALLOCATED_GARBAGE) goes to additional flags out-argument.

Note also, that the only user of @map and @file parameters is img-map. And all other callers have to pass NULL, NULL. I think, this definitely should be refactored.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 14:51   ` Vladimir Sementsov-Ogievskiy
@ 2020-04-30 19:12     ` Vladimir Sementsov-Ogievskiy
  2020-05-01  3:04       ` Andrey Shinkevich
  2020-05-07 12:58     ` Max Reitz
  1 sibling, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-30 19:12 UTC (permalink / raw)
  To: Max Reitz, qemu block; +Cc: Kevin Wolf, Andrey Shinkevich, qemu-devel

28.04.2020 17:51, Vladimir Sementsov-Ogievskiy wrote:
> 28.04.2020 14:08, Max Reitz wrote:
>> On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi!
>>>
>>> I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
>>> is_allocated_above", and returned to all the inconsistencies about
>>> block-status. I keep in mind Max's series about child-access functions,
>>> and Andrey's work about using COR filter in block-stream, which depends
>>> on Max's series (because, without them COR fitler with file child breaks
>>> backing chains).. And, it seems that it's better to discuss some
>>> questions before resending.
>>>
>>> First, problems about block-status:
>>>
>>> 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
>>>
>>>     * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>>     * BDRV_BLOCK_ZERO: offset reads as zero
>>>     * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
>>> raw data
>>>     * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>     *                       layer rather than any backing, set by block
>>> layer
>>>
>>> This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
>>> formats which doesn't support backing. So, all such format drivers must
>>> return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
>>> example, iscsi - doesn't.
>>
>> Hm.  I could imagine that there are formats that have non-zero holes
>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>> ZERO or DATA then.
>>
>> But OTOH we don’t care about such cases, do we?  We need to know whether
>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>> care about whether reading from it will return data from this layer or not.
>>
>> So I suppose that anything that doesn’t support backing files (or
>> filtered children) should always return ZERO and/or DATA.
>>
>>> 2. ZERO. The meaning differs a bit for generic block_status and for
>>> drivers.. I think, we at least should document it like this:
>>>
>>> BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>> BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at
>>> this layer and read as ZERO. If generic block_status returns ZERO, it
>>> only mean that it reads as zero, but the region may be allocated on
>>> underlying level.
>>
>> Hm.  What does that mean?
>>
>> One of the problems is that “allocated” has two meanings:
>> (1) reading data returns data defined at this backing layer,
>> (2) actually allocated, i.e. takes up space on the file represented by
>> this BDS.
>>
>> As far as I understand, we actually don’t care about (2) in the context
>> of block_status, but just about (1).
>>
>> So if a layer returns ZERO, it is by definition (1)-allocated.  (It
>> isn’t necessarily (2)-allocated.)
>>
>>> 3. bdi.unallocated_blocks_are_zero
>>>
>>> I think it's very bad, that we have formats, that supports backing, but
>>> doesn't report bdi.unallocated_blocks_are_zero as true. It means that
>>> UNALLOCATED region reads as zero if we have short backing file, and not
>>> as zero if we remove this short backing file.
>>
>> What do you mean by “remove this short backing file”?  Because generally
>> one can’t just drop a backing file.
>>
>> So maybe a case like block-stream?  Wouldn’t that be a bug in
>> block-stream them, i.e. shouldn’t it stream zeros after the end of the
>> backing file?
>>
>>> I can live with it but
>>> this is weird logic. These bad drivers are qcow (not qcow2), parallels
>>> and vmdk. I hope, they actually just forget to set
>>> unallocated_blocks_are_zero to true.
>>
>> qcow definitely sounds like it.
>>
>>> Next. But what about drivers which doesn't support backing? As we
>>> considered above, they should always return ZERO or DATA, as everything
>>> is allocated in this backing-chain level (last level, of course).. So
>>> again unallocated_blocks_are_zero is meaningless. So, I think, that
>>> driver which doesn't support backings, should be fixed to return always
>>> ZERO or DATA, than we don't need this unallocated_blocks_are_zero at all.
>>
>> Agreed.
>>
>>> 3.
>>
>> The second 3.? :)
>>
>>> Short backing files in allocated_above: we must consider space after
>>> EOF as ALLOCATED, if short backing file is inside requested
>>> backing-chain part, as it produced exactly because of this short file
>>> (and we never go to backing).
>>
>> Sounds correct.
>>
>>> (current realization of allocated_above is
>>> buggy, see my outdated series "[PATCH 0/4] fix & merge
>>> block_status_above and is_allocated_above")
>>>
>>> 4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have
>>> a backing chain of non-backing child.. I just remember that we didn't
>>> reach the consensus.
>>
>> Possible? :)
>>
>>> 5. Filters.. OK we have two functions for them:
>>> bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I
>>> think both are wrong:
>>>
>>> bdrv_co_block_status_from_file leads to problem [4], when we can report
>>> UNALLOCATED, which refers not to the current backing chain, but to sub
>>> backing chain of file child, which is inconsistent with
>>> block_status_above and is_allocated_above iteration.
>>>
>>> bdrv_co_block_status_from_backing is also is not consistent with
>>> block_status_above iteration.. At least at leads to querying the same
>>> node twice.
>>
>> Well, yes.
>>
>>> So, about filters and backing chains. Keeping (OK, just, trying to keep)
>>> all these things in mind, I think that it's better to keep backing
>>> chains exactly *backing* chains, so that "backing" child is the only
>>> "not own" child of the node. So, its close to current behavior and we
>>> don't need child-access functions. Then how filters should work:
>>>
>>> Filter with backing child, should always return UNALLOCATED (i.e. no
>>> DATA, no ZERO), it is honest: everything is on the other level of
>>> backing chain.
>>
>> I disagree, because filters with or without backing children should work
>> exactly the same way and just not appear in the backing chain.
> 
> They work the same way in my variant too. The only difference is that if you use file-child-based filters, you can't do stream/commit around them. I just think, that binding the concept to the "backing" link of the node is simpler and more generic. In blockdev era, when all nodes will be named and libvirt will take care of all nodes including filters, we will not need any filter-skipping magic, libvirt will directly point to exact nodes it means. And we can deprecate "file" children of existing filters, to finally reach simple architecture with simple backing chain of nodes (any nodes). And after deprecating old filename-based interfaces, we'll drop all filter-skipping magic..

What do you think?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-30 19:12     ` Vladimir Sementsov-Ogievskiy
@ 2020-05-01  3:04       ` Andrey Shinkevich
  2020-05-06  5:56         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Andrey Shinkevich @ 2020-05-01  3:04 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu block
  Cc: Kevin Wolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1378 bytes --]

Sounds good to me generally.
Also, we need to identify the filter by its node name when the file names of a node and of the filter above it are the same. And what about automatically generated node name for the filter? We will want to pass it to the stream routine.

Andrey

________________________________
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Sent: Thursday, April 30, 2020 10:12 PM
To: Max Reitz <mreitz@redhat.com>; qemu block <qemu-block@nongnu.org>
Cc: Kevin Wolf <kwolf@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Subject: Re: backing chain & block status & filters


>
> .... The only difference is that if you use file-child-based filters, you can't do stream/commit around them. I just think, that binding the concept to the "backing" link of the node is simpler and more generic. In blockdev era, when all nodes will be named and libvirt will take care of all nodes including filters, we will not need any filter-skipping magic, libvirt will directly point to exact nodes it means. And we can deprecate "file" children of existing filters, to finally reach simple architecture with simple backing chain of nodes (any nodes). And after deprecating old filename-based interfaces, we'll drop all filter-skipping magic..

What do you think?

--
Best regards,
Vladimir

[-- Attachment #2: Type: text/html, Size: 2506 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-05-01  3:04       ` Andrey Shinkevich
@ 2020-05-06  5:56         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-06  5:56 UTC (permalink / raw)
  To: Andrey Shinkevich, Max Reitz, qemu block; +Cc: Kevin Wolf, qemu-devel

01.05.2020 6:04, Andrey Shinkevich wrote:
> Sounds good to me generally.
> Also, we need to identify the filter by its node name when the file names of a node and of the filter above it are the same. And what about automatically generated node name for the filter? We will want to pass it to the stream routine.
> 

As I understand, there should not be auto-generated node-names in blockdev era. All nodes are named by libvirt with full-control on them.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-04-28 14:51   ` Vladimir Sementsov-Ogievskiy
  2020-04-30 19:12     ` Vladimir Sementsov-Ogievskiy
@ 2020-05-07 12:58     ` Max Reitz
  2020-05-07 19:34       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 16+ messages in thread
From: Max Reitz @ 2020-05-07 12:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu block
  Cc: Kevin Wolf, Andrey Shinkevich, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 8317 bytes --]

On 28.04.20 16:51, Vladimir Sementsov-Ogievskiy wrote:
> 28.04.2020 14:08, Max Reitz wrote:
>> On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi!
>>>
>>> I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
>>> is_allocated_above", and returned to all the inconsistencies about
>>> block-status. I keep in mind Max's series about child-access functions,
>>> and Andrey's work about using COR filter in block-stream, which depends
>>> on Max's series (because, without them COR fitler with file child breaks
>>> backing chains).. And, it seems that it's better to discuss some
>>> questions before resending.
>>>
>>> First, problems about block-status:
>>>
>>> 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
>>>
>>>     * BDRV_BLOCK_DATA: allocation for data at offset is tied to this
>>> layer
>>>     * BDRV_BLOCK_ZERO: offset reads as zero
>>>     * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
>>> raw data
>>>     * BDRV_BLOCK_ALLOCATED: the content of the block is determined by
>>> this
>>>     *                       layer rather than any backing, set by block
>>> layer
>>>
>>> This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
>>> formats which doesn't support backing. So, all such format drivers must
>>> return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
>>> example, iscsi - doesn't.
>>
>> Hm.  I could imagine that there are formats that have non-zero holes
>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>> ZERO or DATA then.
>>
>> But OTOH we don’t care about such cases, do we?  We need to know whether
>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>> care about whether reading from it will return data from this layer or
>> not.
>>
>> So I suppose that anything that doesn’t support backing files (or
>> filtered children) should always return ZERO and/or DATA.
>>
>>> 2. ZERO. The meaning differs a bit for generic block_status and for
>>> drivers.. I think, we at least should document it like this:
>>>
>>> BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>> BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at
>>> this layer and read as ZERO. If generic block_status returns ZERO, it
>>> only mean that it reads as zero, but the region may be allocated on
>>> underlying level.
>>
>> Hm.  What does that mean?
>>
>> One of the problems is that “allocated” has two meanings:
>> (1) reading data returns data defined at this backing layer,
>> (2) actually allocated, i.e. takes up space on the file represented by
>> this BDS.
>>
>> As far as I understand, we actually don’t care about (2) in the context
>> of block_status, but just about (1).
>>
>> So if a layer returns ZERO, it is by definition (1)-allocated.  (It
>> isn’t necessarily (2)-allocated.)
>>
>>> 3. bdi.unallocated_blocks_are_zero
>>>
>>> I think it's very bad, that we have formats, that supports backing, but
>>> doesn't report bdi.unallocated_blocks_are_zero as true. It means that
>>> UNALLOCATED region reads as zero if we have short backing file, and not
>>> as zero if we remove this short backing file.
>>
>> What do you mean by “remove this short backing file”?  Because generally
>> one can’t just drop a backing file.
>>
>> So maybe a case like block-stream?  Wouldn’t that be a bug in
>> block-stream them, i.e. shouldn’t it stream zeros after the end of the
>> backing file?
>>
>>> I can live with it but
>>> this is weird logic. These bad drivers are qcow (not qcow2), parallels
>>> and vmdk. I hope, they actually just forget to set
>>> unallocated_blocks_are_zero to true.
>>
>> qcow definitely sounds like it.
>>
>>> Next. But what about drivers which doesn't support backing? As we
>>> considered above, they should always return ZERO or DATA, as everything
>>> is allocated in this backing-chain level (last level, of course).. So
>>> again unallocated_blocks_are_zero is meaningless. So, I think, that
>>> driver which doesn't support backings, should be fixed to return always
>>> ZERO or DATA, than we don't need this unallocated_blocks_are_zero at
>>> all.
>>
>> Agreed.
>>
>>> 3.
>>
>> The second 3.? :)
>>
>>> Short backing files in allocated_above: we must consider space after
>>> EOF as ALLOCATED, if short backing file is inside requested
>>> backing-chain part, as it produced exactly because of this short file
>>> (and we never go to backing).
>>
>> Sounds correct.
>>
>>> (current realization of allocated_above is
>>> buggy, see my outdated series "[PATCH 0/4] fix & merge
>>> block_status_above and is_allocated_above")
>>>
>>> 4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have
>>> a backing chain of non-backing child.. I just remember that we didn't
>>> reach the consensus.
>>
>> Possible? :)
>>
>>> 5. Filters.. OK we have two functions for them:
>>> bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I
>>> think both are wrong:
>>>
>>> bdrv_co_block_status_from_file leads to problem [4], when we can report
>>> UNALLOCATED, which refers not to the current backing chain, but to sub
>>> backing chain of file child, which is inconsistent with
>>> block_status_above and is_allocated_above iteration.
>>>
>>> bdrv_co_block_status_from_backing is also is not consistent with
>>> block_status_above iteration.. At least at leads to querying the same
>>> node twice.
>>
>> Well, yes.
>>
>>> So, about filters and backing chains. Keeping (OK, just, trying to keep)
>>> all these things in mind, I think that it's better to keep backing
>>> chains exactly *backing* chains, so that "backing" child is the only
>>> "not own" child of the node. So, its close to current behavior and we
>>> don't need child-access functions. Then how filters should work:
>>>
>>> Filter with backing child, should always return UNALLOCATED (i.e. no
>>> DATA, no ZERO), it is honest: everything is on the other level of
>>> backing chain.
>>
>> I disagree, because filters with or without backing children should work
>> exactly the same way and just not appear in the backing chain.
> 
> They work the same way in my variant too. The only difference is that if
> you use file-child-based filters, you can't do stream/commit around
> them.

Which is a bug that’s been know for a long time, and the primary cause
for the “Deal with filters” series.

> I just think, that binding the concept to the "backing" link of
> the node is simpler and more generic. In blockdev era, when all nodes
> will be named and libvirt will take care of all nodes including filters,
> we will not need any filter-skipping magic, libvirt will directly point
> to exact nodes it means. And we can deprecate "file" children of
> existing filters,

One thing to note is that all user-creatable filter drivers currently
use “file”, so this would change them all.

The idea of just using backing for filters just always seems to me like
trying to take the easy way out.  It seems to me like many things around
filters are broken because we weren’t careful enough when introducing
them (mostly a combination of “let’s see whether it just works” and “it
seems to mostly work”), and the “Deal with filters” series attempts to
rectify that.  As evidenced by the reviews that required even more
preliminary work (like the still on-list and under-discussion
BdrvChildRole series), there is really a ton of design that should be
improved.

That makes me very hesitant to just switch filters over to backing,
because I feel that might alleviate some of the most pressing symptoms,
while not addressing the underlying issues.  And without symptoms that
hurt, nobody might want to fix the issues.  Basically, it feels like
more of the same “Let’s try whether it mostly works” and “it seems to
mostly work”.

(Also, my naïve feeling is that if just treating backing and file the
same for filter drivers, the deal with filters series would be ten
patches long and v1 would have been uncontroversial.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: backing chain & block status & filters
  2020-05-07 12:58     ` Max Reitz
@ 2020-05-07 19:34       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-07 19:34 UTC (permalink / raw)
  To: Max Reitz, qemu block; +Cc: Kevin Wolf, Andrey Shinkevich, qemu-devel

07.05.2020 15:58, Max Reitz wrote:
> On 28.04.20 16:51, Vladimir Sementsov-Ogievskiy wrote:
>> 28.04.2020 14:08, Max Reitz wrote:
>>> On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi!
>>>>
>>>> I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
>>>> is_allocated_above", and returned to all the inconsistencies about
>>>> block-status. I keep in mind Max's series about child-access functions,
>>>> and Andrey's work about using COR filter in block-stream, which depends
>>>> on Max's series (because, without them COR fitler with file child breaks
>>>> backing chains).. And, it seems that it's better to discuss some
>>>> questions before resending.
>>>>
>>>> First, problems about block-status:
>>>>
>>>> 1. We consider ALLOCATED = ZERO | DATA, and documented as follows:
>>>>
>>>>      * BDRV_BLOCK_DATA: allocation for data at offset is tied to this
>>>> layer
>>>>      * BDRV_BLOCK_ZERO: offset reads as zero
>>>>      * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
>>>> raw data
>>>>      * BDRV_BLOCK_ALLOCATED: the content of the block is determined by
>>>> this
>>>>      *                       layer rather than any backing, set by block
>>>> layer
>>>>
>>>> This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
>>>> formats which doesn't support backing. So, all such format drivers must
>>>> return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
>>>> example, iscsi - doesn't.
>>>
>>> Hm.  I could imagine that there are formats that have non-zero holes
>>> (e.g. 0xff or just garbage).  It would be a bit wrong for them to return
>>> ZERO or DATA then.
>>>
>>> But OTOH we don’t care about such cases, do we?  We need to know whether
>>> ranges are zero, data, or unallocated.  If they aren’t zero, we only
>>> care about whether reading from it will return data from this layer or
>>> not.
>>>
>>> So I suppose that anything that doesn’t support backing files (or
>>> filtered children) should always return ZERO and/or DATA.
>>>
>>>> 2. ZERO. The meaning differs a bit for generic block_status and for
>>>> drivers.. I think, we at least should document it like this:
>>>>
>>>> BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>>> BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at
>>>> this layer and read as ZERO. If generic block_status returns ZERO, it
>>>> only mean that it reads as zero, but the region may be allocated on
>>>> underlying level.
>>>
>>> Hm.  What does that mean?
>>>
>>> One of the problems is that “allocated” has two meanings:
>>> (1) reading data returns data defined at this backing layer,
>>> (2) actually allocated, i.e. takes up space on the file represented by
>>> this BDS.
>>>
>>> As far as I understand, we actually don’t care about (2) in the context
>>> of block_status, but just about (1).
>>>
>>> So if a layer returns ZERO, it is by definition (1)-allocated.  (It
>>> isn’t necessarily (2)-allocated.)
>>>
>>>> 3. bdi.unallocated_blocks_are_zero
>>>>
>>>> I think it's very bad, that we have formats, that supports backing, but
>>>> doesn't report bdi.unallocated_blocks_are_zero as true. It means that
>>>> UNALLOCATED region reads as zero if we have short backing file, and not
>>>> as zero if we remove this short backing file.
>>>
>>> What do you mean by “remove this short backing file”?  Because generally
>>> one can’t just drop a backing file.
>>>
>>> So maybe a case like block-stream?  Wouldn’t that be a bug in
>>> block-stream them, i.e. shouldn’t it stream zeros after the end of the
>>> backing file?
>>>
>>>> I can live with it but
>>>> this is weird logic. These bad drivers are qcow (not qcow2), parallels
>>>> and vmdk. I hope, they actually just forget to set
>>>> unallocated_blocks_are_zero to true.
>>>
>>> qcow definitely sounds like it.
>>>
>>>> Next. But what about drivers which doesn't support backing? As we
>>>> considered above, they should always return ZERO or DATA, as everything
>>>> is allocated in this backing-chain level (last level, of course).. So
>>>> again unallocated_blocks_are_zero is meaningless. So, I think, that
>>>> driver which doesn't support backings, should be fixed to return always
>>>> ZERO or DATA, than we don't need this unallocated_blocks_are_zero at
>>>> all.
>>>
>>> Agreed.
>>>
>>>> 3.
>>>
>>> The second 3.? :)
>>>
>>>> Short backing files in allocated_above: we must consider space after
>>>> EOF as ALLOCATED, if short backing file is inside requested
>>>> backing-chain part, as it produced exactly because of this short file
>>>> (and we never go to backing).
>>>
>>> Sounds correct.
>>>
>>>> (current realization of allocated_above is
>>>> buggy, see my outdated series "[PATCH 0/4] fix & merge
>>>> block_status_above and is_allocated_above")
>>>>
>>>> 4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have
>>>> a backing chain of non-backing child.. I just remember that we didn't
>>>> reach the consensus.
>>>
>>> Possible? :)
>>>
>>>> 5. Filters.. OK we have two functions for them:
>>>> bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I
>>>> think both are wrong:
>>>>
>>>> bdrv_co_block_status_from_file leads to problem [4], when we can report
>>>> UNALLOCATED, which refers not to the current backing chain, but to sub
>>>> backing chain of file child, which is inconsistent with
>>>> block_status_above and is_allocated_above iteration.
>>>>
>>>> bdrv_co_block_status_from_backing is also is not consistent with
>>>> block_status_above iteration.. At least at leads to querying the same
>>>> node twice.
>>>
>>> Well, yes.
>>>
>>>> So, about filters and backing chains. Keeping (OK, just, trying to keep)
>>>> all these things in mind, I think that it's better to keep backing
>>>> chains exactly *backing* chains, so that "backing" child is the only
>>>> "not own" child of the node. So, its close to current behavior and we
>>>> don't need child-access functions. Then how filters should work:
>>>>
>>>> Filter with backing child, should always return UNALLOCATED (i.e. no
>>>> DATA, no ZERO), it is honest: everything is on the other level of
>>>> backing chain.
>>>
>>> I disagree, because filters with or without backing children should work
>>> exactly the same way and just not appear in the backing chain.
>>
>> They work the same way in my variant too. The only difference is that if
>> you use file-child-based filters, you can't do stream/commit around
>> them.
> 
> Which is a bug that’s been know for a long time, and the primary cause
> for the “Deal with filters” series.
> 
>> I just think, that binding the concept to the "backing" link of
>> the node is simpler and more generic. In blockdev era, when all nodes
>> will be named and libvirt will take care of all nodes including filters,
>> we will not need any filter-skipping magic, libvirt will directly point
>> to exact nodes it means. And we can deprecate "file" children of
>> existing filters,
> 
> One thing to note is that all user-creatable filter drivers currently
> use “file”, so this would change them all.
> 
> The idea of just using backing for filters just always seems to me like
> trying to take the easy way out.  It seems to me like many things around
> filters are broken because we weren’t careful enough when introducing
> them (mostly a combination of “let’s see whether it just works” and “it
> seems to mostly work”), and the “Deal with filters” series attempts to
> rectify that.  As evidenced by the reviews that required even more
> preliminary work (like the still on-list and under-discussion
> BdrvChildRole series), there is really a ton of design that should be
> improved.
> 
> That makes me very hesitant to just switch filters over to backing,
> because I feel that might alleviate some of the most pressing symptoms,
> while not addressing the underlying issues.  And without symptoms that
> hurt, nobody might want to fix the issues.  Basically, it feels like
> more of the same “Let’s try whether it mostly works” and “it seems to
> mostly work”.
> 
> (Also, my naïve feeling is that if just treating backing and file the
> same for filter drivers, the deal with filters series would be ten
> patches long and v1 would have been uncontroversial.)
> 

Hmm. OK, I don't know. Probably it's not a big difference: use CAF vs
move filters to backing child. And anyway we may move them later.
Probably it's simpler to fix all the issues, if move the filters to
backing child first, but we already have you series and rebasing it
on another concept is an extra work, not worth doing.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-05-07 19:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28  8:55 backing chain & block status & filters Vladimir Sementsov-Ogievskiy
2020-04-28 11:08 ` Max Reitz
2020-04-28 11:28   ` Kevin Wolf
2020-04-28 15:13     ` Vladimir Sementsov-Ogievskiy
2020-04-28 16:18       ` Eric Blake
2020-04-28 16:46         ` Vladimir Sementsov-Ogievskiy
2020-04-28 18:37           ` Kevin Wolf
2020-04-28 19:44           ` Vladimir Sementsov-Ogievskiy
2020-04-29  9:15             ` Vladimir Sementsov-Ogievskiy
2020-04-29 10:50               ` Vladimir Sementsov-Ogievskiy
2020-04-28 14:51   ` Vladimir Sementsov-Ogievskiy
2020-04-30 19:12     ` Vladimir Sementsov-Ogievskiy
2020-05-01  3:04       ` Andrey Shinkevich
2020-05-06  5:56         ` Vladimir Sementsov-Ogievskiy
2020-05-07 12:58     ` Max Reitz
2020-05-07 19:34       ` Vladimir Sementsov-Ogievskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).