On 23.05.19 17:08, Vladimir Sementsov-Ogievskiy wrote: > 23.05.2019 17:49, Max Reitz wrote: >> On 17.05.19 13:50, Vladimir Sementsov-Ogievskiy wrote: >>> 07.05.2019 18:13, Max Reitz wrote: >>>> On 07.05.19 15:30, Vladimir Sementsov-Ogievskiy wrote: >>>>> 10.04.2019 23:20, Max Reitz wrote: >>>>>> What bs->file and bs->backing mean depends on the node. For filter >>>>>> nodes, both signify a node that will eventually receive all R/W >>>>>> accesses. For format nodes, bs->file contains metadata and data, and >>>>>> bs->backing will not receive writes -- instead, writes are COWed to >>>>>> bs->file. Usually. >>>>>> >>>>>> In any case, it is not trivial to guess what a child means exactly with >>>>>> our currently limited form of expression. It is better to introduce >>>>>> some functions that actually guarantee a meaning: >>>>>> >>>>>> - bdrv_filtered_cow_child() will return the child that receives requests >>>>>> filtered through COW. That is, reads may or may not be forwarded >>>>>> (depending on the overlay's allocation status), but writes never go to >>>>>> this child. >>>>>> >>>>>> - bdrv_filtered_rw_child() will return the child that receives requests >>>>>> filtered through some very plain process. Reads and writes issued to >>>>>> the parent will go to the child as well (although timing, etc. may be >>>>>> modified). >>>>>> >>>>>> - All drivers but quorum (but quorum is pretty opaque to the general >>>>>> block layer anyway) always only have one of these children: All read >>>>>> requests must be served from the filtered_rw_child (if it exists), so >>>>>> if there was a filtered_cow_child in addition, it would not receive >>>>>> any requests at all. >>>>>> (The closest here is mirror, where all requests are passed on to the >>>>>> source, but with write-blocking, write requests are "COWed" to the >>>>>> target. But that just means that the target is a special child that >>>>>> cannot be introspected by the generic block layer functions, and that >>>>>> source is a filtered_rw_child.) >>>>>> Therefore, we can also add bdrv_filtered_child() which returns that >>>>>> one child (or NULL, if there is no filtered child). >>>>>> >>>>>> Also, many places in the current block layer should be skipping filters >>>>>> (all filters or just the ones added implicitly, it depends) when going >>>>>> through a block node chain. They do not do that currently, but this >>>>>> patch makes them. >>>>>> >>>>>> One example for this is qemu-img map, which should skip filters and only >>>>>> look at the COW elements in the graph. The change to iotest 204's >>>>>> reference output shows how using blkdebug on top of a COW node used to >>>>>> make qemu-img map disregard the rest of the backing chain, but with this >>>>>> patch, the allocation in the base image is reported correctly. >>>>>> >>>>>> Furthermore, a note should be made that sometimes we do want to access >>>>>> bs->backing directly. This is whenever the operation in question is not >>>>>> about accessing the COW child, but the "backing" child, be it COW or >>>>>> not. This is the case in functions such as bdrv_open_backing_file() or >>>>>> whenever we have to deal with the special behavior of @backing as a >>>>>> blockdev option, which is that it does not default to null like all >>>>>> other child references do. >>>>>> >>>>>> Finally, the query functions (query-block and query-named-block-nodes) >>>>>> are modified to return any filtered child under "backing", not just >>>>>> bs->backing or COW children. This is so that filters do not interrupt >>>>>> the reported backing chain. This changes the output of iotest 184, as >>>>>> the throttled node now appears as a backing child. >>>>>> >>>>>> Signed-off-by: Max Reitz >>>>>> --- >>>>>> qapi/block-core.json | 4 + >>>>>> include/block/block.h | 1 + >>>>>> include/block/block_int.h | 40 +++++-- >>>>>> block.c | 210 +++++++++++++++++++++++++++------ >>>>>> block/backup.c | 8 +- >>>>>> block/block-backend.c | 16 ++- >>>>>> block/commit.c | 33 +++--- >>>>>> block/io.c | 45 ++++--- >>>>>> block/mirror.c | 21 ++-- >>>>>> block/qapi.c | 30 +++-- >>>>>> block/stream.c | 13 +- >>>>>> blockdev.c | 88 +++++++++++--- >>>>>> migration/block-dirty-bitmap.c | 4 +- >>>>>> nbd/server.c | 6 +- >>>>>> qemu-img.c | 29 ++--- >>>>>> tests/qemu-iotests/184.out | 7 +- >>>>>> tests/qemu-iotests/204.out | 1 + >>>>>> 17 files changed, 411 insertions(+), 145 deletions(-) >>>>>> >>>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json >>>>>> index 7ccbfff9d0..dbd9286e4a 100644 >>>>>> --- a/qapi/block-core.json >>>>>> +++ b/qapi/block-core.json >>>>>> @@ -2502,6 +2502,10 @@ >>>>>> # On successful completion the image file is updated to drop the backing file >>>>>> # and the BLOCK_JOB_COMPLETED event is emitted. >>>>>> # >>>>>> +# In case @device is a filter node, block-stream modifies the first non-filter >>>>>> +# overlay node below it to point to base's backing node (or NULL if @base was >>>>>> +# not specified) instead of modifying @device itself. >>>>>> +# >>>>> >>>>> Is it necessary, why we can't keep it as is, modifying exactly device node? May be, >>>>> user wants to use filter in stream process, throttling for example. >>>> >>>> That wouldn't make any sense. Say you have this configuration: >>>> >>>> throttle -> top -> base >>>> >>>> Now you stream from base to throttle. The data goes from base through >>>> throttle to top. You propose to then make throttle point to base: >>>> >>>> throttle -> base >>>> >>>> This will discard all the data in top. >>>> >>>> Filters don’t store any data. You need to keep the top data storing >>>> image, i.e. the first non-filter overlay. >>> >>> Ah, yes, good reason. >>> >>>> >>>>>> # @job-id: identifier for the newly-created block job. If >>>>>> # omitted, the device name will be used. (Since 2.7) >>>>>> # >>>> >>>> [...] >>>> >>>>>> @@ -2345,7 +2347,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd, >>>>>> bool update_inherits_from = bdrv_chain_contains(bs, backing_hd) && >>>>>> bdrv_inherits_from_recursive(backing_hd, bs); >>>>>> >>>>>> - if (bdrv_is_backing_chain_frozen(bs, backing_bs(bs), errp)) { >>>>>> + if (bdrv_is_backing_chain_frozen(bs, child_bs(bs->backing), errp)) { >>>>> >>>>> If we support file-filters for frozen backing chain, could it go through file child here? >>>>> Hmm, only in case when we are going to set backing hd for file-filter.. Hmm, could filter have >>>>> both file and backing children? >>>> >>>> No. A filter passes through data from its children, so it can only have >>>> a single child, or it is quorum. >>>> >>>> The file/backing combination is reserved for COW overlays. file is >>>> where the current layer’s data is, backing is the filtered child. >>> >>> My backup-top has two children - backing and target.. So, I think, we can state that >>> filter should not have both file and backing children, but may have any other special >>> children he wants, invisible for backing-child/file-child generic logic. >> >> Ah, yes, sorry, that’s what I meant. A filter can have only a single >> filtered child, but other than that, they’re free to have whatever. >> >> [...] >> >>>>> Here we don't want to check the chain, we exactly want to check backing link, so it should be >>>>> something like >>>>> >>>>> if (bs->backing && bs->backing->frozen) { >>>>> error_setg("backig exists and frozen!"); >>>>> return; >>>>> } >>>>> >>>>> >>>>> Hmm, on the other hand, if we have frozen backing chain, going through file child, we must not add >>>>> backing child to the node with file child, as it will change backing chain (which by default goes >>>>> through backing).. >>>>> >>>>> Anyway, we don't need to check the whole backing chain, as we may find other frozen backing subchain, >>>>> far away of bs.. So, we possibly want to check >>>>> >>>>> if (bdrv_filtered_child(bs) && bdrv_filtered_child(bs)->frozed) { >>>>> ERROR >>>>> } >>>>> >>>>> >>>>> .... >>>>> >>>>> also, we'll need to check for frozen file child, when we want to replace it. >>>> >>>> I don’t quite understand. It sounds to me like you’re saying we don’t >>>> need to check the whole chain here but just the immediate child. But >>>> isn’t that true regardless of this series? >>> >>> If we restrict adding backing child to filter with file child, all becomes simpler and seems to be correct. >> >> OK. :-) >> >>> Should we add check for frozen file child to bdrv_replace_child() ? >> >> Argh. You mean move it from bdrv_set_backing_hd()? That actually makes >> a lot of sense to me. The problem is that bdrv_replace_child() >> currently cannot return an error, which may be a problem for >> bdrv_detach_child(). Hm. But that’s effectively only called from >> functions where the child is unref’d, and you have to know that your own >> child is not frozen before you unref it. So I guess we should be good >> to pass an &error_abort there. >> >> [...] >> >>>>>> @@ -2208,7 +2218,7 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs, >>>>>> bool first = true; >>>>>> >>>>>> assert(bs != base); >>>>>> - for (p = bs; p != base; p = backing_bs(p)) { >>>>>> + for (p = bs; p != base; p = bdrv_filtered_bs(p)) { >>>>>> ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map, >>>>>> file); >>>>> >>>>> Interesting that for filters who use bdrv_co_block_status_from_backing and >>>>> bdrv_co_block_status_from_file we will finally call .bdrv_co_block_status of >>>>> underalying real node two or more times.. It's not wrong but obviously not optimal. >>>> >>>> Hm. If @p is a filter, we could skip straight to *file. Would that work? >>> >>> No, as file may be not in backing chain: >>> >>> filter [A] >>> | >>> v >>> qcow2 -> file [B] >>> | >>> v >>> qcow2 >>> >>> So, we shouldn't redirect the whole loop to file.. >> >> But qcow2 is not a filter. I meant skipping to *file only if the >> current node is a filter. And I don’t mean bs->file, I mean *file -- >> like, what bdrv_co_block_status() returns. > > Me too. But as I understand, if we call bdrv_block_status on filter [A], > resulting *file returned by bdrv_co_block_status() will point to file [B] > due to recursion in bdrv_co_block_status. Crap, you’re right. Hm. I guess I’ll just ignore this performance problem for now, then. The best thing I can think of would be to turn @want_zero into a flag mask and then add a flag for “Please do not recurse through filters, just give me BDRV_BLOCK_RAW”. >> You say in your other mail that filters can have an own implementation >> of .bdrv_co_block_status(), but I don’t think that makes sense, >> actually. They should always pass the status of their filtered child. >> >> blkdebug is the only filter I know that has an own implementation, and >> the only thing besides passing the data through is add an alignment >> assertion. If it simplifies everything else, I’m very much willing to >> break that. > > Agree that assertion is a bad reason to not implement some clean generic > logic. OK. Max