All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>
Cc: Kevin Wolf <kwolf@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions
Date: Tue, 7 May 2019 17:13:21 +0200	[thread overview]
Message-ID: <5794529f-9451-4dd9-c509-07df5cefdead@redhat.com> (raw)
In-Reply-To: <344eec5c-8908-7b32-5d5f-61911253a621@virtuozzo.com>

[-- Attachment #1: Type: text/plain, Size: 16148 bytes --]

On 07.05.19 15:30, Vladimir Sementsov-Ogievskiy wrote:
> 10.04.2019 23:20, Max Reitz wrote:
>> What bs->file and bs->backing mean depends on the node.  For filter
>> nodes, both signify a node that will eventually receive all R/W
>> accesses.  For format nodes, bs->file contains metadata and data, and
>> bs->backing will not receive writes -- instead, writes are COWed to
>> bs->file.  Usually.
>>
>> In any case, it is not trivial to guess what a child means exactly with
>> our currently limited form of expression.  It is better to introduce
>> some functions that actually guarantee a meaning:
>>
>> - bdrv_filtered_cow_child() will return the child that receives requests
>>    filtered through COW.  That is, reads may or may not be forwarded
>>    (depending on the overlay's allocation status), but writes never go to
>>    this child.
>>
>> - bdrv_filtered_rw_child() will return the child that receives requests
>>    filtered through some very plain process.  Reads and writes issued to
>>    the parent will go to the child as well (although timing, etc. may be
>>    modified).
>>
>> - All drivers but quorum (but quorum is pretty opaque to the general
>>    block layer anyway) always only have one of these children: All read
>>    requests must be served from the filtered_rw_child (if it exists), so
>>    if there was a filtered_cow_child in addition, it would not receive
>>    any requests at all.
>>    (The closest here is mirror, where all requests are passed on to the
>>    source, but with write-blocking, write requests are "COWed" to the
>>    target.  But that just means that the target is a special child that
>>    cannot be introspected by the generic block layer functions, and that
>>    source is a filtered_rw_child.)
>>    Therefore, we can also add bdrv_filtered_child() which returns that
>>    one child (or NULL, if there is no filtered child).
>>
>> Also, many places in the current block layer should be skipping filters
>> (all filters or just the ones added implicitly, it depends) when going
>> through a block node chain.  They do not do that currently, but this
>> patch makes them.
>>
>> One example for this is qemu-img map, which should skip filters and only
>> look at the COW elements in the graph.  The change to iotest 204's
>> reference output shows how using blkdebug on top of a COW node used to
>> make qemu-img map disregard the rest of the backing chain, but with this
>> patch, the allocation in the base image is reported correctly.
>>
>> Furthermore, a note should be made that sometimes we do want to access
>> bs->backing directly.  This is whenever the operation in question is not
>> about accessing the COW child, but the "backing" child, be it COW or
>> not.  This is the case in functions such as bdrv_open_backing_file() or
>> whenever we have to deal with the special behavior of @backing as a
>> blockdev option, which is that it does not default to null like all
>> other child references do.
>>
>> Finally, the query functions (query-block and query-named-block-nodes)
>> are modified to return any filtered child under "backing", not just
>> bs->backing or COW children.  This is so that filters do not interrupt
>> the reported backing chain.  This changes the output of iotest 184, as
>> the throttled node now appears as a backing child.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   qapi/block-core.json           |   4 +
>>   include/block/block.h          |   1 +
>>   include/block/block_int.h      |  40 +++++--
>>   block.c                        | 210 +++++++++++++++++++++++++++------
>>   block/backup.c                 |   8 +-
>>   block/block-backend.c          |  16 ++-
>>   block/commit.c                 |  33 +++---
>>   block/io.c                     |  45 ++++---
>>   block/mirror.c                 |  21 ++--
>>   block/qapi.c                   |  30 +++--
>>   block/stream.c                 |  13 +-
>>   blockdev.c                     |  88 +++++++++++---
>>   migration/block-dirty-bitmap.c |   4 +-
>>   nbd/server.c                   |   6 +-
>>   qemu-img.c                     |  29 ++---
>>   tests/qemu-iotests/184.out     |   7 +-
>>   tests/qemu-iotests/204.out     |   1 +
>>   17 files changed, 411 insertions(+), 145 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 7ccbfff9d0..dbd9286e4a 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2502,6 +2502,10 @@
>>   # On successful completion the image file is updated to drop the backing file
>>   # and the BLOCK_JOB_COMPLETED event is emitted.
>>   #
>> +# In case @device is a filter node, block-stream modifies the first non-filter
>> +# overlay node below it to point to base's backing node (or NULL if @base was
>> +# not specified) instead of modifying @device itself.
>> +#
> 
> Is it necessary, why we can't keep it as is, modifying exactly device node? May be,
> user wants to use filter in stream process, throttling for example.

That wouldn't make any sense.  Say you have this configuration:

throttle -> top -> base

Now you stream from base to throttle.  The data goes from base through
throttle to top.  You propose to then make throttle point to base:

throttle -> base

This will discard all the data in top.

Filters don’t store any data.  You need to keep the top data storing
image, i.e. the first non-filter overlay.

>>   # @job-id: identifier for the newly-created block job. If
>>   #          omitted, the device name will be used. (Since 2.7)
>>   #

[...]

>> @@ -2345,7 +2347,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>>       bool update_inherits_from = bdrv_chain_contains(bs, backing_hd) &&
>>           bdrv_inherits_from_recursive(backing_hd, bs);
>>   
>> -    if (bdrv_is_backing_chain_frozen(bs, backing_bs(bs), errp)) {
>> +    if (bdrv_is_backing_chain_frozen(bs, child_bs(bs->backing), errp)) {
> 
> If we support file-filters for frozen backing chain, could it go through file child here?
> Hmm, only in case when we are going to set backing hd for file-filter.. Hmm, could filter have
> both file and backing children?

No.  A filter passes through data from its children, so it can only have
a single child, or it is quorum.

The file/backing combination is reserved for COW overlays.  file is
where the current layer’s data is, backing is the filtered child.

> Your new API don't restrict it, and choses backing as a default
> in this case in bdrv_filtered_rw_child(), so, I assume you suppose possibility of it.

I can add an assertion against it if you’d like.

> Here we don't want to check the chain, we exactly want to check backing link, so it should be
> something like
> 
> if (bs->backing && bs->backing->frozen) {
>     error_setg("backig exists and frozen!");
>     return;
> }
> 
> 
> Hmm, on the other hand, if we have frozen backing chain, going through file child, we must not add
> backing child to the node with file child, as it will change backing chain (which by default goes
> through backing)..
> 
> Anyway, we don't need to check the whole backing chain, as we may find other frozen backing subchain,
> far away of bs.. So, we possibly want to check
> 
> if (bdrv_filtered_child(bs) && bdrv_filtered_child(bs)->frozed) {
>    ERROR
> }
> 
> 
> ....
> 
> also, we'll need to check for frozen file child, when we want to replace it.

I don’t quite understand.  It sounds to me like you’re saying we don’t
need to check the whole chain here but just the immediate child.  But
isn’t that true regardless of this series?

>>           return;
>>       }
>>   

[...]

>> @@ -3634,7 +3639,7 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
>>        * its metadata. Otherwise the 'backing' option can be omitted.
>>        */
>>       if (drv->supports_backing && reopen_state->backing_missing &&
>> -        (backing_bs(reopen_state->bs) || reopen_state->bs->backing_file[0])) {
>> +        (reopen_state->bs->backing || reopen_state->bs->backing_file[0])) {
> 
> and if we skip implicit filters in bdrv_backing_chain_next(), shouldn't we skip them
> here too?

This is a check whether it is mandatory for the user to specify the
'backing' key when reopening @bs.  It is mandatory if it currently has a
backing node, or if it should get a backing file by default because the
image header says so.

We don’t care about any node in particular.  The question is just “Is
there a backing node?”.  So I don’t see how skipping filters would
change anything.

>>           error_setg(errp, "backing is missing for '%s'",
>>                      reopen_state->bs->node_name);
>>           ret = -EINVAL;
>> @@ -3779,7 +3784,7 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)
>>        * from bdrv_set_backing_hd()) has the new values.
>>        */
>>       if (reopen_state->replace_backing_bs) {
>> -        BlockDriverState *old_backing_bs = backing_bs(bs);
>> +        BlockDriverState *old_backing_bs = child_bs(bs->backing);
>>           assert(!old_backing_bs || !old_backing_bs->implicit);
>>           /* Abort the permission update on the backing bs we're detaching */
>>           if (old_backing_bs) {
>> @@ -4203,8 +4208,8 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>>   BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
>>                                       BlockDriverState *bs)
>>   {
>> -    while (active && bs != backing_bs(active)) {
>> -        active = backing_bs(active);
>> +    while (active && bs != bdrv_filtered_bs(active)) {
> 
> need to adjust comment to the function then, as we may find file-based-overlay, not backing.

Yes, true.

>> +        active = bdrv_filtered_bs(active);
>>       }
>>   
>>       return active;

[...]

>> @@ -4494,10 +4497,14 @@ bool bdrv_is_sg(BlockDriverState *bs)
>>   
>>   bool bdrv_is_encrypted(BlockDriverState *bs)
>>   {
>> -    if (bs->backing && bs->backing->bs->encrypted) {
>> +    BlockDriverState *filtered = bdrv_filtered_bs(bs);
>> +    if (bs->encrypted) {
>> +        return true;
>> +    }
>> +    if (filtered && bdrv_is_encrypted(filtered)) {
>>           return true;
>>       }
>> -    return bs->encrypted;
>> +    return false;
>>   }
> 
> one backing child -> recursion through extended backing chain

Yes, but isn’t that what we want?

[...]

>> @@ -4866,20 +4887,24 @@ int bdrv_has_zero_init_1(BlockDriverState *bs)
>>   
>>   int bdrv_has_zero_init(BlockDriverState *bs)
>>   {
>> +    BlockDriverState *filtered;
>> +
>>       if (!bs->drv) {
>>           return 0;
>>       }
>>   
>>       /* If BS is a copy on write image, it is initialized to
>>          the contents of the base image, which may not be zeroes.  */
>> -    if (bs->backing) {
>> +    if (bdrv_filtered_cow_child(bs)) {
>>           return 0;
>>       }
>>       if (bs->drv->bdrv_has_zero_init) {
>>           return bs->drv->bdrv_has_zero_init(bs);
>>       }
>> -    if (bs->file && bs->drv->is_filter) {
>> -        return bdrv_has_zero_init(bs->file->bs);
>> +
>> +    filtered = bdrv_filtered_rw_bs(bs);
>> +    if (filtered) {
>> +        return bdrv_has_zero_init(filtered);
>>       }
> 
> add recursion for filters

Not really, we had that before.

>>   
>>       /* safe default */

[...]

>> diff --git a/block/backup.c b/block/backup.c
>> index 9988753249..9c08353b23 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -577,6 +577,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>       int64_t len;
>>       BlockDriverInfo bdi;
>>       BackupBlockJob *job = NULL;
>> +    bool target_does_cow;
>>       int ret;
>>   
>>       assert(bs);
>> @@ -671,8 +672,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>       /* If there is no backing file on the target, we cannot rely on COW if our
>>        * backup cluster size is smaller than the target cluster size. Even for
>>        * targets with a backing file, try to avoid COW if possible. */
>> +    target_does_cow = bdrv_filtered_cow_child(target);
> 
> So, you excluded false-positive case when target is backing-filter. I think, we'd better skip
> filters here:
> 
> target_does_cow = bdrv_filtered_cow_child(bdrv_skip_rw_filters(target))

Sounds correct, yes.

I suppose we then need a fix for compression, too, though.  Currently,
the code checks whether the target driver offers
.bdrv_co_pwritev_compressed().  First, we should do the same there and
skip filters.  But second, in block/io.c, we then need to also skip
filters for compressed writes (by issuing normal writes to them with the
BDRV_REQ_WRITE_COMPRESSED flag set).

>>       ret = bdrv_get_info(target, &bdi);
>> -    if (ret == -ENOTSUP && !target->backing) {
>> +    if (ret == -ENOTSUP && !target_does_cow) {
>>           /* Cluster size is not defined */
>>           warn_report("The target block device doesn't provide "
>>                       "information about the block size and it doesn't have a "

[...]

>> diff --git a/block/io.c b/block/io.c
>> index dfc153b8d8..83c2b6b46a 100644
>> --- a/block/io.c
>> +++ b/block/io.c

[...]

>> @@ -2208,7 +2218,7 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
>>       bool first = true;
>>   
>>       assert(bs != base);
>> -    for (p = bs; p != base; p = backing_bs(p)) {
>> +    for (p = bs; p != base; p = bdrv_filtered_bs(p)) {
>>           ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
>>                                      file);
> 
> Interesting that for filters who use bdrv_co_block_status_from_backing and
> bdrv_co_block_status_from_file we will finally call .bdrv_co_block_status of
> underalying real node two or more times.. It's not wrong but obviously not optimal.

Hm.  If @p is a filter, we could skip straight to *file.  Would that work?

[...]

>> diff --git a/block/mirror.c b/block/mirror.c
>> index 8b2404051f..80cef587f0 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -660,8 +660,9 @@ static int mirror_exit_common(Job *job)
>>                               &error_abort);
>>       if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>>           BlockDriverState *backing = s->is_none_mode ? src : s->base;
>> -        if (backing_bs(target_bs) != backing) {
>> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
>> +        if (bdrv_backing_chain_next(target_bs) != backing) {
>> +            bdrv_set_backing_hd(bdrv_skip_rw_filters(target_bs), backing,
> 
> hmm, here you support filters above target_bs ...
> 
>> +                                &local_err);
>>               if (local_err) {
>>                   error_report_err(local_err);
>>                   ret = -EPERM;
>> @@ -711,7 +712,7 @@ static int mirror_exit_common(Job *job)
>>       block_job_remove_all_bdrv(bjob);
>>       bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
>>                               &error_abort);
>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>   
>>       /* We just changed the BDS the job BB refers to (with either or both of the
>>        * bdrv_replace_node() calls), so switch the BB back so the cleanup does
>> @@ -903,7 +904,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>>       } else {
>>           s->target_cluster_size = BDRV_SECTOR_SIZE;
>>       }
>> -    if (backing_filename[0] && !target_bs->backing &&
>> +    if (backing_filename[0] && !bdrv_filtered_cow_child(target_bs) &&
> 
> ... and here - not

Hm, yes, I think that is a mistake.  I’ll try to fix it.

> [stopped here for now]

Thanks so far! :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-05-07 15:15 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-10 20:20 [Qemu-devel] [PATCH v4 00/11] block: Deal with filters Max Reitz
2019-04-10 20:20 ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 01/11] block: Mark commit and mirror as filter drivers Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-16 10:02   ` Vladimir Sementsov-Ogievskiy
2019-04-17 16:22     ` Max Reitz
2019-04-18  8:36       ` Vladimir Sementsov-Ogievskiy
2019-04-24 15:23         ` Max Reitz
2019-04-19 10:23       ` Vladimir Sementsov-Ogievskiy
2019-04-24 16:36         ` Max Reitz
2019-05-07  9:32           ` Vladimir Sementsov-Ogievskiy
2019-05-07 13:15             ` Max Reitz
2019-05-07 13:33               ` Vladimir Sementsov-Ogievskiy
2019-05-31 16:26     ` Max Reitz
2019-05-31 17:02       ` Max Reitz
2019-05-07 13:30   ` Vladimir Sementsov-Ogievskiy
2019-05-07 15:13     ` Max Reitz [this message]
2019-05-17 11:50       ` Vladimir Sementsov-Ogievskiy
2019-05-23 14:49         ` Max Reitz
2019-05-23 15:08           ` Vladimir Sementsov-Ogievskiy
2019-05-23 15:56             ` Max Reitz
2019-05-17 14:50   ` Vladimir Sementsov-Ogievskiy
2019-05-23 17:27     ` Max Reitz
2019-05-24  8:12       ` Vladimir Sementsov-Ogievskiy
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 03/11] block: Storage child access function Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-05-20 10:41   ` Vladimir Sementsov-Ogievskiy
2019-05-28 18:09     ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 04/11] block: Inline bdrv_co_block_status_from_*() Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-05-21  8:57   ` Vladimir Sementsov-Ogievskiy
2019-05-28 17:58     ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 05/11] block: Fix check_to_replace_node() Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 06/11] iotests: Add tests for mirror @replaces loops Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 07/11] block: Leave BDS.backing_file constant Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 08/11] iotests: Add filter commit test cases Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 09/11] iotests: Add filter mirror " Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 10/11] iotests: Add test for commit in sub directory Max Reitz
2019-04-10 20:20   ` Max Reitz
2019-04-10 20:20 ` [Qemu-devel] [PATCH v4 11/11] iotests: Test committing to overridden backing Max Reitz
2019-04-10 20:20   ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5794529f-9451-4dd9-c509-07df5cefdead@redhat.com \
    --to=mreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.