QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [Qemu-devel] [PATCH v6 00/42] block: Deal with filters
@ 2019-08-09 16:13 Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 01/42] block: Mark commit and mirror as filter drivers Max Reitz
                   ` (41 more replies)
  0 siblings, 42 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Hi,

When we introduced filters, we did it a bit casually.  Sure, we talked a
lot about them before, but that was mostly discussion about where
implicit filters should be added to the graph (note that we currently
only have two implicit filters, those being mirror and commit).  But in
the end, we really just designated some drivers filters (Quorum,
blkdebug, etc.) and added some specifically (throttle, COR), without
really looking through the block layer to see where issues might occur.

It turns out vast areas of the block layer just don’t know about filters
and cannot really handle them.  Many cases will work in practice, in
others, well, too bad, you cannot use some feature because some part
deep inside the block layer looks at your filters and thinks they are
format nodes.

This is one reason why this series is needed.  Over time (since v1), a
second reason has made its way in:

bs->file is not necessarily the place where a node’s data is stored.
qcow2 now has external data files, and currently there is no way for the
general block layer to know that the data is not stored in bs->file.
Right now, I do not think that has any real consequences (all functions
that need access to the actual data storage file should only do so as a
fallback if the driver does not provide some functionality, but qcow2
should provide it all), but it still shows that we need some way to let
the general block layer know about such data files.  (Also, I will need
this for v1 of my “Inquire images’ rotational info” series.)

I won’t go on and on about this series now, I think the patches pretty
much speak for themselves now.  If the cover letter gets too long,
nobody reads it anyway (see previous versions).


*** I’ve based this series on John’s bitmap branch, which I’ve rebased
    on my block-next branch. ***

I’ve pushed the patches here:

  https://git.xanclic.moe/XanClic/qemu child-access-functions-v6
  https://github.com/XanClic/qemu child-access-functions-v6

I’ve also pushed the base branch to each of those repos, it’s
“child-access-functions-base”.


v6:
- Patch 9: Rename *freeze_backing_chain (etc.) to *freeze_chain (etc.)
- Patch 10: Drop bdrv_is_encrypted() instead of fixing it the wrong way
- Patch 15: Add a comment on why this works
- Patch 16: Just flush all children instead of one
- Patch 20: We have to snapshot all non-backing children, so both
            metadata and storage children; for simplification, just
            disallow the fallback path if there is more than one such
            child
- Patch 22: bdrv_get_allocated_file_size() should report all children’s
            sizes, not just the primary or storage child’s
- Patch 24: Make query-blockstats too report any filtered child under
            “backing”
- Patch 25:
  - bdrv_is_allocated_above()’s new @include_base parameter makes things
    a bit simpler
  - Forbid mirroring to a filter on top of the base
- Patch 27: bdrv_is_allocated_above()’s new @include_base parameter
            makes things a bit simpler
- Patch 28: Requires a few changes to keep stream independent of the
            base node
- Patch 30:
  - Sprinkle in a few bdrv_skip_implicit_filters()s
  - Conflicts with the convert --salvage patches


git backport-diff against v5:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/42:[----] [-C] 'block: Mark commit and mirror as filter drivers'
002/42:[----] [--] 'copy-on-read: Support compressed writes'
003/42:[----] [--] 'throttle: Support compressed writes'
004/42:[----] [--] 'block: Add child access functions'
005/42:[----] [--] 'block: Add chain helper functions'
006/42:[----] [--] 'qcow2: Implement .bdrv_storage_child()'
007/42:[----] [--] 'block: *filtered_cow_child() for *has_zero_init()'
008/42:[----] [--] 'block: bdrv_set_backing_hd() is about bs->backing'
009/42:[0078] [FC] 'block: Include filters when freezing backing chain'
010/42:[down] 'block: Drop bdrv_is_encrypted()'
011/42:[----] [-C] 'block: Add bdrv_supports_compressed_writes()'
012/42:[----] [--] 'block: Use bdrv_filtered_rw* where obvious'
013/42:[----] [-C] 'block: Use CAFs in block status functions'
014/42:[----] [--] 'block: Use CAFs when working with backing chains'
015/42:[0017] [FC] 'block: Re-evaluate backing file handling in reopen'
016/42:[down] 'block: Flush all children in generic code'
017/42:[----] [--] 'block: Use CAFs in bdrv_refresh_limits()'
018/42:[----] [--] 'block: Use CAFs in bdrv_refresh_filename()'
019/42:[----] [--] 'block: Use CAF in bdrv_co_rw_vmstate()'
020/42:[down] 'block/snapshot: Fix fallback'
021/42:[----] [--] 'block: Use CAFs for debug breakpoints'
022/42:[down] 'block: Fix bdrv_get_allocated_file_size's fallback'
023/42:[----] [--] 'blockdev: Use CAF in external_snapshot_prepare()'
024/42:[0012] [FC] 'block: Use child access functions for QAPI queries'
025/42:[0019] [FC] 'mirror: Deal with filters'
026/42:[----] [--] 'backup: Deal with filters'
027/42:[0025] [FC] 'commit: Deal with filters'
028/42:[0048] [FC] 'stream: Deal with filters'
029/42:[----] [--] 'nbd: Use CAF when looking for dirty bitmap'
030/42:[0035] [FC] 'qemu-img: Use child access functions'
031/42:[----] [--] 'block: Drop backing_bs()'
032/42:[----] [-C] 'block: Make bdrv_get_cumulative_perm() public'
033/42:[----] [--] 'blockdev: Fix active commit choice'
034/42:[----] [-C] 'block: Inline bdrv_co_block_status_from_*()'
035/42:[----] [--] 'block: Fix check_to_replace_node()'
036/42:[----] [--] 'iotests: Add tests for mirror @replaces loops'
037/42:[----] [-C] 'block: Leave BDS.backing_file constant'
038/42:[----] [--] 'iotests: Let complete_and_wait() work with commit'
039/42:[----] [--] 'iotests: Add filter commit test cases'
040/42:[----] [--] 'iotests: Add filter mirror test cases'
041/42:[----] [--] 'iotests: Add test for commit in sub directory'
042/42:[----] [--] 'iotests: Test committing to overridden backing'


Max Reitz (42):
  block: Mark commit and mirror as filter drivers
  copy-on-read: Support compressed writes
  throttle: Support compressed writes
  block: Add child access functions
  block: Add chain helper functions
  qcow2: Implement .bdrv_storage_child()
  block: *filtered_cow_child() for *has_zero_init()
  block: bdrv_set_backing_hd() is about bs->backing
  block: Include filters when freezing backing chain
  block: Drop bdrv_is_encrypted()
  block: Add bdrv_supports_compressed_writes()
  block: Use bdrv_filtered_rw* where obvious
  block: Use CAFs in block status functions
  block: Use CAFs when working with backing chains
  block: Re-evaluate backing file handling in reopen
  block: Flush all children in generic code
  block: Use CAFs in bdrv_refresh_limits()
  block: Use CAFs in bdrv_refresh_filename()
  block: Use CAF in bdrv_co_rw_vmstate()
  block/snapshot: Fix fallback
  block: Use CAFs for debug breakpoints
  block: Fix bdrv_get_allocated_file_size's fallback
  blockdev: Use CAF in external_snapshot_prepare()
  block: Use child access functions for QAPI queries
  mirror: Deal with filters
  backup: Deal with filters
  commit: Deal with filters
  stream: Deal with filters
  nbd: Use CAF when looking for dirty bitmap
  qemu-img: Use child access functions
  block: Drop backing_bs()
  block: Make bdrv_get_cumulative_perm() public
  blockdev: Fix active commit choice
  block: Inline bdrv_co_block_status_from_*()
  block: Fix check_to_replace_node()
  iotests: Add tests for mirror @replaces loops
  block: Leave BDS.backing_file constant
  iotests: Let complete_and_wait() work with commit
  iotests: Add filter commit test cases
  iotests: Add filter mirror test cases
  iotests: Add test for commit in sub directory
  iotests: Test committing to overridden backing

 qapi/block-core.json          |   4 +
 include/block/block.h         |  13 +-
 include/block/block_int.h     | 109 ++++---
 block.c                       | 552 ++++++++++++++++++++++++++++------
 block/backup.c                |   9 +-
 block/blkdebug.c              |   7 +-
 block/blklogwrites.c          |   1 -
 block/block-backend.c         |  16 +-
 block/commit.c                | 107 +++++--
 block/copy-on-read.c          |  13 +-
 block/io.c                    | 117 ++++---
 block/mirror.c                | 124 ++++++--
 block/qapi.c                  |  48 +--
 block/qcow2.c                 |   9 +
 block/snapshot.c              | 100 ++++--
 block/stream.c                |  52 ++--
 block/throttle.c              |  11 +-
 blockdev.c                    | 139 +++++++--
 nbd/server.c                  |   6 +-
 qemu-img.c                    |  33 +-
 tests/qemu-iotests/020        |  36 +++
 tests/qemu-iotests/020.out    |  10 +
 tests/qemu-iotests/040        | 238 +++++++++++++++
 tests/qemu-iotests/040.out    |   4 +-
 tests/qemu-iotests/041        | 270 ++++++++++++++++-
 tests/qemu-iotests/041.out    |   4 +-
 tests/qemu-iotests/184.out    |   7 +-
 tests/qemu-iotests/191.out    |   1 -
 tests/qemu-iotests/204.out    |   1 +
 tests/qemu-iotests/228        |   6 +-
 tests/qemu-iotests/228.out    |   6 +-
 tests/qemu-iotests/245        |   4 +-
 tests/qemu-iotests/iotests.py |  10 +-
 33 files changed, 1682 insertions(+), 385 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 01/42] block: Mark commit and mirror as filter drivers
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 02/42] copy-on-read: Support compressed writes Max Reitz
                   ` (40 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

The commit and mirror block nodes are filters, so they should be marked
as such.  (Strictly speaking, BDS.is_filter's documentation states that
a filter's child must be bs->file.  The following patch will relax this
restriction, however.)

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/commit.c | 2 ++
 block/mirror.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/block/commit.c b/block/commit.c
index 408ae15389..19915603ae 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -254,6 +254,8 @@ static BlockDriver bdrv_commit_top = {
     .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
     .bdrv_refresh_filename      = bdrv_commit_top_refresh_filename,
     .bdrv_child_perm            = bdrv_commit_top_child_perm,
+
+    .is_filter                  = true,
 };
 
 void commit_start(const char *job_id, BlockDriverState *bs,
diff --git a/block/mirror.c b/block/mirror.c
index 2b870683f1..a8f2d7a305 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1507,6 +1507,8 @@ static BlockDriver bdrv_mirror_top = {
     .bdrv_refresh_filename      = bdrv_mirror_top_refresh_filename,
     .bdrv_child_perm            = bdrv_mirror_top_child_perm,
     .bdrv_refresh_limits        = bdrv_mirror_top_refresh_limits,
+
+    .is_filter                  = true,
 };
 
 static BlockJob *mirror_start_job(
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 02/42] copy-on-read: Support compressed writes
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 01/42] block: Mark commit and mirror as filter drivers Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 03/42] throttle: " Max Reitz
                   ` (39 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/copy-on-read.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 6631f30205..16bdf630b6 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -113,6 +113,16 @@ static int coroutine_fn cor_co_pdiscard(BlockDriverState *bs,
 }
 
 
+static int coroutine_fn cor_co_pwritev_compressed(BlockDriverState *bs,
+                                                  uint64_t offset,
+                                                  uint64_t bytes,
+                                                  QEMUIOVector *qiov)
+{
+    return bdrv_co_pwritev(bs->file, offset, bytes, qiov,
+                           BDRV_REQ_WRITE_COMPRESSED);
+}
+
+
 static void cor_eject(BlockDriverState *bs, bool eject_flag)
 {
     bdrv_eject(bs->file->bs, eject_flag);
@@ -145,6 +155,7 @@ static BlockDriver bdrv_copy_on_read = {
     .bdrv_co_pwritev                    = cor_co_pwritev,
     .bdrv_co_pwrite_zeroes              = cor_co_pwrite_zeroes,
     .bdrv_co_pdiscard                   = cor_co_pdiscard,
+    .bdrv_co_pwritev_compressed         = cor_co_pwritev_compressed,
 
     .bdrv_eject                         = cor_eject,
     .bdrv_lock_medium                   = cor_lock_medium,
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 03/42] throttle: Support compressed writes
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 01/42] block: Mark commit and mirror as filter drivers Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 02/42] copy-on-read: Support compressed writes Max Reitz
@ 2019-08-09 16:13 ` " Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 04/42] block: Add child access functions Max Reitz
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/throttle.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block/throttle.c b/block/throttle.c
index 0349f42257..958a2bcfa6 100644
--- a/block/throttle.c
+++ b/block/throttle.c
@@ -153,6 +153,15 @@ static int coroutine_fn throttle_co_pdiscard(BlockDriverState *bs,
     return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn throttle_co_pwritev_compressed(BlockDriverState *bs,
+                                                       uint64_t offset,
+                                                       uint64_t bytes,
+                                                       QEMUIOVector *qiov)
+{
+    return throttle_co_pwritev(bs, offset, bytes, qiov,
+                               BDRV_REQ_WRITE_COMPRESSED);
+}
+
 static int throttle_co_flush(BlockDriverState *bs)
 {
     return bdrv_co_flush(bs->file->bs);
@@ -251,6 +260,7 @@ static BlockDriver bdrv_throttle = {
 
     .bdrv_co_pwrite_zeroes              =   throttle_co_pwrite_zeroes,
     .bdrv_co_pdiscard                   =   throttle_co_pdiscard,
+    .bdrv_co_pwritev_compressed         =   throttle_co_pwritev_compressed,
 
     .bdrv_recurse_is_first_non_filter   =   throttle_recurse_is_first_non_filter,
 
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (2 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 03/42] throttle: " Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:56   ` Eric Blake
  2019-09-04 16:16   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 05/42] block: Add chain helper functions Max Reitz
                   ` (37 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

There are BDS children that the general block layer code can access,
namely bs->file and bs->backing.  Since the introduction of filters and
external data files, their meaning is not quite clear.  bs->backing can
be a COW source, or it can be an R/W-filtered child; bs->file can be an
R/W-filtered child, it can be data and metadata storage, or it can be
just metadata storage.

This overloading really is not helpful.  This patch adds function that
retrieve the correct child for each exact purpose.  Later patches in
this series will make use of them.  Doing so will allow us to handle
filter nodes and external data files in a meaningful way.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h | 57 ++++++++++++++++++++--
 block.c                   | 99 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 60d9261f8e..6c60abc4c3 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -90,9 +90,11 @@ struct BlockDriver {
     int instance_size;
 
     /* set to true if the BlockDriver is a block filter. Block filters pass
-     * certain callbacks that refer to data (see block.c) to their bs->file if
-     * the driver doesn't implement them. Drivers that do not wish to forward
-     * must implement them and return -ENOTSUP.
+     * certain callbacks that refer to data (see block.c) to their bs->file
+     * or bs->backing (whichever one exists) if the driver doesn't implement
+     * them. Drivers that do not wish to forward must implement them and return
+     * -ENOTSUP.
+     * Note that filters are not allowed to modify data.
      */
     bool is_filter;
     /* for snapshots block filter like Quorum can implement the
@@ -562,6 +564,13 @@ struct BlockDriver {
      * If this pointer is NULL, the array is considered empty.
      * "filename" and "driver" are always considered strong. */
     const char *const *strong_runtime_opts;
+
+    /**
+     * Return the data storage child, if there is exactly one.  If
+     * this function is not implemented, the block layer will assume
+     * bs->file to be this child.
+     */
+    BdrvChild *(*bdrv_storage_child)(BlockDriverState *bs);
 };
 
 typedef struct BlockLimits {
@@ -1276,4 +1285,46 @@ int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, uint64_t src_offset,
 
 int refresh_total_sectors(BlockDriverState *bs, int64_t hint);
 
+BdrvChild *bdrv_filtered_cow_child(BlockDriverState *bs);
+BdrvChild *bdrv_filtered_rw_child(BlockDriverState *bs);
+BdrvChild *bdrv_filtered_child(BlockDriverState *bs);
+BdrvChild *bdrv_metadata_child(BlockDriverState *bs);
+BdrvChild *bdrv_storage_child(BlockDriverState *bs);
+BdrvChild *bdrv_primary_child(BlockDriverState *bs);
+
+static inline BlockDriverState *child_bs(BdrvChild *child)
+{
+    return child ? child->bs : NULL;
+}
+
+static inline BlockDriverState *bdrv_filtered_cow_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_filtered_cow_child(bs));
+}
+
+static inline BlockDriverState *bdrv_filtered_rw_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_filtered_rw_child(bs));
+}
+
+static inline BlockDriverState *bdrv_filtered_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_filtered_child(bs));
+}
+
+static inline BlockDriverState *bdrv_metadata_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_metadata_child(bs));
+}
+
+static inline BlockDriverState *bdrv_storage_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_storage_child(bs));
+}
+
+static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_primary_child(bs));
+}
+
 #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index aae3417dd5..f6c5f8c3eb 100644
--- a/block.c
+++ b/block.c
@@ -6556,3 +6556,102 @@ bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
 
     return drv->bdrv_can_store_new_dirty_bitmap(bs, name, granularity, errp);
 }
+
+/*
+ * Return the child that @bs acts as an overlay for, and from which data may be
+ * copied in COW or COR operations.  Usually this is the backing file.
+ */
+BdrvChild *bdrv_filtered_cow_child(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        return NULL;
+    }
+
+    if (bs->drv->is_filter) {
+        return NULL;
+    }
+
+    return bs->backing;
+}
+
+/*
+ * If @bs acts as a pass-through filter for one of its children,
+ * return that child.  "Pass-through" means that write operations to
+ * @bs are forwarded to that child instead of triggering COW.
+ */
+BdrvChild *bdrv_filtered_rw_child(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        return NULL;
+    }
+
+    if (!bs->drv->is_filter) {
+        return NULL;
+    }
+
+    /* Only one of @backing or @file may be used */
+    assert(!(bs->backing && bs->file));
+
+    return bs->backing ?: bs->file;
+}
+
+/*
+ * Return any filtered child, independently of how it reacts to write
+ * accesses and whether data is copied onto this BDS through COR.
+ */
+BdrvChild *bdrv_filtered_child(BlockDriverState *bs)
+{
+    BdrvChild *cow_child = bdrv_filtered_cow_child(bs);
+    BdrvChild *rw_child = bdrv_filtered_rw_child(bs);
+
+    /* There can only be one filtered child at a time */
+    assert(!(cow_child && rw_child));
+
+    return cow_child ?: rw_child;
+}
+
+/*
+ * Return the child that stores the metadata for this node.
+ */
+BdrvChild *bdrv_metadata_child(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        return NULL;
+    }
+
+    /* Filters do not have metadata */
+    if (bs->drv->is_filter) {
+        return NULL;
+    }
+
+    return bs->file;
+}
+
+/*
+ * Return the child that stores the data that is allocated on this
+ * node.  This may or may not include metadata.
+ */
+BdrvChild *bdrv_storage_child(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        return NULL;
+    }
+
+    if (bs->drv->bdrv_storage_child) {
+        return bs->drv->bdrv_storage_child(bs);
+    }
+
+    return bdrv_filtered_rw_child(bs) ?: bs->file;
+}
+
+/*
+ * Return the primary child of this node: For filters, that is the
+ * filtered child.  For other nodes, that is usually the child storing
+ * metadata.
+ * (A generally more helpful description is that this is (usually) the
+ * child that has the same filename as @bs.)
+ */
+BdrvChild *bdrv_primary_child(BlockDriverState *bs)
+{
+    return bdrv_filtered_rw_child(bs) ?: bs->file;
+}
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 05/42] block: Add chain helper functions
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (3 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 04/42] block: Add child access functions Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 17:01   ` Eric Blake
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 06/42] qcow2: Implement .bdrv_storage_child() Max Reitz
                   ` (36 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Add some helper functions for skipping filters in a chain of block
nodes.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h |  3 +++
 block.c                   | 55 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 6c60abc4c3..5bec3361fd 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1291,6 +1291,9 @@ BdrvChild *bdrv_filtered_child(BlockDriverState *bs);
 BdrvChild *bdrv_metadata_child(BlockDriverState *bs);
 BdrvChild *bdrv_storage_child(BlockDriverState *bs);
 BdrvChild *bdrv_primary_child(BlockDriverState *bs);
+BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs);
+BlockDriverState *bdrv_skip_rw_filters(BlockDriverState *bs);
+BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs);
 
 static inline BlockDriverState *child_bs(BdrvChild *child)
 {
diff --git a/block.c b/block.c
index f6c5f8c3eb..bfa5e27850 100644
--- a/block.c
+++ b/block.c
@@ -6655,3 +6655,58 @@ BdrvChild *bdrv_primary_child(BlockDriverState *bs)
 {
     return bdrv_filtered_rw_child(bs) ?: bs->file;
 }
+
+static BlockDriverState *bdrv_skip_filters(BlockDriverState *bs,
+                                           bool stop_on_explicit_filter)
+{
+    BdrvChild *filtered;
+
+    if (!bs) {
+        return NULL;
+    }
+
+    while (!(stop_on_explicit_filter && !bs->implicit)) {
+        filtered = bdrv_filtered_rw_child(bs);
+        if (!filtered) {
+            break;
+        }
+        bs = filtered->bs;
+    }
+    /*
+     * Note that this treats nodes with bs->drv == NULL as not being
+     * R/W filters (bs->drv == NULL should be replaced by something
+     * else anyway).
+     * The advantage of this behavior is that this function will thus
+     * always return a non-NULL value (given a non-NULL @bs).
+     */
+
+    return bs;
+}
+
+/*
+ * Return the first BDS that has not been added implicitly or that
+ * does not have an RW-filtered child down the chain starting from @bs
+ * (including @bs itself).
+ */
+BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs)
+{
+    return bdrv_skip_filters(bs, true);
+}
+
+/*
+ * Return the first BDS that does not have an RW-filtered child down
+ * the chain starting from @bs (including @bs itself).
+ */
+BlockDriverState *bdrv_skip_rw_filters(BlockDriverState *bs)
+{
+    return bdrv_skip_filters(bs, false);
+}
+
+/*
+ * For a backing chain, return the first non-filter backing image of
+ * the first non-filter image.
+ */
+BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs)
+{
+    return bdrv_skip_rw_filters(bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)));
+}
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 06/42] qcow2: Implement .bdrv_storage_child()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (4 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 05/42] block: Add chain helper functions Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 17:07   ` Eric Blake
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 07/42] block: *filtered_cow_child() for *has_zero_init() Max Reitz
                   ` (35 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 039bdc2f7e..f8570d6210 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -5086,6 +5086,13 @@ void qcow2_signal_corruption(BlockDriverState *bs, bool fatal, int64_t offset,
     s->signaled_corruption = true;
 }
 
+static BdrvChild *qcow2_storage_child(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    return s->data_file;
+}
+
 static QemuOptsList qcow2_create_opts = {
     .name = "qcow2-create-opts",
     .head = QTAILQ_HEAD_INITIALIZER(qcow2_create_opts.head),
@@ -5232,6 +5239,8 @@ BlockDriver bdrv_qcow2 = {
     .bdrv_reopen_bitmaps_rw = qcow2_reopen_bitmaps_rw,
     .bdrv_can_store_new_dirty_bitmap = qcow2_can_store_new_dirty_bitmap,
     .bdrv_remove_persistent_dirty_bitmap = qcow2_remove_persistent_dirty_bitmap,
+
+    .bdrv_storage_child = qcow2_storage_child,
 };
 
 static void bdrv_qcow2_init(void)
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 07/42] block: *filtered_cow_child() for *has_zero_init()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (5 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 06/42] qcow2: Implement .bdrv_storage_child() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 08/42] block: bdrv_set_backing_hd() is about bs->backing Max Reitz
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

bdrv_has_zero_init() and the related bdrv_unallocated_blocks_are_zero()
should use bdrv_filtered_cow_child() if they want to check whether the
given BDS has a COW backing file.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index bfa5e27850..486c75d847 100644
--- a/block.c
+++ b/block.c
@@ -5063,7 +5063,7 @@ int bdrv_has_zero_init(BlockDriverState *bs)
 
     /* If BS is a copy on write image, it is initialized to
        the contents of the base image, which may not be zeroes.  */
-    if (bs->backing) {
+    if (bdrv_filtered_cow_child(bs)) {
         return 0;
     }
     if (bs->drv->bdrv_has_zero_init) {
@@ -5081,7 +5081,7 @@ bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs)
 {
     BlockDriverInfo bdi;
 
-    if (bs->backing) {
+    if (bdrv_filtered_cow_child(bs)) {
         return false;
     }
 
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 08/42] block: bdrv_set_backing_hd() is about bs->backing
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (6 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 07/42] block: *filtered_cow_child() for *has_zero_init() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain Max Reitz
                   ` (33 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

bdrv_set_backing_hd() is a function that explicitly cares about the
bs->backing child.  Highlight that in its description and use
child_bs(bs->backing) instead of backing_bs(bs) to make it more obvious.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 486c75d847..adf82efb0e 100644
--- a/block.c
+++ b/block.c
@@ -2539,7 +2539,7 @@ static bool bdrv_inherits_from_recursive(BlockDriverState *child,
 }
 
 /*
- * Sets the backing file link of a BDS. A new reference is created; callers
+ * Sets the bs->backing link of a BDS. A new reference is created; callers
  * which don't need their own reference any more must call bdrv_unref().
  */
 void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
@@ -2548,7 +2548,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     bool update_inherits_from = bdrv_chain_contains(bs, backing_hd) &&
         bdrv_inherits_from_recursive(backing_hd, bs);
 
-    if (bdrv_is_backing_chain_frozen(bs, backing_bs(bs), errp)) {
+    if (bdrv_is_backing_chain_frozen(bs, child_bs(bs->backing), errp)) {
         return;
     }
 
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (7 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 08/42] block: bdrv_set_backing_hd() is about bs->backing Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 13:32   ` Vladimir Sementsov-Ogievskiy
  2019-09-05 13:05   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 10/42] block: Drop bdrv_is_encrypted() Max Reitz
                   ` (32 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

In order to make filters work in backing chains, the associated
functions must be able to deal with them and freeze all filter links, be
they COW or R/W filter links.

In the process, rename these functions to reflect that they now act on
generalized chains of filter nodes instead of backing chains alone.

While at it, add some comments that note which functions require their
caller to ensure that a given child link is not frozen, and how the
callers do so.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block.h | 10 +++---
 block.c               | 81 +++++++++++++++++++++++++------------------
 block/commit.c        |  8 ++---
 block/mirror.c        |  4 +--
 block/stream.c        |  8 ++---
 5 files changed, 62 insertions(+), 49 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 50a07c1c33..f6f09b95cd 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -364,11 +364,11 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
 BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
                                     BlockDriverState *bs);
 BlockDriverState *bdrv_find_base(BlockDriverState *bs);
-bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
-                                  Error **errp);
-int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
-                              Error **errp);
-void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
+bool bdrv_is_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
+                          Error **errp);
+int bdrv_freeze_chain(BlockDriverState *bs, BlockDriverState *base,
+                      Error **errp);
+void bdrv_unfreeze_chain(BlockDriverState *bs, BlockDriverState *base);
 
 
 typedef struct BdrvCheckResult {
diff --git a/block.c b/block.c
index adf82efb0e..650c00d182 100644
--- a/block.c
+++ b/block.c
@@ -2303,12 +2303,15 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
  * If @new_bs is not NULL, bdrv_check_perm() must be called beforehand, as this
  * function uses bdrv_set_perm() to update the permissions according to the new
  * reference that @new_bs gets.
+ *
+ * Callers must ensure that child->frozen is false.
  */
 static void bdrv_replace_child(BdrvChild *child, BlockDriverState *new_bs)
 {
     BlockDriverState *old_bs = child->bs;
     uint64_t perm, shared_perm;
 
+    /* Asserts that child->frozen == false */
     bdrv_replace_child_noperm(child, new_bs);
 
     /*
@@ -2468,6 +2471,7 @@ static void bdrv_detach_child(BdrvChild *child)
     g_free(child);
 }
 
+/* Callers must ensure that child->frozen is false. */
 void bdrv_root_unref_child(BdrvChild *child)
 {
     BlockDriverState *child_bs;
@@ -2477,10 +2481,6 @@ void bdrv_root_unref_child(BdrvChild *child)
     bdrv_unref(child_bs);
 }
 
-/**
- * Clear all inherits_from pointers from children and grandchildren of
- * @root that point to @root, where necessary.
- */
 static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child)
 {
     BdrvChild *c;
@@ -2505,6 +2505,7 @@ static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child)
     }
 }
 
+/* Callers must ensure that child->frozen is false. */
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
 {
     if (child == NULL) {
@@ -2548,7 +2549,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     bool update_inherits_from = bdrv_chain_contains(bs, backing_hd) &&
         bdrv_inherits_from_recursive(backing_hd, bs);
 
-    if (bdrv_is_backing_chain_frozen(bs, child_bs(bs->backing), errp)) {
+    if (bdrv_is_chain_frozen(bs, child_bs(bs->backing), errp)) {
         return;
     }
 
@@ -2557,6 +2558,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     }
 
     if (bs->backing) {
+        /* Cannot be frozen, we checked that above */
         bdrv_unref_child(bs, bs->backing);
     }
 
@@ -3674,8 +3676,7 @@ static int bdrv_reopen_parse_backing(BDRVReopenState *reopen_state,
             return -EPERM;
         }
         /* Check if the backing link that we want to replace is frozen */
-        if (bdrv_is_backing_chain_frozen(overlay_bs, backing_bs(overlay_bs),
-                                         errp)) {
+        if (bdrv_is_chain_frozen(overlay_bs, backing_bs(overlay_bs), errp)) {
             return -EPERM;
         }
         reopen_state->replace_backing_bs = true;
@@ -4029,6 +4030,7 @@ static void bdrv_close(BlockDriverState *bs)
 
     if (bs->drv) {
         if (bs->drv->bdrv_close) {
+            /* Must unfreeze all children, so bdrv_unref_child() works */
             bs->drv->bdrv_close(bs);
         }
         bs->drv = NULL;
@@ -4398,20 +4400,22 @@ BlockDriverState *bdrv_find_base(BlockDriverState *bs)
 }
 
 /*
- * Return true if at least one of the backing links between @bs and
- * @base is frozen. @errp is set if that's the case.
+ * Return true if at least one of the (COW and R/W) filter links
+ * between @bs and @base is frozen. @errp is set if that's the case.
  * @base must be reachable from @bs, or NULL.
  */
-bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
-                                  Error **errp)
+bool bdrv_is_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
+                          Error **errp)
 {
     BlockDriverState *i;
+    BdrvChild *child;
+
+    for (i = bs; i != base; i = child_bs(child)) {
+        child = bdrv_filtered_child(i);
 
-    for (i = bs; i != base; i = backing_bs(i)) {
-        if (i->backing && i->backing->frozen) {
+        if (child && child->frozen) {
             error_setg(errp, "Cannot change '%s' link from '%s' to '%s'",
-                       i->backing->name, i->node_name,
-                       backing_bs(i)->node_name);
+                       child->name, i->node_name, child->bs->node_name);
             return true;
         }
     }
@@ -4420,32 +4424,35 @@ bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
 }
 
 /*
- * Freeze all backing links between @bs and @base.
+ * Freeze all (COW and R/W) filter links between @bs and @base.
  * If any of the links is already frozen the operation is aborted and
  * none of the links are modified.
  * @base must be reachable from @bs, or NULL.
  * Returns 0 on success. On failure returns < 0 and sets @errp.
  */
-int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
-                              Error **errp)
+int bdrv_freeze_chain(BlockDriverState *bs, BlockDriverState *base,
+                      Error **errp)
 {
     BlockDriverState *i;
+    BdrvChild *child;
 
-    if (bdrv_is_backing_chain_frozen(bs, base, errp)) {
+    if (bdrv_is_chain_frozen(bs, base, errp)) {
         return -EPERM;
     }
 
-    for (i = bs; i != base; i = backing_bs(i)) {
-        if (i->backing && backing_bs(i)->never_freeze) {
+    for (i = bs; i != base; i = child_bs(child)) {
+        child = bdrv_filtered_child(i);
+        if (child && child->bs->never_freeze) {
             error_setg(errp, "Cannot freeze '%s' link to '%s'",
-                       i->backing->name, backing_bs(i)->node_name);
+                       child->name, child->bs->node_name);
             return -EPERM;
         }
     }
 
-    for (i = bs; i != base; i = backing_bs(i)) {
-        if (i->backing) {
-            i->backing->frozen = true;
+    for (i = bs; i != base; i = child_bs(child)) {
+        child = bdrv_filtered_child(i);
+        if (child) {
+            child->frozen = true;
         }
     }
 
@@ -4453,18 +4460,21 @@ int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
 }
 
 /*
- * Unfreeze all backing links between @bs and @base. The caller must
- * ensure that all links are frozen before using this function.
+ * Unfreeze all (COW and R/W) filter links between @bs and @base. The
+ * caller must ensure that all links are frozen before using this
+ * function.
  * @base must be reachable from @bs, or NULL.
  */
-void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base)
+void bdrv_unfreeze_chain(BlockDriverState *bs, BlockDriverState *base)
 {
     BlockDriverState *i;
+    BdrvChild *child;
 
-    for (i = bs; i != base; i = backing_bs(i)) {
-        if (i->backing) {
-            assert(i->backing->frozen);
-            i->backing->frozen = false;
+    for (i = bs; i != base; i = child_bs(child)) {
+        child = bdrv_filtered_child(i);
+        if (child) {
+            assert(child->frozen);
+            child->frozen = false;
         }
     }
 }
@@ -4567,8 +4577,11 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
             }
         }
 
-        /* Do the actual switch in the in-memory graph.
-         * Completes bdrv_check_update_perm() transaction internally. */
+        /*
+         * Do the actual switch in the in-memory graph.
+         * Completes bdrv_check_update_perm() transaction internally.
+         * c->frozen is false, we have checked that above.
+         */
         bdrv_ref(base);
         bdrv_replace_child(c, base);
         bdrv_unref(top);
diff --git a/block/commit.c b/block/commit.c
index 19915603ae..5a7672c7c7 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -68,7 +68,7 @@ static int commit_prepare(Job *job)
 {
     CommitBlockJob *s = container_of(job, CommitBlockJob, common.job);
 
-    bdrv_unfreeze_backing_chain(s->commit_top_bs, s->base_bs);
+    bdrv_unfreeze_chain(s->commit_top_bs, s->base_bs);
     s->chain_frozen = false;
 
     /* Remove base node parent that still uses BLK_PERM_WRITE/RESIZE before
@@ -88,7 +88,7 @@ static void commit_abort(Job *job)
     BlockDriverState *top_bs = blk_bs(s->top);
 
     if (s->chain_frozen) {
-        bdrv_unfreeze_backing_chain(s->commit_top_bs, s->base_bs);
+        bdrv_unfreeze_chain(s->commit_top_bs, s->base_bs);
     }
 
     /* Make sure commit_top_bs and top stay around until bdrv_replace_node() */
@@ -331,7 +331,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
         }
     }
 
-    if (bdrv_freeze_backing_chain(commit_top_bs, base, errp) < 0) {
+    if (bdrv_freeze_chain(commit_top_bs, base, errp) < 0) {
         goto fail;
     }
     s->chain_frozen = true;
@@ -372,7 +372,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 
 fail:
     if (s->chain_frozen) {
-        bdrv_unfreeze_backing_chain(commit_top_bs, base);
+        bdrv_unfreeze_chain(commit_top_bs, base);
     }
     if (s->base) {
         blk_unref(s->base);
diff --git a/block/mirror.c b/block/mirror.c
index a8f2d7a305..54bafdf176 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -633,7 +633,7 @@ static int mirror_exit_common(Job *job)
     s->prepared = true;
 
     if (bdrv_chain_contains(src, target_bs)) {
-        bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
+        bdrv_unfreeze_chain(mirror_top_bs, target_bs);
     }
 
     bdrv_release_dirty_bitmap(src, s->dirty_bitmap);
@@ -1707,7 +1707,7 @@ static BlockJob *mirror_start_job(
             }
         }
 
-        if (bdrv_freeze_backing_chain(mirror_top_bs, target, errp) < 0) {
+        if (bdrv_freeze_chain(mirror_top_bs, target, errp) < 0) {
             goto fail;
         }
     }
diff --git a/block/stream.c b/block/stream.c
index 6ac1e7bec4..4c8b89884a 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -54,7 +54,7 @@ static void stream_abort(Job *job)
 
     if (s->chain_frozen) {
         BlockJob *bjob = &s->common;
-        bdrv_unfreeze_backing_chain(blk_bs(bjob->blk), s->bottom);
+        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->bottom);
     }
 }
 
@@ -67,7 +67,7 @@ static int stream_prepare(Job *job)
     Error *local_err = NULL;
     int ret = 0;
 
-    bdrv_unfreeze_backing_chain(bs, s->bottom);
+    bdrv_unfreeze_chain(bs, s->bottom);
     s->chain_frozen = false;
 
     if (bs->backing) {
@@ -233,7 +233,7 @@ void stream_start(const char *job_id, BlockDriverState *bs,
     int basic_flags = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
     BlockDriverState *bottom = bdrv_find_overlay(bs, base);
 
-    if (bdrv_freeze_backing_chain(bs, bottom, errp) < 0) {
+    if (bdrv_freeze_chain(bs, bottom, errp) < 0) {
         return;
     }
 
@@ -284,5 +284,5 @@ fail:
     if (bs_read_only) {
         bdrv_reopen_set_read_only(bs, true, NULL);
     }
-    bdrv_unfreeze_backing_chain(bs, bottom);
+    bdrv_unfreeze_chain(bs, bottom);
 }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 10/42] block: Drop bdrv_is_encrypted()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (8 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 13:42   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes() Max Reitz
                   ` (31 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

The original purpose of bdrv_is_encrypted() was to inquire whether a BDS
can be used without the user entering a password or not.  It has not
been used for that purpose for quite some time.

Actually, it is not even fit for that purpose, because to answer that
question, it would have recursively query all of the given node's
children.

So now we have to decide in which direction we want to fix
bdrv_is_encrypted(): Recursively query all children, or drop it and just
use bs->encrypted to get the current node's status?

Nowadays, its only purpose is to report through bdrv_query_image_info()
whether the given image is encrypted or not.  For this purpose, it is
probably more interesting to see whether a given node itself is
encrypted or not (otherwise, a management application cannot discern for
certain which nodes are really encrypted and which just have encrypted
children).

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block.h | 1 -
 block.c               | 8 --------
 block/qapi.c          | 2 +-
 3 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index f6f09b95cd..9cfe77abaf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -487,7 +487,6 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it);
 void bdrv_next_cleanup(BdrvNextIterator *it);
 
 BlockDriverState *bdrv_next_monitor_owned(BlockDriverState *bs);
-bool bdrv_is_encrypted(BlockDriverState *bs);
 void bdrv_iterate_format(void (*it)(void *opaque, const char *name),
                          void *opaque, bool read_only);
 const char *bdrv_get_node_name(const BlockDriverState *bs);
diff --git a/block.c b/block.c
index 650c00d182..415c555bf5 100644
--- a/block.c
+++ b/block.c
@@ -4696,14 +4696,6 @@ bool bdrv_is_sg(BlockDriverState *bs)
     return bs->sg;
 }
 
-bool bdrv_is_encrypted(BlockDriverState *bs)
-{
-    if (bs->backing && bs->backing->bs->encrypted) {
-        return true;
-    }
-    return bs->encrypted;
-}
-
 const char *bdrv_get_format_name(BlockDriverState *bs)
 {
     return bs->drv ? bs->drv->format_name : NULL;
diff --git a/block/qapi.c b/block/qapi.c
index 15f1030264..9a185cba48 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -281,7 +281,7 @@ void bdrv_query_image_info(BlockDriverState *bs,
     info->virtual_size    = size;
     info->actual_size     = bdrv_get_allocated_file_size(bs);
     info->has_actual_size = info->actual_size >= 0;
-    if (bdrv_is_encrypted(bs)) {
+    if (bs->encrypted) {
         info->encrypted = true;
         info->has_encrypted = true;
     }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (9 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 10/42] block: Drop bdrv_is_encrypted() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-09-05 13:11   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 12/42] block: Use bdrv_filtered_rw* where obvious Max Reitz
                   ` (30 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Filters cannot compress data themselves but they have to implement
.bdrv_co_pwritev_compressed() still (or they cannot forward compressed
writes).  Therefore, checking whether
bs->drv->bdrv_co_pwritev_compressed is non-NULL is not sufficient to
know whether the node can actually handle compressed writes.  This
function looks down the filter chain to see whether there is a
non-filter that can actually convert the compressed writes into
compressed data (and thus normal writes).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block.h |  1 +
 block.c               | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/block/block.h b/include/block/block.h
index 9cfe77abaf..6ba853fb90 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -487,6 +487,7 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it);
 void bdrv_next_cleanup(BdrvNextIterator *it);
 
 BlockDriverState *bdrv_next_monitor_owned(BlockDriverState *bs);
+bool bdrv_supports_compressed_writes(BlockDriverState *bs);
 void bdrv_iterate_format(void (*it)(void *opaque, const char *name),
                          void *opaque, bool read_only);
 const char *bdrv_get_node_name(const BlockDriverState *bs);
diff --git a/block.c b/block.c
index 415c555bf5..029c809a8e 100644
--- a/block.c
+++ b/block.c
@@ -4696,6 +4696,28 @@ bool bdrv_is_sg(BlockDriverState *bs)
     return bs->sg;
 }
 
+/**
+ * Return whether the given node supports compressed writes.
+ */
+bool bdrv_supports_compressed_writes(BlockDriverState *bs)
+{
+    BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
+
+    if (!bs->drv || !bs->drv->bdrv_co_pwritev_compressed) {
+        return false;
+    }
+
+    if (filtered) {
+        /*
+         * Filters can only forward compressed writes, so we have to
+         * check the child.
+         */
+        return bdrv_supports_compressed_writes(filtered);
+    }
+
+    return true;
+}
+
 const char *bdrv_get_format_name(BlockDriverState *bs)
 {
     return bs->drv ? bs->drv->format_name : NULL;
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 12/42] block: Use bdrv_filtered_rw* where obvious
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (10 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 13/42] block: Use CAFs in block status functions Max Reitz
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Places that use patterns like

    if (bs->drv->is_filter && bs->file) {
        ... something about bs->file->bs ...
    }

should be

    BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
    if (filtered) {
        ... something about @filtered ...
    }

instead.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c    | 23 +++++++++++++++--------
 block/io.c |  5 +++--
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index 029c809a8e..86b84bea21 100644
--- a/block.c
+++ b/block.c
@@ -556,11 +556,12 @@ int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
 int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
 
     if (drv && drv->bdrv_probe_blocksizes) {
         return drv->bdrv_probe_blocksizes(bs, bsz);
-    } else if (drv && drv->is_filter && bs->file) {
-        return bdrv_probe_blocksizes(bs->file->bs, bsz);
+    } else if (filtered) {
+        return bdrv_probe_blocksizes(filtered, bsz);
     }
 
     return -ENOTSUP;
@@ -575,11 +576,12 @@ int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
 
     if (drv && drv->bdrv_probe_geometry) {
         return drv->bdrv_probe_geometry(bs, geo);
-    } else if (drv && drv->is_filter && bs->file) {
-        return bdrv_probe_geometry(bs->file->bs, geo);
+    } else if (filtered) {
+        return bdrv_probe_geometry(filtered, geo);
     }
 
     return -ENOTSUP;
@@ -5084,6 +5086,8 @@ int bdrv_has_zero_init_1(BlockDriverState *bs)
 
 int bdrv_has_zero_init(BlockDriverState *bs)
 {
+    BlockDriverState *filtered;
+
     if (!bs->drv) {
         return 0;
     }
@@ -5096,8 +5100,10 @@ int bdrv_has_zero_init(BlockDriverState *bs)
     if (bs->drv->bdrv_has_zero_init) {
         return bs->drv->bdrv_has_zero_init(bs);
     }
-    if (bs->file && bs->drv->is_filter) {
-        return bdrv_has_zero_init(bs->file->bs);
+
+    filtered = bdrv_filtered_rw_bs(bs);
+    if (filtered) {
+        return bdrv_has_zero_init(filtered);
     }
 
     /* safe default */
@@ -5142,8 +5148,9 @@ int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
         return -ENOMEDIUM;
     }
     if (!drv->bdrv_get_info) {
-        if (bs->file && drv->is_filter) {
-            return bdrv_get_info(bs->file->bs, bdi);
+        BlockDriverState *filtered = bdrv_filtered_rw_bs(bs);
+        if (filtered) {
+            return bdrv_get_info(filtered, bdi);
         }
         return -ENOTSUP;
     }
diff --git a/block/io.c b/block/io.c
index 06305c6ea6..4d6cf4b3c2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3186,8 +3186,9 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset,
     }
 
     if (!drv->bdrv_co_truncate) {
-        if (bs->file && drv->is_filter) {
-            ret = bdrv_co_truncate(bs->file, offset, prealloc, errp);
+        BdrvChild *filtered = bdrv_filtered_rw_child(bs);
+        if (filtered) {
+            ret = bdrv_co_truncate(filtered, offset, prealloc, errp);
             goto out;
         }
         error_setg(errp, "Image format driver does not support resize");
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 13/42] block: Use CAFs in block status functions
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (11 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 12/42] block: Use bdrv_filtered_rw* where obvious Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains Max Reitz
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Use the child access functions in the block status inquiry functions as
appropriate.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/io.c b/block/io.c
index 4d6cf4b3c2..c5a8e3e6a3 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2180,11 +2180,12 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
         ret |= BDRV_BLOCK_ALLOCATED;
     } else if (want_zero) {
+        BlockDriverState *cow_bs = bdrv_filtered_cow_bs(bs);
+
         if (bdrv_unallocated_blocks_are_zero(bs)) {
             ret |= BDRV_BLOCK_ZERO;
-        } else if (bs->backing) {
-            BlockDriverState *bs2 = bs->backing->bs;
-            int64_t size2 = bdrv_getlength(bs2);
+        } else if (cow_bs) {
+            int64_t size2 = bdrv_getlength(cow_bs);
 
             if (size2 >= 0 && offset >= size2) {
                 ret |= BDRV_BLOCK_ZERO;
@@ -2250,7 +2251,7 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     bool first = true;
 
     assert(bs != base);
-    for (p = bs; p != base; p = backing_bs(p)) {
+    for (p = bs; p != base; p = bdrv_filtered_bs(p)) {
         ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                    file);
         if (ret < 0) {
@@ -2336,7 +2337,7 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
 int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
                       int64_t *pnum, int64_t *map, BlockDriverState **file)
 {
-    return bdrv_block_status_above(bs, backing_bs(bs),
+    return bdrv_block_status_above(bs, bdrv_filtered_bs(bs),
                                    offset, bytes, pnum, map, file);
 }
 
@@ -2346,9 +2347,9 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
     int ret;
     int64_t dummy;
 
-    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
-                                         bytes, pnum ? pnum : &dummy, NULL,
-                                         NULL);
+    ret = bdrv_common_block_status_above(bs, bdrv_filtered_bs(bs), false,
+                                         offset, bytes, pnum ? pnum : &dummy,
+                                         NULL, NULL);
     if (ret < 0) {
         return ret;
     }
@@ -2411,7 +2412,7 @@ int bdrv_is_allocated_above(BlockDriverState *top,
             break;
         }
 
-        intermediate = backing_bs(intermediate);
+        intermediate = bdrv_filtered_bs(intermediate);
     }
 
     *pnum = n;
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (12 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 13/42] block: Use CAFs in block status functions Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 15:19   ` Vladimir Sementsov-Ogievskiy
  2019-09-05 14:05   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 15/42] block: Re-evaluate backing file handling in reopen Max Reitz
                   ` (27 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Use child access functions when iterating through backing chains so
filters do not break the chain.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 86b84bea21..42abbaf0ba 100644
--- a/block.c
+++ b/block.c
@@ -4376,7 +4376,8 @@ int bdrv_change_backing_file(BlockDriverState *bs,
 }
 
 /*
- * Finds the image layer in the chain that has 'bs' as its backing file.
+ * Finds the image layer in the chain that has 'bs' (or a filter on
+ * top of it) as its backing file.
  *
  * active is the current topmost image.
  *
@@ -4388,11 +4389,18 @@ int bdrv_change_backing_file(BlockDriverState *bs,
 BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
                                     BlockDriverState *bs)
 {
-    while (active && bs != backing_bs(active)) {
-        active = backing_bs(active);
+    bs = bdrv_skip_rw_filters(bs);
+    active = bdrv_skip_rw_filters(active);
+
+    while (active) {
+        BlockDriverState *next = bdrv_backing_chain_next(active);
+        if (bs == next) {
+            return active;
+        }
+        active = next;
     }
 
-    return active;
+    return NULL;
 }
 
 /* Given a BDS, searches for the base layer. */
@@ -4544,9 +4552,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
      * other intermediate nodes have been dropped.
      * If 'top' is an implicit node (e.g. "commit_top") we should skip
      * it because no one inherits from it. We use explicit_top for that. */
-    while (explicit_top && explicit_top->implicit) {
-        explicit_top = backing_bs(explicit_top);
-    }
+    explicit_top = bdrv_skip_implicit_filters(explicit_top);
     update_inherits_from = bdrv_inherits_from_recursive(base, explicit_top);
 
     /* success - we can delete the intermediate states, and link top->base */
@@ -5014,7 +5020,7 @@ BlockDriverState *bdrv_lookup_bs(const char *device,
 bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
 {
     while (top && top != base) {
-        top = backing_bs(top);
+        top = bdrv_filtered_bs(top);
     }
 
     return top != NULL;
@@ -5253,7 +5259,17 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
 
     is_protocol = path_has_protocol(backing_file);
 
-    for (curr_bs = bs; curr_bs->backing; curr_bs = curr_bs->backing->bs) {
+    /*
+     * Being largely a legacy function, skip any filters here
+     * (because filters do not have normal filenames, so they cannot
+     * match anyway; and allowing json:{} filenames is a bit out of
+     * scope).
+     */
+    for (curr_bs = bdrv_skip_rw_filters(bs);
+         bdrv_filtered_cow_child(curr_bs) != NULL;
+         curr_bs = bdrv_backing_chain_next(curr_bs))
+    {
+        BlockDriverState *bs_below = bdrv_backing_chain_next(curr_bs);
 
         /* If either of the filename paths is actually a protocol, then
          * compare unmodified paths; otherwise make paths relative */
@@ -5261,7 +5277,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
             char *backing_file_full_ret;
 
             if (strcmp(backing_file, curr_bs->backing_file) == 0) {
-                retval = curr_bs->backing->bs;
+                retval = bs_below;
                 break;
             }
             /* Also check against the full backing filename for the image */
@@ -5271,7 +5287,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
                 bool equal = strcmp(backing_file, backing_file_full_ret) == 0;
                 g_free(backing_file_full_ret);
                 if (equal) {
-                    retval = curr_bs->backing->bs;
+                    retval = bs_below;
                     break;
                 }
             }
@@ -5297,7 +5313,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
             g_free(filename_tmp);
 
             if (strcmp(backing_file_full, filename_full) == 0) {
-                retval = curr_bs->backing->bs;
+                retval = bs_below;
                 break;
             }
         }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 15/42] block: Re-evaluate backing file handling in reopen
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (13 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 16:05   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code Max Reitz
                   ` (26 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Reopening a node's backing child needs a bit of special handling because
the "backing" child has different defaults than all other children
(among other things).  Adding filter support here is a bit more
difficult than just using the child access functions.  In fact, we often
have to directly use bs->backing because these functions are about the
"backing" child (which may or may not be the COW backing file).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 45 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index 42abbaf0ba..064cc99857 100644
--- a/block.c
+++ b/block.c
@@ -3660,25 +3660,56 @@ static int bdrv_reopen_parse_backing(BDRVReopenState *reopen_state,
         }
     }
 
+    /*
+     * Ensure that @bs can really handle backing files, because we are
+     * about to give it one (or swap the existing one)
+     */
+    if (bs->drv->is_filter) {
+        /* Filters always have a file or a backing child */
+        if (!bs->backing) {
+            error_setg(errp, "'%s' is a %s filter node that does not support a "
+                       "backing child", bs->node_name, bs->drv->format_name);
+            return -EINVAL;
+        }
+    } else if (!bs->drv->supports_backing) {
+        error_setg(errp, "Driver '%s' of node '%s' does not support backing "
+                   "files", bs->drv->format_name, bs->node_name);
+        return -EINVAL;
+    }
+
     /*
      * Find the "actual" backing file by skipping all links that point
      * to an implicit node, if any (e.g. a commit filter node).
+     * We cannot use any of the bdrv_skip_*() functions here because
+     * those return the first explicit node, while we are looking for
+     * its overlay here.
      */
     overlay_bs = bs;
-    while (backing_bs(overlay_bs) && backing_bs(overlay_bs)->implicit) {
-        overlay_bs = backing_bs(overlay_bs);
+    while (bdrv_filtered_bs(overlay_bs) &&
+           bdrv_filtered_bs(overlay_bs)->implicit)
+    {
+        overlay_bs = bdrv_filtered_bs(overlay_bs);
     }
 
     /* If we want to replace the backing file we need some extra checks */
-    if (new_backing_bs != backing_bs(overlay_bs)) {
+    if (new_backing_bs != bdrv_filtered_bs(overlay_bs)) {
         /* Check for implicit nodes between bs and its backing file */
         if (bs != overlay_bs) {
             error_setg(errp, "Cannot change backing link if '%s' has "
                        "an implicit backing file", bs->node_name);
             return -EPERM;
         }
-        /* Check if the backing link that we want to replace is frozen */
-        if (bdrv_is_chain_frozen(overlay_bs, backing_bs(overlay_bs), errp)) {
+        /*
+         * Check if the backing link that we want to replace is frozen.
+         * Note that
+         * bdrv_filtered_child(overlay_bs) == overlay_bs->backing,
+         * because we know that overlay_bs == bs, and that @bs
+         * either is an R/W filter that uses ->backing or a COW format
+         * with bs->drv->supports_backing == true.
+         */
+        if (bdrv_is_chain_frozen(overlay_bs, child_bs(overlay_bs->backing),
+                                 errp))
+        {
             return -EPERM;
         }
         reopen_state->replace_backing_bs = true;
@@ -3829,7 +3860,7 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
      * its metadata. Otherwise the 'backing' option can be omitted.
      */
     if (drv->supports_backing && reopen_state->backing_missing &&
-        (backing_bs(reopen_state->bs) || reopen_state->bs->backing_file[0])) {
+        (reopen_state->bs->backing || reopen_state->bs->backing_file[0])) {
         error_setg(errp, "backing is missing for '%s'",
                    reopen_state->bs->node_name);
         ret = -EINVAL;
@@ -3974,7 +4005,7 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)
      * from bdrv_set_backing_hd()) has the new values.
      */
     if (reopen_state->replace_backing_bs) {
-        BlockDriverState *old_backing_bs = backing_bs(bs);
+        BlockDriverState *old_backing_bs = child_bs(bs->backing);
         assert(!old_backing_bs || !old_backing_bs->implicit);
         /* Abort the permission update on the backing bs we're detaching */
         if (old_backing_bs) {
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (14 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 15/42] block: Re-evaluate backing file handling in reopen Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 15:36   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 17/42] block: Use CAFs in bdrv_refresh_limits() Max Reitz
                   ` (25 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
itself has to flush the children of the given node, it should not flush
just bs->file->bs, but in fact all children.

In any case, the BLKDBG_EVENT() should be emitted on the primary child,
because that is where a blkdebug node would be if there is any.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/io.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index c5a8e3e6a3..bcc770d336 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
 
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 {
+    BdrvChild *primary_child = bdrv_primary_child(bs);
+    BdrvChild *child;
     int current_gen;
     int ret = 0;
 
@@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
     }
 
     /* Write back cached data to the OS even with cache=unsafe */
-    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
+    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
     if (bs->drv->bdrv_co_flush_to_os) {
         ret = bs->drv->bdrv_co_flush_to_os(bs);
         if (ret < 0) {
@@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 
     /* But don't actually force it to the disk with cache=unsafe */
     if (bs->open_flags & BDRV_O_NO_FLUSH) {
-        goto flush_parent;
+        goto flush_children;
     }
 
     /* Check if we really need to flush anything */
     if (bs->flushed_gen == current_gen) {
-        goto flush_parent;
+        goto flush_children;
     }
 
-    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
+    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
     if (!bs->drv) {
         /* bs->drv->bdrv_co_flush() might have ejected the BDS
          * (even in case of apparent success) */
@@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
     /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
      * in the case of cache=unsafe, so there are no useless flushes.
      */
-flush_parent:
-    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
+flush_children:
+    ret = 0;
+    QLIST_FOREACH(child, &bs->children, next) {
+        int this_child_ret;
+
+        this_child_ret = bdrv_co_flush(child->bs);
+        if (!ret) {
+            ret = this_child_ret;
+        }
+    }
+
 out:
     /* Notify any pending flushes that we have completed */
     if (ret == 0) {
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 17/42] block: Use CAFs in bdrv_refresh_limits()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (15 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 18/42] block: Use CAFs in bdrv_refresh_filename() Max Reitz
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index bcc770d336..dca4689b2f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -135,6 +135,8 @@ static void bdrv_merge_limits(BlockLimits *dst, const BlockLimits *src)
 void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *storage_bs = bdrv_storage_bs(bs);
+    BlockDriverState *cow_bs = bdrv_filtered_cow_bs(bs);
     Error *local_err = NULL;
 
     memset(&bs->bl, 0, sizeof(bs->bl));
@@ -148,13 +150,13 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
                                 drv->bdrv_aio_preadv) ? 1 : 512;
 
     /* Take some limits from the children as a default */
-    if (bs->file) {
-        bdrv_refresh_limits(bs->file->bs, &local_err);
+    if (storage_bs) {
+        bdrv_refresh_limits(storage_bs, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
         }
-        bdrv_merge_limits(&bs->bl, &bs->file->bs->bl);
+        bdrv_merge_limits(&bs->bl, &storage_bs->bl);
     } else {
         bs->bl.min_mem_alignment = 512;
         bs->bl.opt_mem_alignment = getpagesize();
@@ -163,13 +165,13 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
         bs->bl.max_iov = IOV_MAX;
     }
 
-    if (bs->backing) {
-        bdrv_refresh_limits(bs->backing->bs, &local_err);
+    if (cow_bs) {
+        bdrv_refresh_limits(cow_bs, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
         }
-        bdrv_merge_limits(&bs->bl, &bs->backing->bs->bl);
+        bdrv_merge_limits(&bs->bl, &cow_bs->bl);
     }
 
     /* Then let the driver override it */
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 18/42] block: Use CAFs in bdrv_refresh_filename()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (16 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 17/42] block: Use CAFs in bdrv_refresh_limits() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 16:22   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 19/42] block: Use CAF in bdrv_co_rw_vmstate() Max Reitz
                   ` (23 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

bdrv_refresh_filename() and the kind of related bdrv_dirname() should
look to the primary child when they wish to copy the underlying file's
filename.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 064cc99857..a467b175c6 100644
--- a/block.c
+++ b/block.c
@@ -6432,6 +6432,7 @@ void bdrv_refresh_filename(BlockDriverState *bs)
 {
     BlockDriver *drv = bs->drv;
     BdrvChild *child;
+    BlockDriverState *primary_child_bs;
     QDict *opts;
     bool backing_overridden;
     bool generate_json_filename; /* Whether our default implementation should
@@ -6500,20 +6501,30 @@ void bdrv_refresh_filename(BlockDriverState *bs)
     qobject_unref(bs->full_open_options);
     bs->full_open_options = opts;
 
+    primary_child_bs = bdrv_primary_bs(bs);
+
     if (drv->bdrv_refresh_filename) {
         /* Obsolete information is of no use here, so drop the old file name
          * information before refreshing it */
         bs->exact_filename[0] = '\0';
 
         drv->bdrv_refresh_filename(bs);
-    } else if (bs->file) {
-        /* Try to reconstruct valid information from the underlying file */
+    } else if (primary_child_bs) {
+        /*
+         * Try to reconstruct valid information from the underlying
+         * file -- this only works for format nodes (filter nodes
+         * cannot be probed and as such must be selected by the user
+         * either through an options dict, or through a special
+         * filename which the filter driver must construct in its
+         * .bdrv_refresh_filename() implementation).
+         */
 
         bs->exact_filename[0] = '\0';
 
         /*
          * We can use the underlying file's filename if:
          * - it has a filename,
+         * - the current BDS is not a filter,
          * - the file is a protocol BDS, and
          * - opening that file (as this BDS's format) will automatically create
          *   the BDS tree we have right now, that is:
@@ -6522,11 +6533,11 @@ void bdrv_refresh_filename(BlockDriverState *bs)
          *   - no non-file child of this BDS has been overridden by the user
          *   Both of these conditions are represented by generate_json_filename.
          */
-        if (bs->file->bs->exact_filename[0] &&
-            bs->file->bs->drv->bdrv_file_open &&
-            !generate_json_filename)
+        if (primary_child_bs->exact_filename[0] &&
+            primary_child_bs->drv->bdrv_file_open &&
+            !drv->is_filter && !generate_json_filename)
         {
-            strcpy(bs->exact_filename, bs->file->bs->exact_filename);
+            strcpy(bs->exact_filename, primary_child_bs->exact_filename);
         }
     }
 
@@ -6543,6 +6554,7 @@ void bdrv_refresh_filename(BlockDriverState *bs)
 char *bdrv_dirname(BlockDriverState *bs, Error **errp)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *child_bs;
 
     if (!drv) {
         error_setg(errp, "Node '%s' is ejected", bs->node_name);
@@ -6553,8 +6565,9 @@ char *bdrv_dirname(BlockDriverState *bs, Error **errp)
         return drv->bdrv_dirname(bs, errp);
     }
 
-    if (bs->file) {
-        return bdrv_dirname(bs->file->bs, errp);
+    child_bs = bdrv_primary_bs(bs);
+    if (child_bs) {
+        return bdrv_dirname(child_bs, errp);
     }
 
     bdrv_refresh_filename(bs);
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 19/42] block: Use CAF in bdrv_co_rw_vmstate()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (17 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 18/42] block: Use CAFs in bdrv_refresh_filename() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback Max Reitz
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

If a node whose driver does not provide VM state functions has a
metadata child, the VM state should probably go there; if it is a
filter, the VM state should probably go there.  It follows that we
should generally go down to the primary child.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index dca4689b2f..e222d91893 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2434,6 +2434,7 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *child_bs = bdrv_primary_bs(bs);
     int ret = -ENOTSUP;
 
     bdrv_inc_in_flight(bs);
@@ -2446,8 +2447,8 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
         } else {
             ret = drv->bdrv_save_vmstate(bs, qiov, pos);
         }
-    } else if (bs->file) {
-        ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
+    } else if (child_bs) {
+        ret = bdrv_co_rw_vmstate(child_bs, qiov, pos, is_read);
     }
 
     bdrv_dec_in_flight(bs);
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (18 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 19/42] block: Use CAF in bdrv_co_rw_vmstate() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 16:34   ` Vladimir Sementsov-Ogievskiy
  2019-09-10 11:56   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 21/42] block: Use CAFs for debug breakpoints Max Reitz
                   ` (21 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

If the top node's driver does not provide snapshot functionality and we
want to fall back to a node down the chain, we need to snapshot all
non-COW children.  For simplicity's sake, just do not fall back if there
is more than one such child.

bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
the actual child pointer, so it only works if the fallback child is
bs->file or bs->backing (and then we have to find out which it is).

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 79 insertions(+), 21 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index f2f48f926a..35403c167f 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
     return ret;
 }
 
+/**
+ * Return the child BDS to which we can fall back if the given BDS
+ * does not support snapshots.
+ * Return NULL if there is no BDS to (safely) fall back to.
+ */
+static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
+{
+    BlockDriverState *child_bs = NULL;
+    BdrvChild *child;
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        if (child == bdrv_filtered_cow_child(bs)) {
+            /* Ignore: COW children need not be included in snapshots */
+            continue;
+        }
+
+        if (child_bs) {
+            /* Cannot fall back to a single child if there are multiple */
+            return NULL;
+        }
+        child_bs = child->bs;
+    }
+
+    return child_bs;
+}
+
 int bdrv_can_snapshot(BlockDriverState *bs)
 {
     BlockDriver *drv = bs->drv;
@@ -154,8 +180,9 @@ int bdrv_can_snapshot(BlockDriverState *bs)
     }
 
     if (!drv->bdrv_snapshot_create) {
-        if (bs->file != NULL) {
-            return bdrv_can_snapshot(bs->file->bs);
+        BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+        if (fallback_bs) {
+            return bdrv_can_snapshot(fallback_bs);
         }
         return 0;
     }
@@ -167,14 +194,15 @@ int bdrv_snapshot_create(BlockDriverState *bs,
                          QEMUSnapshotInfo *sn_info)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
     if (!drv) {
         return -ENOMEDIUM;
     }
     if (drv->bdrv_snapshot_create) {
         return drv->bdrv_snapshot_create(bs, sn_info);
     }
-    if (bs->file) {
-        return bdrv_snapshot_create(bs->file->bs, sn_info);
+    if (fallback_bs) {
+        return bdrv_snapshot_create(fallback_bs, sn_info);
     }
     return -ENOTSUP;
 }
@@ -184,6 +212,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
                        Error **errp)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *fallback_bs;
     int ret, open_ret;
 
     if (!drv) {
@@ -204,39 +233,66 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
         return ret;
     }
 
-    if (bs->file) {
-        BlockDriverState *file;
-        QDict *options = qdict_clone_shallow(bs->options);
+    fallback_bs = bdrv_snapshot_fallback(bs);
+    if (fallback_bs) {
+        QDict *options;
         QDict *file_options;
         Error *local_err = NULL;
+        bool is_backing_child;
+        BdrvChild **child_pointer;
+
+        /*
+         * We need a pointer to the fallback child pointer, so let us
+         * see whether the child is referenced by a field in the BDS
+         * object.
+         */
+        if (fallback_bs == bs->file->bs) {
+            is_backing_child = false;
+            child_pointer = &bs->file;
+        } else if (fallback_bs == bs->backing->bs) {
+            is_backing_child = true;
+            child_pointer = &bs->backing;
+        } else {
+            /*
+             * The fallback child is not referenced by a field in the
+             * BDS object.  We cannot go on then.
+             */
+            error_setg(errp, "Block driver does not support snapshots");
+            return -ENOTSUP;
+        }
+
+        options = qdict_clone_shallow(bs->options);
 
-        file = bs->file->bs;
         /* Prevent it from getting deleted when detached from bs */
-        bdrv_ref(file);
+        bdrv_ref(fallback_bs);
 
-        qdict_extract_subqdict(options, &file_options, "file.");
+        qdict_extract_subqdict(options, &file_options,
+                               is_backing_child ? "backing." : "file.");
         qobject_unref(file_options);
-        qdict_put_str(options, "file", bdrv_get_node_name(file));
+        qdict_put_str(options, is_backing_child ? "backing" : "file",
+                      bdrv_get_node_name(fallback_bs));
 
         if (drv->bdrv_close) {
             drv->bdrv_close(bs);
         }
-        bdrv_unref_child(bs, bs->file);
-        bs->file = NULL;
 
-        ret = bdrv_snapshot_goto(file, snapshot_id, errp);
+        assert(fallback_bs == (*child_pointer)->bs);
+        bdrv_unref_child(bs, *child_pointer);
+        *child_pointer = NULL;
+
+        ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
         open_ret = drv->bdrv_open(bs, options, bs->open_flags, &local_err);
         qobject_unref(options);
         if (open_ret < 0) {
-            bdrv_unref(file);
+            bdrv_unref(fallback_bs);
             bs->drv = NULL;
             /* A bdrv_snapshot_goto() error takes precedence */
             error_propagate(errp, local_err);
             return ret < 0 ? ret : open_ret;
         }
 
-        assert(bs->file->bs == file);
-        bdrv_unref(file);
+        assert(fallback_bs == (*child_pointer)->bs);
+        bdrv_unref(fallback_bs);
         return ret;
     }
 
@@ -272,6 +328,7 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
                          Error **errp)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
     int ret;
 
     if (!drv) {
@@ -288,8 +345,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
 
     if (drv->bdrv_snapshot_delete) {
         ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
-    } else if (bs->file) {
-        ret = bdrv_snapshot_delete(bs->file->bs, snapshot_id, name, errp);
+    } else if (fallback_bs) {
+        ret = bdrv_snapshot_delete(fallback_bs, snapshot_id, name, errp);
     } else {
         error_setg(errp, "Block format '%s' used by device '%s' "
                    "does not support internal snapshot deletion",
@@ -305,14 +362,15 @@ int bdrv_snapshot_list(BlockDriverState *bs,
                        QEMUSnapshotInfo **psn_info)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
     if (!drv) {
         return -ENOMEDIUM;
     }
     if (drv->bdrv_snapshot_list) {
         return drv->bdrv_snapshot_list(bs, psn_info);
     }
-    if (bs->file) {
-        return bdrv_snapshot_list(bs->file->bs, psn_info);
+    if (fallback_bs) {
+        return bdrv_snapshot_list(fallback_bs, psn_info);
     }
     return -ENOTSUP;
 }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 21/42] block: Use CAFs for debug breakpoints
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (19 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback Max Reitz
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

When looking for a blkdebug node (which implements debug breakpoints),
use bdrv_primary_bs() to iterate through the graph, because that is
where a blkdebug node would be.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index a467b175c6..1070aa1ba9 100644
--- a/block.c
+++ b/block.c
@@ -5218,7 +5218,7 @@ int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
                           const char *tag)
 {
     while (bs && bs->drv && !bs->drv->bdrv_debug_breakpoint) {
-        bs = bs->file ? bs->file->bs : NULL;
+        bs = bdrv_primary_bs(bs);
     }
 
     if (bs && bs->drv && bs->drv->bdrv_debug_breakpoint) {
@@ -5231,7 +5231,7 @@ int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
 int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag)
 {
     while (bs && bs->drv && !bs->drv->bdrv_debug_remove_breakpoint) {
-        bs = bs->file ? bs->file->bs : NULL;
+        bs = bdrv_primary_bs(bs);
     }
 
     if (bs && bs->drv && bs->drv->bdrv_debug_remove_breakpoint) {
@@ -5244,7 +5244,7 @@ int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag)
 int bdrv_debug_resume(BlockDriverState *bs, const char *tag)
 {
     while (bs && (!bs->drv || !bs->drv->bdrv_debug_resume)) {
-        bs = bs->file ? bs->file->bs : NULL;
+        bs = bdrv_primary_bs(bs);
     }
 
     if (bs && bs->drv && bs->drv->bdrv_debug_resume) {
@@ -5257,7 +5257,7 @@ int bdrv_debug_resume(BlockDriverState *bs, const char *tag)
 bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag)
 {
     while (bs && bs->drv && !bs->drv->bdrv_debug_is_suspended) {
-        bs = bs->file ? bs->file->bs : NULL;
+        bs = bdrv_primary_bs(bs);
     }
 
     if (bs && bs->drv && bs->drv->bdrv_debug_is_suspended) {
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (20 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 21/42] block: Use CAFs for debug breakpoints Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 16:41   ` Vladimir Sementsov-Ogievskiy
  2019-09-10 14:52   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare() Max Reitz
                   ` (19 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

If the driver does not implement bdrv_get_allocated_file_size(), we
should fall back to cumulating the allocated size of all non-COW
children instead of just bs->file.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 1070aa1ba9..6e1ddab056 100644
--- a/block.c
+++ b/block.c
@@ -4650,9 +4650,27 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
     if (drv->bdrv_get_allocated_file_size) {
         return drv->bdrv_get_allocated_file_size(bs);
     }
-    if (bs->file) {
-        return bdrv_get_allocated_file_size(bs->file->bs);
+
+    if (!QLIST_EMPTY(&bs->children)) {
+        BdrvChild *child;
+        int64_t child_size, total_size = 0;
+
+        QLIST_FOREACH(child, &bs->children, next) {
+            if (child == bdrv_filtered_cow_child(bs)) {
+                /* Ignore COW backing files */
+                continue;
+            }
+
+            child_size = bdrv_get_allocated_file_size(child->bs);
+            if (child_size < 0) {
+                return child_size;
+            }
+            total_size += child_size;
+        }
+
+        return total_size;
     }
+
     return -ENOTSUP;
 }
 
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (21 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-09-10 15:02   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 24/42] block: Use child access functions for QAPI queries Max Reitz
                   ` (18 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This allows us to differentiate between filters and nodes with COW
backing files: Filters cannot be used as overlays at all (for this
function).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 blockdev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index 29c6c6044a..c540802127 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1664,7 +1664,12 @@ static void external_snapshot_prepare(BlkActionState *common,
         goto out;
     }
 
-    if (state->new_bs->backing != NULL) {
+    if (state->new_bs->drv->is_filter) {
+        error_setg(errp, "Filters cannot be used as overlays");
+        goto out;
+    }
+
+    if (bdrv_filtered_cow_child(state->new_bs)) {
         error_setg(errp, "The overlay already has a backing image");
         goto out;
     }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 24/42] block: Use child access functions for QAPI queries
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (22 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-10 16:57   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters Max Reitz
                   ` (17 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

query-block, query-named-block-nodes, and query-blockstats now return
any filtered child under "backing", not just bs->backing or COW
children.  This is so that filters do not interrupt the reported backing
chain.  This changes the output for iotest 184, as the throttled node
now appears as a backing child.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qapi.c               | 39 +++++++++++++++++++++++---------------
 tests/qemu-iotests/184.out |  7 ++++++-
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 9a185cba48..4f59ac1c0f 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -156,9 +156,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
             return NULL;
         }
 
-        if (bs0->drv && bs0->backing) {
+        if (bs0->drv && bdrv_filtered_child(bs0)) {
+            /*
+             * Put any filtered child here (for backwards compatibility to when
+             * we put bs0->backing here, which might be any filtered child).
+             */
             info->backing_file_depth++;
-            bs0 = bs0->backing->bs;
+            bs0 = bdrv_filtered_bs(bs0);
             (*p_image_info)->has_backing_image = true;
             p_image_info = &((*p_image_info)->backing_image);
         } else {
@@ -167,9 +171,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 
         /* Skip automatically inserted nodes that the user isn't aware of for
          * query-block (blk != NULL), but not for query-named-block-nodes */
-        while (blk && bs0->drv && bs0->implicit) {
-            bs0 = backing_bs(bs0);
-            assert(bs0);
+        if (blk) {
+            bs0 = bdrv_skip_implicit_filters(bs0);
         }
     }
 
@@ -354,9 +357,9 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo **p_info,
     BlockDriverState *bs = blk_bs(blk);
     char *qdev;
 
-    /* Skip automatically inserted nodes that the user isn't aware of */
-    while (bs && bs->drv && bs->implicit) {
-        bs = backing_bs(bs);
+    if (bs) {
+        /* Skip automatically inserted nodes that the user isn't aware of */
+        bs = bdrv_skip_implicit_filters(bs);
     }
 
     info->device = g_strdup(blk_name(blk));
@@ -513,6 +516,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, BlockBackend *blk)
 static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs,
                                         bool blk_level)
 {
+    BlockDriverState *storage_bs, *filtered_bs;
     BlockStats *s = NULL;
 
     s = g_malloc0(sizeof(*s));
@@ -525,9 +529,8 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs,
     /* Skip automatically inserted nodes that the user isn't aware of in
      * a BlockBackend-level command. Stay at the exact node for a node-level
      * command. */
-    while (blk_level && bs->drv && bs->implicit) {
-        bs = backing_bs(bs);
-        assert(bs);
+    if (blk_level) {
+        bs = bdrv_skip_implicit_filters(bs);
     }
 
     if (bdrv_get_node_name(bs)[0]) {
@@ -537,14 +540,20 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs,
 
     s->stats->wr_highest_offset = stat64_get(&bs->wr_highest_offset);
 
-    if (bs->file) {
+    storage_bs = bdrv_storage_bs(bs);
+    if (storage_bs) {
         s->has_parent = true;
-        s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level);
+        s->parent = bdrv_query_bds_stats(storage_bs, blk_level);
     }
 
-    if (blk_level && bs->backing) {
+    filtered_bs = bdrv_filtered_bs(bs);
+    if (blk_level && filtered_bs) {
+        /*
+         * Put any filtered child here (for backwards compatibility to when
+         * we put bs0->backing here, which might be any filtered child).
+         */
         s->has_backing = true;
-        s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level);
+        s->backing = bdrv_query_bds_stats(filtered_bs, blk_level);
     }
 
     return s;
diff --git a/tests/qemu-iotests/184.out b/tests/qemu-iotests/184.out
index 3deb3cfb94..1d61f7e224 100644
--- a/tests/qemu-iotests/184.out
+++ b/tests/qemu-iotests/184.out
@@ -27,6 +27,11 @@ Testing:
             "iops_rd": 0,
             "detect_zeroes": "off",
             "image": {
+                "backing-image": {
+                    "virtual-size": 1073741824,
+                    "filename": "null-co://",
+                    "format": "null-co"
+                },
                 "virtual-size": 1073741824,
                 "filename": "json:{\"throttle-group\": \"group0\", \"driver\": \"throttle\", \"file\": {\"driver\": \"null-co\"}}",
                 "format": "throttle"
@@ -34,7 +39,7 @@ Testing:
             "iops_wr": 0,
             "ro": false,
             "node-name": "throttle0",
-            "backing_file_depth": 0,
+            "backing_file_depth": 1,
             "drv": "throttle",
             "iops": 0,
             "bps_wr": 0,
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (23 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 24/42] block: Use child access functions for QAPI queries Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-12 11:09   ` Vladimir Sementsov-Ogievskiy
                     ` (2 more replies)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 26/42] backup: " Max Reitz
                   ` (16 subsequent siblings)
  41 siblings, 3 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This includes some permission limiting (for example, we only need to
take the RESIZE permission for active commits where the base is smaller
than the top).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
 blockdev.c     |  47 +++++++++++++++++---
 2 files changed, 131 insertions(+), 33 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 54bafdf176..6ddbfb9708 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
     BlockBackend *target;
     BlockDriverState *mirror_top_bs;
     BlockDriverState *base;
+    BlockDriverState *base_overlay;
 
     /* The name of the graph node to replace */
     char *replaces;
@@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
                              &error_abort);
     if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
         BlockDriverState *backing = s->is_none_mode ? src : s->base;
-        if (backing_bs(target_bs) != backing) {
-            bdrv_set_backing_hd(target_bs, backing, &local_err);
+        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
+
+        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
+            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
             if (local_err) {
                 error_report_err(local_err);
                 ret = -EPERM;
@@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
      * valid.
      */
     block_job_remove_all_bdrv(bjob);
-    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
+    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
 
     /* We just changed the BDS the job BB refers to (with either or both of the
      * bdrv_replace_node() calls), so switch the BB back so the cleanup does
@@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
             return 0;
         }
 
-        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
+        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
+                                      &count);
         if (ret < 0) {
             return ret;
         }
@@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
     } else {
         s->target_cluster_size = BDRV_SECTOR_SIZE;
     }
-    if (backing_filename[0] && !target_bs->backing &&
+    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
         s->granularity < s->target_cluster_size) {
         s->buf_size = MAX(s->buf_size, s->target_cluster_size);
         s->cow_bitmap = bitmap_new(length);
@@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
     if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
         int ret;
 
-        assert(!target->backing);
-        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
+        assert(!bdrv_backing_chain_next(target));
+        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
+                                     "backing", errp);
         if (ret < 0) {
             return;
         }
@@ -1531,8 +1536,8 @@ static BlockJob *mirror_start_job(
     MirrorBlockJob *s;
     MirrorBDSOpaque *bs_opaque;
     BlockDriverState *mirror_top_bs;
-    bool target_graph_mod;
     bool target_is_backing;
+    uint64_t target_perms, target_shared_perms;
     Error *local_err = NULL;
     int ret;
 
@@ -1551,7 +1556,7 @@ static BlockJob *mirror_start_job(
         buf_size = DEFAULT_MIRROR_BUF_SIZE;
     }
 
-    if (bs == target) {
+    if (bdrv_skip_rw_filters(bs) == bdrv_skip_rw_filters(target)) {
         error_setg(errp, "Can't mirror node into itself");
         return NULL;
     }
@@ -1615,15 +1620,50 @@ static BlockJob *mirror_start_job(
      * In the case of active commit, things look a bit different, though,
      * because the target is an already populated backing file in active use.
      * We can allow anything except resize there.*/
+
+    target_perms = BLK_PERM_WRITE;
+    target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
+
     target_is_backing = bdrv_chain_contains(bs, target);
-    target_graph_mod = (backing_mode != MIRROR_LEAVE_BACKING_CHAIN);
+    if (target_is_backing) {
+        int64_t bs_size, target_size;
+        bs_size = bdrv_getlength(bs);
+        if (bs_size < 0) {
+            error_setg_errno(errp, -bs_size,
+                             "Could not inquire top image size");
+            goto fail;
+        }
+
+        target_size = bdrv_getlength(target);
+        if (target_size < 0) {
+            error_setg_errno(errp, -target_size,
+                             "Could not inquire base image size");
+            goto fail;
+        }
+
+        if (target_size < bs_size) {
+            target_perms |= BLK_PERM_RESIZE;
+        }
+
+        target_shared_perms |= BLK_PERM_CONSISTENT_READ
+                            |  BLK_PERM_WRITE
+                            |  BLK_PERM_GRAPH_MOD;
+    } else if (bdrv_chain_contains(bs, bdrv_skip_rw_filters(target))) {
+        /*
+         * We may want to allow this in the future, but it would
+         * require taking some extra care.
+         */
+        error_setg(errp, "Cannot mirror to a filter on top of a node in the "
+                   "source's backing chain");
+        goto fail;
+    }
+
+    if (backing_mode != MIRROR_LEAVE_BACKING_CHAIN) {
+        target_perms |= BLK_PERM_GRAPH_MOD;
+    }
+
     s->target = blk_new(s->common.job.aio_context,
-                        BLK_PERM_WRITE | BLK_PERM_RESIZE |
-                        (target_graph_mod ? BLK_PERM_GRAPH_MOD : 0),
-                        BLK_PERM_WRITE_UNCHANGED |
-                        (target_is_backing ? BLK_PERM_CONSISTENT_READ |
-                                             BLK_PERM_WRITE |
-                                             BLK_PERM_GRAPH_MOD : 0));
+                        target_perms, target_shared_perms);
     ret = blk_insert_bs(s->target, target, errp);
     if (ret < 0) {
         goto fail;
@@ -1647,6 +1687,7 @@ static BlockJob *mirror_start_job(
     s->backing_mode = backing_mode;
     s->copy_mode = copy_mode;
     s->base = base;
+    s->base_overlay = bdrv_find_overlay(bs, base);
     s->granularity = granularity;
     s->buf_size = ROUND_UP(buf_size, granularity);
     s->unmap = unmap;
@@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
     /* In commit_active_start() all intermediate nodes disappear, so
      * any jobs in them must be blocked */
     if (target_is_backing) {
-        BlockDriverState *iter;
-        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
-            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
-             * ourselves at s->base (if writes are blocked for a node, they are
-             * also blocked for its backing file). The other options would be a
-             * second filter driver above s->base (== target). */
+        BlockDriverState *iter, *filtered_target;
+        uint64_t iter_shared_perms;
+
+        /*
+         * The topmost node with
+         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
+         */
+        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
+
+        assert(bdrv_skip_rw_filters(filtered_target) ==
+               bdrv_skip_rw_filters(target));
+
+        /*
+         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
+         * ourselves at s->base (if writes are blocked for a node, they are
+         * also blocked for its backing file). The other options would be a
+         * second filter driver above s->base (== target).
+         */
+        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
+
+        for (iter = bdrv_filtered_bs(bs); iter != target;
+             iter = bdrv_filtered_bs(iter))
+        {
+            if (iter == filtered_target) {
+                /*
+                 * From here on, all nodes are filters on the base.
+                 * This allows us to share BLK_PERM_CONSISTENT_READ.
+                 */
+                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
+            }
+
             ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
-                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
-                                     errp);
+                                     iter_shared_perms, errp);
             if (ret < 0) {
                 goto fail;
             }
@@ -1737,7 +1802,7 @@ fail:
     bs_opaque->stop = true;
     bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
                              &error_abort);
-    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
+    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
 
     bdrv_unref(mirror_top_bs);
 
@@ -1764,7 +1829,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
         return;
     }
     is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
-    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
+    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
     mirror_start_job(job_id, bs, creation_flags, target, replaces,
                      speed, granularity, buf_size, backing_mode,
                      on_source_error, on_target_error, unmap, NULL, NULL,
diff --git a/blockdev.c b/blockdev.c
index c540802127..c451f553f7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3851,7 +3851,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
         return;
     }
 
-    if (!bs->backing && sync == MIRROR_SYNC_MODE_TOP) {
+    if (!bdrv_backing_chain_next(bs) && sync == MIRROR_SYNC_MODE_TOP) {
         sync = MIRROR_SYNC_MODE_FULL;
     }
 
@@ -3900,7 +3900,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
 
 void qmp_drive_mirror(DriveMirror *arg, Error **errp)
 {
-    BlockDriverState *bs;
+    BlockDriverState *bs, *unfiltered_bs;
     BlockDriverState *source, *target_bs;
     AioContext *aio_context;
     BlockMirrorBackingMode backing_mode;
@@ -3909,6 +3909,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
     int flags;
     int64_t size;
     const char *format = arg->format;
+    const char *replaces_node_name = NULL;
     int ret;
 
     bs = qmp_get_root_bs(arg->device, errp);
@@ -3921,6 +3922,16 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
         return;
     }
 
+    /*
+     * If the user has not instructed us otherwise, we should let the
+     * block job run from @bs (thus taking into account all filters on
+     * it) but replace @unfiltered_bs when it finishes (thus not
+     * removing those filters).
+     * (And if there are any explicit filters, we should assume the
+     *  user knows how to use the @replaces option.)
+     */
+    unfiltered_bs = bdrv_skip_implicit_filters(bs);
+
     aio_context = bdrv_get_aio_context(bs);
     aio_context_acquire(aio_context);
 
@@ -3934,8 +3945,14 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
     }
 
     flags = bs->open_flags | BDRV_O_RDWR;
-    source = backing_bs(bs);
+    source = bdrv_filtered_cow_bs(unfiltered_bs);
     if (!source && arg->sync == MIRROR_SYNC_MODE_TOP) {
+        if (bdrv_filtered_bs(unfiltered_bs)) {
+            /* @unfiltered_bs is an explicit filter */
+            error_setg(errp, "Cannot perform sync=top mirror through an "
+                       "explicitly added filter node on the source");
+            goto out;
+        }
         arg->sync = MIRROR_SYNC_MODE_FULL;
     }
     if (arg->sync == MIRROR_SYNC_MODE_NONE) {
@@ -3954,6 +3971,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
                              " named node of the graph");
             goto out;
         }
+        replaces_node_name = arg->replaces;
+    } else if (unfiltered_bs != bs) {
+        replaces_node_name = unfiltered_bs->node_name;
     }
 
     if (arg->mode == NEW_IMAGE_MODE_ABSOLUTE_PATHS) {
@@ -3973,6 +3993,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
         bdrv_img_create(arg->target, format,
                         NULL, NULL, NULL, size, flags, false, &local_err);
     } else {
+        /* Implicit filters should not appear in the filename */
+        BlockDriverState *explicit_backing = bdrv_skip_implicit_filters(source);
+
         switch (arg->mode) {
         case NEW_IMAGE_MODE_EXISTING:
             break;
@@ -3980,8 +4003,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
             /* create new image with backing file */
             bdrv_refresh_filename(source);
             bdrv_img_create(arg->target, format,
-                            source->filename,
-                            source->drv->format_name,
+                            explicit_backing->filename,
+                            explicit_backing->drv->format_name,
                             NULL, size, flags, false, &local_err);
             break;
         default:
@@ -4017,7 +4040,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
     }
 
     blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
-                           arg->has_replaces, arg->replaces, arg->sync,
+                           !!replaces_node_name, replaces_node_name, arg->sync,
                            backing_mode, arg->has_speed, arg->speed,
                            arg->has_granularity, arg->granularity,
                            arg->has_buf_size, arg->buf_size,
@@ -4053,7 +4076,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
                          bool has_auto_dismiss, bool auto_dismiss,
                          Error **errp)
 {
-    BlockDriverState *bs;
+    BlockDriverState *bs, *unfiltered_bs;
     BlockDriverState *target_bs;
     AioContext *aio_context;
     BlockMirrorBackingMode backing_mode = MIRROR_LEAVE_BACKING_CHAIN;
@@ -4065,6 +4088,16 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
         return;
     }
 
+    /*
+     * Same as in qmp_drive_mirror(): We want to run the job from @bs,
+     * but we want to replace @unfiltered_bs on completion.
+     */
+    unfiltered_bs = bdrv_skip_implicit_filters(bs);
+    if (!has_replaces && unfiltered_bs != bs) {
+        replaces = unfiltered_bs->node_name;
+        has_replaces = true;
+    }
+
     target_bs = bdrv_lookup_bs(target, target, errp);
     if (!target_bs) {
         return;
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 26/42] backup: Deal with filters
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (24 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters Max Reitz
@ 2019-08-09 16:13 ` " Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 27/42] commit: " Max Reitz
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c |  9 +++++----
 blockdev.c     | 19 +++++++++++++++----
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ecadb61af3..7854d7575b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -611,6 +611,7 @@ static int64_t backup_calculate_cluster_size(BlockDriverState *target,
 {
     int ret;
     BlockDriverInfo bdi;
+    bool target_does_cow = bdrv_backing_chain_next(target);
 
     /*
      * If there is no backing file on the target, we cannot rely on COW if our
@@ -618,7 +619,7 @@ static int64_t backup_calculate_cluster_size(BlockDriverState *target,
      * targets with a backing file, try to avoid COW if possible.
      */
     ret = bdrv_get_info(target, &bdi);
-    if (ret == -ENOTSUP && !target->backing) {
+    if (ret == -ENOTSUP && !target_does_cow) {
         /* Cluster size is not defined */
         warn_report("The target block device doesn't provide "
                     "information about the block size and it doesn't have a "
@@ -627,14 +628,14 @@ static int64_t backup_calculate_cluster_size(BlockDriverState *target,
                     "this default, the backup may be unusable",
                     BACKUP_CLUSTER_SIZE_DEFAULT);
         return BACKUP_CLUSTER_SIZE_DEFAULT;
-    } else if (ret < 0 && !target->backing) {
+    } else if (ret < 0 && !target_does_cow) {
         error_setg_errno(errp, -ret,
             "Couldn't determine the cluster size of the target image, "
             "which has no backing file");
         error_append_hint(errp,
             "Aborting, since this may create an unusable destination image\n");
         return ret;
-    } else if (ret < 0 && target->backing) {
+    } else if (ret < 0 && target_does_cow) {
         /* Not fatal; just trudge on ahead. */
         return BACKUP_CLUSTER_SIZE_DEFAULT;
     }
@@ -683,7 +684,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
         return NULL;
     }
 
-    if (compress && target->drv->bdrv_co_pwritev_compressed == NULL) {
+    if (compress && !bdrv_supports_compressed_writes(target)) {
         error_setg(errp, "Compression is not supported for this drive %s",
                    bdrv_get_device_name(target));
         return NULL;
diff --git a/blockdev.c b/blockdev.c
index c451f553f7..c6f79b4e0e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3656,7 +3656,13 @@ static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
     /* See if we have a backing HD we can use to create our new image
      * on top of. */
     if (backup->sync == MIRROR_SYNC_MODE_TOP) {
-        source = backing_bs(bs);
+        /*
+         * Backup will not replace the source by the target, so none
+         * of the filters skipped here will be removed (in contrast to
+         * mirror).  Therefore, we can skip all of them when looking
+         * for the first COW relationship.
+         */
+        source = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
         if (!source) {
             backup->sync = MIRROR_SYNC_MODE_FULL;
         }
@@ -3676,9 +3682,14 @@ static BlockJob *do_drive_backup(DriveBackup *backup, JobTxn *txn,
     if (backup->mode != NEW_IMAGE_MODE_EXISTING) {
         assert(backup->format);
         if (source) {
-            bdrv_refresh_filename(source);
-            bdrv_img_create(backup->target, backup->format, source->filename,
-                            source->drv->format_name, NULL,
+            /* Implicit filters should not appear in the filename */
+            BlockDriverState *explicit_backing =
+                bdrv_skip_implicit_filters(source);
+
+            bdrv_refresh_filename(explicit_backing);
+            bdrv_img_create(backup->target, backup->format,
+                            explicit_backing->filename,
+                            explicit_backing->drv->format_name, NULL,
                             size, flags, false, &local_err);
         } else {
             bdrv_img_create(backup->target, backup->format, NULL, NULL, NULL,
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 27/42] commit: Deal with filters
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (25 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 26/42] backup: " Max Reitz
@ 2019-08-09 16:13 ` " Max Reitz
  2019-08-31 10:44   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 28/42] stream: " Max Reitz
                   ` (14 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This includes some permission limiting (for example, we only need to
take the RESIZE permission if the base is smaller than the top).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/block-backend.c | 16 +++++---
 block/commit.c        | 96 +++++++++++++++++++++++++++++++------------
 blockdev.c            |  6 ++-
 3 files changed, 85 insertions(+), 33 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index c13c5c83b0..0bc592d023 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2180,11 +2180,17 @@ int blk_commit_all(void)
         AioContext *aio_context = blk_get_aio_context(blk);
 
         aio_context_acquire(aio_context);
-        if (blk_is_inserted(blk) && blk->root->bs->backing) {
-            int ret = bdrv_commit(blk->root->bs);
-            if (ret < 0) {
-                aio_context_release(aio_context);
-                return ret;
+        if (blk_is_inserted(blk)) {
+            BlockDriverState *non_filter;
+
+            /* Legacy function, so skip implicit filters */
+            non_filter = bdrv_skip_implicit_filters(blk->root->bs);
+            if (bdrv_filtered_cow_child(non_filter)) {
+                int ret = bdrv_commit(non_filter);
+                if (ret < 0) {
+                    aio_context_release(aio_context);
+                    return ret;
+                }
             }
         }
         aio_context_release(aio_context);
diff --git a/block/commit.c b/block/commit.c
index 5a7672c7c7..40d1c8eeac 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -37,6 +37,7 @@ typedef struct CommitBlockJob {
     BlockBackend *top;
     BlockBackend *base;
     BlockDriverState *base_bs;
+    BlockDriverState *above_base;
     BlockdevOnError on_error;
     bool base_read_only;
     bool chain_frozen;
@@ -110,7 +111,7 @@ static void commit_abort(Job *job)
      * XXX Can (or should) we somehow keep 'consistent read' blocked even
      * after the failed/cancelled commit job is gone? If we already wrote
      * something to base, the intermediate images aren't valid any more. */
-    bdrv_replace_node(s->commit_top_bs, backing_bs(s->commit_top_bs),
+    bdrv_replace_node(s->commit_top_bs, s->commit_top_bs->backing->bs,
                       &error_abort);
 
     bdrv_unref(s->commit_top_bs);
@@ -174,7 +175,7 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
             break;
         }
         /* Copy if allocated above the base */
-        ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base), false,
+        ret = bdrv_is_allocated_above(blk_bs(s->top), s->above_base, true,
                                       offset, COMMIT_BUFFER_SIZE, &n);
         copy = (ret == 1);
         trace_commit_one_iteration(s, offset, n, ret);
@@ -267,15 +268,35 @@ void commit_start(const char *job_id, BlockDriverState *bs,
     CommitBlockJob *s;
     BlockDriverState *iter;
     BlockDriverState *commit_top_bs = NULL;
+    BlockDriverState *filtered_base;
     Error *local_err = NULL;
+    int64_t base_size, top_size;
+    uint64_t perms, iter_shared_perms;
     int ret;
 
     assert(top != bs);
-    if (top == base) {
+    if (bdrv_skip_rw_filters(top) == bdrv_skip_rw_filters(base)) {
         error_setg(errp, "Invalid files for merge: top and base are the same");
         return;
     }
 
+    base_size = bdrv_getlength(base);
+    if (base_size < 0) {
+        error_setg_errno(errp, -base_size, "Could not inquire base image size");
+        return;
+    }
+
+    top_size = bdrv_getlength(top);
+    if (top_size < 0) {
+        error_setg_errno(errp, -top_size, "Could not inquire top image size");
+        return;
+    }
+
+    perms = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
+    if (base_size < top_size) {
+        perms |= BLK_PERM_RESIZE;
+    }
+
     s = block_job_create(job_id, &commit_job_driver, NULL, bs, 0, BLK_PERM_ALL,
                          speed, creation_flags, NULL, NULL, errp);
     if (!s) {
@@ -315,17 +336,43 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 
     s->commit_top_bs = commit_top_bs;
 
-    /* Block all nodes between top and base, because they will
-     * disappear from the chain after this operation. */
-    assert(bdrv_chain_contains(top, base));
-    for (iter = top; iter != base; iter = backing_bs(iter)) {
-        /* XXX BLK_PERM_WRITE needs to be allowed so we don't block ourselves
-         * at s->base (if writes are blocked for a node, they are also blocked
-         * for its backing file). The other options would be a second filter
-         * driver above s->base. */
+    /*
+     * Block all nodes between top and base, because they will
+     * disappear from the chain after this operation.
+     * Note that this assumes that the user is fine with removing all
+     * nodes (including R/W filters) between top and base.  Assuring
+     * this is the responsibility of the interface (i.e. whoever calls
+     * commit_start()).
+     */
+    s->above_base = bdrv_find_overlay(top, base);
+    assert(s->above_base);
+
+    /*
+     * The topmost node with
+     * bdrv_skip_rw_filters(filtered_base) == bdrv_skip_rw_filters(base)
+     */
+    filtered_base = bdrv_filtered_cow_bs(s->above_base);
+    assert(bdrv_skip_rw_filters(filtered_base) == bdrv_skip_rw_filters(base));
+
+    /*
+     * XXX BLK_PERM_WRITE needs to be allowed so we don't block ourselves
+     * at s->base (if writes are blocked for a node, they are also blocked
+     * for its backing file). The other options would be a second filter
+     * driver above s->base.
+     */
+    iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
+
+    for (iter = top; iter != base; iter = bdrv_filtered_bs(iter)) {
+        if (iter == filtered_base) {
+            /*
+             * From here on, all nodes are filters on the base.  This
+             * allows us to share BLK_PERM_CONSISTENT_READ.
+             */
+            iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
+        }
+
         ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
-                                 BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
-                                 errp);
+                                 iter_shared_perms, errp);
         if (ret < 0) {
             goto fail;
         }
@@ -342,9 +389,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
     }
 
     s->base = blk_new(s->common.job.aio_context,
-                      BLK_PERM_CONSISTENT_READ
-                      | BLK_PERM_WRITE
-                      | BLK_PERM_RESIZE,
+                      perms,
                       BLK_PERM_CONSISTENT_READ
                       | BLK_PERM_GRAPH_MOD
                       | BLK_PERM_WRITE_UNCHANGED);
@@ -412,19 +457,22 @@ int bdrv_commit(BlockDriverState *bs)
     if (!drv)
         return -ENOMEDIUM;
 
-    if (!bs->backing) {
+    backing_file_bs = bdrv_filtered_cow_bs(bs);
+
+    if (!backing_file_bs) {
         return -ENOTSUP;
     }
 
     if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_COMMIT_SOURCE, NULL) ||
-        bdrv_op_is_blocked(bs->backing->bs, BLOCK_OP_TYPE_COMMIT_TARGET, NULL)) {
+        bdrv_op_is_blocked(backing_file_bs, BLOCK_OP_TYPE_COMMIT_TARGET, NULL))
+    {
         return -EBUSY;
     }
 
-    ro = bs->backing->bs->read_only;
+    ro = backing_file_bs->read_only;
 
     if (ro) {
-        if (bdrv_reopen_set_read_only(bs->backing->bs, false, NULL)) {
+        if (bdrv_reopen_set_read_only(backing_file_bs, false, NULL)) {
             return -EACCES;
         }
     }
@@ -440,8 +488,6 @@ int bdrv_commit(BlockDriverState *bs)
     }
 
     /* Insert commit_top block node above backing, so we can write to it */
-    backing_file_bs = backing_bs(bs);
-
     commit_top_bs = bdrv_new_open_driver(&bdrv_commit_top, NULL, BDRV_O_RDWR,
                                          &local_err);
     if (commit_top_bs == NULL) {
@@ -526,15 +572,13 @@ ro_cleanup:
     qemu_vfree(buf);
 
     blk_unref(backing);
-    if (backing_file_bs) {
-        bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);
-    }
+    bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);
     bdrv_unref(commit_top_bs);
     blk_unref(src);
 
     if (ro) {
         /* ignoring error return here */
-        bdrv_reopen_set_read_only(bs->backing->bs, true, NULL);
+        bdrv_reopen_set_read_only(backing_file_bs, true, NULL);
     }
 
     return ret;
diff --git a/blockdev.c b/blockdev.c
index c6f79b4e0e..7bef41c0b0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
             return;
         }
 
-        bs = blk_bs(blk);
+        bs = bdrv_skip_implicit_filters(blk_bs(blk));
         aio_context = bdrv_get_aio_context(bs);
         aio_context_acquire(aio_context);
 
@@ -3454,7 +3454,9 @@ void qmp_block_commit(bool has_job_id, const char *job_id, const char *device,
 
     assert(bdrv_get_aio_context(base_bs) == aio_context);
 
-    for (iter = top_bs; iter != backing_bs(base_bs); iter = backing_bs(iter)) {
+    for (iter = top_bs; iter != bdrv_filtered_bs(base_bs);
+         iter = bdrv_filtered_bs(iter))
+    {
         if (bdrv_op_is_blocked(iter, BLOCK_OP_TYPE_COMMIT_TARGET, errp)) {
             goto out;
         }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 28/42] stream: Deal with filters
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (26 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 27/42] commit: " Max Reitz
@ 2019-08-09 16:13 ` " Max Reitz
  2019-08-12 11:55   ` Vladimir Sementsov-Ogievskiy
  2019-09-13 14:16   ` Kevin Wolf
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 29/42] nbd: Use CAF when looking for dirty bitmap Max Reitz
                   ` (13 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Because of the recent changes that make the stream job independent of
the base node and instead track the node above it, we have to split that
"bottom" node into two cases: The bottom COW node, and the node directly
above the base node (which may be an R/W filter or the bottom COW node).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qapi/block-core.json |  4 ++++
 block/stream.c       | 52 ++++++++++++++++++++++++++++----------------
 blockdev.c           |  2 +-
 3 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 38c4dbd7c3..3c54717870 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2516,6 +2516,10 @@
 # On successful completion the image file is updated to drop the backing file
 # and the BLOCK_JOB_COMPLETED event is emitted.
 #
+# In case @device is a filter node, block-stream modifies the first non-filter
+# overlay node below it to point to base's backing node (or NULL if @base was
+# not specified) instead of modifying @device itself.
+#
 # @job-id: identifier for the newly-created block job. If
 #          omitted, the device name will be used. (Since 2.7)
 #
diff --git a/block/stream.c b/block/stream.c
index 4c8b89884a..bd4a351dae 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -31,7 +31,8 @@ enum {
 
 typedef struct StreamBlockJob {
     BlockJob common;
-    BlockDriverState *bottom;
+    BlockDriverState *bottom_cow_node;
+    BlockDriverState *above_base;
     BlockdevOnError on_error;
     char *backing_file_str;
     bool bs_read_only;
@@ -54,7 +55,7 @@ static void stream_abort(Job *job)
 
     if (s->chain_frozen) {
         BlockJob *bjob = &s->common;
-        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->bottom);
+        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->above_base);
     }
 }
 
@@ -63,14 +64,15 @@ static int stream_prepare(Job *job)
     StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
     BlockJob *bjob = &s->common;
     BlockDriverState *bs = blk_bs(bjob->blk);
-    BlockDriverState *base = backing_bs(s->bottom);
+    BlockDriverState *unfiltered_bs = bdrv_skip_rw_filters(bs);
+    BlockDriverState *base = bdrv_filtered_bs(s->above_base);
     Error *local_err = NULL;
     int ret = 0;
 
-    bdrv_unfreeze_chain(bs, s->bottom);
+    bdrv_unfreeze_chain(bs, s->above_base);
     s->chain_frozen = false;
 
-    if (bs->backing) {
+    if (bdrv_filtered_cow_child(unfiltered_bs)) {
         const char *base_id = NULL, *base_fmt = NULL;
         if (base) {
             base_id = s->backing_file_str;
@@ -78,8 +80,8 @@ static int stream_prepare(Job *job)
                 base_fmt = base->drv->format_name;
             }
         }
-        bdrv_set_backing_hd(bs, base, &local_err);
-        ret = bdrv_change_backing_file(bs, base_id, base_fmt);
+        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
+        ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt);
         if (local_err) {
             error_report_err(local_err);
             return -EPERM;
@@ -110,7 +112,8 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
     StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
     BlockBackend *blk = s->common.blk;
     BlockDriverState *bs = blk_bs(blk);
-    bool enable_cor = !backing_bs(s->bottom);
+    BlockDriverState *unfiltered_bs = bdrv_skip_rw_filters(bs);
+    bool enable_cor = !bdrv_filtered_bs(s->above_base);
     int64_t len;
     int64_t offset = 0;
     uint64_t delay_ns = 0;
@@ -119,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
     int64_t n = 0; /* bytes */
     void *buf;
 
-    if (bs == s->bottom) {
+    if (unfiltered_bs == s->bottom_cow_node) {
         /* Nothing to stream */
         return 0;
     }
@@ -154,13 +157,14 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 
         copy = false;
 
-        ret = bdrv_is_allocated(bs, offset, STREAM_BUFFER_SIZE, &n);
+        ret = bdrv_is_allocated(unfiltered_bs, offset, STREAM_BUFFER_SIZE, &n);
         if (ret == 1) {
             /* Allocated in the top, no need to copy.  */
         } else if (ret >= 0) {
             /* Copy if allocated in the intermediate images.  Limit to the
              * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).  */
-            ret = bdrv_is_allocated_above(backing_bs(bs), s->bottom, true,
+            ret = bdrv_is_allocated_above(bdrv_filtered_cow_bs(unfiltered_bs),
+                                          s->bottom_cow_node, true,
                                           offset, n, &n);
             /* Finish early if end of backing file has been reached */
             if (ret == 0 && n == 0) {
@@ -231,9 +235,16 @@ void stream_start(const char *job_id, BlockDriverState *bs,
     BlockDriverState *iter;
     bool bs_read_only;
     int basic_flags = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
-    BlockDriverState *bottom = bdrv_find_overlay(bs, base);
+    BlockDriverState *bottom_cow_node = bdrv_find_overlay(bs, base);
+    BlockDriverState *above_base;
 
-    if (bdrv_freeze_chain(bs, bottom, errp) < 0) {
+    /* Find the node directly above @base */
+    for (above_base = bottom_cow_node;
+         bdrv_filtered_bs(above_base) != base;
+         above_base = bdrv_filtered_bs(above_base))
+    {}
+
+    if (bdrv_freeze_chain(bs, above_base, errp) < 0) {
         return;
     }
 
@@ -261,16 +272,19 @@ void stream_start(const char *job_id, BlockDriverState *bs,
      * disappear from the chain after this operation. The streaming job reads
      * every block only once, assuming that it doesn't change, so forbid writes
      * and resizes. Reassign the base node pointer because the backing BS of the
-     * bottom node might change after the call to bdrv_reopen_set_read_only()
-     * due to parallel block jobs running.
+     * above_base node might change after the call to
+     * bdrv_reopen_set_read_only() due to parallel block jobs running.
      */
-    base = backing_bs(bottom);
-    for (iter = backing_bs(bs); iter && iter != base; iter = backing_bs(iter)) {
+    base = bdrv_filtered_bs(above_base);
+    for (iter = bdrv_filtered_bs(bs); iter && iter != base;
+         iter = bdrv_filtered_bs(iter))
+    {
         block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
                            basic_flags, &error_abort);
     }
 
-    s->bottom = bottom;
+    s->bottom_cow_node = bottom_cow_node;
+    s->above_base = above_base;
     s->backing_file_str = g_strdup(backing_file_str);
     s->bs_read_only = bs_read_only;
     s->chain_frozen = true;
@@ -284,5 +298,5 @@ fail:
     if (bs_read_only) {
         bdrv_reopen_set_read_only(bs, true, NULL);
     }
-    bdrv_unfreeze_chain(bs, bottom);
+    bdrv_unfreeze_chain(bs, above_base);
 }
diff --git a/blockdev.c b/blockdev.c
index 7bef41c0b0..ee8b951154 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3297,7 +3297,7 @@ void qmp_block_stream(bool has_job_id, const char *job_id, const char *device,
     }
 
     /* Check for op blockers in the whole chain between bs and base */
-    for (iter = bs; iter && iter != base_bs; iter = backing_bs(iter)) {
+    for (iter = bs; iter && iter != base_bs; iter = bdrv_filtered_bs(iter)) {
         if (bdrv_op_is_blocked(iter, BLOCK_OP_TYPE_STREAM, errp)) {
             goto out;
         }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 29/42] nbd: Use CAF when looking for dirty bitmap
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (27 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 28/42] stream: " Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions Max Reitz
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

When looking for a dirty bitmap to share, we should handle filters by
just including them in the search (so they do not break backing chains).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 nbd/server.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index fbd51b48a7..a1c73ce957 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1511,13 +1511,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset,
     if (bitmap) {
         BdrvDirtyBitmap *bm = NULL;
 
-        while (true) {
+        while (bs) {
             bm = bdrv_find_dirty_bitmap(bs, bitmap);
-            if (bm != NULL || bs->backing == NULL) {
+            if (bm != NULL) {
                 break;
             }
 
-            bs = bs->backing->bs;
+            bs = bdrv_filtered_bs(bs);
         }
 
         if (bm == NULL) {
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (28 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 29/42] nbd: Use CAF when looking for dirty bitmap Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-12 12:14   ` Vladimir Sementsov-Ogievskiy
  2019-08-14 16:04   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 31/42] block: Drop backing_bs() Max Reitz
                   ` (11 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This changes iotest 204's output, because blkdebug on top of a COW node
used to make qemu-img map disregard the rest of the backing chain (the
backing chain was broken by the filter).  With this patch, the
allocation in the base image is reported correctly.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qemu-img.c                 | 33 ++++++++++++++++++++-------------
 tests/qemu-iotests/204.out |  1 +
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 79983772de..3b30c5ae70 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1012,7 +1012,7 @@ static int img_commit(int argc, char **argv)
         /* This is different from QMP, which by default uses the deepest file in
          * the backing chain (i.e., the very base); however, the traditional
          * behavior of qemu-img commit is using the immediate backing file. */
-        base_bs = backing_bs(bs);
+        base_bs = bdrv_backing_chain_next(bs);
         if (!base_bs) {
             error_setg(&local_err, "Image does not have a backing file");
             goto done;
@@ -1632,18 +1632,20 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
     if (s->sector_next_status <= sector_num) {
         uint64_t offset = (sector_num - src_cur_offset) * BDRV_SECTOR_SIZE;
         int64_t count;
+        BlockDriverState *src_bs = blk_bs(s->src[src_cur]);
+        BlockDriverState *base;
+
+        if (s->target_has_backing) {
+            base = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(src_bs));
+        } else {
+            base = NULL;
+        }
 
         do {
             count = n * BDRV_SECTOR_SIZE;
 
-            if (s->target_has_backing) {
-                ret = bdrv_block_status(blk_bs(s->src[src_cur]), offset,
-                                        count, &count, NULL, NULL);
-            } else {
-                ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
-                                              offset, count, &count, NULL,
-                                              NULL);
-            }
+            ret = bdrv_block_status_above(src_bs, base, offset, count, &count,
+                                          NULL, NULL);
 
             if (ret < 0) {
                 if (s->salvage) {
@@ -2490,7 +2492,8 @@ static int img_convert(int argc, char **argv)
          * s.target_backing_sectors has to be negative, which it will
          * be automatically).  The backing file length is used only
          * for optimizations, so such a case is not fatal. */
-        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
+        s.target_backing_sectors =
+            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));
     } else {
         s.target_backing_sectors = -1;
     }
@@ -2853,6 +2856,7 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
 
     depth = 0;
     for (;;) {
+        bs = bdrv_skip_rw_filters(bs);
         ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
         if (ret < 0) {
             return ret;
@@ -2861,7 +2865,7 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
         if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
             break;
         }
-        bs = backing_bs(bs);
+        bs = bdrv_filtered_cow_bs(bs);
         if (bs == NULL) {
             ret = 0;
             break;
@@ -3216,6 +3220,7 @@ static int img_rebase(int argc, char **argv)
     uint8_t *buf_old = NULL;
     uint8_t *buf_new = NULL;
     BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
+    BlockDriverState *unfiltered_bs;
     char *filename;
     const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
     int c, flags, src_flags, ret;
@@ -3350,6 +3355,8 @@ static int img_rebase(int argc, char **argv)
     }
     bs = blk_bs(blk);
 
+    unfiltered_bs = bdrv_skip_rw_filters(bs);
+
     if (out_basefmt != NULL) {
         if (bdrv_find_format(out_basefmt) == NULL) {
             error_report("Invalid format name: '%s'", out_basefmt);
@@ -3361,7 +3368,7 @@ static int img_rebase(int argc, char **argv)
     /* For safe rebasing we need to compare old and new backing file */
     if (!unsafe) {
         QDict *options = NULL;
-        BlockDriverState *base_bs = backing_bs(bs);
+        BlockDriverState *base_bs = bdrv_filtered_cow_bs(unfiltered_bs);
 
         if (base_bs) {
             blk_old_backing = blk_new(qemu_get_aio_context(),
@@ -3517,7 +3524,7 @@ static int img_rebase(int argc, char **argv)
                  * If cluster wasn't changed since prefix_chain, we don't need
                  * to take action
                  */
-                ret = bdrv_is_allocated_above(backing_bs(bs), prefix_chain_bs,
+                ret = bdrv_is_allocated_above(unfiltered_bs, prefix_chain_bs,
                                               false, offset, n, &n);
                 if (ret < 0) {
                     error_report("error while reading image metadata: %s",
diff --git a/tests/qemu-iotests/204.out b/tests/qemu-iotests/204.out
index f3a10fbe90..684774d763 100644
--- a/tests/qemu-iotests/204.out
+++ b/tests/qemu-iotests/204.out
@@ -59,5 +59,6 @@ Offset          Length          File
 0x900000        0x2400000       TEST_DIR/t.IMGFMT
 0x3c00000       0x1100000       TEST_DIR/t.IMGFMT
 0x6a00000       0x400000        TEST_DIR/t.IMGFMT
+0x6e00000       0x1200000       TEST_DIR/t.IMGFMT.base
 No errors were found on the image.
 *** done
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 31/42] block: Drop backing_bs()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (29 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 32/42] block: Make bdrv_get_cumulative_perm() public Max Reitz
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

We want to make it explicit where bs->backing is used, and we have done
so.  The old role of backing_bs() is now effectively taken by
bdrv_filtered_cow_bs().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 5bec3361fd..786801c32f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -932,11 +932,6 @@ typedef enum BlockMirrorBackingMode {
     MIRROR_LEAVE_BACKING_CHAIN,
 } BlockMirrorBackingMode;
 
-static inline BlockDriverState *backing_bs(BlockDriverState *bs)
-{
-    return bs->backing ? bs->backing->bs : NULL;
-}
-
 
 /* Essential block drivers which must always be statically linked into qemu, and
  * which therefore can be accessed without using bdrv_find_format() */
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 32/42] block: Make bdrv_get_cumulative_perm() public
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (30 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 31/42] block: Drop backing_bs() Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 33/42] blockdev: Fix active commit choice Max Reitz
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This is useful in other files like blockdev.c to determine for example
whether a node can be written to or not.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h | 3 +++
 block.c                   | 6 ++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 786801c32f..c17df3808a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1205,6 +1205,9 @@ int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
  */
 int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp);
 
+void bdrv_get_cumulative_perm(BlockDriverState *bs,
+                              uint64_t *perm, uint64_t *shared_perm);
+
 /* Default implementation for BlockDriver.bdrv_child_perm() that can be used by
  * block filters: Forward CONSISTENT_READ, WRITE, WRITE_UNCHANGED and RESIZE to
  * all children */
diff --git a/block.c b/block.c
index 6e1ddab056..915b80153c 100644
--- a/block.c
+++ b/block.c
@@ -1713,8 +1713,6 @@ static int bdrv_child_check_perm(BdrvChild *c, BlockReopenQueue *q,
                                  bool *tighten_restrictions, Error **errp);
 static void bdrv_child_abort_perm_update(BdrvChild *c);
 static void bdrv_child_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared);
-static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
-                                     uint64_t *shared_perm);
 
 typedef struct BlockReopenQueueEntry {
      bool prepared;
@@ -1938,8 +1936,8 @@ static void bdrv_set_perm(BlockDriverState *bs, uint64_t cumulative_perms,
     }
 }
 
-static void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
-                                     uint64_t *shared_perm)
+void bdrv_get_cumulative_perm(BlockDriverState *bs,
+                              uint64_t *perm, uint64_t *shared_perm)
 {
     BdrvChild *c;
     uint64_t cumulative_perms = 0;
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 33/42] blockdev: Fix active commit choice
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (31 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 32/42] block: Make bdrv_get_cumulative_perm() public Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 34/42] block: Inline bdrv_co_block_status_from_*() Max Reitz
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

We have to perform an active commit whenever the top node has a parent
that has taken the WRITE permission on it.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 blockdev.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index ee8b951154..4e72f6f701 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3356,6 +3356,7 @@ void qmp_block_commit(bool has_job_id, const char *job_id, const char *device,
      */
     BlockdevOnError on_error = BLOCKDEV_ON_ERROR_REPORT;
     int job_flags = JOB_DEFAULT;
+    uint64_t top_perm, top_shared;
 
     if (!has_speed) {
         speed = 0;
@@ -3468,14 +3469,31 @@ void qmp_block_commit(bool has_job_id, const char *job_id, const char *device,
         goto out;
     }
 
-    if (top_bs == bs) {
+    /*
+     * Active commit is required if and only if someone has taken a
+     * WRITE permission on the top node.  Historically, we have always
+     * used active commit for top nodes, so continue that practice.
+     * (Active commit is never really wrong.)
+     */
+    bdrv_get_cumulative_perm(top_bs, &top_perm, &top_shared);
+    if (top_perm & BLK_PERM_WRITE ||
+        bdrv_skip_rw_filters(top_bs) == bdrv_skip_rw_filters(bs))
+    {
         if (has_backing_file) {
             error_setg(errp, "'backing-file' specified,"
                              " but 'top' is the active layer");
             goto out;
         }
-        commit_active_start(has_job_id ? job_id : NULL, bs, base_bs,
-                            job_flags, speed, on_error,
+        if (!has_job_id) {
+            /*
+             * Emulate here what block_job_create() does, because it
+             * is possible that @bs != @top_bs (the block job should
+             * be named after @bs, even if @top_bs is the actual
+             * source)
+             */
+            job_id = bdrv_get_device_name(bs);
+        }
+        commit_active_start(job_id, top_bs, base_bs, job_flags, speed, on_error,
                             filter_node_name, NULL, NULL, false, &local_err);
     } else {
         BlockDriverState *overlay_bs = bdrv_find_overlay(bs, top_bs);
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 34/42] block: Inline bdrv_co_block_status_from_*()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (32 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 33/42] blockdev: Fix active commit choice Max Reitz
@ 2019-08-09 16:13 ` Max Reitz
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node() Max Reitz
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:13 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

With bdrv_filtered_rw_bs(), we can easily handle this default filter
behavior in bdrv_co_block_status().

blkdebug wants to have an additional assertion, so it keeps its own
implementation, except bdrv_co_block_status_from_file() needs to be
inlined there.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/block_int.h | 22 -----------------
 block/blkdebug.c          |  7 ++++--
 block/blklogwrites.c      |  1 -
 block/commit.c            |  1 -
 block/copy-on-read.c      |  2 --
 block/io.c                | 51 +++++++++++++--------------------------
 block/mirror.c            |  1 -
 block/throttle.c          |  1 -
 8 files changed, 22 insertions(+), 64 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index c17df3808a..42ee2fcf7f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1227,28 +1227,6 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
                                uint64_t perm, uint64_t shared,
                                uint64_t *nperm, uint64_t *nshared);
 
-/*
- * Default implementation for drivers to pass bdrv_co_block_status() to
- * their file.
- */
-int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
-                                                bool want_zero,
-                                                int64_t offset,
-                                                int64_t bytes,
-                                                int64_t *pnum,
-                                                int64_t *map,
-                                                BlockDriverState **file);
-/*
- * Default implementation for drivers to pass bdrv_co_block_status() to
- * their backing file.
- */
-int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
-                                                   bool want_zero,
-                                                   int64_t offset,
-                                                   int64_t bytes,
-                                                   int64_t *pnum,
-                                                   int64_t *map,
-                                                   BlockDriverState **file);
 const char *bdrv_get_parent_name(const BlockDriverState *bs);
 void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp);
 bool blk_dev_has_removable_media(BlockBackend *blk);
diff --git a/block/blkdebug.c b/block/blkdebug.c
index 5ae96c52b0..0011e068ce 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -679,8 +679,11 @@ static int coroutine_fn blkdebug_co_block_status(BlockDriverState *bs,
         return err;
     }
 
-    return bdrv_co_block_status_from_file(bs, want_zero, offset, bytes,
-                                          pnum, map, file);
+    assert(bs->file && bs->file->bs);
+    *pnum = bytes;
+    *map = offset;
+    *file = bs->file->bs;
+    return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }
 
 static void blkdebug_close(BlockDriverState *bs)
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index 04d8b33607..8982fd15c4 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -519,7 +519,6 @@ static BlockDriver bdrv_blk_log_writes = {
     .bdrv_co_pwrite_zeroes  = blk_log_writes_co_pwrite_zeroes,
     .bdrv_co_flush_to_disk  = blk_log_writes_co_flush_to_disk,
     .bdrv_co_pdiscard       = blk_log_writes_co_pdiscard,
-    .bdrv_co_block_status   = bdrv_co_block_status_from_file,
 
     .is_filter              = true,
     .strong_runtime_opts    = blk_log_writes_strong_runtime_opts,
diff --git a/block/commit.c b/block/commit.c
index 40d1c8eeac..c94678bea3 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -252,7 +252,6 @@ static void bdrv_commit_top_child_perm(BlockDriverState *bs, BdrvChild *c,
 static BlockDriver bdrv_commit_top = {
     .format_name                = "commit_top",
     .bdrv_co_preadv             = bdrv_commit_top_preadv,
-    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
     .bdrv_refresh_filename      = bdrv_commit_top_refresh_filename,
     .bdrv_child_perm            = bdrv_commit_top_child_perm,
 
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 16bdf630b6..ad6577d078 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -160,8 +160,6 @@ static BlockDriver bdrv_copy_on_read = {
     .bdrv_eject                         = cor_eject,
     .bdrv_lock_medium                   = cor_lock_medium,
 
-    .bdrv_co_block_status               = bdrv_co_block_status_from_file,
-
     .bdrv_recurse_is_first_non_filter   = cor_recurse_is_first_non_filter,
 
     .has_variable_length                = true,
diff --git a/block/io.c b/block/io.c
index e222d91893..d7d9757f46 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2028,36 +2028,6 @@ typedef struct BdrvCoBlockStatusData {
     bool done;
 } BdrvCoBlockStatusData;
 
-int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
-                                                bool want_zero,
-                                                int64_t offset,
-                                                int64_t bytes,
-                                                int64_t *pnum,
-                                                int64_t *map,
-                                                BlockDriverState **file)
-{
-    assert(bs->file && bs->file->bs);
-    *pnum = bytes;
-    *map = offset;
-    *file = bs->file->bs;
-    return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
-}
-
-int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
-                                                   bool want_zero,
-                                                   int64_t offset,
-                                                   int64_t bytes,
-                                                   int64_t *pnum,
-                                                   int64_t *map,
-                                                   BlockDriverState **file)
-{
-    assert(bs->backing && bs->backing->bs);
-    *pnum = bytes;
-    *map = offset;
-    *file = bs->backing->bs;
-    return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
-}
-
 /*
  * Returns the allocation status of the specified sectors.
  * Drivers not implementing the functionality are assumed to not support
@@ -2098,6 +2068,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     BlockDriverState *local_file = NULL;
     int64_t aligned_offset, aligned_bytes;
     uint32_t align;
+    bool has_filtered_child;
 
     assert(pnum);
     *pnum = 0;
@@ -2123,7 +2094,8 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
 
     /* Must be non-NULL or bdrv_getlength() would have failed */
     assert(bs->drv);
-    if (!bs->drv->bdrv_co_block_status) {
+    has_filtered_child = bs->drv->is_filter && bdrv_filtered_rw_child(bs);
+    if (!bs->drv->bdrv_co_block_status && !has_filtered_child) {
         *pnum = bytes;
         ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
         if (offset + bytes == total_size) {
@@ -2144,9 +2116,20 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     aligned_offset = QEMU_ALIGN_DOWN(offset, align);
     aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
 
-    ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-                                        aligned_bytes, pnum, &local_map,
-                                        &local_file);
+    if (bs->drv->bdrv_co_block_status) {
+        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+                                            aligned_bytes, pnum, &local_map,
+                                            &local_file);
+    } else {
+        /* Default code for filters */
+
+        local_file = bdrv_filtered_rw_bs(bs);
+        assert(local_file);
+
+        *pnum = aligned_bytes;
+        local_map = aligned_offset;
+        ret = BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
+    }
     if (ret < 0) {
         *pnum = 0;
         goto out;
diff --git a/block/mirror.c b/block/mirror.c
index 6ddbfb9708..88155faaec 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1508,7 +1508,6 @@ static BlockDriver bdrv_mirror_top = {
     .bdrv_co_pwrite_zeroes      = bdrv_mirror_top_pwrite_zeroes,
     .bdrv_co_pdiscard           = bdrv_mirror_top_pdiscard,
     .bdrv_co_flush              = bdrv_mirror_top_flush,
-    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
     .bdrv_refresh_filename      = bdrv_mirror_top_refresh_filename,
     .bdrv_child_perm            = bdrv_mirror_top_child_perm,
     .bdrv_refresh_limits        = bdrv_mirror_top_refresh_limits,
diff --git a/block/throttle.c b/block/throttle.c
index 958a2bcfa6..d0436f875b 100644
--- a/block/throttle.c
+++ b/block/throttle.c
@@ -270,7 +270,6 @@ static BlockDriver bdrv_throttle = {
     .bdrv_reopen_prepare                =   throttle_reopen_prepare,
     .bdrv_reopen_commit                 =   throttle_reopen_commit,
     .bdrv_reopen_abort                  =   throttle_reopen_abort,
-    .bdrv_co_block_status               =   bdrv_co_block_status_from_file,
 
     .bdrv_co_drain_begin                =   throttle_co_drain_begin,
     .bdrv_co_drain_end                  =   throttle_co_drain_end,
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (33 preceding siblings ...)
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 34/42] block: Inline bdrv_co_block_status_from_*() Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-08-15 15:21   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 36/42] iotests: Add tests for mirror @replaces loops Max Reitz
                   ` (6 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Currently, check_to_replace_node() only allows mirror to replace a node
in the chain of the source node, and only if it is the first non-filter
node below the source.  Well, technically, the idea is that you can
exactly replace a quorum child by mirroring from quorum.

This has (probably) two reasons:
(1) We do not want to create loops.
(2) @replaces and @device should have exactly the same content so
    replacing them does not cause visible data to change.

This has two issues:
(1) It is overly restrictive.  It is completely fine for @replaces to be
    a filter.
(2) It is not restrictive enough.  You can create loops with this as
    follows:

$ qemu-img create -f qcow2 /tmp/source.qcow2 64M
$ qemu-system-x86_64 -qmp stdio
{"execute": "qmp_capabilities"}
{"execute": "object-add",
 "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
{"execute": "blockdev-add",
 "arguments": {
     "node-name": "source",
     "driver": "throttle",
     "throttle-group": "tg0",
     "file": {
         "node-name": "filtered",
         "driver": "qcow2",
         "file": {
             "driver": "file",
             "filename": "/tmp/source.qcow2"
         } } } }
{"execute": "drive-mirror",
 "arguments": {
     "job-id": "mirror",
     "device": "source",
     "target": "/tmp/target.qcow2",
     "format": "qcow2",
     "node-name": "target",
     "sync" :"none",
     "replaces": "filtered"
 } }
{"execute": "block-job-complete", "arguments": {"device": "mirror"}}

And qemu crashes because of a stack overflow due to the loop being
created (target's backing file is source, so when it replaces filtered,
it points to itself through source).

(blockdev-mirror can be broken similarly.)

So let us make the checks for the two conditions above explicit, which
makes the whole function exactly as restrictive as it needs to be.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block.h |  1 +
 block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
 blockdev.c            | 34 ++++++++++++++++--
 3 files changed, 110 insertions(+), 8 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 6ba853fb90..8da706cd89 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate);
 
 /* check if a named node can be replaced when doing drive-mirror */
 BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
+                                        BlockDriverState *backing_bs,
                                         const char *node_name, Error **errp);
 
 /* async block I/O */
diff --git a/block.c b/block.c
index 915b80153c..4858d3e718 100644
--- a/block.c
+++ b/block.c
@@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)
     return false;
 }
 
+static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
+{
+    BdrvChild *c;
+
+    if (!parent) {
+        return false;
+    }
+
+    QLIST_FOREACH(c, &parent->children, next) {
+        if (c->bs == child || is_child_of(child, c->bs)) {
+            return true;
+        }
+    }
+
+    return false;
+}
+
+/*
+ * Return true if there are only filters in [@top, @base).  Note that
+ * this may include quorum (which bdrv_chain_contains() cannot
+ * handle).
+ */
+static bool is_filtered_child(BlockDriverState *top, BlockDriverState *base)
+{
+    BdrvChild *c;
+
+    if (!top) {
+        return false;
+    }
+
+    if (top == base) {
+        return true;
+    }
+
+    if (!top->drv->is_filter) {
+        return false;
+    }
+
+    QLIST_FOREACH(c, &top->children, next) {
+        if (is_filtered_child(c->bs, base)) {
+            return true;
+        }
+    }
+
+    return false;
+}
+
+/*
+ * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
+ * will be attached to the target when mirror completes.
+ */
 BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
+                                        BlockDriverState *backing_bs,
                                         const char *node_name, Error **errp)
 {
     BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
@@ -6309,13 +6361,32 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
         goto out;
     }
 
-    /* We don't want arbitrary node of the BDS chain to be replaced only the top
-     * most non filter in order to prevent data corruption.
-     * Another benefit is that this tests exclude backing files which are
-     * blocked by the backing blockers.
+    /*
+     * If to_replace_bs is (recursively) a child of backing_bs,
+     * replacing it may create a loop.  We cannot allow that.
      */
-    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
-        error_setg(errp, "Only top most non filter can be replaced");
+    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, backing_bs)) {
+        error_setg(errp, "Replacing this node would result in a loop");
+        to_replace_bs = NULL;
+        goto out;
+    }
+
+    /*
+     * Mirror is designed in such a way that when it completes, the
+     * source BDS is seamlessly replaced.  It is therefore not allowed
+     * to replace a BDS where this condition would be violated, as that
+     * would defeat the purpose of mirror and could lead to data
+     * corruption.
+     * Therefore, between parent_bs and to_replace_bs there may be
+     * only filters (and the one on top must be a filter, too), so
+     * their data always stays in sync and mirror can complete and
+     * replace to_replace_bs without any possible corruptions.
+     */
+    if (!is_filtered_child(parent_bs, to_replace_bs) &&
+        !is_filtered_child(to_replace_bs, parent_bs))
+    {
+        error_setg(errp, "The node to be replaced must be connected to the "
+                   "source through filter nodes only");
         to_replace_bs = NULL;
         goto out;
     }
diff --git a/blockdev.c b/blockdev.c
index 4e72f6f701..758e0b5431 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
     }
 
     if (has_replaces) {
-        BlockDriverState *to_replace_bs;
+        BlockDriverState *to_replace_bs, *backing_bs;
         AioContext *replace_aio_context;
         int64_t bs_size, replace_size;
 
@@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
             return;
         }
 
-        to_replace_bs = check_to_replace_node(bs, replaces, errp);
+        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
+            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
+        {
+            /*
+             * While we do not quite know what OPEN_BACKING_CHAIN
+             * (used for mode=existing) will yield, it is probably
+             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
+             * because that is our best guess.
+             */
+            switch (sync) {
+            case MIRROR_SYNC_MODE_FULL:
+                backing_bs = NULL;
+                break;
+
+            case MIRROR_SYNC_MODE_TOP:
+                backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
+                break;
+
+            case MIRROR_SYNC_MODE_NONE:
+                backing_bs = bs;
+                break;
+
+            default:
+                abort();
+            }
+        } else {
+            assert(backing_mode == MIRROR_LEAVE_BACKING_CHAIN);
+            backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(target));
+        }
+
+        to_replace_bs = check_to_replace_node(bs, backing_bs, replaces, errp);
         if (!to_replace_bs) {
             return;
         }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 36/42] iotests: Add tests for mirror @replaces loops
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (34 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node() Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 37/42] block: Leave BDS.backing_file constant Max Reitz
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This adds two tests for cases where our old check_to_replace_node()
function failed to detect that executing this job with these parameters
would result in a cyclic graph.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/041     | 124 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/041.out |   4 +-
 2 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/041 b/tests/qemu-iotests/041
index 26bf1701eb..0c1432f189 100755
--- a/tests/qemu-iotests/041
+++ b/tests/qemu-iotests/041
@@ -1067,5 +1067,129 @@ class TestOrphanedSource(iotests.QMPTestCase):
                              target='dest-ro')
         self.assert_qmp(result, 'error/class', 'GenericError')
 
+# Various tests for the @replaces option (independent of quorum)
+class TestReplaces(iotests.QMPTestCase):
+    def setUp(self):
+        self.vm = iotests.VM()
+        self.vm.launch()
+
+    def tearDown(self):
+        self.vm.shutdown()
+
+    def test_drive_mirror_loop(self):
+        qemu_img('create', '-f', iotests.imgfmt, test_img, '1M')
+
+        result = self.vm.qmp('object-add', qom_type='throttle-group', id='tg')
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-add', **{
+                    'node-name': 'source',
+                    'driver': 'throttle',
+                    'throttle-group': 'tg',
+                    'file': {
+                        'node-name': 'filtered',
+                        'driver': iotests.imgfmt,
+                        'file': {
+                            'driver': 'file',
+                            'filename': test_img
+                        }
+                    }
+                })
+        self.assert_qmp(result, 'return', {})
+
+        # Mirror from @source to @target in sync=none, so that @source
+        # will be @target's backing file; but replace @filtered.
+        # Then, @target's backing file will be @source, whose backing
+        # file is now @target instead of @filtered.  That is a loop.
+        # (But apart from the loop, replacing @filtered instead of
+        # @source is fine, because both are just filtered versions of
+        # each other.)
+        result = self.vm.qmp('drive-mirror',
+                             job_id='mirror',
+                             device='source',
+                             target=target_img,
+                             format=iotests.imgfmt,
+                             node_name='target',
+                             sync='none',
+                             replaces='filtered')
+        if 'error' in result:
+            # This is the correct result
+            self.assert_qmp(result, 'error/class', 'GenericError')
+        else:
+            # This is wrong, but let's run it to the bitter conclusion
+            self.complete_and_wait(drive='mirror')
+            # Fail for good measure, although qemu should have crashed
+            # anyway
+            self.fail('Loop creation was successful')
+
+        os.remove(test_img)
+        try:
+            os.remove(target_img)
+        except OSError:
+            pass
+
+    def test_blockdev_mirror_loop(self):
+        qemu_img('create', '-f', iotests.imgfmt, test_img, '1M')
+        qemu_img('create', '-f', iotests.imgfmt, target_img, '1M')
+
+        result = self.vm.qmp('object-add', qom_type='throttle-group', id='tg')
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-add', **{
+                    'node-name': 'source',
+                    'driver': 'throttle',
+                    'throttle-group': 'tg',
+                    'file': {
+                        'node-name': 'middle',
+                        'driver': 'throttle',
+                        'throttle-group': 'tg',
+                        'file': {
+                            'node-name': 'bottom',
+                            'driver': iotests.imgfmt,
+                            'file': {
+                                'driver': 'file',
+                                'filename': test_img
+                            }
+                        }
+                    }
+                })
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-add', **{
+                    'node-name': 'target',
+                    'driver': iotests.imgfmt,
+                    'file': {
+                        'driver': 'file',
+                        'filename': target_img
+                    },
+                    'backing': 'middle'
+                })
+
+        # Mirror from @source to @target.  With blockdev-mirror, the
+        # current (old) backing file is retained (which is @middle).
+        # By replacing @bottom, @middle's file will be @target, whose
+        # backing file is @middle again.  That is a loop.
+        # (But apart from the loop, replacing @bottom instead of
+        # @source is fine, because both are just filtered versions of
+        # each other.)
+        result = self.vm.qmp('blockdev-mirror',
+                             job_id='mirror',
+                             device='source',
+                             target='target',
+                             sync='full',
+                             replaces='bottom')
+        if 'error' in result:
+            # This is the correct result
+            self.assert_qmp(result, 'error/class', 'GenericError')
+        else:
+            # This is wrong, but let's run it to the bitter conclusion
+            self.complete_and_wait(drive='mirror')
+            # Fail for good measure, although qemu should have crashed
+            # anyway
+            self.fail('Loop creation was successful')
+
+        os.remove(test_img)
+        os.remove(target_img)
+
 if __name__ == '__main__':
     iotests.main(supported_fmts=['qcow2', 'qed'])
diff --git a/tests/qemu-iotests/041.out b/tests/qemu-iotests/041.out
index e071d0b261..2c448b4239 100644
--- a/tests/qemu-iotests/041.out
+++ b/tests/qemu-iotests/041.out
@@ -1,5 +1,5 @@
-........................................................................................
+..........................................................................................
 ----------------------------------------------------------------------
-Ran 88 tests
+Ran 90 tests
 
 OK
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 37/42] block: Leave BDS.backing_file constant
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (35 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 36/42] iotests: Add tests for mirror @replaces loops Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-08-16 16:16   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit Max Reitz
                   ` (4 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Parts of the block layer treat BDS.backing_file as if it were whatever
the image header says (i.e., if it is a relative path, it is relative to
the overlay), other parts treat it like a cache for
bs->backing->bs->filename (relative paths are relative to the CWD).
Considering bs->backing->bs->filename exists, let us make it mean the
former.

Among other things, this now allows the user to specify a base when
using qemu-img to commit an image file in a directory that is not the
CWD (assuming, everything uses relative filenames).

Before this patch:

$ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
$ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
$ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'

After this patch:

$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
Image committed.
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
Image committed.

With this change, bdrv_find_backing_image() must look at whether the
user has overridden a BDS's backing file.  If so, it can no longer use
bs->backing_file, but must instead compare the given filename against
the backing node's filename directly.

Note that this changes the QAPI output for a node's backing_file.  We
had very inconsistent output there (sometimes what the image header
said, sometimes the actual filename of the backing image).  This
inconsistent output was effectively useless, so we have to decide one
way or the other.  Considering that bs->backing_file usually at runtime
contained the path to the image relative to qemu's CWD (or absolute),
this patch changes QAPI's backing_file to always report the
bs->backing->bs->filename from now on.  If you want to receive the image
header information, you have to refer to full-backing-filename.

This necessitates a change to iotest 228.  The interesting information
it really wanted is the image header, and it can get that now, but it
has to use full-backing-filename instead of backing_file.  Because of
this patch's changes to bs->backing_file's behavior, we also need some
reference output changes.

Along with the changes to bs->backing_file, stop updating
BDS.backing_format in bdrv_backing_attach() as well.  This necessitates
a change to the reference output of iotest 191.

iotest 245 changes in behavior: With the backing node no longer
overriding the parent node's backing_file string, you can now omit the
@backing option when reopening a node with neither a default nor a
current backing file even if it used to have a backing node at some
point.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h  | 19 ++++++++++++++-----
 block.c                    | 35 ++++++++++++++++++++++++++++-------
 block/qapi.c               |  7 ++++---
 tests/qemu-iotests/191.out |  1 -
 tests/qemu-iotests/228     |  6 +++---
 tests/qemu-iotests/228.out |  6 +++---
 tests/qemu-iotests/245     |  4 +++-
 7 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 42ee2fcf7f..993bafc090 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -784,11 +784,20 @@ struct BlockDriverState {
     bool walking_aio_notifiers; /* to make removal during iteration safe */
 
     char filename[PATH_MAX];
-    char backing_file[PATH_MAX]; /* if non zero, the image is a diff of
-                                    this file image */
-    /* The backing filename indicated by the image header; if we ever
-     * open this file, then this is replaced by the resulting BDS's
-     * filename (i.e. after a bdrv_refresh_filename() run). */
+    /*
+     * If not empty, this image is a diff in relation to backing_file.
+     * Note that this is the name given in the image header and
+     * therefore may or may not be equal to .backing->bs->filename.
+     * If this field contains a relative path, it is to be resolved
+     * relatively to the overlay's location.
+     */
+    char backing_file[PATH_MAX];
+    /*
+     * The backing filename indicated by the image header.  Contrary
+     * to backing_file, if we ever open this file, auto_backing_file
+     * is replaced by the resulting BDS's filename (i.e. after a
+     * bdrv_refresh_filename() run).
+     */
     char auto_backing_file[PATH_MAX];
     char backing_format[16]; /* if non-zero and backing_file exists */
 
diff --git a/block.c b/block.c
index 4858d3e718..88533fa0d3 100644
--- a/block.c
+++ b/block.c
@@ -78,6 +78,8 @@ static BlockDriverState *bdrv_open_inherit(const char *filename,
                                            const BdrvChildRole *child_role,
                                            Error **errp);
 
+static bool bdrv_backing_overridden(BlockDriverState *bs);
+
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
 
@@ -1065,10 +1067,6 @@ static void bdrv_backing_attach(BdrvChild *c)
     bdrv_refresh_filename(backing_hd);
 
     parent->open_flags &= ~BDRV_O_NO_BACKING;
-    pstrcpy(parent->backing_file, sizeof(parent->backing_file),
-            backing_hd->filename);
-    pstrcpy(parent->backing_format, sizeof(parent->backing_format),
-            backing_hd->drv ? backing_hd->drv->format_name : "");
 
     bdrv_op_block_all(backing_hd, parent->backing_blocker);
     /* Otherwise we won't be able to commit or stream */
@@ -5294,6 +5292,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
     char *backing_file_full = NULL;
     char *filename_tmp = NULL;
     int is_protocol = 0;
+    bool filenames_refreshed = false;
     BlockDriverState *curr_bs = NULL;
     BlockDriverState *retval = NULL;
 
@@ -5318,9 +5317,31 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
     {
         BlockDriverState *bs_below = bdrv_backing_chain_next(curr_bs);
 
-        /* If either of the filename paths is actually a protocol, then
-         * compare unmodified paths; otherwise make paths relative */
-        if (is_protocol || path_has_protocol(curr_bs->backing_file)) {
+        if (bdrv_backing_overridden(curr_bs)) {
+            /*
+             * If the backing file was overridden, we can only compare
+             * directly against the backing node's filename.
+             */
+
+            if (!filenames_refreshed) {
+                /*
+                 * This will automatically refresh all of the
+                 * filenames in the rest of the backing chain, so we
+                 * only need to do this once.
+                 */
+                bdrv_refresh_filename(bs_below);
+                filenames_refreshed = true;
+            }
+
+            if (strcmp(backing_file, bs_below->filename) == 0) {
+                retval = bs_below;
+                break;
+            }
+        } else if (is_protocol || path_has_protocol(curr_bs->backing_file)) {
+            /*
+             * If either of the filename paths is actually a protocol, then
+             * compare unmodified paths; otherwise make paths relative.
+             */
             char *backing_file_full_ret;
 
             if (strcmp(backing_file, curr_bs->backing_file) == 0) {
diff --git a/block/qapi.c b/block/qapi.c
index 4f59ac1c0f..751c3e695a 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -45,7 +45,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
                                         BlockDriverState *bs, Error **errp)
 {
     ImageInfo **p_image_info;
-    BlockDriverState *bs0;
+    BlockDriverState *bs0, *backing;
     BlockDeviceInfo *info;
 
     if (!bs->drv) {
@@ -74,9 +74,10 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
         info->node_name = g_strdup(bs->node_name);
     }
 
-    if (bs->backing_file[0]) {
+    backing = bdrv_filtered_cow_bs(bs);
+    if (backing) {
         info->has_backing_file = true;
-        info->backing_file = g_strdup(bs->backing_file);
+        info->backing_file = g_strdup(backing->filename);
     }
 
     if (!QLIST_EMPTY(&bs->dirty_bitmaps)) {
diff --git a/tests/qemu-iotests/191.out b/tests/qemu-iotests/191.out
index 3fc92bb56e..0b3c216b0c 100644
--- a/tests/qemu-iotests/191.out
+++ b/tests/qemu-iotests/191.out
@@ -605,7 +605,6 @@ wrote 65536/65536 bytes at offset 1048576
                     "backing-filename": "TEST_DIR/t.IMGFMT.base",
                     "dirty-flag": false
                 },
-                "backing-filename-format": "IMGFMT",
                 "virtual-size": 67108864,
                 "filename": "TEST_DIR/t.IMGFMT.ovl3",
                 "cluster-size": 65536,
diff --git a/tests/qemu-iotests/228 b/tests/qemu-iotests/228
index 9a50afd205..a1f3187212 100755
--- a/tests/qemu-iotests/228
+++ b/tests/qemu-iotests/228
@@ -34,7 +34,7 @@ def log_node_info(node):
 
     log('bs->filename: ' + node['image']['filename'],
         filters=[filter_testfiles, filter_imgfmt])
-    log('bs->backing_file: ' + node['backing_file'],
+    log('bs->backing_file: ' + node['image']['full-backing-filename'],
         filters=[filter_testfiles, filter_imgfmt])
 
     if 'backing-image' in node['image']:
@@ -70,8 +70,8 @@ with iotests.FilePath('base.img') as base_img_path, \
                 },
                 filters=[filter_qmp_testfiles, filter_qmp_imgfmt])
 
-    # Filename should be plain, and the backing filename should not
-    # contain the "file:" prefix
+    # Filename should be plain, and the backing node filename should
+    # not contain the "file:" prefix
     log_node_info(vm.node_info('node0'))
 
     vm.qmp_log('blockdev-del', node_name='node0')
diff --git a/tests/qemu-iotests/228.out b/tests/qemu-iotests/228.out
index 4217df24fe..8c82009abe 100644
--- a/tests/qemu-iotests/228.out
+++ b/tests/qemu-iotests/228.out
@@ -4,7 +4,7 @@
 {"return": {}}
 
 bs->filename: TEST_DIR/PID-top.img
-bs->backing_file: TEST_DIR/PID-base.img
+bs->backing_file: file:TEST_DIR/PID-base.img
 bs->backing->bs->filename: TEST_DIR/PID-base.img
 
 {"execute": "blockdev-del", "arguments": {"node-name": "node0"}}
@@ -41,7 +41,7 @@ bs->backing->bs->filename: TEST_DIR/PID-base.img
 {"return": {}}
 
 bs->filename: TEST_DIR/PID-top.img
-bs->backing_file: TEST_DIR/PID-base.img
+bs->backing_file: file:TEST_DIR/PID-base.img
 bs->backing->bs->filename: TEST_DIR/PID-base.img
 
 {"execute": "blockdev-del", "arguments": {"node-name": "node0"}}
@@ -55,7 +55,7 @@ bs->backing->bs->filename: TEST_DIR/PID-base.img
 {"return": {}}
 
 bs->filename: json:{"backing": {"driver": "null-co"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-top.img"}}
-bs->backing_file: null-co://
+bs->backing_file: TEST_DIR/PID-base.img
 bs->backing->bs->filename: null-co://
 
 {"execute": "blockdev-del", "arguments": {"node-name": "node0"}}
diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index bc1ceb9792..049ef6a71f 100644
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -722,7 +722,9 @@ class TestBlockdevReopen(iotests.QMPTestCase):
 
         # Detach hd2 from hd0.
         self.reopen(opts, {'backing': None})
-        self.reopen(opts, {}, "backing is missing for 'hd0'")
+
+        # Without a backing file, we can omit 'backing' again
+        self.reopen(opts)
 
         # Remove both hd0 and hd2
         result = self.vm.qmp('blockdev-del', conv_keys = True, node_name = 'hd0')
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (36 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 37/42] block: Leave BDS.backing_file constant Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-08-23  5:59   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases Max Reitz
                   ` (3 subsequent siblings)
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

complete_and_wait() and wait_ready() currently only work for mirror
jobs.  Let them work for active commit jobs, too.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/iotests.py | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 84438e837c..3ef846d1dc 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -761,8 +761,12 @@ class QMPTestCase(unittest.TestCase):
 
     def wait_ready(self, drive='drive0'):
         '''Wait until a block job BLOCK_JOB_READY event'''
-        f = {'data': {'type': 'mirror', 'device': drive } }
-        event = self.vm.event_wait(name='BLOCK_JOB_READY', match=f)
+        event = self.vm.events_wait([
+                ('BLOCK_JOB_READY',
+                 {'data': {'type': 'mirror', 'device': drive } }),
+                ('BLOCK_JOB_READY',
+                 {'data': {'type': 'commit', 'device': drive } })
+            ])
 
     def wait_ready_and_cancel(self, drive='drive0'):
         self.wait_ready(drive=drive)
@@ -780,7 +784,7 @@ class QMPTestCase(unittest.TestCase):
         self.assert_qmp(result, 'return', {})
 
         event = self.wait_until_completed(drive=drive)
-        self.assert_qmp(event, 'data/type', 'mirror')
+        self.assertTrue(event['data']['type'] in ['mirror', 'commit'])
 
     def pause_wait(self, job_id='job0'):
         with Timeout(1, "Timeout waiting for job to pause"):
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (37 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-08-31 11:41   ` Vladimir Sementsov-Ogievskiy
  2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 40/42] iotests: Add filter mirror " Max Reitz
                   ` (2 subsequent siblings)
  41 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This patch adds some tests on how commit copes with filter nodes.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/040     | 177 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/040.out |   4 +-
 2 files changed, 179 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
index 6db9abf8e6..a0a0db8889 100755
--- a/tests/qemu-iotests/040
+++ b/tests/qemu-iotests/040
@@ -428,5 +428,182 @@ class TestReopenOverlay(ImageCommitTestCase):
     def test_reopen_overlay(self):
         self.run_commit_test(self.img1, self.img0)
 
+class TestCommitWithFilters(iotests.QMPTestCase):
+    img0 = os.path.join(iotests.test_dir, '0.img')
+    img1 = os.path.join(iotests.test_dir, '1.img')
+    img2 = os.path.join(iotests.test_dir, '2.img')
+    img3 = os.path.join(iotests.test_dir, '3.img')
+
+    def setUp(self):
+        qemu_img('create', '-f', iotests.imgfmt, self.img0, '64M')
+        qemu_img('create', '-f', iotests.imgfmt, self.img1, '64M')
+        qemu_img('create', '-f', iotests.imgfmt, self.img2, '64M')
+        qemu_img('create', '-f', iotests.imgfmt, self.img3, '64M')
+
+        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 1 0M 1M', self.img0)
+        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 2 1M 1M', self.img1)
+        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 3 2M 1M', self.img2)
+        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 4 3M 1M', self.img3)
+
+        # Distributions of the patterns in the files; this is checked
+        # by tearDown() and should be changed by the test cases as is
+        # necessary
+        self.pattern_files = [self.img0, self.img1, self.img2, self.img3]
+
+        self.vm = iotests.VM()
+        self.vm.launch()
+        self.has_quit = False
+
+        result = self.vm.qmp('object-add', qom_type='throttle-group', id='tg')
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-add', **{
+                'node-name': 'top-filter',
+                'driver': 'throttle',
+                'throttle-group': 'tg',
+                'file': {
+                    'node-name': 'cow-3',
+                    'driver': iotests.imgfmt,
+                    'file': {
+                        'driver': 'file',
+                        'filename': self.img3
+                    },
+                    'backing': {
+                        'node-name': 'cow-2',
+                        'driver': iotests.imgfmt,
+                        'file': {
+                            'driver': 'file',
+                            'filename': self.img2
+                        },
+                        'backing': {
+                            'node-name': 'cow-1',
+                            'driver': iotests.imgfmt,
+                            'file': {
+                                'driver': 'file',
+                                'filename': self.img1
+                            },
+                            'backing': {
+                                'node-name': 'bottom-filter',
+                                'driver': 'throttle',
+                                'throttle-group': 'tg',
+                                'file': {
+                                    'node-name': 'cow-0',
+                                    'driver': iotests.imgfmt,
+                                    'file': {
+                                        'driver': 'file',
+                                        'filename': self.img0
+                                    }
+                                }
+                            }
+                        }
+                    }
+                }
+            })
+        self.assert_qmp(result, 'return', {})
+
+    def tearDown(self):
+        self.vm.shutdown(has_quit=self.has_quit)
+
+        for index in range(len(self.pattern_files)):
+            result = qemu_io('-f', iotests.imgfmt,
+                             '-c', 'read -P %i %iM 1M' % (index + 1, index),
+                             self.pattern_files[index])
+            self.assertFalse('Pattern verification failed' in result)
+
+        os.remove(self.img3)
+        os.remove(self.img2)
+        os.remove(self.img1)
+        os.remove(self.img0)
+
+    # Filters make for funny filenames, so we cannot just use
+    # self.imgX to get them
+    def get_filename(self, node):
+        return self.vm.node_info(node)['image']['filename']
+
+    def test_filterless_commit(self):
+        self.assert_no_active_block_jobs()
+        result = self.vm.qmp('block-commit',
+                             job_id='commit',
+                             device='top-filter',
+                             top_node='cow-2',
+                             base_node='cow-1')
+        self.assert_qmp(result, 'return', {})
+        self.wait_until_completed(drive='commit')
+
+        self.assertIsNotNone(self.vm.node_info('cow-3'))
+        self.assertIsNone(self.vm.node_info('cow-2'))
+        self.assertIsNotNone(self.vm.node_info('cow-1'))
+
+        # 2 has been comitted into 1
+        self.pattern_files[2] = self.img1
+
+    def test_commit_through_filter(self):
+        self.assert_no_active_block_jobs()
+        result = self.vm.qmp('block-commit',
+                             job_id='commit',
+                             device='top-filter',
+                             top_node='cow-1',
+                             base_node='cow-0')
+        self.assert_qmp(result, 'return', {})
+        self.wait_until_completed(drive='commit')
+
+        self.assertIsNotNone(self.vm.node_info('cow-2'))
+        self.assertIsNone(self.vm.node_info('cow-1'))
+        self.assertIsNone(self.vm.node_info('bottom-filter'))
+        self.assertIsNotNone(self.vm.node_info('cow-0'))
+
+        # 1 has been comitted into 0
+        self.pattern_files[1] = self.img0
+
+    def test_filtered_active_commit_with_filter(self):
+        # Add a device, so the commit job finds a parent it can change
+        # to point to the base node (so we can test that top-filter is
+        # dropped from the graph)
+        result = self.vm.qmp('device_add', id='drv0', driver='virtio-blk',
+                             drive='top-filter')
+        self.assert_qmp(result, 'return', {})
+
+        # Try to release our reference to top-filter; that should not
+        # work because drv0 uses it
+        result = self.vm.qmp('blockdev-del', node_name='top-filter')
+        self.assert_qmp(result, 'error/class', 'GenericError')
+        self.assert_qmp(result, 'error/desc', 'Node top-filter is in use')
+
+        self.assert_no_active_block_jobs()
+        result = self.vm.qmp('block-commit',
+                             job_id='commit',
+                             device='top-filter',
+                             base_node='cow-2')
+        self.assert_qmp(result, 'return', {})
+        self.complete_and_wait(drive='commit')
+
+        # Try to release our reference to top-filter again
+        result = self.vm.qmp('blockdev-del', node_name='top-filter')
+        self.assert_qmp(result, 'return', {})
+
+        self.assertIsNone(self.vm.node_info('top-filter'))
+        self.assertIsNone(self.vm.node_info('cow-3'))
+        self.assertIsNotNone(self.vm.node_info('cow-2'))
+
+        # 3 has been comitted into 2
+        self.pattern_files[3] = self.img2
+
+    def test_filtered_active_commit_without_filter(self):
+        self.assert_no_active_block_jobs()
+        result = self.vm.qmp('block-commit',
+                             job_id='commit',
+                             device='top-filter',
+                             top_node='cow-3',
+                             base_node='cow-2')
+        self.assert_qmp(result, 'return', {})
+        self.complete_and_wait(drive='commit')
+
+        self.assertIsNotNone(self.vm.node_info('top-filter'))
+        self.assertIsNone(self.vm.node_info('cow-3'))
+        self.assertIsNotNone(self.vm.node_info('cow-2'))
+
+        # 3 has been comitted into 2
+        self.pattern_files[3] = self.img2
+
 if __name__ == '__main__':
     iotests.main(supported_fmts=['qcow2', 'qed'])
diff --git a/tests/qemu-iotests/040.out b/tests/qemu-iotests/040.out
index 220a5fa82c..fe58934d7a 100644
--- a/tests/qemu-iotests/040.out
+++ b/tests/qemu-iotests/040.out
@@ -1,5 +1,5 @@
-...............................................
+...................................................
 ----------------------------------------------------------------------
-Ran 47 tests
+Ran 51 tests
 
 OK
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 40/42] iotests: Add filter mirror test cases
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (38 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases Max Reitz
@ 2019-08-09 16:14 ` " Max Reitz
  2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 41/42] iotests: Add test for commit in sub directory Max Reitz
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 42/42] iotests: Test committing to overridden backing Max Reitz
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

This patch adds some test cases how mirroring relates to filters.  One
of them tests what happens when you mirror off a filtered COW node, two
others use the mirror filter node as basically our only example of an
implicitly created filter node so far (besides the commit filter).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/041     | 146 ++++++++++++++++++++++++++++++++++++-
 tests/qemu-iotests/041.out |   4 +-
 2 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/041 b/tests/qemu-iotests/041
index 0c1432f189..c2b5299f62 100755
--- a/tests/qemu-iotests/041
+++ b/tests/qemu-iotests/041
@@ -20,8 +20,9 @@
 
 import time
 import os
+import json
 import iotests
-from iotests import qemu_img, qemu_io
+from iotests import qemu_img, qemu_img_pipe, qemu_io
 
 backing_img = os.path.join(iotests.test_dir, 'backing.img')
 target_backing_img = os.path.join(iotests.test_dir, 'target-backing.img')
@@ -1191,5 +1192,148 @@ class TestReplaces(iotests.QMPTestCase):
         os.remove(test_img)
         os.remove(target_img)
 
+# Tests for mirror with filters (and how the mirror filter behaves, as
+# an example for an implicit filter)
+class TestFilters(iotests.QMPTestCase):
+    def setUp(self):
+        qemu_img('create', '-f', iotests.imgfmt, backing_img, '1M')
+        qemu_img('create', '-f', iotests.imgfmt, '-b', backing_img, test_img)
+        qemu_img('create', '-f', iotests.imgfmt, '-b', backing_img, target_img)
+
+        qemu_io('-c', 'write -P 1 0 512k', backing_img)
+        qemu_io('-c', 'write -P 2 512k 512k', test_img)
+
+        self.vm = iotests.VM()
+        self.vm.launch()
+
+        result = self.vm.qmp('blockdev-add', **{
+                                'node-name': 'target',
+                                'driver': iotests.imgfmt,
+                                'file': {
+                                    'driver': 'file',
+                                    'filename': target_img
+                                },
+                                'backing': None
+                            })
+        self.assert_qmp(result, 'return', {})
+
+        self.filterless_chain = {
+                'node-name': 'source',
+                'driver': iotests.imgfmt,
+                'file': {
+                    'driver': 'file',
+                    'filename': test_img
+                },
+                'backing': {
+                    'node-name': 'backing',
+                    'driver': iotests.imgfmt,
+                    'file': {
+                        'driver': 'file',
+                        'filename': backing_img
+                    }
+                }
+            }
+
+    def tearDown(self):
+        self.vm.shutdown()
+
+        os.remove(test_img)
+        os.remove(target_img)
+        os.remove(backing_img)
+
+    def test_cor(self):
+        result = self.vm.qmp('blockdev-add', **{
+                                'node-name': 'filter',
+                                'driver': 'copy-on-read',
+                                'file': self.filterless_chain
+                            })
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-mirror',
+                             job_id='mirror',
+                             device='filter',
+                             target='target',
+                             sync='top')
+        self.assert_qmp(result, 'return', {})
+
+        self.complete_and_wait('mirror')
+
+        self.vm.qmp('blockdev-del', node_name='target')
+
+        target_map = qemu_img_pipe('map', '--output=json', target_img)
+        target_map = json.loads(target_map)
+
+        assert target_map[0]['start'] == 0
+        assert target_map[0]['length'] == 512 * 1024
+        assert target_map[0]['depth'] == 1
+
+        assert target_map[1]['start'] == 512 * 1024
+        assert target_map[1]['length'] == 512 * 1024
+        assert target_map[1]['depth'] == 0
+
+    def test_implicit_mirror_filter(self):
+        result = self.vm.qmp('blockdev-add', **self.filterless_chain)
+        self.assert_qmp(result, 'return', {})
+
+        # We need this so we can query from above the mirror node
+        result = self.vm.qmp('device_add',
+                             driver='virtio-blk',
+                             id='virtio',
+                             bus='pci.0',
+                             drive='source')
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-mirror',
+                             job_id='mirror',
+                             device='source',
+                             target='target',
+                             sync='top')
+        self.assert_qmp(result, 'return', {})
+
+        # The mirror filter is now an implicit node, so it should be
+        # invisible when querying the backing chain
+        device_info = self.vm.qmp('query-block')['return'][0]
+        assert device_info['qdev'] == '/machine/peripheral/virtio/virtio-backend'
+
+        assert device_info['inserted']['node-name'] == 'source'
+
+        image_info = device_info['inserted']['image']
+        assert image_info['filename'] == test_img
+        assert image_info['backing-image']['filename'] == backing_img
+
+        self.complete_and_wait('mirror')
+
+    def test_explicit_mirror_filter(self):
+        # Same test as above, but this time we give the mirror filter
+        # a node-name so it will not be invisible
+        result = self.vm.qmp('blockdev-add', **self.filterless_chain)
+        self.assert_qmp(result, 'return', {})
+
+        # We need this so we can query from above the mirror node
+        result = self.vm.qmp('device_add',
+                             driver='virtio-blk',
+                             id='virtio',
+                             bus='pci.0',
+                             drive='source')
+        self.assert_qmp(result, 'return', {})
+
+        result = self.vm.qmp('blockdev-mirror',
+                             job_id='mirror',
+                             device='source',
+                             target='target',
+                             sync='top',
+                             filter_node_name='mirror-filter')
+        self.assert_qmp(result, 'return', {})
+
+        # With a node-name given to it, the mirror filter should now
+        # be visible
+        device_info = self.vm.qmp('query-block')['return'][0]
+        assert device_info['qdev'] == '/machine/peripheral/virtio/virtio-backend'
+
+        assert device_info['inserted']['node-name'] == 'mirror-filter'
+
+        self.complete_and_wait('mirror')
+
+
 if __name__ == '__main__':
     iotests.main(supported_fmts=['qcow2', 'qed'])
diff --git a/tests/qemu-iotests/041.out b/tests/qemu-iotests/041.out
index 2c448b4239..ffc779b4d1 100644
--- a/tests/qemu-iotests/041.out
+++ b/tests/qemu-iotests/041.out
@@ -1,5 +1,5 @@
-..........................................................................................
+.............................................................................................
 ----------------------------------------------------------------------
-Ran 90 tests
+Ran 93 tests
 
 OK
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 41/42] iotests: Add test for commit in sub directory
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (39 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 40/42] iotests: Add filter mirror " Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 42/42] iotests: Test committing to overridden backing Max Reitz
  41 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Add a test for committing an overlay in a sub directory to one of the
images in its backing chain, using both relative and absolute filenames.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/020     | 36 ++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/020.out | 10 ++++++++++
 2 files changed, 46 insertions(+)

diff --git a/tests/qemu-iotests/020 b/tests/qemu-iotests/020
index 6b0ebb37d2..94633c3a32 100755
--- a/tests/qemu-iotests/020
+++ b/tests/qemu-iotests/020
@@ -31,6 +31,11 @@ _cleanup()
 	_cleanup_test_img
     rm -f "$TEST_IMG.base"
     rm -f "$TEST_IMG.orig"
+
+    rm -f "$TEST_DIR/subdir/t.$IMGFMT.base"
+    rm -f "$TEST_DIR/subdir/t.$IMGFMT.mid"
+    rm -f "$TEST_DIR/subdir/t.$IMGFMT"
+    rmdir "$TEST_DIR/subdir" &> /dev/null
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
@@ -133,6 +138,37 @@ $QEMU_IO -c 'writev 0 64k' "$TEST_IMG" | _filter_qemu_io
 $QEMU_IMG commit "$TEST_IMG"
 _cleanup
 
+
+echo
+echo 'Testing commit in sub-directory with relative filenames'
+echo
+
+pushd "$TEST_DIR" > /dev/null
+
+mkdir subdir
+
+TEST_IMG="subdir/t.$IMGFMT.base" _make_test_img 1M
+TEST_IMG="subdir/t.$IMGFMT.mid" _make_test_img -b "t.$IMGFMT.base"
+TEST_IMG="subdir/t.$IMGFMT" _make_test_img -b "t.$IMGFMT.mid"
+
+# Should work
+$QEMU_IMG commit -b "t.$IMGFMT.mid" "subdir/t.$IMGFMT"
+
+# Might theoretically work, but does not in practice (we have to
+# decide between this and the above; and since we always represent
+# backing file names as relative to the overlay, we go for the above)
+$QEMU_IMG commit -b "subdir/t.$IMGFMT.mid" "subdir/t.$IMGFMT" 2>&1 | \
+    _filter_imgfmt
+
+# This should work as well
+$QEMU_IMG commit -b "$TEST_DIR/subdir/t.$IMGFMT.mid" "subdir/t.$IMGFMT"
+
+popd > /dev/null
+
+# Now let's try with just absolute filenames
+$QEMU_IMG commit -b "$TEST_DIR/subdir/t.$IMGFMT.mid" \
+    "$TEST_DIR/subdir/t.$IMGFMT"
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/020.out b/tests/qemu-iotests/020.out
index 4b722b2dd0..228c37dded 100644
--- a/tests/qemu-iotests/020.out
+++ b/tests/qemu-iotests/020.out
@@ -1094,4 +1094,14 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 backing_file=json:{'driv
 wrote 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Block job failed: No space left on device
+
+Testing commit in sub-directory with relative filenames
+
+Formatting 'subdir/t.IMGFMT.base', fmt=IMGFMT size=1048576
+Formatting 'subdir/t.IMGFMT.mid', fmt=IMGFMT size=1048576 backing_file=t.IMGFMT.base
+Formatting 'subdir/t.IMGFMT', fmt=IMGFMT size=1048576 backing_file=t.IMGFMT.mid
+Image committed.
+qemu-img: Did not find 'subdir/t.IMGFMT.mid' in the backing chain of 'subdir/t.IMGFMT'
+Image committed.
+Image committed.
 *** done
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [Qemu-devel] [PATCH v6 42/42] iotests: Test committing to overridden backing
  2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
                   ` (40 preceding siblings ...)
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 41/42] iotests: Add test for commit in sub directory Max Reitz
@ 2019-08-09 16:14 ` Max Reitz
  2019-09-03  9:18   ` Vladimir Sementsov-Ogievskiy
  41 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-09 16:14 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/040     | 61 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/040.out |  4 +--
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
index a0a0db8889..558fdb9a09 100755
--- a/tests/qemu-iotests/040
+++ b/tests/qemu-iotests/040
@@ -605,5 +605,66 @@ class TestCommitWithFilters(iotests.QMPTestCase):
         # 3 has been comitted into 2
         self.pattern_files[3] = self.img2
 
+class TestCommitWithOverriddenBacking(iotests.QMPTestCase):
+    img_base_a = os.path.join(iotests.test_dir, 'base_a.img')
+    img_base_b = os.path.join(iotests.test_dir, 'base_b.img')
+    img_top = os.path.join(iotests.test_dir, 'top.img')
+
+    def setUp(self):
+        qemu_img('create', '-f', iotests.imgfmt, self.img_base_a, '1M')
+        qemu_img('create', '-f', iotests.imgfmt, self.img_base_b, '1M')
+        qemu_img('create', '-f', iotests.imgfmt, '-b', self.img_base_a, \
+                 self.img_top)
+
+        self.vm = iotests.VM()
+        self.vm.launch()
+
+        # Use base_b instead of base_a as the backing of top
+        result = self.vm.qmp('blockdev-add', **{
+                                'node-name': 'top',
+                                'driver': iotests.imgfmt,
+                                'file': {
+                                    'driver': 'file',
+                                    'filename': self.img_top
+                                },
+                                'backing': {
+                                    'node-name': 'base',
+                                    'driver': iotests.imgfmt,
+                                    'file': {
+                                        'driver': 'file',
+                                        'filename': self.img_base_b
+                                    }
+                                }
+                            })
+        self.assert_qmp(result, 'return', {})
+
+    def tearDown(self):
+        self.vm.shutdown()
+        os.remove(self.img_top)
+        os.remove(self.img_base_a)
+        os.remove(self.img_base_b)
+
+    def test_commit_to_a(self):
+        # Try committing to base_a (which should fail, as top's
+        # backing image is base_b instead)
+        result = self.vm.qmp('block-commit',
+                             job_id='commit',
+                             device='top',
+                             base=self.img_base_a)
+        self.assert_qmp(result, 'error/class', 'GenericError')
+
+    def test_commit_to_b(self):
+        # Try committing to base_b (which should work, since that is
+        # actually top's backing image)
+        result = self.vm.qmp('block-commit',
+                             job_id='commit',
+                             device='top',
+                             base=self.img_base_b)
+        self.assert_qmp(result, 'return', {})
+
+        self.vm.event_wait('BLOCK_JOB_READY')
+        self.vm.qmp('block-job-complete', device='commit')
+        self.vm.event_wait('BLOCK_JOB_COMPLETED')
+
 if __name__ == '__main__':
     iotests.main(supported_fmts=['qcow2', 'qed'])
diff --git a/tests/qemu-iotests/040.out b/tests/qemu-iotests/040.out
index fe58934d7a..499af0e2ff 100644
--- a/tests/qemu-iotests/040.out
+++ b/tests/qemu-iotests/040.out
@@ -1,5 +1,5 @@
-...................................................
+.....................................................
 ----------------------------------------------------------------------
-Ran 51 tests
+Ran 53 tests
 
 OK
-- 
2.21.0



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 04/42] block: Add child access functions Max Reitz
@ 2019-08-09 16:56   ` Eric Blake
  2019-09-04 16:16   ` Kevin Wolf
  1 sibling, 0 replies; 132+ messages in thread
From: Eric Blake @ 2019-08-09 16:56 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 1219 bytes --]

On 8/9/19 11:13 AM, Max Reitz wrote:
> There are BDS children that the general block layer code can access,
> namely bs->file and bs->backing.  Since the introduction of filters and
> external data files, their meaning is not quite clear.  bs->backing can
> be a COW source, or it can be an R/W-filtered child; bs->file can be an
> R/W-filtered child, it can be data and metadata storage, or it can be
> just metadata storage.
> 
> This overloading really is not helpful.  This patch adds function that
> retrieve the correct child for each exact purpose.  Later patches in
> this series will make use of them.  Doing so will allow us to handle
> filter nodes and external data files in a meaningful way.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block_int.h | 57 ++++++++++++++++++++--
>  block.c                   | 99 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 153 insertions(+), 3 deletions(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 05/42] block: Add chain helper functions
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 05/42] block: Add chain helper functions Max Reitz
@ 2019-08-09 17:01   ` Eric Blake
  0 siblings, 0 replies; 132+ messages in thread
From: Eric Blake @ 2019-08-09 17:01 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 580 bytes --]

On 8/9/19 11:13 AM, Max Reitz wrote:
> Add some helper functions for skipping filters in a chain of block
> nodes.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/block/block_int.h |  3 +++
>  block.c                   | 55 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 06/42] qcow2: Implement .bdrv_storage_child()
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 06/42] qcow2: Implement .bdrv_storage_child() Max Reitz
@ 2019-08-09 17:07   ` Eric Blake
  0 siblings, 0 replies; 132+ messages in thread
From: Eric Blake @ 2019-08-09 17:07 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 416 bytes --]

On 8/9/19 11:13 AM, Max Reitz wrote:
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/qcow2.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain Max Reitz
@ 2019-08-10 13:32   ` Vladimir Sementsov-Ogievskiy
  2019-08-12 12:56     ` Max Reitz
  2019-09-05 13:05   ` Kevin Wolf
  1 sibling, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 13:32 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> In order to make filters work in backing chains, the associated
> functions must be able to deal with them and freeze all filter links, be
> they COW or R/W filter links.
> 
> In the process, rename these functions to reflect that they now act on
> generalized chains of filter nodes instead of backing chains alone.
> 
> While at it, add some comments that note which functions require their
> caller to ensure that a given child link is not frozen, and how the
> callers do so.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/block.h | 10 +++---
>   block.c               | 81 +++++++++++++++++++++++++------------------
>   block/commit.c        |  8 ++---
>   block/mirror.c        |  4 +--
>   block/stream.c        |  8 ++---
>   5 files changed, 62 insertions(+), 49 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 50a07c1c33..f6f09b95cd 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -364,11 +364,11 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
>   BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
>                                       BlockDriverState *bs);
>   BlockDriverState *bdrv_find_base(BlockDriverState *bs);
> -bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
> -                                  Error **errp);
> -int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
> -                              Error **errp);
> -void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
> +bool bdrv_is_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
> +                          Error **errp);
> +int bdrv_freeze_chain(BlockDriverState *bs, BlockDriverState *base,
> +                      Error **errp);
> +void bdrv_unfreeze_chain(BlockDriverState *bs, BlockDriverState *base);
>   
>   
>   typedef struct BdrvCheckResult {
> diff --git a/block.c b/block.c
> index adf82efb0e..650c00d182 100644
> --- a/block.c
> +++ b/block.c
> @@ -2303,12 +2303,15 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>    * If @new_bs is not NULL, bdrv_check_perm() must be called beforehand, as this
>    * function uses bdrv_set_perm() to update the permissions according to the new
>    * reference that @new_bs gets.
> + *
> + * Callers must ensure that child->frozen is false.
>    */
>   static void bdrv_replace_child(BdrvChild *child, BlockDriverState *new_bs)
>   {
>       BlockDriverState *old_bs = child->bs;
>       uint64_t perm, shared_perm;
>   
> +    /* Asserts that child->frozen == false */
>       bdrv_replace_child_noperm(child, new_bs);
>   
>       /*
> @@ -2468,6 +2471,7 @@ static void bdrv_detach_child(BdrvChild *child)
>       g_free(child);
>   }
>   
> +/* Callers must ensure that child->frozen is false. */
>   void bdrv_root_unref_child(BdrvChild *child)
>   {
>       BlockDriverState *child_bs;
> @@ -2477,10 +2481,6 @@ void bdrv_root_unref_child(BdrvChild *child)
>       bdrv_unref(child_bs);
>   }
>   
> -/**
> - * Clear all inherits_from pointers from children and grandchildren of
> - * @root that point to @root, where necessary.
> - */

Hmm, unrelated chunk? Without it:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

>   static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child)
>   {
>       BdrvChild *c;
> @@ -2505,6 +2505,7 @@ static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child)

[..]

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/42] block: Drop bdrv_is_encrypted()
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 10/42] block: Drop bdrv_is_encrypted() Max Reitz
@ 2019-08-10 13:42   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 13:42 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> The original purpose of bdrv_is_encrypted() was to inquire whether a BDS
> can be used without the user entering a password or not.  It has not
> been used for that purpose for quite some time.
> 
> Actually, it is not even fit for that purpose, because to answer that
> question, it would have recursively query all of the given node's
> children.
> 
> So now we have to decide in which direction we want to fix
> bdrv_is_encrypted(): Recursively query all children, or drop it and just
> use bs->encrypted to get the current node's status?
> 
> Nowadays, its only purpose is to report through bdrv_query_image_info()
> whether the given image is encrypted or not.  For this purpose, it is
> probably more interesting to see whether a given node itself is
> encrypted or not (otherwise, a management application cannot discern for
> certain which nodes are really encrypted and which just have encrypted
> children).
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains Max Reitz
@ 2019-08-10 15:19   ` Vladimir Sementsov-Ogievskiy
  2019-09-05 14:05   ` Kevin Wolf
  1 sibling, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 15:19 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> Use child access functions when iterating through backing chains so
> filters do not break the chain.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code Max Reitz
@ 2019-08-10 15:36   ` Vladimir Sementsov-Ogievskiy
  2019-08-12 12:58     ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 15:36 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
> itself has to flush the children of the given node, it should not flush
> just bs->file->bs, but in fact all children.
> 
> In any case, the BLKDBG_EVENT() should be emitted on the primary child,
> because that is where a blkdebug node would be if there is any.
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/io.c | 23 +++++++++++++++++------
>   1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index c5a8e3e6a3..bcc770d336 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
>   
>   int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>   {
> +    BdrvChild *primary_child = bdrv_primary_child(bs);
> +    BdrvChild *child;
>       int current_gen;
>       int ret = 0;
>   
> @@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>       }
>   
>       /* Write back cached data to the OS even with cache=unsafe */
> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
>       if (bs->drv->bdrv_co_flush_to_os) {
>           ret = bs->drv->bdrv_co_flush_to_os(bs);
>           if (ret < 0) {
> @@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>   
>       /* But don't actually force it to the disk with cache=unsafe */
>       if (bs->open_flags & BDRV_O_NO_FLUSH) {
> -        goto flush_parent;
> +        goto flush_children;
>       }
>   
>       /* Check if we really need to flush anything */
>       if (bs->flushed_gen == current_gen) {
> -        goto flush_parent;
> +        goto flush_children;
>       }
>   
> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
>       if (!bs->drv) {
>           /* bs->drv->bdrv_co_flush() might have ejected the BDS
>            * (even in case of apparent success) */
> @@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>       /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
>        * in the case of cache=unsafe, so there are no useless flushes.
>        */
> -flush_parent:
> -    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
> +flush_children:
> +    ret = 0; > +    QLIST_FOREACH(child, &bs->children, next) {
> +        int this_child_ret;
> +
> +        this_child_ret = bdrv_co_flush(child->bs);
> +        if (!ret) {
> +            ret = this_child_ret;
> +        }
> +    }

Hmm, you said that we want to flush only children with write-access from parent..
Shouldn't we check it? Or we assume that it's always safe to call bdrv_co_flush on
a node?

> +
>   out:
>       /* Notify any pending flushes that we have completed */
>       if (ret == 0) {
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 15/42] block: Re-evaluate backing file handling in reopen
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 15/42] block: Re-evaluate backing file handling in reopen Max Reitz
@ 2019-08-10 16:05   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 16:05 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> Reopening a node's backing child needs a bit of special handling because
> the "backing" child has different defaults than all other children
> (among other things).  Adding filter support here is a bit more
> difficult than just using the child access functions.  In fact, we often
> have to directly use bs->backing because these functions are about the
> "backing" child (which may or may not be the COW backing file).
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 18/42] block: Use CAFs in bdrv_refresh_filename()
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 18/42] block: Use CAFs in bdrv_refresh_filename() Max Reitz
@ 2019-08-10 16:22   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 16:22 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> bdrv_refresh_filename() and the kind of related bdrv_dirname() should
> look to the primary child when they wish to copy the underlying file's
> filename.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback Max Reitz
@ 2019-08-10 16:34   ` Vladimir Sementsov-Ogievskiy
  2019-08-12 13:06     ` Max Reitz
  2019-09-10 11:56   ` Kevin Wolf
  1 sibling, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 16:34 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> If the top node's driver does not provide snapshot functionality and we
> want to fall back to a node down the chain, we need to snapshot all
> non-COW children.  For simplicity's sake, just do not fall back if there
> is more than one such child.
> 
> bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
> the actual child pointer, so it only works if the fallback child is
> bs->file or bs->backing (and then we have to find out which it is).
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 79 insertions(+), 21 deletions(-)
> 
> diff --git a/block/snapshot.c b/block/snapshot.c
> index f2f48f926a..35403c167f 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
>       return ret;
>   }
>   
> +/**
> + * Return the child BDS to which we can fall back if the given BDS
> + * does not support snapshots.
> + * Return NULL if there is no BDS to (safely) fall back to.
> + */
> +static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
> +{
> +    BlockDriverState *child_bs = NULL;
> +    BdrvChild *child;
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        if (child == bdrv_filtered_cow_child(bs)) {
> +            /* Ignore: COW children need not be included in snapshots */
> +            continue;
> +        }
> +
> +        if (child_bs) {
> +            /* Cannot fall back to a single child if there are multiple */
> +            return NULL;
> +        }
> +        child_bs = child->bs;
> +    }
> +
> +    return child_bs;
> +}
> +
>   int bdrv_can_snapshot(BlockDriverState *bs)
>   {
>       BlockDriver *drv = bs->drv;
> @@ -154,8 +180,9 @@ int bdrv_can_snapshot(BlockDriverState *bs)
>       }
>   
>       if (!drv->bdrv_snapshot_create) {
> -        if (bs->file != NULL) {
> -            return bdrv_can_snapshot(bs->file->bs);
> +        BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
> +        if (fallback_bs) {
> +            return bdrv_can_snapshot(fallback_bs);
>           }
>           return 0;
>       }
> @@ -167,14 +194,15 @@ int bdrv_snapshot_create(BlockDriverState *bs,
>                            QEMUSnapshotInfo *sn_info)
>   {
>       BlockDriver *drv = bs->drv;
> +    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
>       if (!drv) {
>           return -ENOMEDIUM;
>       }
>       if (drv->bdrv_snapshot_create) {
>           return drv->bdrv_snapshot_create(bs, sn_info);
>       }
> -    if (bs->file) {
> -        return bdrv_snapshot_create(bs->file->bs, sn_info);
> +    if (fallback_bs) {
> +        return bdrv_snapshot_create(fallback_bs, sn_info);
>       }
>       return -ENOTSUP;
>   }
> @@ -184,6 +212,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>                          Error **errp)
>   {
>       BlockDriver *drv = bs->drv;
> +    BlockDriverState *fallback_bs;
>       int ret, open_ret;
>   
>       if (!drv) {
> @@ -204,39 +233,66 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>           return ret;
>       }
>   
> -    if (bs->file) {
> -        BlockDriverState *file;
> -        QDict *options = qdict_clone_shallow(bs->options);
> +    fallback_bs = bdrv_snapshot_fallback(bs);
> +    if (fallback_bs) {
> +        QDict *options;
>           QDict *file_options;
>           Error *local_err = NULL;
> +        bool is_backing_child;
> +        BdrvChild **child_pointer;
> +
> +        /*
> +         * We need a pointer to the fallback child pointer, so let us
> +         * see whether the child is referenced by a field in the BDS
> +         * object.
> +         */
> +        if (fallback_bs == bs->file->bs) {
> +            is_backing_child = false;
> +            child_pointer = &bs->file;
> +        } else if (fallback_bs == bs->backing->bs) {
> +            is_backing_child = true;
> +            child_pointer = &bs->backing;
> +        } else {
> +            /*
> +             * The fallback child is not referenced by a field in the
> +             * BDS object.  We cannot go on then.
> +             */
> +            error_setg(errp, "Block driver does not support snapshots");
> +            return -ENOTSUP;
> +        }
> +

Hmm.. Should not this check be included into bdrv_snapshot_fallback(), to
work only with file and backing?

And could we allow fallback only for filters? Is there real usecase except filters?
Or may be, drop fallback at all?


>   
> -        file = bs->file->bs;
>           /* Prevent it from getting deleted when detached from bs */
> -        bdrv_ref(file);
> +        bdrv_ref(fallback_bs);
>   
> -        qdict_extract_subqdict(options, &file_options, "file.");
> +        qdict_extract_subqdict(options, &file_options,
> +                               is_backing_child ? "backing." : "file.");
>           qobject_unref(file_options);
> -        qdict_put_str(options, "file", bdrv_get_node_name(file));
> +        qdict_put_str(options, is_backing_child ? "backing" : "file",
> +                      bdrv_get_node_name(fallback_bs));
>   
>           if (drv->bdrv_close) {
>               drv->bdrv_close(bs);
>           }
> -        bdrv_unref_child(bs, bs->file);
> -        bs->file = NULL;
>   
> -        ret = bdrv_snapshot_goto(file, snapshot_id, errp);
> +        assert(fallback_bs == (*child_pointer)->bs);
> +        bdrv_unref_child(bs, *child_pointer);
> +        *child_pointer = NULL;
> +
> +        ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
>           open_ret = drv->bdrv_open(bs, options, bs->open_flags, &local_err);
>           qobject_unref(options);
>           if (open_ret < 0) {
> -            bdrv_unref(file);
> +            bdrv_unref(fallback_bs);
>               bs->drv = NULL;
>               /* A bdrv_snapshot_goto() error takes precedence */
>               error_propagate(errp, local_err);
>               return ret < 0 ? ret : open_ret;
>           }
>   
> -        assert(bs->file->bs == file);
> -        bdrv_unref(file);
> +        assert(fallback_bs == (*child_pointer)->bs);
> +        bdrv_unref(fallback_bs);
>           return ret;
>       }
>   
> @@ -272,6 +328,7 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>                            Error **errp)
>   {
>       BlockDriver *drv = bs->drv;
> +    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
>       int ret;
>   
>       if (!drv) {
> @@ -288,8 +345,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>   
>       if (drv->bdrv_snapshot_delete) {
>           ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
> -    } else if (bs->file) {
> -        ret = bdrv_snapshot_delete(bs->file->bs, snapshot_id, name, errp);
> +    } else if (fallback_bs) {
> +        ret = bdrv_snapshot_delete(fallback_bs, snapshot_id, name, errp);
>       } else {
>           error_setg(errp, "Block format '%s' used by device '%s' "
>                      "does not support internal snapshot deletion",
> @@ -305,14 +362,15 @@ int bdrv_snapshot_list(BlockDriverState *bs,
>                          QEMUSnapshotInfo **psn_info)
>   {
>       BlockDriver *drv = bs->drv;
> +    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
>       if (!drv) {
>           return -ENOMEDIUM;
>       }
>       if (drv->bdrv_snapshot_list) {
>           return drv->bdrv_snapshot_list(bs, psn_info);
>       }
> -    if (bs->file) {
> -        return bdrv_snapshot_list(bs->file->bs, psn_info);
> +    if (fallback_bs) {
> +        return bdrv_snapshot_list(fallback_bs, psn_info);
>       }
>       return -ENOTSUP;
>   }
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback Max Reitz
@ 2019-08-10 16:41   ` Vladimir Sementsov-Ogievskiy
  2019-08-12 13:09     ` Max Reitz
  2019-09-10 14:52   ` Kevin Wolf
  1 sibling, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 16:41 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> If the driver does not implement bdrv_get_allocated_file_size(), we
> should fall back to cumulating the allocated size of all non-COW
> children instead of just bs->file.
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block.c | 22 ++++++++++++++++++++--
>   1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 1070aa1ba9..6e1ddab056 100644
> --- a/block.c
> +++ b/block.c
> @@ -4650,9 +4650,27 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
>       if (drv->bdrv_get_allocated_file_size) {
>           return drv->bdrv_get_allocated_file_size(bs);
>       }
> -    if (bs->file) {
> -        return bdrv_get_allocated_file_size(bs->file->bs);
> +
> +    if (!QLIST_EMPTY(&bs->children)) {
> +        BdrvChild *child;
> +        int64_t child_size, total_size = 0;
> +
> +        QLIST_FOREACH(child, &bs->children, next) {
> +            if (child == bdrv_filtered_cow_child(bs)) {
> +                /* Ignore COW backing files */
> +                continue;
> +            }
> +
> +            child_size = bdrv_get_allocated_file_size(child->bs);
> +            if (child_size < 0) {
> +                return child_size;
> +            }
> +            total_size += child_size;
> +        }
> +
> +        return total_size;
>       }
> +
>       return -ENOTSUP;
>   }
>   
> 

Hmm..

1. No children -> -ENOTSUP
2. Only cow child -> 0
3. Some non-cow children -> SUM

It's all arguable (the strictest way is -ENOTSUP in either case),
but if we want to fallback to SUM of non-cow children, 1. and 2. should return
the same.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 24/42] block: Use child access functions for QAPI queries
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 24/42] block: Use child access functions for QAPI queries Max Reitz
@ 2019-08-10 16:57   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-10 16:57 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> query-block, query-named-block-nodes, and query-blockstats now return
> any filtered child under "backing", not just bs->backing or COW
> children.  This is so that filters do not interrupt the reported backing
> chain.  This changes the output for iotest 184, as the throttled node
> now appears as a backing child.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/qapi.c               | 39 +++++++++++++++++++++++---------------
>   tests/qemu-iotests/184.out |  7 ++++++-
>   2 files changed, 30 insertions(+), 16 deletions(-)
> 
> diff --git a/block/qapi.c b/block/qapi.c
> index 9a185cba48..4f59ac1c0f 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c

[..]

> @@ -354,9 +357,9 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo **p_info,
>       BlockDriverState *bs = blk_bs(blk);
>       char *qdev;
>   
> -    /* Skip automatically inserted nodes that the user isn't aware of */
> -    while (bs && bs->drv && bs->implicit) {
> -        bs = backing_bs(bs);
> +    if (bs) {
> +        /* Skip automatically inserted nodes that the user isn't aware of */
> +        bs = bdrv_skip_implicit_filters(bs);
>       }

bdrv_skip_implicit_filters supports NULL, so it may be written without "if"

Anyway:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

[..]

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters Max Reitz
@ 2019-08-12 11:09   ` Vladimir Sementsov-Ogievskiy
  2019-08-12 13:26     ` Max Reitz
  2019-08-31  9:57   ` Vladimir Sementsov-Ogievskiy
  2019-09-13 12:55   ` Kevin Wolf
  2 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-12 11:09 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> This includes some permission limiting (for example, we only need to
> take the RESIZE permission for active commits where the base is smaller
> than the top).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>   blockdev.c     |  47 +++++++++++++++++---
>   2 files changed, 131 insertions(+), 33 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 54bafdf176..6ddbfb9708 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c


[..]

> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>       /* In commit_active_start() all intermediate nodes disappear, so
>        * any jobs in them must be blocked */
>       if (target_is_backing) {
> -        BlockDriverState *iter;
> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
> -             * ourselves at s->base (if writes are blocked for a node, they are
> -             * also blocked for its backing file). The other options would be a
> -             * second filter driver above s->base (== target). */
> +        BlockDriverState *iter, *filtered_target;
> +        uint64_t iter_shared_perms;
> +
> +        /*
> +         * The topmost node with
> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
> +         */
> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
> +
> +        assert(bdrv_skip_rw_filters(filtered_target) ==
> +               bdrv_skip_rw_filters(target));
> +
> +        /*
> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
> +         * ourselves at s->base (if writes are blocked for a node, they are
> +         * also blocked for its backing file). The other options would be a
> +         * second filter driver above s->base (== target).
> +         */
> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
> +
> +        for (iter = bdrv_filtered_bs(bs); iter != target;
> +             iter = bdrv_filtered_bs(iter))
> +        {
> +            if (iter == filtered_target) {
> +                /*
> +                 * From here on, all nodes are filters on the base.
> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
> +                 */
> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;


Hmm, I don't understand, why read from upper nodes is not shared?

> +            }
> +
>               ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
> -                                     errp);
> +                                     iter_shared_perms, errp);
>               if (ret < 0) {
>                   goto fail;
>               }

[..]

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 28/42] stream: Deal with filters
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 28/42] stream: " Max Reitz
@ 2019-08-12 11:55   ` Vladimir Sementsov-Ogievskiy
  2019-09-13 14:16   ` Kevin Wolf
  1 sibling, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-12 11:55 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> Because of the recent changes that make the stream job independent of
> the base node and instead track the node above it, we have to split that
> "bottom" node into two cases: The bottom COW node, and the node directly
> above the base node (which may be an R/W filter or the bottom COW node).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>



-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions Max Reitz
@ 2019-08-12 12:14   ` Vladimir Sementsov-Ogievskiy
  2019-08-12 13:28     ` Max Reitz
  2019-08-14 16:04   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-12 12:14 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> This changes iotest 204's output, because blkdebug on top of a COW node
> used to make qemu-img map disregard the rest of the backing chain (the
> backing chain was broken by the filter).  With this patch, the
> allocation in the base image is reported correctly.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   qemu-img.c                 | 33 ++++++++++++++++++++-------------
>   tests/qemu-iotests/204.out |  1 +
>   2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index 79983772de..3b30c5ae70 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1012,7 +1012,7 @@ static int img_commit(int argc, char **argv)
>           /* This is different from QMP, which by default uses the deepest file in
>            * the backing chain (i.e., the very base); however, the traditional
>            * behavior of qemu-img commit is using the immediate backing file. */
> -        base_bs = backing_bs(bs);
> +        base_bs = bdrv_backing_chain_next(bs);
>           if (!base_bs) {
>               error_setg(&local_err, "Image does not have a backing file");
>               goto done;
> @@ -1632,18 +1632,20 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
>       if (s->sector_next_status <= sector_num) {
>           uint64_t offset = (sector_num - src_cur_offset) * BDRV_SECTOR_SIZE;
>           int64_t count;
> +        BlockDriverState *src_bs = blk_bs(s->src[src_cur]);
> +        BlockDriverState *base;
> +
> +        if (s->target_has_backing) {
> +            base = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(src_bs));
> +        } else {
> +            base = NULL;
> +        }
>   
>           do {
>               count = n * BDRV_SECTOR_SIZE;
>   
> -            if (s->target_has_backing) {
> -                ret = bdrv_block_status(blk_bs(s->src[src_cur]), offset,
> -                                        count, &count, NULL, NULL);
> -            } else {
> -                ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
> -                                              offset, count, &count, NULL,
> -                                              NULL);
> -            }
> +            ret = bdrv_block_status_above(src_bs, base, offset, count, &count,
> +                                          NULL, NULL);
>   
>               if (ret < 0) {
>                   if (s->salvage) {
> @@ -2490,7 +2492,8 @@ static int img_convert(int argc, char **argv)
>            * s.target_backing_sectors has to be negative, which it will
>            * be automatically).  The backing file length is used only
>            * for optimizations, so such a case is not fatal. */
> -        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
> +        s.target_backing_sectors =
> +            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));

why not skip_rw_filters? It will fail if out_bs is filter..

>       } else {
>           s.target_backing_sectors = -1;
>       }
> @@ -2853,6 +2856,7 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
>   
>       depth = 0;
>       for (;;) {
> +        bs = bdrv_skip_rw_filters(bs);
>           ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
>           if (ret < 0) {
>               return ret;
> @@ -2861,7 +2865,7 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
>           if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
>               break;
>           }
> -        bs = backing_bs(bs);
> +        bs = bdrv_filtered_cow_bs(bs);
>           if (bs == NULL) {
>               ret = 0;
>               break;
> @@ -3216,6 +3220,7 @@ static int img_rebase(int argc, char **argv)
>       uint8_t *buf_old = NULL;
>       uint8_t *buf_new = NULL;
>       BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
> +    BlockDriverState *unfiltered_bs;
>       char *filename;
>       const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
>       int c, flags, src_flags, ret;
> @@ -3350,6 +3355,8 @@ static int img_rebase(int argc, char **argv)
>       }
>       bs = blk_bs(blk);
>   
> +    unfiltered_bs = bdrv_skip_rw_filters(bs);
> +
>       if (out_basefmt != NULL) {
>           if (bdrv_find_format(out_basefmt) == NULL) {
>               error_report("Invalid format name: '%s'", out_basefmt);
> @@ -3361,7 +3368,7 @@ static int img_rebase(int argc, char **argv)
>       /* For safe rebasing we need to compare old and new backing file */
>       if (!unsafe) {
>           QDict *options = NULL;
> -        BlockDriverState *base_bs = backing_bs(bs);
> +        BlockDriverState *base_bs = bdrv_filtered_cow_bs(unfiltered_bs);
>   
>           if (base_bs) {
>               blk_old_backing = blk_new(qemu_get_aio_context(),
> @@ -3517,7 +3524,7 @@ static int img_rebase(int argc, char **argv)
>                    * If cluster wasn't changed since prefix_chain, we don't need
>                    * to take action
>                    */
> -                ret = bdrv_is_allocated_above(backing_bs(bs), prefix_chain_bs,
> +                ret = bdrv_is_allocated_above(unfiltered_bs, prefix_chain_bs,
>                                                 false, offset, n, &n);
>                   if (ret < 0) {
>                       error_report("error while reading image metadata: %s",
> diff --git a/tests/qemu-iotests/204.out b/tests/qemu-iotests/204.out
> index f3a10fbe90..684774d763 100644
> --- a/tests/qemu-iotests/204.out
> +++ b/tests/qemu-iotests/204.out
> @@ -59,5 +59,6 @@ Offset          Length          File
>   0x900000        0x2400000       TEST_DIR/t.IMGFMT
>   0x3c00000       0x1100000       TEST_DIR/t.IMGFMT
>   0x6a00000       0x400000        TEST_DIR/t.IMGFMT
> +0x6e00000       0x1200000       TEST_DIR/t.IMGFMT.base
>   No errors were found on the image.
>   *** done
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain
  2019-08-10 13:32   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 12:56     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-12 12:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 1427 bytes --]

On 10.08.19 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> In order to make filters work in backing chains, the associated
>> functions must be able to deal with them and freeze all filter links, be
>> they COW or R/W filter links.
>>
>> In the process, rename these functions to reflect that they now act on
>> generalized chains of filter nodes instead of backing chains alone.
>>
>> While at it, add some comments that note which functions require their
>> caller to ensure that a given child link is not frozen, and how the
>> callers do so.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block.h | 10 +++---
>>   block.c               | 81 +++++++++++++++++++++++++------------------
>>   block/commit.c        |  8 ++---
>>   block/mirror.c        |  4 +--
>>   block/stream.c        |  8 ++---
>>   5 files changed, 62 insertions(+), 49 deletions(-)

[...]

>> @@ -2477,10 +2481,6 @@ void bdrv_root_unref_child(BdrvChild *child)
>>       bdrv_unref(child_bs);
>>   }
>>   
>> -/**
>> - * Clear all inherits_from pointers from children and grandchildren of
>> - * @root that point to @root, where necessary.
>> - */
> 
> Hmm, unrelated chunk? Without it:
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

I don’t know how that slipped in, sorry...

Once again, thanks for reviewing!

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-08-10 15:36   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 12:58     ` Max Reitz
  2019-09-05 16:24       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-12 12:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 3385 bytes --]

On 10.08.19 17:36, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
>> itself has to flush the children of the given node, it should not flush
>> just bs->file->bs, but in fact all children.
>>
>> In any case, the BLKDBG_EVENT() should be emitted on the primary child,
>> because that is where a blkdebug node would be if there is any.
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/io.c | 23 +++++++++++++++++------
>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index c5a8e3e6a3..bcc770d336 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
>>   
>>   int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>   {
>> +    BdrvChild *primary_child = bdrv_primary_child(bs);
>> +    BdrvChild *child;
>>       int current_gen;
>>       int ret = 0;
>>   
>> @@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>       }
>>   
>>       /* Write back cached data to the OS even with cache=unsafe */
>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
>>       if (bs->drv->bdrv_co_flush_to_os) {
>>           ret = bs->drv->bdrv_co_flush_to_os(bs);
>>           if (ret < 0) {
>> @@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>   
>>       /* But don't actually force it to the disk with cache=unsafe */
>>       if (bs->open_flags & BDRV_O_NO_FLUSH) {
>> -        goto flush_parent;
>> +        goto flush_children;
>>       }
>>   
>>       /* Check if we really need to flush anything */
>>       if (bs->flushed_gen == current_gen) {
>> -        goto flush_parent;
>> +        goto flush_children;
>>       }
>>   
>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
>>       if (!bs->drv) {
>>           /* bs->drv->bdrv_co_flush() might have ejected the BDS
>>            * (even in case of apparent success) */
>> @@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>       /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
>>        * in the case of cache=unsafe, so there are no useless flushes.
>>        */
>> -flush_parent:
>> -    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
>> +flush_children:
>> +    ret = 0; > +    QLIST_FOREACH(child, &bs->children, next) {
>> +        int this_child_ret;
>> +
>> +        this_child_ret = bdrv_co_flush(child->bs);
>> +        if (!ret) {
>> +            ret = this_child_ret;
>> +        }
>> +    }
> 
> Hmm, you said that we want to flush only children with write-access from parent..

Good that you remember it, I must have overlooked it (when reading the
replies to the previous version). :-)

> Shouldn't we check it? Or we assume that it's always safe to call bdrv_co_flush on
> a node?

I think it’s always safe.  But checking it seems like a nice touch, yes.

Max

>> +
>>   out:
>>       /* Notify any pending flushes that we have completed */
>>       if (ret == 0) {
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-08-10 16:34   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 13:06     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-12 13:06 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 5465 bytes --]

On 10.08.19 18:34, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> If the top node's driver does not provide snapshot functionality and we
>> want to fall back to a node down the chain, we need to snapshot all
>> non-COW children.  For simplicity's sake, just do not fall back if there
>> is more than one such child.
>>
>> bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
>> the actual child pointer, so it only works if the fallback child is
>> bs->file or bs->backing (and then we have to find out which it is).
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 79 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/snapshot.c b/block/snapshot.c
>> index f2f48f926a..35403c167f 100644
>> --- a/block/snapshot.c
>> +++ b/block/snapshot.c
>> @@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
>>       return ret;
>>   }
>>   
>> +/**
>> + * Return the child BDS to which we can fall back if the given BDS
>> + * does not support snapshots.
>> + * Return NULL if there is no BDS to (safely) fall back to.
>> + */
>> +static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
>> +{
>> +    BlockDriverState *child_bs = NULL;
>> +    BdrvChild *child;
>> +
>> +    QLIST_FOREACH(child, &bs->children, next) {
>> +        if (child == bdrv_filtered_cow_child(bs)) {
>> +            /* Ignore: COW children need not be included in snapshots */
>> +            continue;
>> +        }
>> +
>> +        if (child_bs) {
>> +            /* Cannot fall back to a single child if there are multiple */
>> +            return NULL;
>> +        }
>> +        child_bs = child->bs;
>> +    }
>> +
>> +    return child_bs;
>> +}
>> +
>>   int bdrv_can_snapshot(BlockDriverState *bs)
>>   {
>>       BlockDriver *drv = bs->drv;
>> @@ -154,8 +180,9 @@ int bdrv_can_snapshot(BlockDriverState *bs)
>>       }
>>   
>>       if (!drv->bdrv_snapshot_create) {
>> -        if (bs->file != NULL) {
>> -            return bdrv_can_snapshot(bs->file->bs);
>> +        BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
>> +        if (fallback_bs) {
>> +            return bdrv_can_snapshot(fallback_bs);
>>           }
>>           return 0;
>>       }
>> @@ -167,14 +194,15 @@ int bdrv_snapshot_create(BlockDriverState *bs,
>>                            QEMUSnapshotInfo *sn_info)
>>   {
>>       BlockDriver *drv = bs->drv;
>> +    BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
>>       if (!drv) {
>>           return -ENOMEDIUM;
>>       }
>>       if (drv->bdrv_snapshot_create) {
>>           return drv->bdrv_snapshot_create(bs, sn_info);
>>       }
>> -    if (bs->file) {
>> -        return bdrv_snapshot_create(bs->file->bs, sn_info);
>> +    if (fallback_bs) {
>> +        return bdrv_snapshot_create(fallback_bs, sn_info);
>>       }
>>       return -ENOTSUP;
>>   }
>> @@ -184,6 +212,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>>                          Error **errp)
>>   {
>>       BlockDriver *drv = bs->drv;
>> +    BlockDriverState *fallback_bs;
>>       int ret, open_ret;
>>   
>>       if (!drv) {
>> @@ -204,39 +233,66 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>>           return ret;
>>       }
>>   
>> -    if (bs->file) {
>> -        BlockDriverState *file;
>> -        QDict *options = qdict_clone_shallow(bs->options);
>> +    fallback_bs = bdrv_snapshot_fallback(bs);
>> +    if (fallback_bs) {
>> +        QDict *options;
>>           QDict *file_options;
>>           Error *local_err = NULL;
>> +        bool is_backing_child;
>> +        BdrvChild **child_pointer;
>> +
>> +        /*
>> +         * We need a pointer to the fallback child pointer, so let us
>> +         * see whether the child is referenced by a field in the BDS
>> +         * object.
>> +         */
>> +        if (fallback_bs == bs->file->bs) {
>> +            is_backing_child = false;
>> +            child_pointer = &bs->file;
>> +        } else if (fallback_bs == bs->backing->bs) {
>> +            is_backing_child = true;
>> +            child_pointer = &bs->backing;
>> +        } else {
>> +            /*
>> +             * The fallback child is not referenced by a field in the
>> +             * BDS object.  We cannot go on then.
>> +             */
>> +            error_setg(errp, "Block driver does not support snapshots");
>> +            return -ENOTSUP;
>> +        }
>> +
> 
> Hmm.. Should not this check be included into bdrv_snapshot_fallback(), to
> work only with file and backing?

I was under the impression that this was just special code for what
turned out to be bdrv_snapshot_load_tmp() now, because it seems so
weird.  (So I thought just making the restriction here wouldn’t really
be a limit.)

I was wrong.  This is used when applying snapshots, so it is important.
 If we make a restriction here, we should have it in all fallback code, yes.

> And could we allow fallback only for filters? Is there real usecase except filters?
> Or may be, drop fallback at all?

raw isn’t a filter driver.  And rbd as a protocol supports snapshotting.
 Hence the fallback code, I presume.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-08-10 16:41   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 13:09     ` Max Reitz
  2019-08-12 17:14       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-12 13:09 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 2285 bytes --]

On 10.08.19 18:41, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> If the driver does not implement bdrv_get_allocated_file_size(), we
>> should fall back to cumulating the allocated size of all non-COW
>> children instead of just bs->file.
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 1070aa1ba9..6e1ddab056 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -4650,9 +4650,27 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
>>       if (drv->bdrv_get_allocated_file_size) {
>>           return drv->bdrv_get_allocated_file_size(bs);
>>       }
>> -    if (bs->file) {
>> -        return bdrv_get_allocated_file_size(bs->file->bs);
>> +
>> +    if (!QLIST_EMPTY(&bs->children)) {
>> +        BdrvChild *child;
>> +        int64_t child_size, total_size = 0;
>> +
>> +        QLIST_FOREACH(child, &bs->children, next) {
>> +            if (child == bdrv_filtered_cow_child(bs)) {
>> +                /* Ignore COW backing files */
>> +                continue;
>> +            }
>> +
>> +            child_size = bdrv_get_allocated_file_size(child->bs);
>> +            if (child_size < 0) {
>> +                return child_size;
>> +            }
>> +            total_size += child_size;
>> +        }
>> +
>> +        return total_size;
>>       }
>> +
>>       return -ENOTSUP;
>>   }
>>   
>>
> 
> Hmm..
> 
> 1. No children -> -ENOTSUP
> 2. Only cow child -> 0
> 3. Some non-cow children -> SUM
> 
> It's all arguable (the strictest way is -ENOTSUP in either case),
> but if we want to fallback to SUM of non-cow children, 1. and 2. should return
> the same.

I don’t think 2 is possible at all.  If you have a COW child, you need
some other child to COW to.

And in the weird (and probably impossible) case where a node really only
has a COW child, I’d say it’s correct that it has a disk size of 0 –
because it hasn’t COWed anything yet.  (Just like a new qcow2 image with
a backing file only has its metadata as its disk size.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-12 11:09   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 13:26     ` Max Reitz
  2019-08-14 15:17       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-12 13:26 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 3996 bytes --]

On 12.08.19 13:09, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> This includes some permission limiting (for example, we only need to
>> take the RESIZE permission for active commits where the base is smaller
>> than the top).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>>   blockdev.c     |  47 +++++++++++++++++---
>>   2 files changed, 131 insertions(+), 33 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 54bafdf176..6ddbfb9708 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
> 
> 
> [..]
> 
>> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>>       /* In commit_active_start() all intermediate nodes disappear, so
>>        * any jobs in them must be blocked */
>>       if (target_is_backing) {
>> -        BlockDriverState *iter;
>> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
>> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>> -             * ourselves at s->base (if writes are blocked for a node, they are
>> -             * also blocked for its backing file). The other options would be a
>> -             * second filter driver above s->base (== target). */
>> +        BlockDriverState *iter, *filtered_target;
>> +        uint64_t iter_shared_perms;
>> +
>> +        /*
>> +         * The topmost node with
>> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
>> +         */
>> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
>> +
>> +        assert(bdrv_skip_rw_filters(filtered_target) ==
>> +               bdrv_skip_rw_filters(target));
>> +
>> +        /*
>> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
>> +         * ourselves at s->base (if writes are blocked for a node, they are
>> +         * also blocked for its backing file). The other options would be a
>> +         * second filter driver above s->base (== target).
>> +         */
>> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
>> +
>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
>> +             iter = bdrv_filtered_bs(iter))
>> +        {
>> +            if (iter == filtered_target) {
>> +                /*
>> +                 * From here on, all nodes are filters on the base.
>> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
>> +                 */
>> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
> 
> 
> Hmm, I don't understand, why read from upper nodes is not shared?

Because they don’t represent a consistent disk state during the commit.

Please don’t ask me details about CONSISTENT_READ, because I always
pretend I understand it, but I never really do, actually.

(My problem is that I do understand why the intermediate nodes shouldn’t
share CONSISTENT_READ: It’s because they only read garbage, effectively.
 But I don’t understand how any block job target (like our base here)
can have CONSISTENT_READ.  Block job targets are mostly written front to
back (except with sync=none), so they too don’t “[represent] the
contents of a disk at a specific point.”
But that is how it was, so that is how it should be kept.)

If it makes you any happier, BLK_PERM_CONSISTENT_READ’s description
explicitly notes that it will not be shared on intermediate nodes of a
commit job.

Max

>> +            }
>> +
>>               ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
>> -                                     errp);
>> +                                     iter_shared_perms, errp);
>>               if (ret < 0) {
>>                   goto fail;
>>               }
> 
> [..]
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions
  2019-08-12 12:14   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 13:28     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-12 13:28 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 1404 bytes --]

On 12.08.19 14:14, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> This changes iotest 204's output, because blkdebug on top of a COW node
>> used to make qemu-img map disregard the rest of the backing chain (the
>> backing chain was broken by the filter).  With this patch, the
>> allocation in the base image is reported correctly.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   qemu-img.c                 | 33 ++++++++++++++++++++-------------
>>   tests/qemu-iotests/204.out |  1 +
>>   2 files changed, 21 insertions(+), 13 deletions(-)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 79983772de..3b30c5ae70 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c

[...]

>> @@ -2490,7 +2492,8 @@ static int img_convert(int argc, char **argv)
>>            * s.target_backing_sectors has to be negative, which it will
>>            * be automatically).  The backing file length is used only
>>            * for optimizations, so such a case is not fatal. */
>> -        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
>> +        s.target_backing_sectors =
>> +            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));
> 
> why not skip_rw_filters? It will fail if out_bs is filter..

Because I forgot this place. :-)  Although backing_chain_next() would be
simpler here and do the same, effectively.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-08-12 13:09     ` Max Reitz
@ 2019-08-12 17:14       ` Vladimir Sementsov-Ogievskiy
  2019-08-12 19:15         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-12 17:14 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

12.08.2019 16:09, Max Reitz wrote:
> On 10.08.19 18:41, Vladimir Sementsov-Ogievskiy wrote:
>> 09.08.2019 19:13, Max Reitz wrote:
>>> If the driver does not implement bdrv_get_allocated_file_size(), we
>>> should fall back to cumulating the allocated size of all non-COW
>>> children instead of just bs->file.
>>>
>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>    block.c | 22 ++++++++++++++++++++--
>>>    1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 1070aa1ba9..6e1ddab056 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -4650,9 +4650,27 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
>>>        if (drv->bdrv_get_allocated_file_size) {
>>>            return drv->bdrv_get_allocated_file_size(bs);
>>>        }
>>> -    if (bs->file) {
>>> -        return bdrv_get_allocated_file_size(bs->file->bs);
>>> +
>>> +    if (!QLIST_EMPTY(&bs->children)) {
>>> +        BdrvChild *child;
>>> +        int64_t child_size, total_size = 0;
>>> +
>>> +        QLIST_FOREACH(child, &bs->children, next) {
>>> +            if (child == bdrv_filtered_cow_child(bs)) {
>>> +                /* Ignore COW backing files */
>>> +                continue;
>>> +            }
>>> +
>>> +            child_size = bdrv_get_allocated_file_size(child->bs);
>>> +            if (child_size < 0) {
>>> +                return child_size;
>>> +            }
>>> +            total_size += child_size;
>>> +        }
>>> +
>>> +        return total_size;
>>>        }
>>> +
>>>        return -ENOTSUP;
>>>    }
>>>    
>>>
>>
>> Hmm..
>>
>> 1. No children -> -ENOTSUP
>> 2. Only cow child -> 0
>> 3. Some non-cow children -> SUM
>>
>> It's all arguable (the strictest way is -ENOTSUP in either case),
>> but if we want to fallback to SUM of non-cow children, 1. and 2. should return
>> the same.
> 
> I don’t think 2 is possible at all.  If you have a COW child, you need
> some other child to COW to.
> 
> And in the weird (and probably impossible) case where a node really only
> has a COW child, I’d say it’s correct that it has a disk size of 0 –
> because it hasn’t COWed anything yet.  (Just like a new qcow2 image with
> a backing file only has its metadata as its disk size.)
> 

Agreed. Then, why not return 0 on [1] ?

Also, another idea: shouldn't we return 0 for filters, i.e. skip filtered_rw_child too?
[as filtered-child is more like backing child than file one, it's "less owned" by its parent]

with or without any of these suggestions:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-08-12 17:14       ` Vladimir Sementsov-Ogievskiy
@ 2019-08-12 19:15         ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-08-12 19:15 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 3597 bytes --]

On 12.08.19 19:14, Vladimir Sementsov-Ogievskiy wrote:
> 12.08.2019 16:09, Max Reitz wrote:
>> On 10.08.19 18:41, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.08.2019 19:13, Max Reitz wrote:
>>>> If the driver does not implement bdrv_get_allocated_file_size(), we
>>>> should fall back to cumulating the allocated size of all non-COW
>>>> children instead of just bs->file.
>>>>
>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>    block.c | 22 ++++++++++++++++++++--
>>>>    1 file changed, 20 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 1070aa1ba9..6e1ddab056 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -4650,9 +4650,27 @@ int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
>>>>        if (drv->bdrv_get_allocated_file_size) {
>>>>            return drv->bdrv_get_allocated_file_size(bs);
>>>>        }
>>>> -    if (bs->file) {
>>>> -        return bdrv_get_allocated_file_size(bs->file->bs);
>>>> +
>>>> +    if (!QLIST_EMPTY(&bs->children)) {
>>>> +        BdrvChild *child;
>>>> +        int64_t child_size, total_size = 0;
>>>> +
>>>> +        QLIST_FOREACH(child, &bs->children, next) {
>>>> +            if (child == bdrv_filtered_cow_child(bs)) {
>>>> +                /* Ignore COW backing files */
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            child_size = bdrv_get_allocated_file_size(child->bs);
>>>> +            if (child_size < 0) {
>>>> +                return child_size;
>>>> +            }
>>>> +            total_size += child_size;
>>>> +        }
>>>> +
>>>> +        return total_size;
>>>>        }
>>>> +
>>>>        return -ENOTSUP;
>>>>    }
>>>>    
>>>>
>>>
>>> Hmm..
>>>
>>> 1. No children -> -ENOTSUP
>>> 2. Only cow child -> 0
>>> 3. Some non-cow children -> SUM
>>>
>>> It's all arguable (the strictest way is -ENOTSUP in either case),
>>> but if we want to fallback to SUM of non-cow children, 1. and 2. should return
>>> the same.
>>
>> I don’t think 2 is possible at all.  If you have a COW child, you need
>> some other child to COW to.
>>
>> And in the weird (and probably impossible) case where a node really only
>> has a COW child, I’d say it’s correct that it has a disk size of 0 –
>> because it hasn’t COWed anything yet.  (Just like a new qcow2 image with
>> a backing file only has its metadata as its disk size.)
>>
> 
> Agreed. Then, why not return 0 on [1] ?

(1) Because that’s the current behavior. :-)

(2) Nodes that have no children are protocol nodes.  Protocol nodes
(apart from null) still have to store their data somewhere.  Therefore,
they must implement .bdrv_get_allocated_file_size() to report that.  If
they don’t, that doesn’t mean they don’t store any data, but only that
we don’t know how much data they store.

> Also, another idea: shouldn't we return 0 for filters, i.e. skip filtered_rw_child too?
> [as filtered-child is more like backing child than file one, it's "less owned" by its parent]

Why would we do that?  If I have a block device with a throttle node
attached to it and request how much space it uses, of course I will want
to know how much space the whole tree below it uses.

(Otherwise, bdrv_get_allocated_file_size() should only report anything
for protocol nodes, and 0 for everything else.)

Max

> with or without any of these suggestions:
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-12 13:26     ` Max Reitz
@ 2019-08-14 15:17       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-14 15:17 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

12.08.2019 16:26, Max Reitz wrote:
> On 12.08.19 13:09, Vladimir Sementsov-Ogievskiy wrote:
>> 09.08.2019 19:13, Max Reitz wrote:
>>> This includes some permission limiting (for example, we only need to
>>> take the RESIZE permission for active commits where the base is smaller
>>> than the top).
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>    block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>>>    blockdev.c     |  47 +++++++++++++++++---
>>>    2 files changed, 131 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/block/mirror.c b/block/mirror.c
>>> index 54bafdf176..6ddbfb9708 100644
>>> --- a/block/mirror.c
>>> +++ b/block/mirror.c
>>
>>
>> [..]
>>
>>> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>>>        /* In commit_active_start() all intermediate nodes disappear, so
>>>         * any jobs in them must be blocked */
>>>        if (target_is_backing) {
>>> -        BlockDriverState *iter;
>>> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
>>> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>>> -             * ourselves at s->base (if writes are blocked for a node, they are
>>> -             * also blocked for its backing file). The other options would be a
>>> -             * second filter driver above s->base (== target). */
>>> +        BlockDriverState *iter, *filtered_target;
>>> +        uint64_t iter_shared_perms;
>>> +
>>> +        /*
>>> +         * The topmost node with
>>> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
>>> +         */
>>> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
>>> +
>>> +        assert(bdrv_skip_rw_filters(filtered_target) ==
>>> +               bdrv_skip_rw_filters(target));
>>> +
>>> +        /*
>>> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
>>> +         * ourselves at s->base (if writes are blocked for a node, they are
>>> +         * also blocked for its backing file). The other options would be a
>>> +         * second filter driver above s->base (== target).
>>> +         */
>>> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
>>> +
>>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
>>> +             iter = bdrv_filtered_bs(iter))
>>> +        {
>>> +            if (iter == filtered_target) {
>>> +                /*
>>> +                 * From here on, all nodes are filters on the base.
>>> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
>>> +                 */
>>> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
>>
>>
>> Hmm, I don't understand, why read from upper nodes is not shared?
> 
> Because they don’t represent a consistent disk state during the commit.
> 
> Please don’t ask me details about CONSISTENT_READ, because I always
> pretend I understand it, but I never really do, actually.
> 
> (My problem is that I do understand why the intermediate nodes shouldn’t
> share CONSISTENT_READ: It’s because they only read garbage, effectively.
>   But I don’t understand how any block job target (like our base here)
> can have CONSISTENT_READ.

I know such example: it's image fleecing scheme, when for backup job source
is a backing for target. If serialization of requests works well target represents
consistent state of disk ate backup-start point in time.

But yes, it's not about mirror or commit.

>  Block job targets are mostly written front to
> back (except with sync=none), so they too don’t “[represent] the
> contents of a disk at a specific point.”
> But that is how it was, so that is how it should be kept.)
> 
> If it makes you any happier, BLK_PERM_CONSISTENT_READ’s description
> explicitly notes that it will not be shared on intermediate nodes of a
> commit job.
> 
> Max
> 
>>> +            }
>>> +
>>>                ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>>> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
>>> -                                     errp);
>>> +                                     iter_shared_perms, errp);
>>>                if (ret < 0) {
>>>                    goto fail;
>>>                }
>>
>> [..]
>>
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions Max Reitz
  2019-08-12 12:14   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-14 16:04   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-14 16:04 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> This changes iotest 204's output, because blkdebug on top of a COW node
> used to make qemu-img map disregard the rest of the backing chain (the
> backing chain was broken by the filter).  With this patch, the
> allocation in the base image is reported correctly.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   qemu-img.c                 | 33 ++++++++++++++++++++-------------
>   tests/qemu-iotests/204.out |  1 +
>   2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index 79983772de..3b30c5ae70 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1012,7 +1012,7 @@ static int img_commit(int argc, char **argv)
>           /* This is different from QMP, which by default uses the deepest file in
>            * the backing chain (i.e., the very base); however, the traditional
>            * behavior of qemu-img commit is using the immediate backing file. */
> -        base_bs = backing_bs(bs);
> +        base_bs = bdrv_backing_chain_next(bs);
>           if (!base_bs) {
>               error_setg(&local_err, "Image does not have a backing file");
>               goto done;
> @@ -1632,18 +1632,20 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
>       if (s->sector_next_status <= sector_num) {
>           uint64_t offset = (sector_num - src_cur_offset) * BDRV_SECTOR_SIZE;
>           int64_t count;
> +        BlockDriverState *src_bs = blk_bs(s->src[src_cur]);
> +        BlockDriverState *base;
> +
> +        if (s->target_has_backing) {
> +            base = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(src_bs));
> +        } else {
> +            base = NULL;
> +        }
>   
>           do {
>               count = n * BDRV_SECTOR_SIZE;
>   
> -            if (s->target_has_backing) {
> -                ret = bdrv_block_status(blk_bs(s->src[src_cur]), offset,
> -                                        count, &count, NULL, NULL);
> -            } else {
> -                ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
> -                                              offset, count, &count, NULL,
> -                                              NULL);
> -            }
> +            ret = bdrv_block_status_above(src_bs, base, offset, count, &count,
> +                                          NULL, NULL);
>   
>               if (ret < 0) {
>                   if (s->salvage) {
> @@ -2490,7 +2492,8 @@ static int img_convert(int argc, char **argv)
>            * s.target_backing_sectors has to be negative, which it will
>            * be automatically).  The backing file length is used only
>            * for optimizations, so such a case is not fatal. */
> -        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
> +        s.target_backing_sectors =
> +            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));

bdrv_nb_sectors(bdrv_backing_chain_next(out_bs))

>       } else {
>           s.target_backing_sectors = -1;
>       }
> @@ -2853,6 +2856,7 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
>   
>       depth = 0;
>       for (;;) {
> +        bs = bdrv_skip_rw_filters(bs);
>           ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
>           if (ret < 0) {
>               return ret;
> @@ -2861,7 +2865,7 @@ static int get_block_status(BlockDriverState *bs, int64_t offset,
>           if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
>               break;
>           }
> -        bs = backing_bs(bs);
> +        bs = bdrv_filtered_cow_bs(bs);
>           if (bs == NULL) {
>               ret = 0;
>               break;
> @@ -3216,6 +3220,7 @@ static int img_rebase(int argc, char **argv)
>       uint8_t *buf_old = NULL;
>       uint8_t *buf_new = NULL;
>       BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
> +    BlockDriverState *unfiltered_bs;
>       char *filename;
>       const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
>       int c, flags, src_flags, ret;
> @@ -3350,6 +3355,8 @@ static int img_rebase(int argc, char **argv)
>       }
>       bs = blk_bs(blk);
>   
> +    unfiltered_bs = bdrv_skip_rw_filters(bs);
> +
>       if (out_basefmt != NULL) {
>           if (bdrv_find_format(out_basefmt) == NULL) {
>               error_report("Invalid format name: '%s'", out_basefmt);
> @@ -3361,7 +3368,7 @@ static int img_rebase(int argc, char **argv)
>       /* For safe rebasing we need to compare old and new backing file */
>       if (!unsafe) {
>           QDict *options = NULL;
> -        BlockDriverState *base_bs = backing_bs(bs);
> +        BlockDriverState *base_bs = bdrv_filtered_cow_bs(unfiltered_bs);
>   
>           if (base_bs) {
>               blk_old_backing = blk_new(qemu_get_aio_context(),
> @@ -3517,7 +3524,7 @@ static int img_rebase(int argc, char **argv)
>                    * If cluster wasn't changed since prefix_chain, we don't need
>                    * to take action
>                    */
> -                ret = bdrv_is_allocated_above(backing_bs(bs), prefix_chain_bs,
> +                ret = bdrv_is_allocated_above(unfiltered_bs, prefix_chain_bs,

s/unfiltered_bs/bdrv_filtered_cow_bs(unfiltered_bs)/

>                                                 false, offset, n, &n);
>                   if (ret < 0) {
>                       error_report("error while reading image metadata: %s",
> diff --git a/tests/qemu-iotests/204.out b/tests/qemu-iotests/204.out
> index f3a10fbe90..684774d763 100644
> --- a/tests/qemu-iotests/204.out
> +++ b/tests/qemu-iotests/204.out
> @@ -59,5 +59,6 @@ Offset          Length          File
>   0x900000        0x2400000       TEST_DIR/t.IMGFMT
>   0x3c00000       0x1100000       TEST_DIR/t.IMGFMT
>   0x6a00000       0x400000        TEST_DIR/t.IMGFMT
> +0x6e00000       0x1200000       TEST_DIR/t.IMGFMT.base
>   No errors were found on the image.
>   *** done
> 

With two fixes:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node() Max Reitz
@ 2019-08-15 15:21   ` Vladimir Sementsov-Ogievskiy
  2019-08-15 17:01     ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-15 15:21 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> Currently, check_to_replace_node() only allows mirror to replace a node
> in the chain of the source node, and only if it is the first non-filter
> node below the source.  Well, technically, the idea is that you can
> exactly replace a quorum child by mirroring from quorum.
> 
> This has (probably) two reasons:
> (1) We do not want to create loops.
> (2) @replaces and @device should have exactly the same content so
>      replacing them does not cause visible data to change.
> 
> This has two issues:
> (1) It is overly restrictive.  It is completely fine for @replaces to be
>      a filter.
> (2) It is not restrictive enough.  You can create loops with this as
>      follows:
> 
> $ qemu-img create -f qcow2 /tmp/source.qcow2 64M
> $ qemu-system-x86_64 -qmp stdio
> {"execute": "qmp_capabilities"}
> {"execute": "object-add",
>   "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
> {"execute": "blockdev-add",
>   "arguments": {
>       "node-name": "source",
>       "driver": "throttle",
>       "throttle-group": "tg0",
>       "file": {
>           "node-name": "filtered",
>           "driver": "qcow2",
>           "file": {
>               "driver": "file",
>               "filename": "/tmp/source.qcow2"
>           } } } }
> {"execute": "drive-mirror",
>   "arguments": {
>       "job-id": "mirror",
>       "device": "source",
>       "target": "/tmp/target.qcow2",
>       "format": "qcow2",
>       "node-name": "target",
>       "sync" :"none",
>       "replaces": "filtered"
>   } }
> {"execute": "block-job-complete", "arguments": {"device": "mirror"}}
> 
> And qemu crashes because of a stack overflow due to the loop being
> created (target's backing file is source, so when it replaces filtered,
> it points to itself through source).
> 
> (blockdev-mirror can be broken similarly.)
> 
> So let us make the checks for the two conditions above explicit, which
> makes the whole function exactly as restrictive as it needs to be.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/block.h |  1 +
>   block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
>   blockdev.c            | 34 ++++++++++++++++--
>   3 files changed, 110 insertions(+), 8 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 6ba853fb90..8da706cd89 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate);
>   
>   /* check if a named node can be replaced when doing drive-mirror */
>   BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
> +                                        BlockDriverState *backing_bs,
>                                           const char *node_name, Error **errp);
>   
>   /* async block I/O */
> diff --git a/block.c b/block.c
> index 915b80153c..4858d3e718 100644
> --- a/block.c
> +++ b/block.c
> @@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)
>       return false;
>   }
>   
> +static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
> +{
> +    BdrvChild *c;
> +
> +    if (!parent) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(c, &parent->children, next) {
> +        if (c->bs == child || is_child_of(child, c->bs)) {
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> +
> +/*
> + * Return true if there are only filters in [@top, @base).  Note that
> + * this may include quorum (which bdrv_chain_contains() cannot
> + * handle).

More presizely: return true if exists chain of filters from top to base or if
top == base.

I keep in mind backup-top filter:

[backup-top]
|          \target
|backing    -------->[target]
V                    /
[source]  <---------/backing

> + */
> +static bool is_filtered_child(BlockDriverState *top, BlockDriverState *base)
> +{
> +    BdrvChild *c;
> +
> +    if (!top) {
> +        return false;
> +    }
> +
> +    if (top == base) {
> +        return true;
> +    }
> +
> +    if (!top->drv->is_filter) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(c, &top->children, next) {
> +        if (is_filtered_child(c->bs, base)) {
> +            return true;
> +        }
> +    }

interesting, how much is it better to somehow reuse DFS search written in should_update_child()..
[just note, don't do it in these series please]

> +
> +    return false;
> +}
> +
> +/*
> + * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
> + * will be attached to the target when mirror completes.
> + */
>   BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
> +                                        BlockDriverState *backing_bs,
>                                           const char *node_name, Error **errp)
>   {
>       BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
> @@ -6309,13 +6361,32 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>           goto out;
>       }
>   
> -    /* We don't want arbitrary node of the BDS chain to be replaced only the top
> -     * most non filter in order to prevent data corruption.
> -     * Another benefit is that this tests exclude backing files which are
> -     * blocked by the backing blockers.
> +    /*
> +     * If to_replace_bs is (recursively) a child of backing_bs,
> +     * replacing it may create a loop.  We cannot allow that.
>        */
> -    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
> -        error_setg(errp, "Only top most non filter can be replaced");
> +    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, backing_bs)) {

first condition is covered by second, so first may be omitted.

> +        error_setg(errp, "Replacing this node would result in a loop");
> +        to_replace_bs = NULL;
> +        goto out;
> +    }
> +
> +    /*
> +     * Mirror is designed in such a way that when it completes, the
> +     * source BDS is seamlessly replaced.  

Not source but to_replace_bs is replaced?

> It is therefore not allowed
> +     * to replace a BDS where this condition would be violated, as that
> +     * would defeat the purpose of mirror and could lead to data
> +     * corruption.
> +     * Therefore, between parent_bs and to_replace_bs there may be
> +     * only filters (and the one on top must be a filter, too), so
> +     * their data always stays in sync and mirror can complete and
> +     * replace to_replace_bs without any possible corruptions.
> +     */
> +    if (!is_filtered_child(parent_bs, to_replace_bs) &&
> +        !is_filtered_child(to_replace_bs, parent_bs))
> +    {
> +        error_setg(errp, "The node to be replaced must be connected to the "
> +                   "source through filter nodes only");

"and the one on top must be a filter, too" not mentioned in the error..

>           to_replace_bs = NULL;
>           goto out;
>       }
> diff --git a/blockdev.c b/blockdev.c
> index 4e72f6f701..758e0b5431 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>       }
>   
>       if (has_replaces) {
> -        BlockDriverState *to_replace_bs;
> +        BlockDriverState *to_replace_bs, *backing_bs;
>           AioContext *replace_aio_context;
>           int64_t bs_size, replace_size;
>   
> @@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>               return;
>           }
>   
> -        to_replace_bs = check_to_replace_node(bs, replaces, errp);
> +        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
> +            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
> +        {
> +            /*
> +             * While we do not quite know what OPEN_BACKING_CHAIN
> +             * (used for mode=existing) will yield, it is probably
> +             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
> +             * because that is our best guess.
> +             */
> +            switch (sync) {
> +            case MIRROR_SYNC_MODE_FULL:
> +                backing_bs = NULL;
> +                break;
> +
> +            case MIRROR_SYNC_MODE_TOP:
> +                backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));

why not  bdrv_backing_chain_next(bs) like in mirror_start?

> +                break;
> +
> +            case MIRROR_SYNC_MODE_NONE:
> +                backing_bs = bs;
> +                break;
> +
> +            default:
> +                abort();
> +            }
> +        } else {
> +            assert(backing_mode == MIRROR_LEAVE_BACKING_CHAIN);
> +            backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(target));
> +        }
> +
> +        to_replace_bs = check_to_replace_node(bs, backing_bs, replaces, errp);
>           if (!to_replace_bs) {
>               return;
>           }
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
  2019-08-15 15:21   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-15 17:01     ` Max Reitz
  2019-08-16 11:01       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-15 17:01 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 10509 bytes --]

On 15.08.19 17:21, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:14, Max Reitz wrote:
>> Currently, check_to_replace_node() only allows mirror to replace a node
>> in the chain of the source node, and only if it is the first non-filter
>> node below the source.  Well, technically, the idea is that you can
>> exactly replace a quorum child by mirroring from quorum.
>>
>> This has (probably) two reasons:
>> (1) We do not want to create loops.
>> (2) @replaces and @device should have exactly the same content so
>>      replacing them does not cause visible data to change.
>>
>> This has two issues:
>> (1) It is overly restrictive.  It is completely fine for @replaces to be
>>      a filter.
>> (2) It is not restrictive enough.  You can create loops with this as
>>      follows:
>>
>> $ qemu-img create -f qcow2 /tmp/source.qcow2 64M
>> $ qemu-system-x86_64 -qmp stdio
>> {"execute": "qmp_capabilities"}
>> {"execute": "object-add",
>>   "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
>> {"execute": "blockdev-add",
>>   "arguments": {
>>       "node-name": "source",
>>       "driver": "throttle",
>>       "throttle-group": "tg0",
>>       "file": {
>>           "node-name": "filtered",
>>           "driver": "qcow2",
>>           "file": {
>>               "driver": "file",
>>               "filename": "/tmp/source.qcow2"
>>           } } } }
>> {"execute": "drive-mirror",
>>   "arguments": {
>>       "job-id": "mirror",
>>       "device": "source",
>>       "target": "/tmp/target.qcow2",
>>       "format": "qcow2",
>>       "node-name": "target",
>>       "sync" :"none",
>>       "replaces": "filtered"
>>   } }
>> {"execute": "block-job-complete", "arguments": {"device": "mirror"}}
>>
>> And qemu crashes because of a stack overflow due to the loop being
>> created (target's backing file is source, so when it replaces filtered,
>> it points to itself through source).
>>
>> (blockdev-mirror can be broken similarly.)
>>
>> So let us make the checks for the two conditions above explicit, which
>> makes the whole function exactly as restrictive as it needs to be.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block.h |  1 +
>>   block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
>>   blockdev.c            | 34 ++++++++++++++++--
>>   3 files changed, 110 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 6ba853fb90..8da706cd89 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate);
>>   
>>   /* check if a named node can be replaced when doing drive-mirror */
>>   BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>> +                                        BlockDriverState *backing_bs,
>>                                           const char *node_name, Error **errp);
>>   
>>   /* async block I/O */
>> diff --git a/block.c b/block.c
>> index 915b80153c..4858d3e718 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)
>>       return false;
>>   }
>>   
>> +static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
>> +{
>> +    BdrvChild *c;
>> +
>> +    if (!parent) {
>> +        return false;
>> +    }
>> +
>> +    QLIST_FOREACH(c, &parent->children, next) {
>> +        if (c->bs == child || is_child_of(child, c->bs)) {
>> +            return true;
>> +        }
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/*
>> + * Return true if there are only filters in [@top, @base).  Note that
>> + * this may include quorum (which bdrv_chain_contains() cannot
>> + * handle).
> 
> More presizely: return true if exists chain of filters from top to base or if
> top == base.
> 
> I keep in mind backup-top filter:
> 
> [backup-top]
> |          \target

backup-top can’t be a filter if it has two children with different
contents, though.

(commit-top and mirror-top aren’t filters either.)

That’s why there must be a unique chain [@top, @base).

I should probably not that it will return true if top == base, though, yes.

> |backing    -------->[target]
> V                    /
> [source]  <---------/backing
> 
>> + */
>> +static bool is_filtered_child(BlockDriverState *top, BlockDriverState *base)
>> +{
>> +    BdrvChild *c;
>> +
>> +    if (!top) {
>> +        return false;
>> +    }
>> +
>> +    if (top == base) {
>> +        return true;
>> +    }
>> +
>> +    if (!top->drv->is_filter) {
>> +        return false;
>> +    }
>> +
>> +    QLIST_FOREACH(c, &top->children, next) {
>> +        if (is_filtered_child(c->bs, base)) {
>> +            return true;
>> +        }
>> +    }
> 
> interesting, how much is it better to somehow reuse DFS search written in should_update_child()..
> [just note, don't do it in these series please]
> 
>> +
>> +    return false;
>> +}
>> +
>> +/*
>> + * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
>> + * will be attached to the target when mirror completes.
>> + */
>>   BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>> +                                        BlockDriverState *backing_bs,
>>                                           const char *node_name, Error **errp)
>>   {
>>       BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
>> @@ -6309,13 +6361,32 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>           goto out;
>>       }
>>   
>> -    /* We don't want arbitrary node of the BDS chain to be replaced only the top
>> -     * most non filter in order to prevent data corruption.
>> -     * Another benefit is that this tests exclude backing files which are
>> -     * blocked by the backing blockers.
>> +    /*
>> +     * If to_replace_bs is (recursively) a child of backing_bs,
>> +     * replacing it may create a loop.  We cannot allow that.
>>        */
>> -    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
>> -        error_setg(errp, "Only top most non filter can be replaced");
>> +    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, backing_bs)) {
> 
> first condition is covered by second, so first may be omitted.

It is not.  is_child_of() does not return true if child == parent.

>> +        error_setg(errp, "Replacing this node would result in a loop");
>> +        to_replace_bs = NULL;
>> +        goto out;
>> +    }
>> +
>> +    /*
>> +     * Mirror is designed in such a way that when it completes, the
>> +     * source BDS is seamlessly replaced.  
> 
> Not source but to_replace_bs is replaced?

It has originally been designed to replace the source.  If it could
replace any arbitrary BDS, all of this would be moot.

>> It is therefore not allowed
>> +     * to replace a BDS where this condition would be violated, as that
>> +     * would defeat the purpose of mirror and could lead to data
>> +     * corruption.
>> +     * Therefore, between parent_bs and to_replace_bs there may be
>> +     * only filters (and the one on top must be a filter, too), so
>> +     * their data always stays in sync and mirror can complete and
>> +     * replace to_replace_bs without any possible corruptions.
>> +     */
>> +    if (!is_filtered_child(parent_bs, to_replace_bs) &&
>> +        !is_filtered_child(to_replace_bs, parent_bs))
>> +    {
>> +        error_setg(errp, "The node to be replaced must be connected to the "
>> +                   "source through filter nodes only");
> 
> "and the one on top must be a filter, too" not mentioned in the error..

Well, unless the source node is the node to be replaced.  Hm...  This
gets very hard to express.  I think I’d prefer to keep this as it is,
even though it is not quite correct, unless you have a better suggestion
of what to report. :-/

>>           to_replace_bs = NULL;
>>           goto out;
>>       }
>> diff --git a/blockdev.c b/blockdev.c
>> index 4e72f6f701..758e0b5431 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>       }
>>   
>>       if (has_replaces) {
>> -        BlockDriverState *to_replace_bs;
>> +        BlockDriverState *to_replace_bs, *backing_bs;
>>           AioContext *replace_aio_context;
>>           int64_t bs_size, replace_size;
>>   
>> @@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>               return;
>>           }
>>   
>> -        to_replace_bs = check_to_replace_node(bs, replaces, errp);
>> +        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
>> +            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
>> +        {
>> +            /*
>> +             * While we do not quite know what OPEN_BACKING_CHAIN
>> +             * (used for mode=existing) will yield, it is probably
>> +             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
>> +             * because that is our best guess.
>> +             */
>> +            switch (sync) {
>> +            case MIRROR_SYNC_MODE_FULL:
>> +                backing_bs = NULL;
>> +                break;
>> +
>> +            case MIRROR_SYNC_MODE_TOP:
>> +                backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
> 
> why not  bdrv_backing_chain_next(bs) like in mirror_start?

Good question.  I suppose it should be
bdrv_filtered_cow_bs(bdrv_backing_chain_next(bs)) in mirror_start()?
Because with sync=top, we just want to remove the topmost COW node (and
filters on top), but keep filters behind it.

Max

>> +                break;
>> +
>> +            case MIRROR_SYNC_MODE_NONE:
>> +                backing_bs = bs;
>> +                break;
>> +
>> +            default:
>> +                abort();
>> +            }
>> +        } else {
>> +            assert(backing_mode == MIRROR_LEAVE_BACKING_CHAIN);
>> +            backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(target));
>> +        }
>> +
>> +        to_replace_bs = check_to_replace_node(bs, backing_bs, replaces, errp);
>>           if (!to_replace_bs) {
>>               return;
>>           }
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
  2019-08-15 17:01     ` Max Reitz
@ 2019-08-16 11:01       ` Vladimir Sementsov-Ogievskiy
  2019-08-16 13:30         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-16 11:01 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

15.08.2019 20:01, Max Reitz wrote:
> On 15.08.19 17:21, Vladimir Sementsov-Ogievskiy wrote:
>> 09.08.2019 19:14, Max Reitz wrote:
>>> Currently, check_to_replace_node() only allows mirror to replace a node
>>> in the chain of the source node, and only if it is the first non-filter
>>> node below the source.  Well, technically, the idea is that you can
>>> exactly replace a quorum child by mirroring from quorum.
>>>
>>> This has (probably) two reasons:
>>> (1) We do not want to create loops.
>>> (2) @replaces and @device should have exactly the same content so
>>>       replacing them does not cause visible data to change.
>>>
>>> This has two issues:
>>> (1) It is overly restrictive.  It is completely fine for @replaces to be
>>>       a filter.
>>> (2) It is not restrictive enough.  You can create loops with this as
>>>       follows:
>>>
>>> $ qemu-img create -f qcow2 /tmp/source.qcow2 64M
>>> $ qemu-system-x86_64 -qmp stdio
>>> {"execute": "qmp_capabilities"}
>>> {"execute": "object-add",
>>>    "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
>>> {"execute": "blockdev-add",
>>>    "arguments": {
>>>        "node-name": "source",
>>>        "driver": "throttle",
>>>        "throttle-group": "tg0",
>>>        "file": {
>>>            "node-name": "filtered",
>>>            "driver": "qcow2",
>>>            "file": {
>>>                "driver": "file",
>>>                "filename": "/tmp/source.qcow2"
>>>            } } } }
>>> {"execute": "drive-mirror",
>>>    "arguments": {
>>>        "job-id": "mirror",
>>>        "device": "source",
>>>        "target": "/tmp/target.qcow2",
>>>        "format": "qcow2",
>>>        "node-name": "target",
>>>        "sync" :"none",
>>>        "replaces": "filtered"
>>>    } }
>>> {"execute": "block-job-complete", "arguments": {"device": "mirror"}}
>>>
>>> And qemu crashes because of a stack overflow due to the loop being
>>> created (target's backing file is source, so when it replaces filtered,
>>> it points to itself through source).
>>>
>>> (blockdev-mirror can be broken similarly.)
>>>
>>> So let us make the checks for the two conditions above explicit, which
>>> makes the whole function exactly as restrictive as it needs to be.
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>    include/block/block.h |  1 +
>>>    block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
>>>    blockdev.c            | 34 ++++++++++++++++--
>>>    3 files changed, 110 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/include/block/block.h b/include/block/block.h
>>> index 6ba853fb90..8da706cd89 100644
>>> --- a/include/block/block.h
>>> +++ b/include/block/block.h
>>> @@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate);
>>>    
>>>    /* check if a named node can be replaced when doing drive-mirror */
>>>    BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>> +                                        BlockDriverState *backing_bs,
>>>                                            const char *node_name, Error **errp);
>>>    
>>>    /* async block I/O */
>>> diff --git a/block.c b/block.c
>>> index 915b80153c..4858d3e718 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)
>>>        return false;
>>>    }
>>>    
>>> +static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
>>> +{
>>> +    BdrvChild *c;
>>> +
>>> +    if (!parent) {
>>> +        return false;
>>> +    }
>>> +
>>> +    QLIST_FOREACH(c, &parent->children, next) {
>>> +        if (c->bs == child || is_child_of(child, c->bs)) {
>>> +            return true;
>>> +        }
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>> +
>>> +/*
>>> + * Return true if there are only filters in [@top, @base).  Note that
>>> + * this may include quorum (which bdrv_chain_contains() cannot
>>> + * handle).
>>
>> More presizely: return true if exists chain of filters from top to base or if
>> top == base.
>>
>> I keep in mind backup-top filter:
>>
>> [backup-top]
>> |          \target
> 
> backup-top can’t be a filter if it has two children with different
> contents, though.

Why? target is special child, unrelated to what is read/written over backup-top.
It's an own business of backup-top.

> 
> (commit-top and mirror-top aren’t filters either.)

Ahm, I missed something. They have is_filter = true and their children considered
to be filtered-rw children in your series? And than, who they are? Format nodes?
And how they appears in backing chains than?

> 
> That’s why there must be a unique chain [@top, @base).
> 
> I should probably not that it will return true if top == base, though, yes.
> 
>> |backing    -------->[target]
>> V                    /
>> [source]  <---------/backing
>>
>>> + */
>>> +static bool is_filtered_child(BlockDriverState *top, BlockDriverState *base)
>>> +{
>>> +    BdrvChild *c;
>>> +
>>> +    if (!top) {
>>> +        return false;
>>> +    }
>>> +
>>> +    if (top == base) {
>>> +        return true;
>>> +    }
>>> +
>>> +    if (!top->drv->is_filter) {
>>> +        return false;
>>> +    }
>>> +
>>> +    QLIST_FOREACH(c, &top->children, next) {
>>> +        if (is_filtered_child(c->bs, base)) {
>>> +            return true;
>>> +        }
>>> +    }
>>
>> interesting, how much is it better to somehow reuse DFS search written in should_update_child()..
>> [just note, don't do it in these series please]
>>
>>> +
>>> +    return false;
>>> +}
>>> +
>>> +/*
>>> + * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
>>> + * will be attached to the target when mirror completes.
>>> + */
>>>    BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>> +                                        BlockDriverState *backing_bs,
>>>                                            const char *node_name, Error **errp)
>>>    {
>>>        BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
>>> @@ -6309,13 +6361,32 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>            goto out;
>>>        }
>>>    
>>> -    /* We don't want arbitrary node of the BDS chain to be replaced only the top
>>> -     * most non filter in order to prevent data corruption.
>>> -     * Another benefit is that this tests exclude backing files which are
>>> -     * blocked by the backing blockers.
>>> +    /*
>>> +     * If to_replace_bs is (recursively) a child of backing_bs,
>>> +     * replacing it may create a loop.  We cannot allow that.
>>>         */
>>> -    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
>>> -        error_setg(errp, "Only top most non filter can be replaced");
>>> +    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, backing_bs)) {
>>
>> first condition is covered by second, so first may be omitted.
> 
> It is not.  is_child_of() does not return true if child == parent.
> 
>>> +        error_setg(errp, "Replacing this node would result in a loop");
>>> +        to_replace_bs = NULL;
>>> +        goto out;
>>> +    }
>>> +
>>> +    /*
>>> +     * Mirror is designed in such a way that when it completes, the
>>> +     * source BDS is seamlessly replaced.
>>
>> Not source but to_replace_bs is replaced?
> 
> It has originally been designed to replace the source.  If it could
> replace any arbitrary BDS, all of this would be moot.

quorum child, you saying about in commit message?

> 
>>> It is therefore not allowed
>>> +     * to replace a BDS where this condition would be violated, as that
>>> +     * would defeat the purpose of mirror and could lead to data
>>> +     * corruption.
>>> +     * Therefore, between parent_bs and to_replace_bs there may be
>>> +     * only filters (and the one on top must be a filter, too), so
>>> +     * their data always stays in sync and mirror can complete and
>>> +     * replace to_replace_bs without any possible corruptions.
>>> +     */
>>> +    if (!is_filtered_child(parent_bs, to_replace_bs) &&
>>> +        !is_filtered_child(to_replace_bs, parent_bs))
>>> +    {
>>> +        error_setg(errp, "The node to be replaced must be connected to the "
>>> +                   "source through filter nodes only");
>>
>> "and the one on top must be a filter, too" not mentioned in the error..
> 
> Well, unless the source node is the node to be replaced.  Hm...  This
> gets very hard to express.  I think I’d prefer to keep this as it is,
> even though it is not quite correct, unless you have a better suggestion
> of what to report. :-/

I can't imaging something better than just add "(and the one on top must be a filter, too)"

> 
>>>            to_replace_bs = NULL;
>>>            goto out;
>>>        }
>>> diff --git a/blockdev.c b/blockdev.c
>>> index 4e72f6f701..758e0b5431 100644
>>> --- a/blockdev.c
>>> +++ b/blockdev.c
>>> @@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>        }
>>>    
>>>        if (has_replaces) {
>>> -        BlockDriverState *to_replace_bs;
>>> +        BlockDriverState *to_replace_bs, *backing_bs;
>>>            AioContext *replace_aio_context;
>>>            int64_t bs_size, replace_size;
>>>    
>>> @@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>                return;
>>>            }
>>>    
>>> -        to_replace_bs = check_to_replace_node(bs, replaces, errp);
>>> +        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
>>> +            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
>>> +        {
>>> +            /*
>>> +             * While we do not quite know what OPEN_BACKING_CHAIN
>>> +             * (used for mode=existing) will yield, it is probably
>>> +             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
>>> +             * because that is our best guess.
>>> +             */
>>> +            switch (sync) {
>>> +            case MIRROR_SYNC_MODE_FULL:
>>> +                backing_bs = NULL;
>>> +                break;
>>> +
>>> +            case MIRROR_SYNC_MODE_TOP:
>>> +                backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
>>
>> why not  bdrv_backing_chain_next(bs) like in mirror_start?
> 
> Good question.  I suppose it should be
> bdrv_filtered_cow_bs(bdrv_backing_chain_next(bs)) in mirror_start()?

You mean bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)), I hope)

> Because with sync=top, we just want to remove the topmost COW node (and
> filters on top), but keep filters behind it.
> 

Agreed.

> 
>>> +                break;
>>> +
>>> +            case MIRROR_SYNC_MODE_NONE:
>>> +                backing_bs = bs;
>>> +                break;
>>> +
>>> +            default:
>>> +                abort();
>>> +            }
>>> +        } else {
>>> +            assert(backing_mode == MIRROR_LEAVE_BACKING_CHAIN);
>>> +            backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(target));
>>> +        }
>>> +
>>> +        to_replace_bs = check_to_replace_node(bs, backing_bs, replaces, errp);
>>>            if (!to_replace_bs) {
>>>                return;
>>>            }
>>>
>>
>>
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
  2019-08-16 11:01       ` Vladimir Sementsov-Ogievskiy
@ 2019-08-16 13:30         ` Max Reitz
  2019-08-16 14:24           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-08-16 13:30 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 13404 bytes --]

On 16.08.19 13:01, Vladimir Sementsov-Ogievskiy wrote:
> 15.08.2019 20:01, Max Reitz wrote:
>> On 15.08.19 17:21, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.08.2019 19:14, Max Reitz wrote:
>>>> Currently, check_to_replace_node() only allows mirror to replace a node
>>>> in the chain of the source node, and only if it is the first non-filter
>>>> node below the source.  Well, technically, the idea is that you can
>>>> exactly replace a quorum child by mirroring from quorum.
>>>>
>>>> This has (probably) two reasons:
>>>> (1) We do not want to create loops.
>>>> (2) @replaces and @device should have exactly the same content so
>>>>       replacing them does not cause visible data to change.
>>>>
>>>> This has two issues:
>>>> (1) It is overly restrictive.  It is completely fine for @replaces to be
>>>>       a filter.
>>>> (2) It is not restrictive enough.  You can create loops with this as
>>>>       follows:
>>>>
>>>> $ qemu-img create -f qcow2 /tmp/source.qcow2 64M
>>>> $ qemu-system-x86_64 -qmp stdio
>>>> {"execute": "qmp_capabilities"}
>>>> {"execute": "object-add",
>>>>    "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
>>>> {"execute": "blockdev-add",
>>>>    "arguments": {
>>>>        "node-name": "source",
>>>>        "driver": "throttle",
>>>>        "throttle-group": "tg0",
>>>>        "file": {
>>>>            "node-name": "filtered",
>>>>            "driver": "qcow2",
>>>>            "file": {
>>>>                "driver": "file",
>>>>                "filename": "/tmp/source.qcow2"
>>>>            } } } }
>>>> {"execute": "drive-mirror",
>>>>    "arguments": {
>>>>        "job-id": "mirror",
>>>>        "device": "source",
>>>>        "target": "/tmp/target.qcow2",
>>>>        "format": "qcow2",
>>>>        "node-name": "target",
>>>>        "sync" :"none",
>>>>        "replaces": "filtered"
>>>>    } }
>>>> {"execute": "block-job-complete", "arguments": {"device": "mirror"}}
>>>>
>>>> And qemu crashes because of a stack overflow due to the loop being
>>>> created (target's backing file is source, so when it replaces filtered,
>>>> it points to itself through source).
>>>>
>>>> (blockdev-mirror can be broken similarly.)
>>>>
>>>> So let us make the checks for the two conditions above explicit, which
>>>> makes the whole function exactly as restrictive as it needs to be.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>    include/block/block.h |  1 +
>>>>    block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
>>>>    blockdev.c            | 34 ++++++++++++++++--
>>>>    3 files changed, 110 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/include/block/block.h b/include/block/block.h
>>>> index 6ba853fb90..8da706cd89 100644
>>>> --- a/include/block/block.h
>>>> +++ b/include/block/block.h
>>>> @@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate);
>>>>    
>>>>    /* check if a named node can be replaced when doing drive-mirror */
>>>>    BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>> +                                        BlockDriverState *backing_bs,
>>>>                                            const char *node_name, Error **errp);
>>>>    
>>>>    /* async block I/O */
>>>> diff --git a/block.c b/block.c
>>>> index 915b80153c..4858d3e718 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)
>>>>        return false;
>>>>    }
>>>>    
>>>> +static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
>>>> +{
>>>> +    BdrvChild *c;
>>>> +
>>>> +    if (!parent) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    QLIST_FOREACH(c, &parent->children, next) {
>>>> +        if (c->bs == child || is_child_of(child, c->bs)) {
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Return true if there are only filters in [@top, @base).  Note that
>>>> + * this may include quorum (which bdrv_chain_contains() cannot
>>>> + * handle).
>>>
>>> More presizely: return true if exists chain of filters from top to base or if
>>> top == base.
>>>
>>> I keep in mind backup-top filter:
>>>
>>> [backup-top]
>>> |          \target
>>
>> backup-top can’t be a filter if it has two children with different
>> contents, though.
> 
> Why? target is special child, unrelated to what is read/written over backup-top.
> It's an own business of backup-top.
> 
>>
>> (commit-top and mirror-top aren’t filters either.)
> 
> Ahm, I missed something. They have is_filter = true and their children considered
> to be filtered-rw children in your series? And than, who they are? Format nodes?
> And how they appears in backing chains than?

Er, right, I remember, I made them filters in patch 1 of this series. m( :-)

But the chain would still be unique, in a sense, because backup-top only
has one filtered child, so you could go down the chain with
bdrv_filtered_rw_child().

This function doesn’t do that because of Quorum, which is actually a
better example.  All of its children are filtered, so we must consider
all of them.

But backup-top is actually a reason why this function is wrong as it is;
the target is not a filtered child, so it shouldn’t return true there.

Hmmmm.

Actually, bdrv_recurse_is_first_non_filter() does nearly what we want.
(Which is why it was used here.)  The only problem is that it expects
@candidate to be a non-filter (as the name implies).  But we don’t care
about that, actually.

I suppose I can just turn bdrv_recurse_is_first_non_filter() into
bdrv_is_child_of(); it has only two callers, one is here, the other is
bdrv_is_first_non_filter().  In the latter, we can just check whether
@candidate is a filter and return false if it isn’t.

>> That’s why there must be a unique chain [@top, @base).
>>
>> I should probably not that it will return true if top == base, though, yes.
>>
>>> |backing    -------->[target]
>>> V                    /
>>> [source]  <---------/backing
>>>
>>>> + */
>>>> +static bool is_filtered_child(BlockDriverState *top, BlockDriverState *base)
>>>> +{
>>>> +    BdrvChild *c;
>>>> +
>>>> +    if (!top) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    if (top == base) {
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    if (!top->drv->is_filter) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    QLIST_FOREACH(c, &top->children, next) {
>>>> +        if (is_filtered_child(c->bs, base)) {
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>
>>> interesting, how much is it better to somehow reuse DFS search written in should_update_child()..
>>> [just note, don't do it in these series please]
>>>
>>>> +
>>>> +    return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
>>>> + * will be attached to the target when mirror completes.
>>>> + */
>>>>    BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>> +                                        BlockDriverState *backing_bs,
>>>>                                            const char *node_name, Error **errp)
>>>>    {
>>>>        BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
>>>> @@ -6309,13 +6361,32 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>>            goto out;
>>>>        }
>>>>    
>>>> -    /* We don't want arbitrary node of the BDS chain to be replaced only the top
>>>> -     * most non filter in order to prevent data corruption.
>>>> -     * Another benefit is that this tests exclude backing files which are
>>>> -     * blocked by the backing blockers.
>>>> +    /*
>>>> +     * If to_replace_bs is (recursively) a child of backing_bs,
>>>> +     * replacing it may create a loop.  We cannot allow that.
>>>>         */
>>>> -    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
>>>> -        error_setg(errp, "Only top most non filter can be replaced");
>>>> +    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, backing_bs)) {
>>>
>>> first condition is covered by second, so first may be omitted.
>>
>> It is not.  is_child_of() does not return true if child == parent.
>>
>>>> +        error_setg(errp, "Replacing this node would result in a loop");
>>>> +        to_replace_bs = NULL;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Mirror is designed in such a way that when it completes, the
>>>> +     * source BDS is seamlessly replaced.
>>>
>>> Not source but to_replace_bs is replaced?
>>
>> It has originally been designed to replace the source.  If it could
>> replace any arbitrary BDS, all of this would be moot.
> 
> quorum child, you saying about in commit message?

Which is not any arbitrary BDS, but one that looks exactly like the source.

My point is, mirror has been *designed* to replace the source
seamlessly.  It can do more things today, but that was its original point.

That means that the target must be exactly the same as the source.  And
then we come to this:

>>>> It is therefore not allowed
>>>> +     * to replace a BDS where this condition would be violated, as that
>>>> +     * would defeat the purpose of mirror and could lead to data
>>>> +     * corruption.
>>>> +     * Therefore, between parent_bs and to_replace_bs there may be
>>>> +     * only filters (and the one on top must be a filter, too), so
>>>> +     * their data always stays in sync and mirror can complete and
>>>> +     * replace to_replace_bs without any possible corruptions.

So replacing a node that’s connected to the source only through filters
is fine because that means the replaced node will also have the same
content as the source.


How about I replace the first paragraph with:

At the end of the mirror job, the target exhibits exactly the same
content as the source, so it can replace the source node seamlessly.  It
cannot replace a BDS that differs in content, as that could lead to data
corruption.

?

>>>> +     */
>>>> +    if (!is_filtered_child(parent_bs, to_replace_bs) &&
>>>> +        !is_filtered_child(to_replace_bs, parent_bs))
>>>> +    {
>>>> +        error_setg(errp, "The node to be replaced must be connected to the "
>>>> +                   "source through filter nodes only");
>>>
>>> "and the one on top must be a filter, too" not mentioned in the error..
>>
>> Well, unless the source node is the node to be replaced.  Hm...  This
>> gets very hard to express.  I think I’d prefer to keep this as it is,
>> even though it is not quite correct, unless you have a better suggestion
>> of what to report. :-/
> 
> I can't imaging something better than just add "(and the one on top must be a filter, too)"

The problem is that “the one on top” wouldn’t sound very clear to me as
a user.

Maybe include the explanation à la “The node to be replaced must be
connected to the source through filter nodes only, so its data is the
exact same at all times”?  Maybe then users can guess what this
“connected” means exactly.

>>>>            to_replace_bs = NULL;
>>>>            goto out;
>>>>        }
>>>> diff --git a/blockdev.c b/blockdev.c
>>>> index 4e72f6f701..758e0b5431 100644
>>>> --- a/blockdev.c
>>>> +++ b/blockdev.c
>>>> @@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>>        }
>>>>    
>>>>        if (has_replaces) {
>>>> -        BlockDriverState *to_replace_bs;
>>>> +        BlockDriverState *to_replace_bs, *backing_bs;
>>>>            AioContext *replace_aio_context;
>>>>            int64_t bs_size, replace_size;
>>>>    
>>>> @@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>>                return;
>>>>            }
>>>>    
>>>> -        to_replace_bs = check_to_replace_node(bs, replaces, errp);
>>>> +        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
>>>> +            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
>>>> +        {
>>>> +            /*
>>>> +             * While we do not quite know what OPEN_BACKING_CHAIN
>>>> +             * (used for mode=existing) will yield, it is probably
>>>> +             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
>>>> +             * because that is our best guess.
>>>> +             */
>>>> +            switch (sync) {
>>>> +            case MIRROR_SYNC_MODE_FULL:
>>>> +                backing_bs = NULL;
>>>> +                break;
>>>> +
>>>> +            case MIRROR_SYNC_MODE_TOP:
>>>> +                backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
>>>
>>> why not  bdrv_backing_chain_next(bs) like in mirror_start?
>>
>> Good question.  I suppose it should be
>> bdrv_filtered_cow_bs(bdrv_backing_chain_next(bs)) in mirror_start()?
> 
> You mean bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)), I hope)

Er, yes, sure.

>> Because with sync=top, we just want to remove the topmost COW node (and
>> filters on top), but keep filters behind it.
>>
> 
> Agreed.

OK.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node()
  2019-08-16 13:30         ` Max Reitz
@ 2019-08-16 14:24           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-16 14:24 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

16.08.2019 16:30, Max Reitz wrote:
> On 16.08.19 13:01, Vladimir Sementsov-Ogievskiy wrote:
>> 15.08.2019 20:01, Max Reitz wrote:
>>> On 15.08.19 17:21, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.08.2019 19:14, Max Reitz wrote:
>>>>> Currently, check_to_replace_node() only allows mirror to replace a node
>>>>> in the chain of the source node, and only if it is the first non-filter
>>>>> node below the source.  Well, technically, the idea is that you can
>>>>> exactly replace a quorum child by mirroring from quorum.
>>>>>
>>>>> This has (probably) two reasons:
>>>>> (1) We do not want to create loops.
>>>>> (2) @replaces and @device should have exactly the same content so
>>>>>        replacing them does not cause visible data to change.
>>>>>
>>>>> This has two issues:
>>>>> (1) It is overly restrictive.  It is completely fine for @replaces to be
>>>>>        a filter.
>>>>> (2) It is not restrictive enough.  You can create loops with this as
>>>>>        follows:
>>>>>
>>>>> $ qemu-img create -f qcow2 /tmp/source.qcow2 64M
>>>>> $ qemu-system-x86_64 -qmp stdio
>>>>> {"execute": "qmp_capabilities"}
>>>>> {"execute": "object-add",
>>>>>     "arguments": {"qom-type": "throttle-group", "id": "tg0"}}
>>>>> {"execute": "blockdev-add",
>>>>>     "arguments": {
>>>>>         "node-name": "source",
>>>>>         "driver": "throttle",
>>>>>         "throttle-group": "tg0",
>>>>>         "file": {
>>>>>             "node-name": "filtered",
>>>>>             "driver": "qcow2",
>>>>>             "file": {
>>>>>                 "driver": "file",
>>>>>                 "filename": "/tmp/source.qcow2"
>>>>>             } } } }
>>>>> {"execute": "drive-mirror",
>>>>>     "arguments": {
>>>>>         "job-id": "mirror",
>>>>>         "device": "source",
>>>>>         "target": "/tmp/target.qcow2",
>>>>>         "format": "qcow2",
>>>>>         "node-name": "target",
>>>>>         "sync" :"none",
>>>>>         "replaces": "filtered"
>>>>>     } }
>>>>> {"execute": "block-job-complete", "arguments": {"device": "mirror"}}
>>>>>
>>>>> And qemu crashes because of a stack overflow due to the loop being
>>>>> created (target's backing file is source, so when it replaces filtered,
>>>>> it points to itself through source).
>>>>>
>>>>> (blockdev-mirror can be broken similarly.)
>>>>>
>>>>> So let us make the checks for the two conditions above explicit, which
>>>>> makes the whole function exactly as restrictive as it needs to be.
>>>>>
>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>> ---
>>>>>     include/block/block.h |  1 +
>>>>>     block.c               | 83 +++++++++++++++++++++++++++++++++++++++----
>>>>>     blockdev.c            | 34 ++++++++++++++++--
>>>>>     3 files changed, 110 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/include/block/block.h b/include/block/block.h
>>>>> index 6ba853fb90..8da706cd89 100644
>>>>> --- a/include/block/block.h
>>>>> +++ b/include/block/block.h
>>>>> @@ -404,6 +404,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate);
>>>>>     
>>>>>     /* check if a named node can be replaced when doing drive-mirror */
>>>>>     BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>>> +                                        BlockDriverState *backing_bs,
>>>>>                                             const char *node_name, Error **errp);
>>>>>     
>>>>>     /* async block I/O */
>>>>> diff --git a/block.c b/block.c
>>>>> index 915b80153c..4858d3e718 100644
>>>>> --- a/block.c
>>>>> +++ b/block.c
>>>>> @@ -6290,7 +6290,59 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)
>>>>>         return false;
>>>>>     }
>>>>>     
>>>>> +static bool is_child_of(BlockDriverState *child, BlockDriverState *parent)
>>>>> +{
>>>>> +    BdrvChild *c;
>>>>> +
>>>>> +    if (!parent) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    QLIST_FOREACH(c, &parent->children, next) {
>>>>> +        if (c->bs == child || is_child_of(child, c->bs)) {
>>>>> +            return true;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return false;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Return true if there are only filters in [@top, @base).  Note that
>>>>> + * this may include quorum (which bdrv_chain_contains() cannot
>>>>> + * handle).
>>>>
>>>> More presizely: return true if exists chain of filters from top to base or if
>>>> top == base.
>>>>
>>>> I keep in mind backup-top filter:
>>>>
>>>> [backup-top]
>>>> |          \target
>>>
>>> backup-top can’t be a filter if it has two children with different
>>> contents, though.
>>
>> Why? target is special child, unrelated to what is read/written over backup-top.
>> It's an own business of backup-top.
>>
>>>
>>> (commit-top and mirror-top aren’t filters either.)
>>
>> Ahm, I missed something. They have is_filter = true and their children considered
>> to be filtered-rw children in your series? And than, who they are? Format nodes?
>> And how they appears in backing chains than?
> 
> Er, right, I remember, I made them filters in patch 1 of this series. m( :-)
> 
> But the chain would still be unique, in a sense, because backup-top only
> has one filtered child, so you could go down the chain with
> bdrv_filtered_rw_child().
> 
> This function doesn’t do that because of Quorum, which is actually a
> better example.  All of its children are filtered, so we must consider
> all of them.
> 
> But backup-top is actually a reason why this function is wrong as it is;
> the target is not a filtered child, so it shouldn’t return true there.
> 
> Hmmmm.
> 
> Actually, bdrv_recurse_is_first_non_filter() does nearly what we want.
> (Which is why it was used here.)  The only problem is that it expects
> @candidate to be a non-filter (as the name implies).  But we don’t care
> about that, actually.
> 
> I suppose I can just turn bdrv_recurse_is_first_non_filter() into
> bdrv_is_child_of(); it has only two callers, one is here, the other is
> bdrv_is_first_non_filter().  In the latter, we can just check whether
> @candidate is a filter and return false if it isn’t.

Wow, me go and see that bdrv_recurse_is_first_non_filter is something complicated
backed by driver callback, different for different filters 0_O.. It seems to be a
generator of some chain, which is not backing chain. Ohh

Hmm, but what you are proposing will break "_first_" part of interface: it maybe
child of, an we may check is it filter or not, but how to check that it is exactly
"first"? Ok, anyway if you are going to rewrite it somehow I can wait and than look
at new patch to understand.

> 
>>> That’s why there must be a unique chain [@top, @base).
>>>
>>> I should probably not that it will return true if top == base, though, yes.
>>>
>>>> |backing    -------->[target]
>>>> V                    /
>>>> [source]  <---------/backing
>>>>
>>>>> + */
>>>>> +static bool is_filtered_child(BlockDriverState *top, BlockDriverState *base)
>>>>> +{
>>>>> +    BdrvChild *c;
>>>>> +
>>>>> +    if (!top) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    if (top == base) {
>>>>> +        return true;
>>>>> +    }
>>>>> +
>>>>> +    if (!top->drv->is_filter) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    QLIST_FOREACH(c, &top->children, next) {
>>>>> +        if (is_filtered_child(c->bs, base)) {
>>>>> +            return true;
>>>>> +        }
>>>>> +    }
>>>>
>>>> interesting, how much is it better to somehow reuse DFS search written in should_update_child()..
>>>> [just note, don't do it in these series please]
>>>>
>>>>> +
>>>>> +    return false;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * @parent_bs is mirror's source BDS, @backing_bs is the BDS which
>>>>> + * will be attached to the target when mirror completes.
>>>>> + */
>>>>>     BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>>> +                                        BlockDriverState *backing_bs,
>>>>>                                             const char *node_name, Error **errp)
>>>>>     {
>>>>>         BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
>>>>> @@ -6309,13 +6361,32 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>>>>>             goto out;
>>>>>         }
>>>>>     
>>>>> -    /* We don't want arbitrary node of the BDS chain to be replaced only the top
>>>>> -     * most non filter in order to prevent data corruption.
>>>>> -     * Another benefit is that this tests exclude backing files which are
>>>>> -     * blocked by the backing blockers.
>>>>> +    /*
>>>>> +     * If to_replace_bs is (recursively) a child of backing_bs,
>>>>> +     * replacing it may create a loop.  We cannot allow that.
>>>>>          */
>>>>> -    if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
>>>>> -        error_setg(errp, "Only top most non filter can be replaced");
>>>>> +    if (to_replace_bs == backing_bs || is_child_of(to_replace_bs, backing_bs)) {
>>>>
>>>> first condition is covered by second, so first may be omitted.
>>>
>>> It is not.  is_child_of() does not return true if child == parent.
>>>
>>>>> +        error_setg(errp, "Replacing this node would result in a loop");
>>>>> +        to_replace_bs = NULL;
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * Mirror is designed in such a way that when it completes, the
>>>>> +     * source BDS is seamlessly replaced.
>>>>
>>>> Not source but to_replace_bs is replaced?
>>>
>>> It has originally been designed to replace the source.  If it could
>>> replace any arbitrary BDS, all of this would be moot.
>>
>> quorum child, you saying about in commit message?
> 
> Which is not any arbitrary BDS, but one that looks exactly like the source.
> 
> My point is, mirror has been *designed* to replace the source
> seamlessly.  It can do more things today, but that was its original point.
> 
> That means that the target must be exactly the same as the source.  And
> then we come to this:
> 
>>>>> It is therefore not allowed
>>>>> +     * to replace a BDS where this condition would be violated, as that
>>>>> +     * would defeat the purpose of mirror and could lead to data
>>>>> +     * corruption.
>>>>> +     * Therefore, between parent_bs and to_replace_bs there may be
>>>>> +     * only filters (and the one on top must be a filter, too), so
>>>>> +     * their data always stays in sync and mirror can complete and
>>>>> +     * replace to_replace_bs without any possible corruptions.
> 
> So replacing a node that’s connected to the source only through filters
> is fine because that means the replaced node will also have the same
> content as the source.
> 
> 
> How about I replace the first paragraph with:
> 
> At the end of the mirror job, the target exhibits exactly the same
> content as the source, so it can replace the source node seamlessly.  It
> cannot replace a BDS that differs in content, as that could lead to data
> corruption.
> 

OK for me, thanks

> 
>>>>> +     */
>>>>> +    if (!is_filtered_child(parent_bs, to_replace_bs) &&
>>>>> +        !is_filtered_child(to_replace_bs, parent_bs))
>>>>> +    {
>>>>> +        error_setg(errp, "The node to be replaced must be connected to the "
>>>>> +                   "source through filter nodes only");
>>>>
>>>> "and the one on top must be a filter, too" not mentioned in the error..
>>>
>>> Well, unless the source node is the node to be replaced.  Hm...  This
>>> gets very hard to express.  I think I’d prefer to keep this as it is,
>>> even though it is not quite correct, unless you have a better suggestion
>>> of what to report. :-/
>>
>> I can't imaging something better than just add "(and the one on top must be a filter, too)"
> 
> The problem is that “the one on top” wouldn’t sound very clear to me as
> a user.
> 
> Maybe include the explanation à la “The node to be replaced must be
> connected to the source through filter nodes only, so its data is the
> exact same at all times”?  Maybe then users can guess what this
> “connected” means exactly.

Hmm, what about: "There may be not more than one data node in the whole chain
from node to be replaced to the source node and others must filters, to be sure
that data of source node and the node to be replaced are the same at all times"?

For sure, Eric is the best at such exercises.

But, anyway, I'm OK with your variant, it's all a kind of nit-picking.

> 
>>>>>             to_replace_bs = NULL;
>>>>>             goto out;
>>>>>         }
>>>>> diff --git a/blockdev.c b/blockdev.c
>>>>> index 4e72f6f701..758e0b5431 100644
>>>>> --- a/blockdev.c
>>>>> +++ b/blockdev.c
>>>>> @@ -3887,7 +3887,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>>>         }
>>>>>     
>>>>>         if (has_replaces) {
>>>>> -        BlockDriverState *to_replace_bs;
>>>>> +        BlockDriverState *to_replace_bs, *backing_bs;
>>>>>             AioContext *replace_aio_context;
>>>>>             int64_t bs_size, replace_size;
>>>>>     
>>>>> @@ -3897,7 +3897,37 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>>>                 return;
>>>>>             }
>>>>>     
>>>>> -        to_replace_bs = check_to_replace_node(bs, replaces, errp);
>>>>> +        if (backing_mode == MIRROR_SOURCE_BACKING_CHAIN ||
>>>>> +            backing_mode == MIRROR_OPEN_BACKING_CHAIN)
>>>>> +        {
>>>>> +            /*
>>>>> +             * While we do not quite know what OPEN_BACKING_CHAIN
>>>>> +             * (used for mode=existing) will yield, it is probably
>>>>> +             * best to restrict it exactly like SOURCE_BACKING_CHAIN,
>>>>> +             * because that is our best guess.
>>>>> +             */
>>>>> +            switch (sync) {
>>>>> +            case MIRROR_SYNC_MODE_FULL:
>>>>> +                backing_bs = NULL;
>>>>> +                break;
>>>>> +
>>>>> +            case MIRROR_SYNC_MODE_TOP:
>>>>> +                backing_bs = bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs));
>>>>
>>>> why not  bdrv_backing_chain_next(bs) like in mirror_start?
>>>
>>> Good question.  I suppose it should be
>>> bdrv_filtered_cow_bs(bdrv_backing_chain_next(bs)) in mirror_start()?
>>
>> You mean bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)), I hope)
> 
> Er, yes, sure.
> 
>>> Because with sync=top, we just want to remove the topmost COW node (and
>>> filters on top), but keep filters behind it.
>>>
>>
>> Agreed.
> 
> OK.
> 
> Max
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 37/42] block: Leave BDS.backing_file constant
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 37/42] block: Leave BDS.backing_file constant Max Reitz
@ 2019-08-16 16:16   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-16 16:16 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> Parts of the block layer treat BDS.backing_file as if it were whatever
> the image header says (i.e., if it is a relative path, it is relative to
> the overlay), other parts treat it like a cache for
> bs->backing->bs->filename (relative paths are relative to the CWD).
> Considering bs->backing->bs->filename exists, let us make it mean the
> former.
> 
> Among other things, this now allows the user to specify a base when
> using qemu-img to commit an image file in a directory that is not the
> CWD (assuming, everything uses relative filenames).
> 
> Before this patch:
> 
> $ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
> $ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
> $ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
> $ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
> qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
> $ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
> qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
> $ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
> qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'

nothing works

> 
> After this patch:
> 
> $ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
> Image committed.
> $ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
> qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
> $ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
> Image committed.

something works.. However it seems that not working one is actually most probable
to be called by user. Anyway something is better than nothing.

> 
> With this change, bdrv_find_backing_image() must look at whether the
> user has overridden a BDS's backing file.  If so, it can no longer use
> bs->backing_file, but must instead compare the given filename against
> the backing node's filename directly.
> 
> Note that this changes the QAPI output for a node's backing_file.  We
> had very inconsistent output there (sometimes what the image header
> said, sometimes the actual filename of the backing image).  This
> inconsistent output was effectively useless, so we have to decide one
> way or the other.  Considering that bs->backing_file usually at runtime
> contained the path to the image relative to qemu's CWD (or absolute),
> this patch changes QAPI's backing_file to always report the
> bs->backing->bs->filename from now on.  If you want to receive the image
> header information, you have to refer to full-backing-filename.
> 
> This necessitates a change to iotest 228.  The interesting information
> it really wanted is the image header, and it can get that now, but it
> has to use full-backing-filename instead of backing_file.  Because of
> this patch's changes to bs->backing_file's behavior, we also need some
> reference output changes.
> 
> Along with the changes to bs->backing_file, stop updating
> BDS.backing_format in bdrv_backing_attach() as well.  This necessitates
> a change to the reference output of iotest 191.
> 
> iotest 245 changes in behavior: With the backing node no longer
> overriding the parent node's backing_file string, you can now omit the
> @backing option when reopening a node with neither a default nor a
> current backing file even if it used to have a backing node at some
> point.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/block_int.h  | 19 ++++++++++++++-----
>   block.c                    | 35 ++++++++++++++++++++++++++++-------
>   block/qapi.c               |  7 ++++---
>   tests/qemu-iotests/191.out |  1 -
>   tests/qemu-iotests/228     |  6 +++---
>   tests/qemu-iotests/228.out |  6 +++---
>   tests/qemu-iotests/245     |  4 +++-
>   7 files changed, 55 insertions(+), 23 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 42ee2fcf7f..993bafc090 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -784,11 +784,20 @@ struct BlockDriverState {
>       bool walking_aio_notifiers; /* to make removal during iteration safe */
>   
>       char filename[PATH_MAX];
> -    char backing_file[PATH_MAX]; /* if non zero, the image is a diff of
> -                                    this file image */
> -    /* The backing filename indicated by the image header; if we ever
> -     * open this file, then this is replaced by the resulting BDS's
> -     * filename (i.e. after a bdrv_refresh_filename() run). */
> +    /*
> +     * If not empty, this image is a diff in relation to backing_file.
> +     * Note that this is the name given in the image header

Is it synced when image header is updated? If yes, it's not constant, if not it's just wrong.

> and
> +     * therefore may or may not be equal to .backing->bs->filename.
> +     * If this field contains a relative path, it is to be resolved
> +     * relatively to the overlay's location.
> +     */
> +    char backing_file[PATH_MAX];
> +    /*
> +     * The backing filename indicated by the image header.  Contrary
> +     * to backing_file, if we ever open this file, auto_backing_file

Preexisting, but for a fresh look (I don't know backing_file and auto_backing_file related
logic and assume that these are strange fields which would better not exist) this is
hard to understand.

open this file - which one? this bds? or it's backing?

> +     * is replaced by the resulting BDS's filename (i.e. after a
> +     * bdrv_refresh_filename() run).
> +     */
>       char auto_backing_file[PATH_MAX];
>       char backing_format[16]; /* if non-zero and backing_file exists */
>   
> diff --git a/block.c b/block.c
> index 4858d3e718..88533fa0d3 100644
> --- a/block.c
> +++ b/block.c
> @@ -78,6 +78,8 @@ static BlockDriverState *bdrv_open_inherit(const char *filename,
>                                              const BdrvChildRole *child_role,
>                                              Error **errp);
>   
> +static bool bdrv_backing_overridden(BlockDriverState *bs);
> +
>   /* If non-zero, use only whitelisted block drivers */
>   static int use_bdrv_whitelist;
>   
> @@ -1065,10 +1067,6 @@ static void bdrv_backing_attach(BdrvChild *c)
>       bdrv_refresh_filename(backing_hd);
>   
>       parent->open_flags &= ~BDRV_O_NO_BACKING;
> -    pstrcpy(parent->backing_file, sizeof(parent->backing_file),
> -            backing_hd->filename);
> -    pstrcpy(parent->backing_format, sizeof(parent->backing_format),
> -            backing_hd->drv ? backing_hd->drv->format_name : "");
>   
>       bdrv_op_block_all(backing_hd, parent->backing_blocker);
>       /* Otherwise we won't be able to commit or stream */
> @@ -5294,6 +5292,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>       char *backing_file_full = NULL;
>       char *filename_tmp = NULL;
>       int is_protocol = 0;
> +    bool filenames_refreshed = false;
>       BlockDriverState *curr_bs = NULL;
>       BlockDriverState *retval = NULL;
>   
> @@ -5318,9 +5317,31 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>       {
>           BlockDriverState *bs_below = bdrv_backing_chain_next(curr_bs);
>   
> -        /* If either of the filename paths is actually a protocol, then
> -         * compare unmodified paths; otherwise make paths relative */
> -        if (is_protocol || path_has_protocol(curr_bs->backing_file)) {
> +        if (bdrv_backing_overridden(curr_bs)) {
> +            /*
> +             * If the backing file was overridden, we can only compare
> +             * directly against the backing node's filename.
> +             */
> +
> +            if (!filenames_refreshed) {
> +                /*
> +                 * This will automatically refresh all of the
> +                 * filenames in the rest of the backing chain, so we
> +                 * only need to do this once.
> +                 */
> +                bdrv_refresh_filename(bs_below);
> +                filenames_refreshed = true;
> +            }
> +
> +            if (strcmp(backing_file, bs_below->filename) == 0) {

Don't you want try realpath() here too?

And I doubt, why we can't always assume bdrv_backing_overridden() = true? (keeping in mind
that it may give false positives, as comment says)?

Just go through backing chain and check

searched_file == bs->filename || realpath(searched_file) == realpath(bs->filename) ?

Do we need to cover case when something is relative to parent really? I'm not sure
but it seems like in libvirt absolute paths are used always.. And if we need just add
to this condition:

|| realpath(joinpath(prev_bs->filename, searched_file)) == realpath(bs->filename))


> +                retval = bs_below;
> +                break;
> +            }
> +        } else if (is_protocol || path_has_protocol(curr_bs->backing_file)) {
> +            /*
> +             * If either of the filename paths is actually a protocol, then
> +             * compare unmodified paths; otherwise make paths relative.
> +             */
>               char *backing_file_full_ret;
>   
>               if (strcmp(backing_file, curr_bs->backing_file) == 0) {
> diff --git a/block/qapi.c b/block/qapi.c
> index 4f59ac1c0f..751c3e695a 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -45,7 +45,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
>                                           BlockDriverState *bs, Error **errp)
>   {
>       ImageInfo **p_image_info;
> -    BlockDriverState *bs0;
> +    BlockDriverState *bs0, *backing;
>       BlockDeviceInfo *info;
>   
>       if (!bs->drv) {
> @@ -74,9 +74,10 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
>           info->node_name = g_strdup(bs->node_name);
>       }
>   
> -    if (bs->backing_file[0]) {
> +    backing = bdrv_filtered_cow_bs(bs);
> +    if (backing) {
>           info->has_backing_file = true;
> -        info->backing_file = g_strdup(bs->backing_file);
> +        info->backing_file = g_strdup(backing->filename);
>       }
>   
>       if (!QLIST_EMPTY(&bs->dirty_bitmaps)) {
> diff --git a/tests/qemu-iotests/191.out b/tests/qemu-iotests/191.out
> index 3fc92bb56e..0b3c216b0c 100644
> --- a/tests/qemu-iotests/191.out
> +++ b/tests/qemu-iotests/191.out
> @@ -605,7 +605,6 @@ wrote 65536/65536 bytes at offset 1048576
>                       "backing-filename": "TEST_DIR/t.IMGFMT.base",
>                       "dirty-flag": false
>                   },
> -                "backing-filename-format": "IMGFMT",
>                   "virtual-size": 67108864,
>                   "filename": "TEST_DIR/t.IMGFMT.ovl3",
>                   "cluster-size": 65536,
> diff --git a/tests/qemu-iotests/228 b/tests/qemu-iotests/228
> index 9a50afd205..a1f3187212 100755
> --- a/tests/qemu-iotests/228
> +++ b/tests/qemu-iotests/228
> @@ -34,7 +34,7 @@ def log_node_info(node):
>   
>       log('bs->filename: ' + node['image']['filename'],
>           filters=[filter_testfiles, filter_imgfmt])
> -    log('bs->backing_file: ' + node['backing_file'],
> +    log('bs->backing_file: ' + node['image']['full-backing-filename'],
>           filters=[filter_testfiles, filter_imgfmt])
>   
>       if 'backing-image' in node['image']:
> @@ -70,8 +70,8 @@ with iotests.FilePath('base.img') as base_img_path, \
>                   },
>                   filters=[filter_qmp_testfiles, filter_qmp_imgfmt])
>   
> -    # Filename should be plain, and the backing filename should not
> -    # contain the "file:" prefix
> +    # Filename should be plain, and the backing node filename should
> +    # not contain the "file:" prefix
>       log_node_info(vm.node_info('node0'))
>   
>       vm.qmp_log('blockdev-del', node_name='node0')
> diff --git a/tests/qemu-iotests/228.out b/tests/qemu-iotests/228.out
> index 4217df24fe..8c82009abe 100644
> --- a/tests/qemu-iotests/228.out
> +++ b/tests/qemu-iotests/228.out
> @@ -4,7 +4,7 @@
>   {"return": {}}
>   
>   bs->filename: TEST_DIR/PID-top.img
> -bs->backing_file: TEST_DIR/PID-base.img
> +bs->backing_file: file:TEST_DIR/PID-base.img
>   bs->backing->bs->filename: TEST_DIR/PID-base.img
>   
>   {"execute": "blockdev-del", "arguments": {"node-name": "node0"}}
> @@ -41,7 +41,7 @@ bs->backing->bs->filename: TEST_DIR/PID-base.img
>   {"return": {}}
>   
>   bs->filename: TEST_DIR/PID-top.img
> -bs->backing_file: TEST_DIR/PID-base.img
> +bs->backing_file: file:TEST_DIR/PID-base.img
>   bs->backing->bs->filename: TEST_DIR/PID-base.img
>   
>   {"execute": "blockdev-del", "arguments": {"node-name": "node0"}}
> @@ -55,7 +55,7 @@ bs->backing->bs->filename: TEST_DIR/PID-base.img
>   {"return": {}}
>   
>   bs->filename: json:{"backing": {"driver": "null-co"}, "driver": "IMGFMT", "file": {"driver": "file", "filename": "TEST_DIR/PID-top.img"}}
> -bs->backing_file: null-co://
> +bs->backing_file: TEST_DIR/PID-base.img
>   bs->backing->bs->filename: null-co://
>   
>   {"execute": "blockdev-del", "arguments": {"node-name": "node0"}}
> diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
> index bc1ceb9792..049ef6a71f 100644
> --- a/tests/qemu-iotests/245
> +++ b/tests/qemu-iotests/245
> @@ -722,7 +722,9 @@ class TestBlockdevReopen(iotests.QMPTestCase):
>   
>           # Detach hd2 from hd0.
>           self.reopen(opts, {'backing': None})
> -        self.reopen(opts, {}, "backing is missing for 'hd0'")
> +
> +        # Without a backing file, we can omit 'backing' again
> +        self.reopen(opts)
>   
>           # Remove both hd0 and hd2
>           result = self.vm.qmp('blockdev-del', conv_keys = True, node_name = 'hd0')
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit Max Reitz
@ 2019-08-23  5:59   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-23  5:59 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> complete_and_wait() and wait_ready() currently only work for mirror
> jobs.  Let them work for active commit jobs, too.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   tests/qemu-iotests/iotests.py | 10 +++++++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
> index 84438e837c..3ef846d1dc 100644
> --- a/tests/qemu-iotests/iotests.py
> +++ b/tests/qemu-iotests/iotests.py
> @@ -761,8 +761,12 @@ class QMPTestCase(unittest.TestCase):
>   
>       def wait_ready(self, drive='drive0'):
>           '''Wait until a block job BLOCK_JOB_READY event'''
> -        f = {'data': {'type': 'mirror', 'device': drive } }
> -        event = self.vm.event_wait(name='BLOCK_JOB_READY', match=f)
> +        event = self.vm.events_wait([
> +                ('BLOCK_JOB_READY',
> +                 {'data': {'type': 'mirror', 'device': drive } }),
> +                ('BLOCK_JOB_READY',
> +                 {'data': {'type': 'commit', 'device': drive } })
> +            ])
>   
>       def wait_ready_and_cancel(self, drive='drive0'):
>           self.wait_ready(drive=drive)
> @@ -780,7 +784,7 @@ class QMPTestCase(unittest.TestCase):
>           self.assert_qmp(result, 'return', {})
>   
>           event = self.wait_until_completed(drive=drive)
> -        self.assert_qmp(event, 'data/type', 'mirror')
> +        self.assertTrue(event['data']['type'] in ['mirror', 'commit'])
>   
>       def pause_wait(self, job_id='job0'):
>           with Timeout(1, "Timeout waiting for job to pause"):
> 


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters Max Reitz
  2019-08-12 11:09   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-31  9:57   ` Vladimir Sementsov-Ogievskiy
  2019-09-02 14:35     ` Max Reitz
  2019-09-13 12:55   ` Kevin Wolf
  2 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-31  9:57 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> This includes some permission limiting (for example, we only need to
> take the RESIZE permission for active commits where the base is smaller
> than the top).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>   blockdev.c     |  47 +++++++++++++++++---
>   2 files changed, 131 insertions(+), 33 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 54bafdf176..6ddbfb9708 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
>       BlockBackend *target;
>       BlockDriverState *mirror_top_bs;
>       BlockDriverState *base;
> +    BlockDriverState *base_overlay;
>   
>       /* The name of the graph node to replace */
>       char *replaces;
> @@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
>                                &error_abort);
>       if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>           BlockDriverState *backing = s->is_none_mode ? src : s->base;
> -        if (backing_bs(target_bs) != backing) {
> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
> +        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
> +
> +        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
> +            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
>               if (local_err) {
>                   error_report_err(local_err);
>                   ret = -EPERM;
> @@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
>        * valid.
>        */
>       block_job_remove_all_bdrv(bjob);
> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>   
>       /* We just changed the BDS the job BB refers to (with either or both of the
>        * bdrv_replace_node() calls), so switch the BB back so the cleanup does
> @@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>               return 0;
>           }
>   
> -        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
> +        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
> +                                      &count);
>           if (ret < 0) {
>               return ret;
>           }
> @@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>       } else {
>           s->target_cluster_size = BDRV_SECTOR_SIZE;
>       }
> -    if (backing_filename[0] && !target_bs->backing &&
> +    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
>           s->granularity < s->target_cluster_size) {
>           s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>           s->cow_bitmap = bitmap_new(length);
> @@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
>       if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>           int ret;
>   
> -        assert(!target->backing);
> -        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
> +        assert(!bdrv_backing_chain_next(target));

Preexisting, but seems we may crash here, I don't see where it is checked before, to
return error if there is some backing. And even if we do so, we don't prevent appearing
of target backing during mirror operation.

> +        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
> +                                     "backing", errp);
>           if (ret < 0) {
>               return;
>           }
> @@ -1531,8 +1536,8 @@ static BlockJob *mirror_start_job(
>       MirrorBlockJob *s;
>       MirrorBDSOpaque *bs_opaque;
>       BlockDriverState *mirror_top_bs;
> -    bool target_graph_mod;
>       bool target_is_backing;
> +    uint64_t target_perms, target_shared_perms;
>       Error *local_err = NULL;
>       int ret;
>   
> @@ -1551,7 +1556,7 @@ static BlockJob *mirror_start_job(
>           buf_size = DEFAULT_MIRROR_BUF_SIZE;
>       }
>   
> -    if (bs == target) {
> +    if (bdrv_skip_rw_filters(bs) == bdrv_skip_rw_filters(target)) {
>           error_setg(errp, "Can't mirror node into itself");
>           return NULL;
>       }
> @@ -1615,15 +1620,50 @@ static BlockJob *mirror_start_job(
>        * In the case of active commit, things look a bit different, though,
>        * because the target is an already populated backing file in active use.
>        * We can allow anything except resize there.*/
> +
> +    target_perms = BLK_PERM_WRITE;
> +    target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
> +
>       target_is_backing = bdrv_chain_contains(bs, target);
> -    target_graph_mod = (backing_mode != MIRROR_LEAVE_BACKING_CHAIN);
> +    if (target_is_backing) {
> +        int64_t bs_size, target_size;

<empty after definitions>

> +        bs_size = bdrv_getlength(bs);
> +        if (bs_size < 0) {
> +            error_setg_errno(errp, -bs_size,
> +                             "Could not inquire top image size");
> +            goto fail;
> +        }
> +
> +        target_size = bdrv_getlength(target);
> +        if (target_size < 0) {
> +            error_setg_errno(errp, -target_size,
> +                             "Could not inquire base image size");
> +            goto fail;
> +        }
> +
> +        if (target_size < bs_size) {
> +            target_perms |= BLK_PERM_RESIZE;
> +        }
> +
> +        target_shared_perms |= BLK_PERM_CONSISTENT_READ
> +                            |  BLK_PERM_WRITE
> +                            |  BLK_PERM_GRAPH_MOD;
> +    } else if (bdrv_chain_contains(bs, bdrv_skip_rw_filters(target))) {
> +        /*
> +         * We may want to allow this in the future, but it would
> +         * require taking some extra care.
> +         */
> +        error_setg(errp, "Cannot mirror to a filter on top of a node in the "
> +                   "source's backing chain");
> +        goto fail;
> +    }
> +
> +    if (backing_mode != MIRROR_LEAVE_BACKING_CHAIN) {
> +        target_perms |= BLK_PERM_GRAPH_MOD;
> +    }
> +
>       s->target = blk_new(s->common.job.aio_context,
> -                        BLK_PERM_WRITE | BLK_PERM_RESIZE |
> -                        (target_graph_mod ? BLK_PERM_GRAPH_MOD : 0),
> -                        BLK_PERM_WRITE_UNCHANGED |
> -                        (target_is_backing ? BLK_PERM_CONSISTENT_READ |
> -                                             BLK_PERM_WRITE |
> -                                             BLK_PERM_GRAPH_MOD : 0));
> +                        target_perms, target_shared_perms);
>       ret = blk_insert_bs(s->target, target, errp);
>       if (ret < 0) {
>           goto fail;
> @@ -1647,6 +1687,7 @@ static BlockJob *mirror_start_job(
>       s->backing_mode = backing_mode;
>       s->copy_mode = copy_mode;
>       s->base = base;
> +    s->base_overlay = bdrv_find_overlay(bs, base);
>       s->granularity = granularity;
>       s->buf_size = ROUND_UP(buf_size, granularity);
>       s->unmap = unmap;
> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>       /* In commit_active_start() all intermediate nodes disappear, so
>        * any jobs in them must be blocked */
>       if (target_is_backing) {
> -        BlockDriverState *iter;
> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
> -             * ourselves at s->base (if writes are blocked for a node, they are
> -             * also blocked for its backing file). The other options would be a
> -             * second filter driver above s->base (== target). */
> +        BlockDriverState *iter, *filtered_target;
> +        uint64_t iter_shared_perms;
> +
> +        /*
> +         * The topmost node with
> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
> +         */
> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
> +
> +        assert(bdrv_skip_rw_filters(filtered_target) ==
> +               bdrv_skip_rw_filters(target));
> +
> +        /*
> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
> +         * ourselves at s->base (if writes are blocked for a node, they are
> +         * also blocked for its backing file). The other options would be a
> +         * second filter driver above s->base (== target).
> +         */
> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
> +
> +        for (iter = bdrv_filtered_bs(bs); iter != target;
> +             iter = bdrv_filtered_bs(iter))
> +        {
> +            if (iter == filtered_target) {
> +                /*
> +                 * From here on, all nodes are filters on the base.
> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.

I'd prefere to add something like: "because we share it on target (see target BlockBackend creation
and corresponding comment above)".

> +                 */
> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
> +            }
> +
>               ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
> -                                     errp);
> +                                     iter_shared_perms, errp);
>               if (ret < 0) {
>                   goto fail;
>               }
> @@ -1737,7 +1802,7 @@ fail:
>       bs_opaque->stop = true;
>       bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
>                                &error_abort);
> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>   
>       bdrv_unref(mirror_top_bs);
>   
> @@ -1764,7 +1829,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
>           return;
>       }
>       is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
>       mirror_start_job(job_id, bs, creation_flags, target, replaces,
>                        speed, granularity, buf_size, backing_mode,
>                        on_source_error, on_target_error, unmap, NULL, NULL,
> diff --git a/blockdev.c b/blockdev.c
> index c540802127..c451f553f7 100644


block/mirroc.c is OK for me. Continue with blockdev.c...

> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -3851,7 +3851,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>           return;
>       }
>   
> -    if (!bs->backing && sync == MIRROR_SYNC_MODE_TOP) {
> +    if (!bdrv_backing_chain_next(bs) && sync == MIRROR_SYNC_MODE_TOP) {
>           sync = MIRROR_SYNC_MODE_FULL;
>       }
>   
> @@ -3900,7 +3900,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>   
>   void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>   {
> -    BlockDriverState *bs;
> +    BlockDriverState *bs, *unfiltered_bs;
>       BlockDriverState *source, *target_bs;
>       AioContext *aio_context;
>       BlockMirrorBackingMode backing_mode;
> @@ -3909,6 +3909,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>       int flags;
>       int64_t size;
>       const char *format = arg->format;
> +    const char *replaces_node_name = NULL;
>       int ret;
>   
>       bs = qmp_get_root_bs(arg->device, errp);
> @@ -3921,6 +3922,16 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>           return;
>       }
>   
> +    /*
> +     * If the user has not instructed us otherwise, we should let the
> +     * block job run from @bs (thus taking into account all filters on
> +     * it) but replace @unfiltered_bs when it finishes (thus not
> +     * removing those filters).
> +     * (And if there are any explicit filters, we should assume the
> +     *  user knows how to use the @replaces option.)
> +     */
> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
> +
>       aio_context = bdrv_get_aio_context(bs);
>       aio_context_acquire(aio_context);
>   
> @@ -3934,8 +3945,14 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>       }
>   
>       flags = bs->open_flags | BDRV_O_RDWR;
> -    source = backing_bs(bs);
> +    source = bdrv_filtered_cow_bs(unfiltered_bs);
>       if (!source && arg->sync == MIRROR_SYNC_MODE_TOP) {


Hmm, you handle this case a bit differently here and in blockdev_mirror_common..
Can we handle it only in blockdev_mirror_common, to be consistent with qmp_blockdev_mirror?

> +        if (bdrv_filtered_bs(unfiltered_bs)) {
> +            /* @unfiltered_bs is an explicit filter */
> +            error_setg(errp, "Cannot perform sync=top mirror through an "
> +                       "explicitly added filter node on the source");
> +            goto out;
> +        }
>           arg->sync = MIRROR_SYNC_MODE_FULL;
>       }
>       if (arg->sync == MIRROR_SYNC_MODE_NONE) {
> @@ -3954,6 +3971,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>                                " named node of the graph");
>               goto out;
>           }
> +        replaces_node_name = arg->replaces;
> +    } else if (unfiltered_bs != bs) {
> +        replaces_node_name = unfiltered_bs->node_name;
>       }
>   
>       if (arg->mode == NEW_IMAGE_MODE_ABSOLUTE_PATHS) {
> @@ -3973,6 +3993,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>           bdrv_img_create(arg->target, format,
>                           NULL, NULL, NULL, size, flags, false, &local_err);
>       } else {
> +        /* Implicit filters should not appear in the filename */
> +        BlockDriverState *explicit_backing = bdrv_skip_implicit_filters(source);
> +
>           switch (arg->mode) {
>           case NEW_IMAGE_MODE_EXISTING:
>               break;
> @@ -3980,8 +4003,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>               /* create new image with backing file */
>               bdrv_refresh_filename(source);
>               bdrv_img_create(arg->target, format,
> -                            source->filename,
> -                            source->drv->format_name,
> +                            explicit_backing->filename,
> +                            explicit_backing->drv->format_name,
>                               NULL, size, flags, false, &local_err);
>               break;
>           default:
> @@ -4017,7 +4040,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>       }
>   
>       blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
> -                           arg->has_replaces, arg->replaces, arg->sync,
> +                           !!replaces_node_name, replaces_node_name, arg->sync,
>                              backing_mode, arg->has_speed, arg->speed,
>                              arg->has_granularity, arg->granularity,
>                              arg->has_buf_size, arg->buf_size,
> @@ -4053,7 +4076,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>                            bool has_auto_dismiss, bool auto_dismiss,
>                            Error **errp)
>   {
> -    BlockDriverState *bs;
> +    BlockDriverState *bs, *unfiltered_bs;
>       BlockDriverState *target_bs;
>       AioContext *aio_context;
>       BlockMirrorBackingMode backing_mode = MIRROR_LEAVE_BACKING_CHAIN;
> @@ -4065,6 +4088,16 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>           return;
>       }
>   
> +    /*
> +     * Same as in qmp_drive_mirror():

Then, may be better do it in blockdev_mirror_common ?

> We want to run the job from @bs,
> +     * but we want to replace @unfiltered_bs on completion.
> +     */
> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
> +    if (!has_replaces && unfiltered_bs != bs) {
> +        replaces = unfiltered_bs->node_name;
> +        has_replaces = true;
> +    }
> +
>       target_bs = bdrv_lookup_bs(target, target, errp);
>       if (!target_bs) {
>           return;
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 27/42] commit: Deal with filters
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 27/42] commit: " Max Reitz
@ 2019-08-31 10:44   ` Vladimir Sementsov-Ogievskiy
  2019-09-02 14:55     ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-31 10:44 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:13, Max Reitz wrote:
> This includes some permission limiting (for example, we only need to
> take the RESIZE permission if the base is smaller than the top).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/block-backend.c | 16 +++++---
>   block/commit.c        | 96 +++++++++++++++++++++++++++++++------------
>   blockdev.c            |  6 ++-
>   3 files changed, 85 insertions(+), 33 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index c13c5c83b0..0bc592d023 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -2180,11 +2180,17 @@ int blk_commit_all(void)
>           AioContext *aio_context = blk_get_aio_context(blk);
>   
>           aio_context_acquire(aio_context);
> -        if (blk_is_inserted(blk) && blk->root->bs->backing) {
> -            int ret = bdrv_commit(blk->root->bs);
> -            if (ret < 0) {
> -                aio_context_release(aio_context);
> -                return ret;
> +        if (blk_is_inserted(blk)) {
> +            BlockDriverState *non_filter;
> +
> +            /* Legacy function, so skip implicit filters */
> +            non_filter = bdrv_skip_implicit_filters(blk->root->bs);
> +            if (bdrv_filtered_cow_child(non_filter)) {
> +                int ret = bdrv_commit(non_filter);
> +                if (ret < 0) {
> +                    aio_context_release(aio_context);
> +                    return ret;
> +                }
>               }

and if non_filter is explicit filter we just skip it. I think we'd better return
error in this case. For example, just drop if (bdrv_filtered_cow_child) and get
ENOTSUP from bdrv_commit in this case.

And with at least this fixed I'm OK with this patch:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

However some comments below:

>           }
>           aio_context_release(aio_context);
> diff --git a/block/commit.c b/block/commit.c
> index 5a7672c7c7..40d1c8eeac 100644
> --- a/block/commit.c
> +++ b/block/commit.c
> @@ -37,6 +37,7 @@ typedef struct CommitBlockJob {
>       BlockBackend *top;
>       BlockBackend *base;
>       BlockDriverState *base_bs;
> +    BlockDriverState *above_base;

you called it base_overlay in mirror, seems better to keep same naming

>       BlockdevOnError on_error;
>       bool base_read_only;
>       bool chain_frozen;
> @@ -110,7 +111,7 @@ static void commit_abort(Job *job)
>        * XXX Can (or should) we somehow keep 'consistent read' blocked even
>        * after the failed/cancelled commit job is gone? If we already wrote
>        * something to base, the intermediate images aren't valid any more. */
> -    bdrv_replace_node(s->commit_top_bs, backing_bs(s->commit_top_bs),
> +    bdrv_replace_node(s->commit_top_bs, s->commit_top_bs->backing->bs,
>                         &error_abort);
>   
>       bdrv_unref(s->commit_top_bs);
> @@ -174,7 +175,7 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
>               break;
>           }
>           /* Copy if allocated above the base */
> -        ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base), false,
> +        ret = bdrv_is_allocated_above(blk_bs(s->top), s->above_base, true,
>                                         offset, COMMIT_BUFFER_SIZE, &n);
>           copy = (ret == 1);
>           trace_commit_one_iteration(s, offset, n, ret);
> @@ -267,15 +268,35 @@ void commit_start(const char *job_id, BlockDriverState *bs,
>       CommitBlockJob *s;
>       BlockDriverState *iter;
>       BlockDriverState *commit_top_bs = NULL;
> +    BlockDriverState *filtered_base;
>       Error *local_err = NULL;
> +    int64_t base_size, top_size;
> +    uint64_t perms, iter_shared_perms;
>       int ret;
>   
>       assert(top != bs);
> -    if (top == base) {
> +    if (bdrv_skip_rw_filters(top) == bdrv_skip_rw_filters(base)) {
>           error_setg(errp, "Invalid files for merge: top and base are the same");
>           return;
>       }
>   
> +    base_size = bdrv_getlength(base);
> +    if (base_size < 0) {
> +        error_setg_errno(errp, -base_size, "Could not inquire base image size");
> +        return;
> +    }
> +
> +    top_size = bdrv_getlength(top);
> +    if (top_size < 0) {
> +        error_setg_errno(errp, -top_size, "Could not inquire top image size");
> +        return;
> +    }
> +
> +    perms = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> +    if (base_size < top_size) {
> +        perms |= BLK_PERM_RESIZE;
> +    }
> +
>       s = block_job_create(job_id, &commit_job_driver, NULL, bs, 0, BLK_PERM_ALL,
>                            speed, creation_flags, NULL, NULL, errp);
>       if (!s) {
> @@ -315,17 +336,43 @@ void commit_start(const char *job_id, BlockDriverState *bs,
>   
>       s->commit_top_bs = commit_top_bs;
>   
> -    /* Block all nodes between top and base, because they will
> -     * disappear from the chain after this operation. */
> -    assert(bdrv_chain_contains(top, base));
> -    for (iter = top; iter != base; iter = backing_bs(iter)) {
> -        /* XXX BLK_PERM_WRITE needs to be allowed so we don't block ourselves
> -         * at s->base (if writes are blocked for a node, they are also blocked
> -         * for its backing file). The other options would be a second filter
> -         * driver above s->base. */

This code part is absolutely equal to corresponding in block/mirror.c.. It would be great
to put it into a function and reuse. However its not about these series.

> +    /*
> +     * Block all nodes between top and base, because they will
> +     * disappear from the chain after this operation.
> +     * Note that this assumes that the user is fine with removing all
> +     * nodes (including R/W filters) between top and base.  Assuring
> +     * this is the responsibility of the interface (i.e. whoever calls
> +     * commit_start()).
> +     */
> +    s->above_base = bdrv_find_overlay(top, base);
> +    assert(s->above_base);
> +
> +    /*
> +     * The topmost node with
> +     * bdrv_skip_rw_filters(filtered_base) == bdrv_skip_rw_filters(base)
> +     */
> +    filtered_base = bdrv_filtered_cow_bs(s->above_base);
> +    assert(bdrv_skip_rw_filters(filtered_base) == bdrv_skip_rw_filters(base));
> +
> +    /*
> +     * XXX BLK_PERM_WRITE needs to be allowed so we don't block ourselves
> +     * at s->base (if writes are blocked for a node, they are also blocked
> +     * for its backing file). The other options would be a second filter
> +     * driver above s->base.
> +     */
> +    iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
> +
> +    for (iter = top; iter != base; iter = bdrv_filtered_bs(iter)) {
> +        if (iter == filtered_base) {
> +            /*
> +             * From here on, all nodes are filters on the base.  This
> +             * allows us to share BLK_PERM_CONSISTENT_READ.
> +             */
> +            iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
> +        }
> +
>           ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
> -                                 BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
> -                                 errp);
> +                                 iter_shared_perms, errp);
>           if (ret < 0) {
>               goto fail;
>           }
> @@ -342,9 +389,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
>       }
>   
>       s->base = blk_new(s->common.job.aio_context,
> -                      BLK_PERM_CONSISTENT_READ
> -                      | BLK_PERM_WRITE
> -                      | BLK_PERM_RESIZE,
> +                      perms,
>                         BLK_PERM_CONSISTENT_READ
>                         | BLK_PERM_GRAPH_MOD
>                         | BLK_PERM_WRITE_UNCHANGED);
> @@ -412,19 +457,22 @@ int bdrv_commit(BlockDriverState *bs)
>       if (!drv)
>           return -ENOMEDIUM;
>   
> -    if (!bs->backing) {
> +    backing_file_bs = bdrv_filtered_cow_bs(bs);

Hmm just note: if in future we'll have cow child which is not bs->backing, a lot of code will
fail, as we always assume that cow child is bs->backing. May be, this should be commented in
bdrv_filtered_cow_child implementation.

> +
> +    if (!backing_file_bs) {
>           return -ENOTSUP;
>       }
>   
>       if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_COMMIT_SOURCE, NULL) ||
> -        bdrv_op_is_blocked(bs->backing->bs, BLOCK_OP_TYPE_COMMIT_TARGET, NULL)) {
> +        bdrv_op_is_blocked(backing_file_bs, BLOCK_OP_TYPE_COMMIT_TARGET, NULL))
> +    {
>           return -EBUSY;
>       }
>   
> -    ro = bs->backing->bs->read_only;
> +    ro = backing_file_bs->read_only;
>   
>       if (ro) {
> -        if (bdrv_reopen_set_read_only(bs->backing->bs, false, NULL)) {
> +        if (bdrv_reopen_set_read_only(backing_file_bs, false, NULL)) {
>               return -EACCES;
>           }
>       }
> @@ -440,8 +488,6 @@ int bdrv_commit(BlockDriverState *bs)
>       }
>   
>       /* Insert commit_top block node above backing, so we can write to it */
> -    backing_file_bs = backing_bs(bs);
> -
>       commit_top_bs = bdrv_new_open_driver(&bdrv_commit_top, NULL, BDRV_O_RDWR,
>                                            &local_err);
>       if (commit_top_bs == NULL) {
> @@ -526,15 +572,13 @@ ro_cleanup:
>       qemu_vfree(buf);
>   
>       blk_unref(backing);
> -    if (backing_file_bs) {
> -        bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);
> -    }
> +    bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);

Preexisting, but we should not drop filter if we didn't added it (if we failed above filter
insertion). You increased amount of such cases. No damage still.

>       bdrv_unref(commit_top_bs);
>       blk_unref(src);
>   
>       if (ro) {
>           /* ignoring error return here */
> -        bdrv_reopen_set_read_only(bs->backing->bs, true, NULL);
> +        bdrv_reopen_set_read_only(backing_file_bs, true, NULL);
>       }
>   
>       return ret;
> diff --git a/blockdev.c b/blockdev.c
> index c6f79b4e0e..7bef41c0b0 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
>               return;
>           }
>   
> -        bs = blk_bs(blk);
> +        bs = bdrv_skip_implicit_filters(blk_bs(blk));
>           aio_context = bdrv_get_aio_context(bs);
>           aio_context_acquire(aio_context);
>   
> @@ -3454,7 +3454,9 @@ void qmp_block_commit(bool has_job_id, const char *job_id, const char *device,
>   
>       assert(bdrv_get_aio_context(base_bs) == aio_context);
>   
> -    for (iter = top_bs; iter != backing_bs(base_bs); iter = backing_bs(iter)) {
> +    for (iter = top_bs; iter != bdrv_filtered_bs(base_bs);
> +         iter = bdrv_filtered_bs(iter))
> +    {
>           if (bdrv_op_is_blocked(iter, BLOCK_OP_TYPE_COMMIT_TARGET, errp)) {
>               goto out;
>           }
> 

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases Max Reitz
@ 2019-08-31 11:41   ` Vladimir Sementsov-Ogievskiy
  2019-09-02 15:06     ` Max Reitz
  2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-31 11:41 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> This patch adds some tests on how commit copes with filter nodes.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   tests/qemu-iotests/040     | 177 +++++++++++++++++++++++++++++++++++++
>   tests/qemu-iotests/040.out |   4 +-
>   2 files changed, 179 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
> index 6db9abf8e6..a0a0db8889 100755
> --- a/tests/qemu-iotests/040
> +++ b/tests/qemu-iotests/040
> @@ -428,5 +428,182 @@ class TestReopenOverlay(ImageCommitTestCase):
>       def test_reopen_overlay(self):
>           self.run_commit_test(self.img1, self.img0)
>   
> +class TestCommitWithFilters(iotests.QMPTestCase):
> +    img0 = os.path.join(iotests.test_dir, '0.img')
> +    img1 = os.path.join(iotests.test_dir, '1.img')
> +    img2 = os.path.join(iotests.test_dir, '2.img')
> +    img3 = os.path.join(iotests.test_dir, '3.img')
> +
> +    def setUp(self):
> +        qemu_img('create', '-f', iotests.imgfmt, self.img0, '64M')
> +        qemu_img('create', '-f', iotests.imgfmt, self.img1, '64M')
> +        qemu_img('create', '-f', iotests.imgfmt, self.img2, '64M')
> +        qemu_img('create', '-f', iotests.imgfmt, self.img3, '64M')
> +
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 1 0M 1M', self.img0)
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 2 1M 1M', self.img1)
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 3 2M 1M', self.img2)
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 4 3M 1M', self.img3)
> +
> +        # Distributions of the patterns in the files; this is checked
> +        # by tearDown() and should be changed by the test cases as is
> +        # necessary
> +        self.pattern_files = [self.img0, self.img1, self.img2, self.img3]
> +
> +        self.vm = iotests.VM()
> +        self.vm.launch()
> +        self.has_quit = False
> +
> +        result = self.vm.qmp('object-add', qom_type='throttle-group', id='tg')
> +        self.assert_qmp(result, 'return', {})
> +
> +        result = self.vm.qmp('blockdev-add', **{
> +                'node-name': 'top-filter',
> +                'driver': 'throttle',
> +                'throttle-group': 'tg',
> +                'file': {
> +                    'node-name': 'cow-3',
> +                    'driver': iotests.imgfmt,
> +                    'file': {
> +                        'driver': 'file',
> +                        'filename': self.img3
> +                    },
> +                    'backing': {
> +                        'node-name': 'cow-2',
> +                        'driver': iotests.imgfmt,
> +                        'file': {
> +                            'driver': 'file',
> +                            'filename': self.img2
> +                        },
> +                        'backing': {
> +                            'node-name': 'cow-1',
> +                            'driver': iotests.imgfmt,
> +                            'file': {
> +                                'driver': 'file',
> +                                'filename': self.img1
> +                            },
> +                            'backing': {
> +                                'node-name': 'bottom-filter',
> +                                'driver': 'throttle',
> +                                'throttle-group': 'tg',
> +                                'file': {
> +                                    'node-name': 'cow-0',
> +                                    'driver': iotests.imgfmt,
> +                                    'file': {
> +                                        'driver': 'file',
> +                                        'filename': self.img0
> +                                    }
> +                                }
> +                            }
> +                        }
> +                    }
> +                }
> +            })
> +        self.assert_qmp(result, 'return', {})
> +
> +    def tearDown(self):
> +        self.vm.shutdown(has_quit=self.has_quit)
> +
> +        for index in range(len(self.pattern_files)):

you may use enumerate for such cases:
for ind, file in enumerate(self.pattern_files):
    ...

> +            result = qemu_io('-f', iotests.imgfmt,
> +                             '-c', 'read -P %i %iM 1M' % (index + 1, index),
> +                             self.pattern_files[index])
> +            self.assertFalse('Pattern verification failed' in result)

A bit better would be to keep this loop in a function and do "writes" through it too,
to make it more obvious that they are the same.. But I'm OK with it as is.

> +
> +        os.remove(self.img3)
> +        os.remove(self.img2)
> +        os.remove(self.img1)
> +        os.remove(self.img0)
> +
> +    # Filters make for funny filenames, so we cannot just use
> +    # self.imgX to get them
> +    def get_filename(self, node):
> +        return self.vm.node_info(node)['image']['filename']
> +

maybe:
def assertHasNode(self, node_name):
   self.assertIsNotNone(self.vm.node_info(node_name))

and similar for assertNoNode...

> +    def test_filterless_commit(self):
> +        self.assert_no_active_block_jobs()

why not just to include this call into setUp() ? Or even, just drop it?
We create and start new vm in setUp, it don't have any block jobs for sure.

> +        result = self.vm.qmp('block-commit',
> +                             job_id='commit',
> +                             device='top-filter',
> +                             top_node='cow-2',
> +                             base_node='cow-1')
> +        self.assert_qmp(result, 'return', {})
> +        self.wait_until_completed(drive='commit')
> +
> +        self.assertIsNotNone(self.vm.node_info('cow-3'))
> +        self.assertIsNone(self.vm.node_info('cow-2'))
> +        self.assertIsNotNone(self.vm.node_info('cow-1'))
> +
> +        # 2 has been comitted into 1
> +        self.pattern_files[2] = self.img1
> +
> +    def test_commit_through_filter(self):
> +        self.assert_no_active_block_jobs()
> +        result = self.vm.qmp('block-commit',
> +                             job_id='commit',
> +                             device='top-filter',
> +                             top_node='cow-1',
> +                             base_node='cow-0')
> +        self.assert_qmp(result, 'return', {})
> +        self.wait_until_completed(drive='commit')
> +
> +        self.assertIsNotNone(self.vm.node_info('cow-2'))
> +        self.assertIsNone(self.vm.node_info('cow-1'))
> +        self.assertIsNone(self.vm.node_info('bottom-filter'))
> +        self.assertIsNotNone(self.vm.node_info('cow-0'))
> +
> +        # 1 has been comitted into 0
> +        self.pattern_files[1] = self.img0
> +
> +    def test_filtered_active_commit_with_filter(self):
> +        # Add a device, so the commit job finds a parent it can change
> +        # to point to the base node (so we can test that top-filter is
> +        # dropped from the graph)
> +        result = self.vm.qmp('device_add', id='drv0', driver='virtio-blk',
> +                             drive='top-filter')
> +        self.assert_qmp(result, 'return', {})
> +
> +        # Try to release our reference to top-filter; that should not
> +        # work because drv0 uses it
> +        result = self.vm.qmp('blockdev-del', node_name='top-filter')
> +        self.assert_qmp(result, 'error/class', 'GenericError')
> +        self.assert_qmp(result, 'error/desc', 'Node top-filter is in use')
> +
> +        self.assert_no_active_block_jobs()
> +        result = self.vm.qmp('block-commit',
> +                             job_id='commit',
> +                             device='top-filter',
> +                             base_node='cow-2')
> +        self.assert_qmp(result, 'return', {})
> +        self.complete_and_wait(drive='commit')
> +
> +        # Try to release our reference to top-filter again
> +        result = self.vm.qmp('blockdev-del', node_name='top-filter')
> +        self.assert_qmp(result, 'return', {})
> +
> +        self.assertIsNone(self.vm.node_info('top-filter'))
> +        self.assertIsNone(self.vm.node_info('cow-3'))
> +        self.assertIsNotNone(self.vm.node_info('cow-2'))

It would be good to assert here the cow-2 became drv0 child. However, otherwise
it should be automatically dropped, so it's not necessary.

> +
> +        # 3 has been comitted into 2
> +        self.pattern_files[3] = self.img2
> +
> +    def test_filtered_active_commit_without_filter(self):
> +        self.assert_no_active_block_jobs()
> +        result = self.vm.qmp('block-commit',
> +                             job_id='commit',
> +                             device='top-filter',
> +                             top_node='cow-3',
> +                             base_node='cow-2')
> +        self.assert_qmp(result, 'return', {})

can we check that really "active" commit is started, i.e. mirror block job?

> +        self.complete_and_wait(drive='commit')
> +
> +        self.assertIsNotNone(self.vm.node_info('top-filter'))
> +        self.assertIsNone(self.vm.node_info('cow-3'))
> +        self.assertIsNotNone(self.vm.node_info('cow-2'))
> +
> +        # 3 has been comitted into 2
> +        self.pattern_files[3] = self.img2
> +
>   if __name__ == '__main__':
>       iotests.main(supported_fmts=['qcow2', 'qed'])
> diff --git a/tests/qemu-iotests/040.out b/tests/qemu-iotests/040.out
> index 220a5fa82c..fe58934d7a 100644
> --- a/tests/qemu-iotests/040.out
> +++ b/tests/qemu-iotests/040.out
> @@ -1,5 +1,5 @@
> -...............................................
> +...................................................
>   ----------------------------------------------------------------------
> -Ran 47 tests
> +Ran 51 tests
>   
>   OK
> 

With or without any of my suggestions:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 40/42] iotests: Add filter mirror test cases
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 40/42] iotests: Add filter mirror " Max Reitz
@ 2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-31 12:35 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> This patch adds some test cases how mirroring relates to filters.  One
> of them tests what happens when you mirror off a filtered COW node, two
> others use the mirror filter node as basically our only example of an
> implicitly created filter node so far (besides the commit filter).
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases Max Reitz
  2019-08-31 11:41   ` Vladimir Sementsov-Ogievskiy
@ 2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
  2019-09-02 15:09     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-08-31 12:35 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> This patch adds some tests on how commit copes with filter nodes.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   tests/qemu-iotests/040     | 177 +++++++++++++++++++++++++++++++++++++
>   tests/qemu-iotests/040.out |   4 +-
>   2 files changed, 179 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
> index 6db9abf8e6..a0a0db8889 100755
> --- a/tests/qemu-iotests/040
> +++ b/tests/qemu-iotests/040
> @@ -428,5 +428,182 @@ class TestReopenOverlay(ImageCommitTestCase):
>       def test_reopen_overlay(self):
>           self.run_commit_test(self.img1, self.img0)
>   
> +class TestCommitWithFilters(iotests.QMPTestCase):
> +    img0 = os.path.join(iotests.test_dir, '0.img')
> +    img1 = os.path.join(iotests.test_dir, '1.img')
> +    img2 = os.path.join(iotests.test_dir, '2.img')
> +    img3 = os.path.join(iotests.test_dir, '3.img')
> +
> +    def setUp(self):
> +        qemu_img('create', '-f', iotests.imgfmt, self.img0, '64M')
> +        qemu_img('create', '-f', iotests.imgfmt, self.img1, '64M')
> +        qemu_img('create', '-f', iotests.imgfmt, self.img2, '64M')
> +        qemu_img('create', '-f', iotests.imgfmt, self.img3, '64M')
> +
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 1 0M 1M', self.img0)
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 2 1M 1M', self.img1)
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 3 2M 1M', self.img2)
> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 4 3M 1M', self.img3)
> +
> +        # Distributions of the patterns in the files; this is checked
> +        # by tearDown() and should be changed by the test cases as is
> +        # necessary
> +        self.pattern_files = [self.img0, self.img1, self.img2, self.img3]
> +
> +        self.vm = iotests.VM()
> +        self.vm.launch()
> +        self.has_quit = False

has_quit is unused actually. It's always False.



-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-31  9:57   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-02 14:35     ` Max Reitz
  2019-09-03  8:32       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-02 14:35 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 19488 bytes --]

On 31.08.19 11:57, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> This includes some permission limiting (for example, we only need to
>> take the RESIZE permission for active commits where the base is smaller
>> than the top).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>>   blockdev.c     |  47 +++++++++++++++++---
>>   2 files changed, 131 insertions(+), 33 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 54bafdf176..6ddbfb9708 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
>>       BlockBackend *target;
>>       BlockDriverState *mirror_top_bs;
>>       BlockDriverState *base;
>> +    BlockDriverState *base_overlay;
>>   
>>       /* The name of the graph node to replace */
>>       char *replaces;
>> @@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
>>                                &error_abort);
>>       if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>>           BlockDriverState *backing = s->is_none_mode ? src : s->base;
>> -        if (backing_bs(target_bs) != backing) {
>> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
>> +        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
>> +
>> +        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
>> +            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
>>               if (local_err) {
>>                   error_report_err(local_err);
>>                   ret = -EPERM;
>> @@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
>>        * valid.
>>        */
>>       block_job_remove_all_bdrv(bjob);
>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>   
>>       /* We just changed the BDS the job BB refers to (with either or both of the
>>        * bdrv_replace_node() calls), so switch the BB back so the cleanup does
>> @@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>>               return 0;
>>           }
>>   
>> -        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
>> +        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
>> +                                      &count);
>>           if (ret < 0) {
>>               return ret;
>>           }
>> @@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>>       } else {
>>           s->target_cluster_size = BDRV_SECTOR_SIZE;
>>       }
>> -    if (backing_filename[0] && !target_bs->backing &&
>> +    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
>>           s->granularity < s->target_cluster_size) {
>>           s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>>           s->cow_bitmap = bitmap_new(length);
>> @@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
>>       if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>>           int ret;
>>   
>> -        assert(!target->backing);
>> -        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
>> +        assert(!bdrv_backing_chain_next(target));
> 
> Preexisting, but seems we may crash here, I don't see where it is checked before, to
> return error if there is some backing. And even if we do so, we don't prevent appearing
> of target backing during mirror operation.

The idea is that MIRROR_OPEN_BACKING_CHAIN is set only when using
drive-mirror with mode=existing.  In this case, we also set
BDRV_O_NO_BACKING for the target.

You’re right that a user could add a backing chain to the target during
the operation.  They really have to make an effort to shoot themselves
in the foot for this because the target must have an auto-generated node
name.

I suppose the best would be not to open the backing chain if the target
node already has a backing child?

>> +        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
>> +                                     "backing", errp);
>>           if (ret < 0) {
>>               return;
>>           }
>> @@ -1531,8 +1536,8 @@ static BlockJob *mirror_start_job(
>>       MirrorBlockJob *s;
>>       MirrorBDSOpaque *bs_opaque;
>>       BlockDriverState *mirror_top_bs;
>> -    bool target_graph_mod;
>>       bool target_is_backing;
>> +    uint64_t target_perms, target_shared_perms;
>>       Error *local_err = NULL;
>>       int ret;
>>   
>> @@ -1551,7 +1556,7 @@ static BlockJob *mirror_start_job(
>>           buf_size = DEFAULT_MIRROR_BUF_SIZE;
>>       }
>>   
>> -    if (bs == target) {
>> +    if (bdrv_skip_rw_filters(bs) == bdrv_skip_rw_filters(target)) {
>>           error_setg(errp, "Can't mirror node into itself");
>>           return NULL;
>>       }
>> @@ -1615,15 +1620,50 @@ static BlockJob *mirror_start_job(
>>        * In the case of active commit, things look a bit different, though,
>>        * because the target is an already populated backing file in active use.
>>        * We can allow anything except resize there.*/
>> +
>> +    target_perms = BLK_PERM_WRITE;
>> +    target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
>> +
>>       target_is_backing = bdrv_chain_contains(bs, target);
>> -    target_graph_mod = (backing_mode != MIRROR_LEAVE_BACKING_CHAIN);
>> +    if (target_is_backing) {
>> +        int64_t bs_size, target_size;
> 
> <empty after definitions>

Is that part of any of our guidelines? :-)

Sure, will add.

>> +        bs_size = bdrv_getlength(bs);
>> +        if (bs_size < 0) {
>> +            error_setg_errno(errp, -bs_size,
>> +                             "Could not inquire top image size");
>> +            goto fail;
>> +        }
>> +
>> +        target_size = bdrv_getlength(target);
>> +        if (target_size < 0) {
>> +            error_setg_errno(errp, -target_size,
>> +                             "Could not inquire base image size");
>> +            goto fail;
>> +        }
>> +
>> +        if (target_size < bs_size) {
>> +            target_perms |= BLK_PERM_RESIZE;
>> +        }
>> +
>> +        target_shared_perms |= BLK_PERM_CONSISTENT_READ
>> +                            |  BLK_PERM_WRITE
>> +                            |  BLK_PERM_GRAPH_MOD;
>> +    } else if (bdrv_chain_contains(bs, bdrv_skip_rw_filters(target))) {
>> +        /*
>> +         * We may want to allow this in the future, but it would
>> +         * require taking some extra care.
>> +         */
>> +        error_setg(errp, "Cannot mirror to a filter on top of a node in the "
>> +                   "source's backing chain");
>> +        goto fail;
>> +    }
>> +
>> +    if (backing_mode != MIRROR_LEAVE_BACKING_CHAIN) {
>> +        target_perms |= BLK_PERM_GRAPH_MOD;
>> +    }
>> +
>>       s->target = blk_new(s->common.job.aio_context,
>> -                        BLK_PERM_WRITE | BLK_PERM_RESIZE |
>> -                        (target_graph_mod ? BLK_PERM_GRAPH_MOD : 0),
>> -                        BLK_PERM_WRITE_UNCHANGED |
>> -                        (target_is_backing ? BLK_PERM_CONSISTENT_READ |
>> -                                             BLK_PERM_WRITE |
>> -                                             BLK_PERM_GRAPH_MOD : 0));
>> +                        target_perms, target_shared_perms);
>>       ret = blk_insert_bs(s->target, target, errp);
>>       if (ret < 0) {
>>           goto fail;
>> @@ -1647,6 +1687,7 @@ static BlockJob *mirror_start_job(
>>       s->backing_mode = backing_mode;
>>       s->copy_mode = copy_mode;
>>       s->base = base;
>> +    s->base_overlay = bdrv_find_overlay(bs, base);
>>       s->granularity = granularity;
>>       s->buf_size = ROUND_UP(buf_size, granularity);
>>       s->unmap = unmap;
>> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>>       /* In commit_active_start() all intermediate nodes disappear, so
>>        * any jobs in them must be blocked */
>>       if (target_is_backing) {
>> -        BlockDriverState *iter;
>> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
>> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>> -             * ourselves at s->base (if writes are blocked for a node, they are
>> -             * also blocked for its backing file). The other options would be a
>> -             * second filter driver above s->base (== target). */
>> +        BlockDriverState *iter, *filtered_target;
>> +        uint64_t iter_shared_perms;
>> +
>> +        /*
>> +         * The topmost node with
>> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
>> +         */
>> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
>> +
>> +        assert(bdrv_skip_rw_filters(filtered_target) ==
>> +               bdrv_skip_rw_filters(target));
>> +
>> +        /*
>> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
>> +         * ourselves at s->base (if writes are blocked for a node, they are
>> +         * also blocked for its backing file). The other options would be a
>> +         * second filter driver above s->base (== target).
>> +         */
>> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
>> +
>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
>> +             iter = bdrv_filtered_bs(iter))
>> +        {
>> +            if (iter == filtered_target) {
>> +                /*
>> +                 * From here on, all nodes are filters on the base.
>> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
> 
> I'd prefere to add something like: "because we share it on target (see target BlockBackend creation
> and corresponding comment above)".

I’d rather not refer to other comments in case they change…  Maybe just
“This allows us to share BLK_PERM_CONSISTENT_READ, as we do on the
target.”?  I think if someone is interested, they will scan the file for
what permissions are shared on the target anyway.

>> +                 */
>> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
>> +            }
>> +
>>               ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
>> -                                     errp);
>> +                                     iter_shared_perms, errp);
>>               if (ret < 0) {
>>                   goto fail;
>>               }
>> @@ -1737,7 +1802,7 @@ fail:
>>       bs_opaque->stop = true;
>>       bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
>>                                &error_abort);
>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>   
>>       bdrv_unref(mirror_top_bs);
>>   
>> @@ -1764,7 +1829,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
>>           return;
>>       }
>>       is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
>> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
>> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
>>       mirror_start_job(job_id, bs, creation_flags, target, replaces,
>>                        speed, granularity, buf_size, backing_mode,
>>                        on_source_error, on_target_error, unmap, NULL, NULL,
>> diff --git a/blockdev.c b/blockdev.c
>> index c540802127..c451f553f7 100644
> 
> 
> block/mirroc.c is OK for me. Continue with blockdev.c...
> 
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -3851,7 +3851,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>           return;
>>       }
>>   
>> -    if (!bs->backing && sync == MIRROR_SYNC_MODE_TOP) {
>> +    if (!bdrv_backing_chain_next(bs) && sync == MIRROR_SYNC_MODE_TOP) {
>>           sync = MIRROR_SYNC_MODE_FULL;
>>       }
>>   
>> @@ -3900,7 +3900,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>   
>>   void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>   {
>> -    BlockDriverState *bs;
>> +    BlockDriverState *bs, *unfiltered_bs;
>>       BlockDriverState *source, *target_bs;
>>       AioContext *aio_context;
>>       BlockMirrorBackingMode backing_mode;
>> @@ -3909,6 +3909,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>       int flags;
>>       int64_t size;
>>       const char *format = arg->format;
>> +    const char *replaces_node_name = NULL;
>>       int ret;
>>   
>>       bs = qmp_get_root_bs(arg->device, errp);
>> @@ -3921,6 +3922,16 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>           return;
>>       }
>>   
>> +    /*
>> +     * If the user has not instructed us otherwise, we should let the
>> +     * block job run from @bs (thus taking into account all filters on
>> +     * it) but replace @unfiltered_bs when it finishes (thus not
>> +     * removing those filters).
>> +     * (And if there are any explicit filters, we should assume the
>> +     *  user knows how to use the @replaces option.)
>> +     */
>> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
>> +
>>       aio_context = bdrv_get_aio_context(bs);
>>       aio_context_acquire(aio_context);
>>   
>> @@ -3934,8 +3945,14 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>       }
>>   
>>       flags = bs->open_flags | BDRV_O_RDWR;
>> -    source = backing_bs(bs);
>> +    source = bdrv_filtered_cow_bs(unfiltered_bs);
>>       if (!source && arg->sync == MIRROR_SYNC_MODE_TOP) {
> 
> 
> Hmm, you handle this case a bit differently here and in blockdev_mirror_common..
> Can we handle it only in blockdev_mirror_common, to be consistent with qmp_blockdev_mirror?

What exactly do you mean?  The difference between skipping all filters
and just skipping implicit filters?  Hm.

First, the check in blockdev_mirror_common() should actually be
unnecessary.  In qmp_{blockdev,drive}_mirror(), we do nearly the same
check anyway (and then force sync=full if there is no backing file).  So
if all three functions did the same check, we wouldn’t need it in
blockdev_mirror_common().

Second, let’s look at the difference in an example: One where
blockdev_mirror_common() would not decide to enforce mode=full, but
qmp_{blockdev,drive}_mirror() would.
This happens when @bs is an explicit filter over some overlay with a
backing file, e.g.:

throttle --file--> qcow2 --backing--> raw

It’s correct to run the mirror job from the throttle node; but @source
should be bdrv_backing_chain_next() so it will point to the raw node.
Currently, it is NULL (because the throttle node does not have a COW child).

But then again, I’ve made qmp_{blockdev,drive}_mirror() throw an error
in such a case:

>> +        if (bdrv_filtered_bs(unfiltered_bs)) {
>> +            /* @unfiltered_bs is an explicit filter */
>> +            error_setg(errp, "Cannot perform sync=top mirror through an "
>> +                       "explicitly added filter node on the source");
>> +            goto out;
>> +        }

So it isn’t really a problem.  Still, does the error make sense?  Should
we just allow that case by letting source be
bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))?

(BTW, I just noticed that @base seems to be pretty much unused in
block/mirror.c.  It only really uses @base_overlay now.  So I suppose it
makes sense to remove it in v7.)

>>           arg->sync = MIRROR_SYNC_MODE_FULL;
>>       }
>>       if (arg->sync == MIRROR_SYNC_MODE_NONE) {
>> @@ -3954,6 +3971,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>                                " named node of the graph");
>>               goto out;
>>           }
>> +        replaces_node_name = arg->replaces;
>> +    } else if (unfiltered_bs != bs) {
>> +        replaces_node_name = unfiltered_bs->node_name;
>>       }
>>   
>>       if (arg->mode == NEW_IMAGE_MODE_ABSOLUTE_PATHS) {
>> @@ -3973,6 +3993,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>           bdrv_img_create(arg->target, format,
>>                           NULL, NULL, NULL, size, flags, false, &local_err);
>>       } else {
>> +        /* Implicit filters should not appear in the filename */
>> +        BlockDriverState *explicit_backing = bdrv_skip_implicit_filters(source);
>> +
>>           switch (arg->mode) {
>>           case NEW_IMAGE_MODE_EXISTING:
>>               break;
>> @@ -3980,8 +4003,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>               /* create new image with backing file */
>>               bdrv_refresh_filename(source);
>>               bdrv_img_create(arg->target, format,
>> -                            source->filename,
>> -                            source->drv->format_name,
>> +                            explicit_backing->filename,
>> +                            explicit_backing->drv->format_name,
>>                               NULL, size, flags, false, &local_err);
>>               break;
>>           default:
>> @@ -4017,7 +4040,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>       }
>>   
>>       blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
>> -                           arg->has_replaces, arg->replaces, arg->sync,
>> +                           !!replaces_node_name, replaces_node_name, arg->sync,
>>                              backing_mode, arg->has_speed, arg->speed,
>>                              arg->has_granularity, arg->granularity,
>>                              arg->has_buf_size, arg->buf_size,
>> @@ -4053,7 +4076,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>>                            bool has_auto_dismiss, bool auto_dismiss,
>>                            Error **errp)
>>   {
>> -    BlockDriverState *bs;
>> +    BlockDriverState *bs, *unfiltered_bs;
>>       BlockDriverState *target_bs;
>>       AioContext *aio_context;
>>       BlockMirrorBackingMode backing_mode = MIRROR_LEAVE_BACKING_CHAIN;
>> @@ -4065,6 +4088,16 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>>           return;
>>       }
>>   
>> +    /*
>> +     * Same as in qmp_drive_mirror():
> 
> Then, may be better do it in blockdev_mirror_common ?

Hm, maybe.  Should we decide to let @source be
bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)) in qmp_drive_mirror(), I
don’t think we need @unfiltered_bs there to determine @source.

Max

>> We want to run the job from @bs,
>> +     * but we want to replace @unfiltered_bs on completion.
>> +     */
>> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
>> +    if (!has_replaces && unfiltered_bs != bs) {
>> +        replaces = unfiltered_bs->node_name;
>> +        has_replaces = true;
>> +    }
>> +
>>       target_bs = bdrv_lookup_bs(target, target, errp);
>>       if (!target_bs) {
>>           return;
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 27/42] commit: Deal with filters
  2019-08-31 10:44   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-02 14:55     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-02 14:55 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 4275 bytes --]

On 31.08.19 12:44, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:13, Max Reitz wrote:
>> This includes some permission limiting (for example, we only need to
>> take the RESIZE permission if the base is smaller than the top).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/block-backend.c | 16 +++++---
>>   block/commit.c        | 96 +++++++++++++++++++++++++++++++------------
>>   blockdev.c            |  6 ++-
>>   3 files changed, 85 insertions(+), 33 deletions(-)
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index c13c5c83b0..0bc592d023 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -2180,11 +2180,17 @@ int blk_commit_all(void)
>>           AioContext *aio_context = blk_get_aio_context(blk);
>>   
>>           aio_context_acquire(aio_context);
>> -        if (blk_is_inserted(blk) && blk->root->bs->backing) {
>> -            int ret = bdrv_commit(blk->root->bs);
>> -            if (ret < 0) {
>> -                aio_context_release(aio_context);
>> -                return ret;
>> +        if (blk_is_inserted(blk)) {
>> +            BlockDriverState *non_filter;
>> +
>> +            /* Legacy function, so skip implicit filters */
>> +            non_filter = bdrv_skip_implicit_filters(blk->root->bs);
>> +            if (bdrv_filtered_cow_child(non_filter)) {
>> +                int ret = bdrv_commit(non_filter);
>> +                if (ret < 0) {
>> +                    aio_context_release(aio_context);
>> +                    return ret;
>> +                }
>>               }
> 
> and if non_filter is explicit filter we just skip it. I think we'd better return
> error in this case. For example, just drop if (bdrv_filtered_cow_child) and get
> ENOTSUP from bdrv_commit in this case.

Sounds good, yes.

> And with at least this fixed I'm OK with this patch:
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> However some comments below:
> 
>>           }
>>           aio_context_release(aio_context);
>> diff --git a/block/commit.c b/block/commit.c
>> index 5a7672c7c7..40d1c8eeac 100644
>> --- a/block/commit.c
>> +++ b/block/commit.c
>> @@ -37,6 +37,7 @@ typedef struct CommitBlockJob {
>>       BlockBackend *top;
>>       BlockBackend *base;
>>       BlockDriverState *base_bs;
>> +    BlockDriverState *above_base;
> 
> you called it base_overlay in mirror, seems better to keep same naming

Indeed.

[...]

>> @@ -315,17 +336,43 @@ void commit_start(const char *job_id, BlockDriverState *bs,
>>   
>>       s->commit_top_bs = commit_top_bs;
>>   
>> -    /* Block all nodes between top and base, because they will
>> -     * disappear from the chain after this operation. */
>> -    assert(bdrv_chain_contains(top, base));
>> -    for (iter = top; iter != base; iter = backing_bs(iter)) {
>> -        /* XXX BLK_PERM_WRITE needs to be allowed so we don't block ourselves
>> -         * at s->base (if writes are blocked for a node, they are also blocked
>> -         * for its backing file). The other options would be a second filter
>> -         * driver above s->base. */
> 
> This code part is absolutely equal to corresponding in block/mirror.c.. It would be great
> to put it into a function and reuse. However its not about these series.

It would probably be great to just drop block/commit.c altogether and
fully merge it into block/mirror.c at some point.

(I suppose we’d just have to check whether there’s any parent who’s
taken the WRITE permission on the top node, and if so, emit READY (and
if not, skip to COMPLETED).)

[...]

>> @@ -412,19 +457,22 @@ int bdrv_commit(BlockDriverState *bs)
>>       if (!drv)
>>           return -ENOMEDIUM;
>>   
>> -    if (!bs->backing) {
>> +    backing_file_bs = bdrv_filtered_cow_bs(bs);
> 
> Hmm just note: if in future we'll have cow child which is not bs->backing, a lot of code will
> fail, as we always assume that cow child is bs->backing. May be, this should be commented in
> bdrv_filtered_cow_child implementation.

I couldn’t see why we’d ever do this.  I hope we never do.

(Aside from just removing bs->file and bs->backing altogether.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases
  2019-08-31 11:41   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-02 15:06     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-02 15:06 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 3401 bytes --]

On 31.08.19 13:41, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:14, Max Reitz wrote:
>> This patch adds some tests on how commit copes with filter nodes.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   tests/qemu-iotests/040     | 177 +++++++++++++++++++++++++++++++++++++
>>   tests/qemu-iotests/040.out |   4 +-
>>   2 files changed, 179 insertions(+), 2 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
>> index 6db9abf8e6..a0a0db8889 100755
>> --- a/tests/qemu-iotests/040
>> +++ b/tests/qemu-iotests/040

[...]

>> +    def tearDown(self):
>> +        self.vm.shutdown(has_quit=self.has_quit)
>> +
>> +        for index in range(len(self.pattern_files)):
> 
> you may use enumerate for such cases:
> for ind, file in enumerate(self.pattern_files):
>     ...

Ah, nice.

>> +            result = qemu_io('-f', iotests.imgfmt,
>> +                             '-c', 'read -P %i %iM 1M' % (index + 1, index),
>> +                             self.pattern_files[index])
>> +            self.assertFalse('Pattern verification failed' in result)
> 
> A bit better would be to keep this loop in a function and do "writes" through it too,
> to make it more obvious that they are the same.. But I'm OK with it as is.

Hm, yes.  I’ll have a look.

>> +
>> +        os.remove(self.img3)
>> +        os.remove(self.img2)
>> +        os.remove(self.img1)
>> +        os.remove(self.img0)
>> +
>> +    # Filters make for funny filenames, so we cannot just use
>> +    # self.imgX to get them
>> +    def get_filename(self, node):
>> +        return self.vm.node_info(node)['image']['filename']
>> +
> 
> maybe:
> def assertHasNode(self, node_name):
>    self.assertIsNotNone(self.vm.node_info(node_name))
> 
> and similar for assertNoNode...

Hm, I don’t know.  It fits on one line either way.

>> +    def test_filterless_commit(self):
>> +        self.assert_no_active_block_jobs()
> 
> why not just to include this call into setUp() ? Or even, just drop it?
> We create and start new vm in setUp, it don't have any block jobs for sure.

Other tests do it the same way, e.g. 030, 040, and 041.

[...]

>> +        self.assertIsNone(self.vm.node_info('top-filter'))
>> +        self.assertIsNone(self.vm.node_info('cow-3'))
>> +        self.assertIsNotNone(self.vm.node_info('cow-2'))
> 
> It would be good to assert here the cow-2 became drv0 child. However, otherwise
> it should be automatically dropped, so it's not necessary.

Yep, like cow-3.  I’ll look into it anyway.

>> +
>> +        # 3 has been comitted into 2
>> +        self.pattern_files[3] = self.img2
>> +
>> +    def test_filtered_active_commit_without_filter(self):
>> +        self.assert_no_active_block_jobs()
>> +        result = self.vm.qmp('block-commit',
>> +                             job_id='commit',
>> +                             device='top-filter',
>> +                             top_node='cow-3',
>> +                             base_node='cow-2')
>> +        self.assert_qmp(result, 'return', {})
> 
> can we check that really "active" commit is started, i.e. mirror block job?

We do:

>> +        self.complete_and_wait(drive='commit')

wait_ready is True by default, so this will first wait for a READY
event.  That only happens for active commit.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases
  2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-02 15:09     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-02 15:09 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 2150 bytes --]

On 31.08.19 14:35, Vladimir Sementsov-Ogievskiy wrote:
> 09.08.2019 19:14, Max Reitz wrote:
>> This patch adds some tests on how commit copes with filter nodes.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   tests/qemu-iotests/040     | 177 +++++++++++++++++++++++++++++++++++++
>>   tests/qemu-iotests/040.out |   4 +-
>>   2 files changed, 179 insertions(+), 2 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
>> index 6db9abf8e6..a0a0db8889 100755
>> --- a/tests/qemu-iotests/040
>> +++ b/tests/qemu-iotests/040
>> @@ -428,5 +428,182 @@ class TestReopenOverlay(ImageCommitTestCase):
>>       def test_reopen_overlay(self):
>>           self.run_commit_test(self.img1, self.img0)
>>   
>> +class TestCommitWithFilters(iotests.QMPTestCase):
>> +    img0 = os.path.join(iotests.test_dir, '0.img')
>> +    img1 = os.path.join(iotests.test_dir, '1.img')
>> +    img2 = os.path.join(iotests.test_dir, '2.img')
>> +    img3 = os.path.join(iotests.test_dir, '3.img')
>> +
>> +    def setUp(self):
>> +        qemu_img('create', '-f', iotests.imgfmt, self.img0, '64M')
>> +        qemu_img('create', '-f', iotests.imgfmt, self.img1, '64M')
>> +        qemu_img('create', '-f', iotests.imgfmt, self.img2, '64M')
>> +        qemu_img('create', '-f', iotests.imgfmt, self.img3, '64M')
>> +
>> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 1 0M 1M', self.img0)
>> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 2 1M 1M', self.img1)
>> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 3 2M 1M', self.img2)
>> +        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 4 3M 1M', self.img3)
>> +
>> +        # Distributions of the patterns in the files; this is checked
>> +        # by tearDown() and should be changed by the test cases as is
>> +        # necessary
>> +        self.pattern_files = [self.img0, self.img1, self.img2, self.img3]
>> +
>> +        self.vm = iotests.VM()
>> +        self.vm.launch()
>> +        self.has_quit = False
> 
> has_quit is unused actually. It's always False.

True. (:-))  I wonder why I added it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-09-02 14:35     ` Max Reitz
@ 2019-09-03  8:32       ` Vladimir Sementsov-Ogievskiy
  2019-09-09  7:41         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-09-03  8:32 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

02.09.2019 17:35, Max Reitz wrote:
> On 31.08.19 11:57, Vladimir Sementsov-Ogievskiy wrote:
>> 09.08.2019 19:13, Max Reitz wrote:
>>> This includes some permission limiting (for example, we only need to
>>> take the RESIZE permission for active commits where the base is smaller
>>> than the top).
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>    block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>>>    blockdev.c     |  47 +++++++++++++++++---
>>>    2 files changed, 131 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/block/mirror.c b/block/mirror.c
>>> index 54bafdf176..6ddbfb9708 100644
>>> --- a/block/mirror.c
>>> +++ b/block/mirror.c
>>> @@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
>>>        BlockBackend *target;
>>>        BlockDriverState *mirror_top_bs;
>>>        BlockDriverState *base;
>>> +    BlockDriverState *base_overlay;
>>>    
>>>        /* The name of the graph node to replace */
>>>        char *replaces;
>>> @@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
>>>                                 &error_abort);
>>>        if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>>>            BlockDriverState *backing = s->is_none_mode ? src : s->base;
>>> -        if (backing_bs(target_bs) != backing) {
>>> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
>>> +        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
>>> +
>>> +        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
>>> +            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
>>>                if (local_err) {
>>>                    error_report_err(local_err);
>>>                    ret = -EPERM;
>>> @@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
>>>         * valid.
>>>         */
>>>        block_job_remove_all_bdrv(bjob);
>>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>>    
>>>        /* We just changed the BDS the job BB refers to (with either or both of the
>>>         * bdrv_replace_node() calls), so switch the BB back so the cleanup does
>>> @@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>>>                return 0;
>>>            }
>>>    
>>> -        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
>>> +        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
>>> +                                      &count);
>>>            if (ret < 0) {
>>>                return ret;
>>>            }
>>> @@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>>>        } else {
>>>            s->target_cluster_size = BDRV_SECTOR_SIZE;
>>>        }
>>> -    if (backing_filename[0] && !target_bs->backing &&
>>> +    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
>>>            s->granularity < s->target_cluster_size) {
>>>            s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>>>            s->cow_bitmap = bitmap_new(length);
>>> @@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
>>>        if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>>>            int ret;
>>>    
>>> -        assert(!target->backing);
>>> -        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
>>> +        assert(!bdrv_backing_chain_next(target));
>>
>> Preexisting, but seems we may crash here, I don't see where it is checked before, to
>> return error if there is some backing. And even if we do so, we don't prevent appearing
>> of target backing during mirror operation.
> 
> The idea is that MIRROR_OPEN_BACKING_CHAIN is set only when using
> drive-mirror with mode=existing.  In this case, we also set
> BDRV_O_NO_BACKING for the target.
> 
> You’re right that a user could add a backing chain to the target during
> the operation.  They really have to make an effort to shoot themselves
> in the foot for this because the target must have an auto-generated node
> name.
> 
> I suppose the best would be not to open the backing chain if the target
> node already has a backing child?

Hmm, but we still should generate an error, as we can't do what was requested.

> 
>>> +        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
>>> +                                     "backing", errp);
>>>            if (ret < 0) {
>>>                return;
>>>            }
>>> @@ -1531,8 +1536,8 @@ static BlockJob *mirror_start_job(
>>>        MirrorBlockJob *s;
>>>        MirrorBDSOpaque *bs_opaque;
>>>        BlockDriverState *mirror_top_bs;
>>> -    bool target_graph_mod;
>>>        bool target_is_backing;
>>> +    uint64_t target_perms, target_shared_perms;
>>>        Error *local_err = NULL;
>>>        int ret;
>>>    
>>> @@ -1551,7 +1556,7 @@ static BlockJob *mirror_start_job(
>>>            buf_size = DEFAULT_MIRROR_BUF_SIZE;
>>>        }
>>>    
>>> -    if (bs == target) {
>>> +    if (bdrv_skip_rw_filters(bs) == bdrv_skip_rw_filters(target)) {
>>>            error_setg(errp, "Can't mirror node into itself");
>>>            return NULL;
>>>        }
>>> @@ -1615,15 +1620,50 @@ static BlockJob *mirror_start_job(
>>>         * In the case of active commit, things look a bit different, though,
>>>         * because the target is an already populated backing file in active use.
>>>         * We can allow anything except resize there.*/
>>> +
>>> +    target_perms = BLK_PERM_WRITE;
>>> +    target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
>>> +
>>>        target_is_backing = bdrv_chain_contains(bs, target);
>>> -    target_graph_mod = (backing_mode != MIRROR_LEAVE_BACKING_CHAIN);
>>> +    if (target_is_backing) {
>>> +        int64_t bs_size, target_size;
>>
>> <empty after definitions>
> 
> Is that part of any of our guidelines? :-)
> 
> Sure, will add.

Not sure. Someone asked me about it on list in past and I'm used to.

> 
>>> +        bs_size = bdrv_getlength(bs);
>>> +        if (bs_size < 0) {
>>> +            error_setg_errno(errp, -bs_size,
>>> +                             "Could not inquire top image size");
>>> +            goto fail;
>>> +        }
>>> +
>>> +        target_size = bdrv_getlength(target);
>>> +        if (target_size < 0) {
>>> +            error_setg_errno(errp, -target_size,
>>> +                             "Could not inquire base image size");
>>> +            goto fail;
>>> +        }
>>> +
>>> +        if (target_size < bs_size) {
>>> +            target_perms |= BLK_PERM_RESIZE;
>>> +        }
>>> +
>>> +        target_shared_perms |= BLK_PERM_CONSISTENT_READ
>>> +                            |  BLK_PERM_WRITE
>>> +                            |  BLK_PERM_GRAPH_MOD;
>>> +    } else if (bdrv_chain_contains(bs, bdrv_skip_rw_filters(target))) {
>>> +        /*
>>> +         * We may want to allow this in the future, but it would
>>> +         * require taking some extra care.
>>> +         */
>>> +        error_setg(errp, "Cannot mirror to a filter on top of a node in the "
>>> +                   "source's backing chain");
>>> +        goto fail;
>>> +    }
>>> +
>>> +    if (backing_mode != MIRROR_LEAVE_BACKING_CHAIN) {
>>> +        target_perms |= BLK_PERM_GRAPH_MOD;
>>> +    }
>>> +
>>>        s->target = blk_new(s->common.job.aio_context,
>>> -                        BLK_PERM_WRITE | BLK_PERM_RESIZE |
>>> -                        (target_graph_mod ? BLK_PERM_GRAPH_MOD : 0),
>>> -                        BLK_PERM_WRITE_UNCHANGED |
>>> -                        (target_is_backing ? BLK_PERM_CONSISTENT_READ |
>>> -                                             BLK_PERM_WRITE |
>>> -                                             BLK_PERM_GRAPH_MOD : 0));
>>> +                        target_perms, target_shared_perms);
>>>        ret = blk_insert_bs(s->target, target, errp);
>>>        if (ret < 0) {
>>>            goto fail;
>>> @@ -1647,6 +1687,7 @@ static BlockJob *mirror_start_job(
>>>        s->backing_mode = backing_mode;
>>>        s->copy_mode = copy_mode;
>>>        s->base = base;
>>> +    s->base_overlay = bdrv_find_overlay(bs, base);
>>>        s->granularity = granularity;
>>>        s->buf_size = ROUND_UP(buf_size, granularity);
>>>        s->unmap = unmap;
>>> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>>>        /* In commit_active_start() all intermediate nodes disappear, so
>>>         * any jobs in them must be blocked */
>>>        if (target_is_backing) {
>>> -        BlockDriverState *iter;
>>> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
>>> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>>> -             * ourselves at s->base (if writes are blocked for a node, they are
>>> -             * also blocked for its backing file). The other options would be a
>>> -             * second filter driver above s->base (== target). */
>>> +        BlockDriverState *iter, *filtered_target;
>>> +        uint64_t iter_shared_perms;
>>> +
>>> +        /*
>>> +         * The topmost node with
>>> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
>>> +         */
>>> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
>>> +
>>> +        assert(bdrv_skip_rw_filters(filtered_target) ==
>>> +               bdrv_skip_rw_filters(target));
>>> +
>>> +        /*
>>> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
>>> +         * ourselves at s->base (if writes are blocked for a node, they are
>>> +         * also blocked for its backing file). The other options would be a
>>> +         * second filter driver above s->base (== target).
>>> +         */
>>> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
>>> +
>>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
>>> +             iter = bdrv_filtered_bs(iter))
>>> +        {
>>> +            if (iter == filtered_target) {
>>> +                /*
>>> +                 * From here on, all nodes are filters on the base.
>>> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
>>
>> I'd prefere to add something like: "because we share it on target (see target BlockBackend creation
>> and corresponding comment above)".
> 
> I’d rather not refer to other comments in case they change…  Maybe just
> “This allows us to share BLK_PERM_CONSISTENT_READ, as we do on the
> target.”?  I think if someone is interested, they will scan the file for
> what permissions are shared on the target anyway.

OK. Yes I just wanted to stress that we just duplicate behavior about target, as it helped
me to understand.

> 
>>> +                 */
>>> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
>>> +            }
>>> +
>>>                ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>>> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
>>> -                                     errp);
>>> +                                     iter_shared_perms, errp);
>>>                if (ret < 0) {
>>>                    goto fail;
>>>                }
>>> @@ -1737,7 +1802,7 @@ fail:
>>>        bs_opaque->stop = true;
>>>        bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
>>>                                 &error_abort);
>>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>>    
>>>        bdrv_unref(mirror_top_bs);
>>>    
>>> @@ -1764,7 +1829,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
>>>            return;
>>>        }
>>>        is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
>>> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
>>> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
>>>        mirror_start_job(job_id, bs, creation_flags, target, replaces,
>>>                         speed, granularity, buf_size, backing_mode,
>>>                         on_source_error, on_target_error, unmap, NULL, NULL,
>>> diff --git a/blockdev.c b/blockdev.c
>>> index c540802127..c451f553f7 100644
>>
>>
>> block/mirroc.c is OK for me. Continue with blockdev.c...
>>
>>> --- a/blockdev.c
>>> +++ b/blockdev.c
>>> @@ -3851,7 +3851,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>            return;
>>>        }
>>>    
>>> -    if (!bs->backing && sync == MIRROR_SYNC_MODE_TOP) {
>>> +    if (!bdrv_backing_chain_next(bs) && sync == MIRROR_SYNC_MODE_TOP) {
>>>            sync = MIRROR_SYNC_MODE_FULL;
>>>        }
>>>    
>>> @@ -3900,7 +3900,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>>    
>>>    void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>    {
>>> -    BlockDriverState *bs;
>>> +    BlockDriverState *bs, *unfiltered_bs;
>>>        BlockDriverState *source, *target_bs;
>>>        AioContext *aio_context;
>>>        BlockMirrorBackingMode backing_mode;
>>> @@ -3909,6 +3909,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>        int flags;
>>>        int64_t size;
>>>        const char *format = arg->format;
>>> +    const char *replaces_node_name = NULL;
>>>        int ret;
>>>    
>>>        bs = qmp_get_root_bs(arg->device, errp);
>>> @@ -3921,6 +3922,16 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>            return;
>>>        }
>>>    
>>> +    /*
>>> +     * If the user has not instructed us otherwise, we should let the
>>> +     * block job run from @bs (thus taking into account all filters on
>>> +     * it) but replace @unfiltered_bs when it finishes (thus not
>>> +     * removing those filters).
>>> +     * (And if there are any explicit filters, we should assume the
>>> +     *  user knows how to use the @replaces option.)
>>> +     */
>>> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
>>> +
>>>        aio_context = bdrv_get_aio_context(bs);
>>>        aio_context_acquire(aio_context);
>>>    
>>> @@ -3934,8 +3945,14 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>        }
>>>    
>>>        flags = bs->open_flags | BDRV_O_RDWR;
>>> -    source = backing_bs(bs);
>>> +    source = bdrv_filtered_cow_bs(unfiltered_bs);
>>>        if (!source && arg->sync == MIRROR_SYNC_MODE_TOP) {
>>
>>
>> Hmm, you handle this case a bit differently here and in blockdev_mirror_common..
>> Can we handle it only in blockdev_mirror_common, to be consistent with qmp_blockdev_mirror?
> 
> What exactly do you mean?  The difference between skipping all filters
> and just skipping implicit filters?  Hm.
> 
> First, the check in blockdev_mirror_common() should actually be
> unnecessary.  In qmp_{blockdev,drive}_mirror(), we do nearly the same
> check anyway (and then force sync=full if there is no backing file).

Hmm, I see it only in qmp_drive_mirror, not in _blockdev_

>  So
> if all three functions did the same check, we wouldn’t need it in
> blockdev_mirror_common().

And if it was so, better have one check in _common than two duplicated.

> 
> Second, let’s look at the difference in an example: One where
> blockdev_mirror_common() would not decide to enforce mode=full, but
> qmp_{blockdev,drive}_mirror() would.
> This happens when @bs is an explicit filter over some overlay with a
> backing file, e.g.:
> 
> throttle --file--> qcow2 --backing--> raw
> 
> It’s correct to run the mirror job from the throttle node; but @source
> should be bdrv_backing_chain_next() so it will point to the raw node.
> Currently, it is NULL (because the throttle node does not have a COW child).
> 
> But then again, I’ve made qmp_{blockdev,drive}_mirror() throw an error
> in such a case:
> 
>>> +        if (bdrv_filtered_bs(unfiltered_bs)) {
>>> +            /* @unfiltered_bs is an explicit filter */
>>> +            error_setg(errp, "Cannot perform sync=top mirror through an "
>>> +                       "explicitly added filter node on the source");
>>> +            goto out;
>>> +        }
> 
> So it isn’t really a problem.  Still, does the error make sense?  Should
> we just allow that case by letting source be
> bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))?

Looks good. As I understand, you have the test (40/42) for this case and it
works for _blockdev_ and for for _drive_ version of command. Of course, they'd
better behave in same manner.

> 
> (BTW, I just noticed that @base seems to be pretty much unused in
> block/mirror.c.  It only really uses @base_overlay now.  So I suppose it
> makes sense to remove it in v7.)
> 
>>>            arg->sync = MIRROR_SYNC_MODE_FULL;
>>>        }
>>>        if (arg->sync == MIRROR_SYNC_MODE_NONE) {
>>> @@ -3954,6 +3971,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>                                 " named node of the graph");
>>>                goto out;
>>>            }
>>> +        replaces_node_name = arg->replaces;
>>> +    } else if (unfiltered_bs != bs) {
>>> +        replaces_node_name = unfiltered_bs->node_name;
>>>        }
>>>    
>>>        if (arg->mode == NEW_IMAGE_MODE_ABSOLUTE_PATHS) {
>>> @@ -3973,6 +3993,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>            bdrv_img_create(arg->target, format,
>>>                            NULL, NULL, NULL, size, flags, false, &local_err);
>>>        } else {
>>> +        /* Implicit filters should not appear in the filename */
>>> +        BlockDriverState *explicit_backing = bdrv_skip_implicit_filters(source);
>>> +
>>>            switch (arg->mode) {
>>>            case NEW_IMAGE_MODE_EXISTING:
>>>                break;
>>> @@ -3980,8 +4003,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>                /* create new image with backing file */
>>>                bdrv_refresh_filename(source);
>>>                bdrv_img_create(arg->target, format,
>>> -                            source->filename,
>>> -                            source->drv->format_name,
>>> +                            explicit_backing->filename,
>>> +                            explicit_backing->drv->format_name,
>>>                                NULL, size, flags, false, &local_err);
>>>                break;
>>>            default:
>>> @@ -4017,7 +4040,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>>        }
>>>    
>>>        blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
>>> -                           arg->has_replaces, arg->replaces, arg->sync,
>>> +                           !!replaces_node_name, replaces_node_name, arg->sync,
>>>                               backing_mode, arg->has_speed, arg->speed,
>>>                               arg->has_granularity, arg->granularity,
>>>                               arg->has_buf_size, arg->buf_size,
>>> @@ -4053,7 +4076,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>>>                             bool has_auto_dismiss, bool auto_dismiss,
>>>                             Error **errp)
>>>    {
>>> -    BlockDriverState *bs;
>>> +    BlockDriverState *bs, *unfiltered_bs;
>>>        BlockDriverState *target_bs;
>>>        AioContext *aio_context;
>>>        BlockMirrorBackingMode backing_mode = MIRROR_LEAVE_BACKING_CHAIN;
>>> @@ -4065,6 +4088,16 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>>>            return;
>>>        }
>>>    
>>> +    /*
>>> +     * Same as in qmp_drive_mirror():
>>
>> Then, may be better do it in blockdev_mirror_common ?
> 
> Hm, maybe.  Should we decide to let @source be
> bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)) in qmp_drive_mirror(), I
> don’t think we need @unfiltered_bs there to determine @source.
> 
> Max
> 
>>> We want to run the job from @bs,
>>> +     * but we want to replace @unfiltered_bs on completion.
>>> +     */
>>> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
>>> +    if (!has_replaces && unfiltered_bs != bs) {
>>> +        replaces = unfiltered_bs->node_name;
>>> +        has_replaces = true;
>>> +    }
>>> +
>>>        target_bs = bdrv_lookup_bs(target, target, errp);
>>>        if (!target_bs) {
>>>            return;
>>>
>>
>>
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 42/42] iotests: Test committing to overridden backing
  2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 42/42] iotests: Test committing to overridden backing Max Reitz
@ 2019-09-03  9:18   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 132+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-09-03  9:18 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: Kevin Wolf, qemu-devel

09.08.2019 19:14, Max Reitz wrote:
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 04/42] block: Add child access functions Max Reitz
  2019-08-09 16:56   ` Eric Blake
@ 2019-09-04 16:16   ` Kevin Wolf
  2019-09-09  7:56     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-04 16:16 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> There are BDS children that the general block layer code can access,
> namely bs->file and bs->backing.  Since the introduction of filters and
> external data files, their meaning is not quite clear.  bs->backing can
> be a COW source, or it can be an R/W-filtered child; bs->file can be an
> R/W-filtered child, it can be data and metadata storage, or it can be
> just metadata storage.
> 
> This overloading really is not helpful.  This patch adds function that
> retrieve the correct child for each exact purpose.  Later patches in
> this series will make use of them.  Doing so will allow us to handle
> filter nodes and external data files in a meaningful way.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Each time I look at this patch, I'm confused by the function names.
Maybe I should just ask what the idea there was, or more specifically:
What does the "filtered" in "filtered child" really mean?

Apparently any child of a filter node is "filtered" (which makes sense),
but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
raw doesn't have any "filtered" child. What's the system behind this?

It looks like bdrv_filtered_child() is the right function to iterate
along a backing file chain, but I just still fail to connect that and
the name of the function in a meaningful way.

> +/*
> + * Return the child that @bs acts as an overlay for, and from which data may be
> + * copied in COW or COR operations.  Usually this is the backing file.
> + */

Or NULL, if no such child exists.

It's relatively obvious here, but for some of the functions further down
it would be really good to describe in which cases NULL is expected (or
that NULL is even a possible return value).

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain Max Reitz
  2019-08-10 13:32   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-05 13:05   ` Kevin Wolf
  2019-09-09  8:02     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-05 13:05 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> In order to make filters work in backing chains, the associated
> functions must be able to deal with them and freeze all filter links, be
> they COW or R/W filter links.
> 
> In the process, rename these functions to reflect that they now act on
> generalized chains of filter nodes instead of backing chains alone.

I don't think this is a good idea. The functions are still following the
backing chain. A generic "chain" could mean following the bs->file links
or any other children, so the new name is confusing because it doesn't
really tell you any more what the function does. I'd prefer the name to
stay specific.

> While at it, add some comments that note which functions require their
> caller to ensure that a given child link is not frozen, and how the
> callers do so.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes()
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes() Max Reitz
@ 2019-09-05 13:11   ` Kevin Wolf
  2019-09-09  8:09     ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-05 13:11 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> Filters cannot compress data themselves but they have to implement
> .bdrv_co_pwritev_compressed() still (or they cannot forward compressed
> writes).  Therefore, checking whether
> bs->drv->bdrv_co_pwritev_compressed is non-NULL is not sufficient to
> know whether the node can actually handle compressed writes.  This
> function looks down the filter chain to see whether there is a
> non-filter that can actually convert the compressed writes into
> compressed data (and thus normal writes).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Should patches 2 and 3 that add the .bdrv_co_pwritev_compressed()
callback to filter drivers come only after this one? And should we also
support it in the mirror filter?

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains Max Reitz
  2019-08-10 15:19   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-05 14:05   ` Kevin Wolf
  2019-09-09  8:25     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-05 14:05 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> Use child access functions when iterating through backing chains so
> filters do not break the chain.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c | 40 ++++++++++++++++++++++++++++------------
>  1 file changed, 28 insertions(+), 12 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 86b84bea21..42abbaf0ba 100644
> --- a/block.c
> +++ b/block.c
> @@ -4376,7 +4376,8 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>  }
>  
>  /*
> - * Finds the image layer in the chain that has 'bs' as its backing file.
> + * Finds the image layer in the chain that has 'bs' (or a filter on
> + * top of it) as its backing file.
>   *
>   * active is the current topmost image.
>   *
> @@ -4388,11 +4389,18 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>  BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
>                                      BlockDriverState *bs)
>  {
> -    while (active && bs != backing_bs(active)) {
> -        active = backing_bs(active);
> +    bs = bdrv_skip_rw_filters(bs);
> +    active = bdrv_skip_rw_filters(active);

This does more than the commit message says. In addition to iterating
through filters instead of stopping, it also changes the semantics of
the function to return the next non-filter on top of bs instead of the
next node.

The block jobs seem to use it only for bdrv_is_allocated_above(), which
should return the same thing in both cases, so the behaviour stays the
same. qmp_block_commit() will check op blockers on a different node now,
which could be a fix or a bug, I can't tell offhand. Probably the
blocking doesn't really work anyway.

All of this should be mentioned in the commit message at least. Maybe
it's also worth splitting in two patches.

> +    while (active) {
> +        BlockDriverState *next = bdrv_backing_chain_next(active);
> +        if (bs == next) {
> +            return active;
> +        }
> +        active = next;
>      }
>  
> -    return active;
> +    return NULL;
>  }
>  
>  /* Given a BDS, searches for the base layer. */
> @@ -4544,9 +4552,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
>       * other intermediate nodes have been dropped.
>       * If 'top' is an implicit node (e.g. "commit_top") we should skip
>       * it because no one inherits from it. We use explicit_top for that. */
> -    while (explicit_top && explicit_top->implicit) {
> -        explicit_top = backing_bs(explicit_top);
> -    }
> +    explicit_top = bdrv_skip_implicit_filters(explicit_top);
>      update_inherits_from = bdrv_inherits_from_recursive(base, explicit_top);
>  
>      /* success - we can delete the intermediate states, and link top->base */
> @@ -5014,7 +5020,7 @@ BlockDriverState *bdrv_lookup_bs(const char *device,
>  bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
>  {
>      while (top && top != base) {
> -        top = backing_bs(top);
> +        top = bdrv_filtered_bs(top);
>      }
>  
>      return top != NULL;
> @@ -5253,7 +5259,17 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>  
>      is_protocol = path_has_protocol(backing_file);
>  
> -    for (curr_bs = bs; curr_bs->backing; curr_bs = curr_bs->backing->bs) {
> +    /*
> +     * Being largely a legacy function, skip any filters here
> +     * (because filters do not have normal filenames, so they cannot
> +     * match anyway; and allowing json:{} filenames is a bit out of
> +     * scope).
> +     */
> +    for (curr_bs = bdrv_skip_rw_filters(bs);
> +         bdrv_filtered_cow_child(curr_bs) != NULL;
> +         curr_bs = bdrv_backing_chain_next(curr_bs))

This could just use bs_below instead of recalculating the node if you
moved the declaration of bs_below to the function scope.

> +    {
> +        BlockDriverState *bs_below = bdrv_backing_chain_next(curr_bs);
>  
>          /* If either of the filename paths is actually a protocol, then
>           * compare unmodified paths; otherwise make paths relative */
> @@ -5261,7 +5277,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>              char *backing_file_full_ret;
>  
>              if (strcmp(backing_file, curr_bs->backing_file) == 0) {
> -                retval = curr_bs->backing->bs;
> +                retval = bs_below;
>                  break;
>              }
>              /* Also check against the full backing filename for the image */
> @@ -5271,7 +5287,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>                  bool equal = strcmp(backing_file, backing_file_full_ret) == 0;
>                  g_free(backing_file_full_ret);
>                  if (equal) {
> -                    retval = curr_bs->backing->bs;
> +                    retval = bs_below;
>                      break;
>                  }
>              }
> @@ -5297,7 +5313,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>              g_free(filename_tmp);
>  
>              if (strcmp(backing_file_full, filename_full) == 0) {
> -                retval = curr_bs->backing->bs;
> +                retval = bs_below;
>                  break;
>              }
>          }

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-08-12 12:58     ` Max Reitz
@ 2019-09-05 16:24       ` Kevin Wolf
  2019-09-09  8:31         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-05 16:24 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 12.08.2019 um 14:58 hat Max Reitz geschrieben:
> On 10.08.19 17:36, Vladimir Sementsov-Ogievskiy wrote:
> > 09.08.2019 19:13, Max Reitz wrote:
> >> If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
> >> itself has to flush the children of the given node, it should not flush
> >> just bs->file->bs, but in fact all children.
> >>
> >> In any case, the BLKDBG_EVENT() should be emitted on the primary child,
> >> because that is where a blkdebug node would be if there is any.
> >>
> >> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> ---
> >>   block/io.c | 23 +++++++++++++++++------
> >>   1 file changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/block/io.c b/block/io.c
> >> index c5a8e3e6a3..bcc770d336 100644
> >> --- a/block/io.c
> >> +++ b/block/io.c
> >> @@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
> >>   
> >>   int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>   {
> >> +    BdrvChild *primary_child = bdrv_primary_child(bs);
> >> +    BdrvChild *child;
> >>       int current_gen;
> >>       int ret = 0;
> >>   
> >> @@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>       }
> >>   
> >>       /* Write back cached data to the OS even with cache=unsafe */
> >> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
> >> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
> >>       if (bs->drv->bdrv_co_flush_to_os) {
> >>           ret = bs->drv->bdrv_co_flush_to_os(bs);
> >>           if (ret < 0) {
> >> @@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>   
> >>       /* But don't actually force it to the disk with cache=unsafe */
> >>       if (bs->open_flags & BDRV_O_NO_FLUSH) {
> >> -        goto flush_parent;
> >> +        goto flush_children;
> >>       }
> >>   
> >>       /* Check if we really need to flush anything */
> >>       if (bs->flushed_gen == current_gen) {
> >> -        goto flush_parent;
> >> +        goto flush_children;
> >>       }
> >>   
> >> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
> >> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
> >>       if (!bs->drv) {
> >>           /* bs->drv->bdrv_co_flush() might have ejected the BDS
> >>            * (even in case of apparent success) */
> >> @@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>       /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
> >>        * in the case of cache=unsafe, so there are no useless flushes.
> >>        */
> >> -flush_parent:
> >> -    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
> >> +flush_children:
> >> +    ret = 0; > +    QLIST_FOREACH(child, &bs->children, next) {
> >> +        int this_child_ret;
> >> +
> >> +        this_child_ret = bdrv_co_flush(child->bs);
> >> +        if (!ret) {
> >> +            ret = this_child_ret;
> >> +        }
> >> +    }
> > 
> > Hmm, you said that we want to flush only children with write-access from parent..
> 
> Good that you remember it, I must have overlooked it (when reading the
> replies to the previous version). :-)
> 
> > Shouldn't we check it? Or we assume that it's always safe to call bdrv_co_flush on
> > a node?
> 
> I think it’s always safe.  But checking it seems like a nice touch, yes.

I'm not sure why we would unconditionally flush all children anyway. The
only drivers I can think of that really need to flush more than one
child are blkverify and quorum, and both of them already implement this.
blkverify implements .bdrv_co_flush, so it's not affected by the change
anyway, but quorum children will be flushed twice now.

But more than this, I'm worried about the overhead of needlessly
recursing through the whole backing chain and calling flush on every
node there.  Maybe bs->write_gen saves us so that at least this doesn't
result in an fdatasync() call for each, but still... Without a use case,
I'd rather not do this.

Oh, well, after having written all of this, I see that qcow2 with an
external data file is buggy... This could be fixed in the qcow2 driver,
but maybe restricting the recursion to read-only is actually good enough
then. Can you mention this case in the commit message and maybe build a
test for it?

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-09-03  8:32       ` Vladimir Sementsov-Ogievskiy
@ 2019-09-09  7:41         ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-09  7:41 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1.1: Type: text/plain, Size: 5317 bytes --]

On 03.09.19 10:32, Vladimir Sementsov-Ogievskiy wrote:
> 02.09.2019 17:35, Max Reitz wrote:
>> On 31.08.19 11:57, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.08.2019 19:13, Max Reitz wrote:
>>>> This includes some permission limiting (for example, we only need to
>>>> take the RESIZE permission for active commits where the base is smaller
>>>> than the top).
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>    block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>>>>    blockdev.c     |  47 +++++++++++++++++---
>>>>    2 files changed, 131 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/block/mirror.c b/block/mirror.c
>>>> index 54bafdf176..6ddbfb9708 100644
>>>> --- a/block/mirror.c
>>>> +++ b/block/mirror.c
>>>> @@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
>>>>        BlockBackend *target;
>>>>        BlockDriverState *mirror_top_bs;
>>>>        BlockDriverState *base;
>>>> +    BlockDriverState *base_overlay;
>>>>    
>>>>        /* The name of the graph node to replace */
>>>>        char *replaces;
>>>> @@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
>>>>                                 &error_abort);
>>>>        if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>>>>            BlockDriverState *backing = s->is_none_mode ? src : s->base;
>>>> -        if (backing_bs(target_bs) != backing) {
>>>> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
>>>> +        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
>>>> +
>>>> +        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
>>>> +            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
>>>>                if (local_err) {
>>>>                    error_report_err(local_err);
>>>>                    ret = -EPERM;
>>>> @@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
>>>>         * valid.
>>>>         */
>>>>        block_job_remove_all_bdrv(bjob);
>>>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>>>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>>>    
>>>>        /* We just changed the BDS the job BB refers to (with either or both of the
>>>>         * bdrv_replace_node() calls), so switch the BB back so the cleanup does
>>>> @@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>>>>                return 0;
>>>>            }
>>>>    
>>>> -        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
>>>> +        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
>>>> +                                      &count);
>>>>            if (ret < 0) {
>>>>                return ret;
>>>>            }
>>>> @@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>>>>        } else {
>>>>            s->target_cluster_size = BDRV_SECTOR_SIZE;
>>>>        }
>>>> -    if (backing_filename[0] && !target_bs->backing &&
>>>> +    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
>>>>            s->granularity < s->target_cluster_size) {
>>>>            s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>>>>            s->cow_bitmap = bitmap_new(length);
>>>> @@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
>>>>        if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>>>>            int ret;
>>>>    
>>>> -        assert(!target->backing);
>>>> -        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
>>>> +        assert(!bdrv_backing_chain_next(target));
>>>
>>> Preexisting, but seems we may crash here, I don't see where it is checked before, to
>>> return error if there is some backing. And even if we do so, we don't prevent appearing
>>> of target backing during mirror operation.
>>
>> The idea is that MIRROR_OPEN_BACKING_CHAIN is set only when using
>> drive-mirror with mode=existing.  In this case, we also set
>> BDRV_O_NO_BACKING for the target.
>>
>> You’re right that a user could add a backing chain to the target during
>> the operation.  They really have to make an effort to shoot themselves
>> in the foot for this because the target must have an auto-generated node
>> name.
>>
>> I suppose the best would be not to open the backing chain if the target
>> node already has a backing child?
> 
> Hmm, but we still should generate an error, as we can't do what was requested.

But the user didn’t request anything.  They just specified an existing
file as the target with mode=existing, then (for whatever reason) made a
real effort to add a backing node to it (i.e. they had to look up the
target’s auto-generated node name), and then the job finishes.

The user didn’t request us to open the backing chain of the target.
They just requested to mirror to an existing file.  If they manually
override that existing file’s backing chain, I think it makes sense to
keep it that way.

Max

>>>> +        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
>>>> +                                     "backing", errp);
>>>>            if (ret < 0) {
>>>>                return;
>>>>            }


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-04 16:16   ` Kevin Wolf
@ 2019-09-09  7:56     ` Max Reitz
  2019-09-09  9:36       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-09  7:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 3901 bytes --]

On 04.09.19 18:16, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> There are BDS children that the general block layer code can access,
>> namely bs->file and bs->backing.  Since the introduction of filters and
>> external data files, their meaning is not quite clear.  bs->backing can
>> be a COW source, or it can be an R/W-filtered child; bs->file can be an
>> R/W-filtered child, it can be data and metadata storage, or it can be
>> just metadata storage.
>>
>> This overloading really is not helpful.  This patch adds function that
>> retrieve the correct child for each exact purpose.  Later patches in
>> this series will make use of them.  Doing so will allow us to handle
>> filter nodes and external data files in a meaningful way.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Each time I look at this patch, I'm confused by the function names.
> Maybe I should just ask what the idea there was, or more specifically:
> What does the "filtered" in "filtered child" really mean?
> 
> Apparently any child of a filter node is "filtered" (which makes sense),

It isn’t, filters can have non-filter children.  For example, backup-top
could have the source as a filtered child and the target as a non-filter
child.

> but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
> raw doesn't have any "filtered" child. What's the system behind this?

“filtered” means: If the parent node returns data from this child, it
won’t modify it, neither its content nor its position.  COW and R/W
filters differ in how they handle writes; R/W filters pass them through
to the filtered child, COW filters copy them off to some other child
node (and then the filtered child’s data will no longer be visible at
that location).

The main reason behind the common “filtered” name is for the generic
functions that work on both COW and true filter (R/W filters) chains.
We need such functionality sometimes.  I personally felt like the
concept of true (R/W) filters and COW children was similar enough to
share a common name base.

qcow2 has a COW child.  As such, it acts as a COW filter in the sense of
the function names.

raw has neither a COW child nor acts as an R/W filter.  As such, it has
no filtered child.  My opinion on this hasn’t changed.

(To reiterate, in practice I see no way anyone would ever use raw as an
R/W filter.
Either you use it without offset/size, in which case you simply use it
in lieu of a format node, so you precisely don’t want it to act as a
filter when it comes to allocation information and so on (even though it
can be classified a filter here).
Or you use it as kind of a filter with offset/size, but then it no
longer is a filter.

Filters are defined by “Every filter must fulfill these conditions: ...”
– not by “Everything that fulfills these conditions is a filter”.
Marking a driver as a filter has consequences, and I don’t see why we
would want those consequences for raw.)

> It looks like bdrv_filtered_child() is the right function to iterate
> along a backing file chain, but I just still fail to connect that and
> the name of the function in a meaningful way.

It‘s the right function to iterate along a filter chain.  This includes
COW backing children and R/W filtered children.

>> +/*
>> + * Return the child that @bs acts as an overlay for, and from which data may be
>> + * copied in COW or COR operations.  Usually this is the backing file.
>> + */
> 
> Or NULL, if no such child exists.
> 
> It's relatively obvious here, but for some of the functions further down
> it would be really good to describe in which cases NULL is expected (or
> that NULL is even a possible return value).

I’ll look into it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain
  2019-09-05 13:05   ` Kevin Wolf
@ 2019-09-09  8:02     ` Max Reitz
  2019-09-09  9:40       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-09  8:02 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 1381 bytes --]

On 05.09.19 15:05, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> In order to make filters work in backing chains, the associated
>> functions must be able to deal with them and freeze all filter links, be
>> they COW or R/W filter links.
>>
>> In the process, rename these functions to reflect that they now act on
>> generalized chains of filter nodes instead of backing chains alone.
> 
> I don't think this is a good idea. The functions are still following the
> backing chain. A generic "chain" could mean following the bs->file links
> or any other children, so the new name is confusing because it doesn't
> really tell you any more what the function does. I'd prefer the name to
> stay specific.
They’re following backing chains, among others.

It would make sense to rename s/backing_chain/filter_chain/, that is, in
case you don‘t find lumping COW and R/W filters together under “filter”
too offensive.

(Naming things is hard.  I’m open for suggestions, but I found the
“filter” concept succinct, even if it does not fully align with our
existing parlance.)

Max

>> While at it, add some comments that note which functions require their
>> caller to ensure that a given child link is not frozen, and how the
>> callers do so.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> Kevin
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes()
  2019-09-05 13:11   ` Kevin Wolf
@ 2019-09-09  8:09     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-09  8:09 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 1525 bytes --]

On 05.09.19 15:11, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> Filters cannot compress data themselves but they have to implement
>> .bdrv_co_pwritev_compressed() still (or they cannot forward compressed
>> writes).  Therefore, checking whether
>> bs->drv->bdrv_co_pwritev_compressed is non-NULL is not sufficient to
>> know whether the node can actually handle compressed writes.  This
>> function looks down the filter chain to see whether there is a
>> non-filter that can actually convert the compressed writes into
>> compressed data (and thus normal writes).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Should patches 2 and 3 that add the .bdrv_co_pwritev_compressed()
> callback to filter drivers come only after this one?

Why not.

> And should we also
> support it in the mirror filter?

Hm.  AFAIU, compressed writes have very limited use.  You can basically
only use them when writing to a new image (where you’d never write
anywhere you’ve already written something to), i.e. for qemu-img convert
or the backup target.  It makes sense to blockdev-backup to throttle, so
that’s why it should be implemented there.  I don’t really see how it
would make sense for mirror.

OTOH, it doesn’t make sense for COR either.  And it isn’t that hard.
Now I don’t have a strong preference for either dropping the COR patch
or adding it to mirror as well...

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains
  2019-09-05 14:05   ` Kevin Wolf
@ 2019-09-09  8:25     ` Max Reitz
  2019-09-09  9:55       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-09  8:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 5102 bytes --]

On 05.09.19 16:05, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> Use child access functions when iterating through backing chains so
>> filters do not break the chain.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block.c | 40 ++++++++++++++++++++++++++++------------
>>  1 file changed, 28 insertions(+), 12 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 86b84bea21..42abbaf0ba 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -4376,7 +4376,8 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>>  }
>>  
>>  /*
>> - * Finds the image layer in the chain that has 'bs' as its backing file.
>> + * Finds the image layer in the chain that has 'bs' (or a filter on
>> + * top of it) as its backing file.
>>   *
>>   * active is the current topmost image.
>>   *
>> @@ -4388,11 +4389,18 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>>  BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
>>                                      BlockDriverState *bs)
>>  {
>> -    while (active && bs != backing_bs(active)) {
>> -        active = backing_bs(active);
>> +    bs = bdrv_skip_rw_filters(bs);
>> +    active = bdrv_skip_rw_filters(active);
> 
> This does more than the commit message says. In addition to iterating
> through filters instead of stopping, it also changes the semantics of
> the function to return the next non-filter on top of bs instead of the
> next node.

Which is to say the overlay.

(I think we only ever use “overlay” in the COW sense.)

> The block jobs seem to use it only for bdrv_is_allocated_above(), which
> should return the same thing in both cases, so the behaviour stays the
> same. qmp_block_commit() will check op blockers on a different node now,
> which could be a fix or a bug, I can't tell offhand. Probably the
> blocking doesn't really work anyway.

You mean that the op blocker could have been on a block job filter node
before?  I think that‘s pretty much the point of this fix; that that
doesn’t make sense.  (We didn’t have mirror_top_bs and the like at
058223a6e3b.)

> All of this should be mentioned in the commit message at least. Maybe
> it's also worth splitting in two patches.

I don’t know.  The function was written when there were no filters.
This change would have been a no-op then.  The fact that it isn’t to me
just means that introducing filters broke it.

So I don’t know what I would write.  Maybe “bdrv_find_overlay() now
actually finds the overlay, that is, it will not return a filter node.
This is the behavior that all callers expect (because they work on COW
backing chains).”

>> +    while (active) {
>> +        BlockDriverState *next = bdrv_backing_chain_next(active);
>> +        if (bs == next) {
>> +            return active;
>> +        }
>> +        active = next;
>>      }
>>  
>> -    return active;
>> +    return NULL;
>>  }
>>  
>>  /* Given a BDS, searches for the base layer. */
>> @@ -4544,9 +4552,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
>>       * other intermediate nodes have been dropped.
>>       * If 'top' is an implicit node (e.g. "commit_top") we should skip
>>       * it because no one inherits from it. We use explicit_top for that. */
>> -    while (explicit_top && explicit_top->implicit) {
>> -        explicit_top = backing_bs(explicit_top);
>> -    }
>> +    explicit_top = bdrv_skip_implicit_filters(explicit_top);
>>      update_inherits_from = bdrv_inherits_from_recursive(base, explicit_top);
>>  
>>      /* success - we can delete the intermediate states, and link top->base */
>> @@ -5014,7 +5020,7 @@ BlockDriverState *bdrv_lookup_bs(const char *device,
>>  bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
>>  {
>>      while (top && top != base) {
>> -        top = backing_bs(top);
>> +        top = bdrv_filtered_bs(top);
>>      }
>>  
>>      return top != NULL;
>> @@ -5253,7 +5259,17 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
>>  
>>      is_protocol = path_has_protocol(backing_file);
>>  
>> -    for (curr_bs = bs; curr_bs->backing; curr_bs = curr_bs->backing->bs) {
>> +    /*
>> +     * Being largely a legacy function, skip any filters here
>> +     * (because filters do not have normal filenames, so they cannot
>> +     * match anyway; and allowing json:{} filenames is a bit out of
>> +     * scope).
>> +     */
>> +    for (curr_bs = bdrv_skip_rw_filters(bs);
>> +         bdrv_filtered_cow_child(curr_bs) != NULL;
>> +         curr_bs = bdrv_backing_chain_next(curr_bs))
> 
> This could just use bs_below instead of recalculating the node if you
> moved the declaration of bs_below to the function scope.

Indeed, thanks.

Max

>> +    {
>> +        BlockDriverState *bs_below = bdrv_backing_chain_next(curr_bs);
>>  
>>          /* If either of the filename paths is actually a protocol, then
>>           * compare unmodified paths; otherwise make paths relative */


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-09-05 16:24       ` Kevin Wolf
@ 2019-09-09  8:31         ` Max Reitz
  2019-09-09 10:01           ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-09  8:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 4710 bytes --]

On 05.09.19 18:24, Kevin Wolf wrote:
> Am 12.08.2019 um 14:58 hat Max Reitz geschrieben:
>> On 10.08.19 17:36, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.08.2019 19:13, Max Reitz wrote:
>>>> If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
>>>> itself has to flush the children of the given node, it should not flush
>>>> just bs->file->bs, but in fact all children.
>>>>
>>>> In any case, the BLKDBG_EVENT() should be emitted on the primary child,
>>>> because that is where a blkdebug node would be if there is any.
>>>>
>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>   block/io.c | 23 +++++++++++++++++------
>>>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index c5a8e3e6a3..bcc770d336 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
>>>>   
>>>>   int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>   {
>>>> +    BdrvChild *primary_child = bdrv_primary_child(bs);
>>>> +    BdrvChild *child;
>>>>       int current_gen;
>>>>       int ret = 0;
>>>>   
>>>> @@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>       }
>>>>   
>>>>       /* Write back cached data to the OS even with cache=unsafe */
>>>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
>>>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
>>>>       if (bs->drv->bdrv_co_flush_to_os) {
>>>>           ret = bs->drv->bdrv_co_flush_to_os(bs);
>>>>           if (ret < 0) {
>>>> @@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>   
>>>>       /* But don't actually force it to the disk with cache=unsafe */
>>>>       if (bs->open_flags & BDRV_O_NO_FLUSH) {
>>>> -        goto flush_parent;
>>>> +        goto flush_children;
>>>>       }
>>>>   
>>>>       /* Check if we really need to flush anything */
>>>>       if (bs->flushed_gen == current_gen) {
>>>> -        goto flush_parent;
>>>> +        goto flush_children;
>>>>       }
>>>>   
>>>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
>>>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
>>>>       if (!bs->drv) {
>>>>           /* bs->drv->bdrv_co_flush() might have ejected the BDS
>>>>            * (even in case of apparent success) */
>>>> @@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>       /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
>>>>        * in the case of cache=unsafe, so there are no useless flushes.
>>>>        */
>>>> -flush_parent:
>>>> -    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
>>>> +flush_children:
>>>> +    ret = 0; > +    QLIST_FOREACH(child, &bs->children, next) {
>>>> +        int this_child_ret;
>>>> +
>>>> +        this_child_ret = bdrv_co_flush(child->bs);
>>>> +        if (!ret) {
>>>> +            ret = this_child_ret;
>>>> +        }
>>>> +    }
>>>
>>> Hmm, you said that we want to flush only children with write-access from parent..
>>
>> Good that you remember it, I must have overlooked it (when reading the
>> replies to the previous version). :-)
>>
>>> Shouldn't we check it? Or we assume that it's always safe to call bdrv_co_flush on
>>> a node?
>>
>> I think it’s always safe.  But checking it seems like a nice touch, yes.
> 
> I'm not sure why we would unconditionally flush all children anyway. The
> only drivers I can think of that really need to flush more than one
> child are blkverify and quorum, and both of them already implement this.
> blkverify implements .bdrv_co_flush, so it's not affected by the change
> anyway, but quorum children will be flushed twice now.
> 
> But more than this, I'm worried about the overhead of needlessly
> recursing through the whole backing chain and calling flush on every
> node there.  Maybe bs->write_gen saves us so that at least this doesn't
> result in an fdatasync() call for each, but still... Without a use case,
> I'd rather not do this.
> 
> Oh, well, after having written all of this, I see that qcow2 with an
> external data file is buggy... This could be fixed in the qcow2 driver,
> but maybe restricting the recursion to read-only is actually good enough
> then. Can you mention this case in the commit message and maybe build a
> test for it?

And I should thus probably drop vmdk’s .bdrv_co_flush_to_disk()
implementation.

I will indeed try to write a test, but to be completely honest, I feel
like this series is long enough.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-09  7:56     ` Max Reitz
@ 2019-09-09  9:36       ` Kevin Wolf
  2019-09-09 14:04         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-09  9:36 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 5069 bytes --]

Am 09.09.2019 um 09:56 hat Max Reitz geschrieben:
> On 04.09.19 18:16, Kevin Wolf wrote:
> > Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >> There are BDS children that the general block layer code can access,
> >> namely bs->file and bs->backing.  Since the introduction of filters and
> >> external data files, their meaning is not quite clear.  bs->backing can
> >> be a COW source, or it can be an R/W-filtered child; bs->file can be an
> >> R/W-filtered child, it can be data and metadata storage, or it can be
> >> just metadata storage.
> >>
> >> This overloading really is not helpful.  This patch adds function that
> >> retrieve the correct child for each exact purpose.  Later patches in
> >> this series will make use of them.  Doing so will allow us to handle
> >> filter nodes and external data files in a meaningful way.
> >>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > 
> > Each time I look at this patch, I'm confused by the function names.
> > Maybe I should just ask what the idea there was, or more specifically:
> > What does the "filtered" in "filtered child" really mean?
> > 
> > Apparently any child of a filter node is "filtered" (which makes sense),
> 
> It isn’t, filters can have non-filter children.  For example, backup-top
> could have the source as a filtered child and the target as a non-filter
> child.

Hm, okay, makes sense. I had a definition in mind that says that filter
nodes only have a single child node. Is it that a filter may have only a
single _filtered_ child node?

> > but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
> > raw doesn't have any "filtered" child. What's the system behind this?
> 
> “filtered” means: If the parent node returns data from this child, it
> won’t modify it, neither its content nor its position.  COW and R/W
> filters differ in how they handle writes; R/W filters pass them through
> to the filtered child, COW filters copy them off to some other child
> node (and then the filtered child’s data will no longer be visible at
> that location).

But there is no reason why a node couldn't fulfill this condition for
more than one child node. bdrv_filtered_child() isn't well-defined then.
Technically, the description "Return any filtered child" is correct
because "any" can be interpreted as "an arbitrary", but obviously that
makes the function useless.

Specficially, according to your definition, qcow2 filters both the
backing file (COW filter) and the external data file (R/W filter).

> The main reason behind the common “filtered” name is for the generic
> functions that work on both COW and true filter (R/W filters) chains.
> We need such functionality sometimes.  I personally felt like the
> concept of true (R/W) filters and COW children was similar enough to
> share a common name base.

We generally call this concept a "backing chain".

> qcow2 has a COW child.  As such, it acts as a COW filter in the sense of
> the function names.
> 
> raw has neither a COW child nor acts as an R/W filter.  As such, it has
> no filtered child.  My opinion on this hasn’t changed.
> 
> (To reiterate, in practice I see no way anyone would ever use raw as an
> R/W filter.
> Either you use it without offset/size, in which case you simply use it
> in lieu of a format node, so you precisely don’t want it to act as a
> filter when it comes to allocation information and so on (even though it
> can be classified a filter here).
> Or you use it as kind of a filter with offset/size, but then it no
> longer is a filter.

Agreed with offset, but with only size, it matches your definition of a
filter.

> Filters are defined by “Every filter must fulfill these conditions: ...”
> – not by “Everything that fulfills these conditions is a filter”.
> Marking a driver as a filter has consequences, and I don’t see why we
> would want those consequences for raw.)
> 
> > It looks like bdrv_filtered_child() is the right function to iterate
> > along a backing file chain, but I just still fail to connect that and
> > the name of the function in a meaningful way.
> 
> It‘s the right function to iterate along a filter chain.  This includes
> COW backing children and R/W filtered children.

qcow2 doesn't fulfill the conditions for begin a filter driver. Two of
its possible children fulfill the conditions for being a filtered child.
You can pick either approach, talking about a "filter chain" just
doesn't make sense there. Either the chain is broken by a non-filter
driver like qcow2, or it must become a filter tree.

What we're really interested in is iterating the backing chain even
across filter nodes, so your implementation achieves the right result.
It just feels completely arbitrary, counterintuitive and confusing to
call this a (or actually "the") "filter chain" and to pretend that the
name tells anyone what it really is.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain
  2019-09-09  8:02     ` Max Reitz
@ 2019-09-09  9:40       ` Kevin Wolf
  0 siblings, 0 replies; 132+ messages in thread
From: Kevin Wolf @ 2019-09-09  9:40 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1604 bytes --]

Am 09.09.2019 um 10:02 hat Max Reitz geschrieben:
> On 05.09.19 15:05, Kevin Wolf wrote:
> > Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >> In order to make filters work in backing chains, the associated
> >> functions must be able to deal with them and freeze all filter links, be
> >> they COW or R/W filter links.
> >>
> >> In the process, rename these functions to reflect that they now act on
> >> generalized chains of filter nodes instead of backing chains alone.
> > 
> > I don't think this is a good idea. The functions are still following the
> > backing chain. A generic "chain" could mean following the bs->file links
> > or any other children, so the new name is confusing because it doesn't
> > really tell you any more what the function does. I'd prefer the name to
> > stay specific.
> They’re following backing chains, among others.
> 
> It would make sense to rename s/backing_chain/filter_chain/, that is, in
> case you don‘t find lumping COW and R/W filters together under “filter”
> too offensive.
> 
> (Naming things is hard.  I’m open for suggestions, but I found the
> “filter” concept succinct, even if it does not fully align with our
> existing parlance.)

As you could see in my reply to patch 4, I didn't. :-)

I think it makes a lot more sense to just broaden the meaning of "backing
chain" to be what you call a "filter chain" (following the backing file
links, but accept filter nodes in between), because of how unspecific
"filter chain" is. The primary thing we're interested in is still the
backing files.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains
  2019-09-09  8:25     ` Max Reitz
@ 2019-09-09  9:55       ` Kevin Wolf
  2019-09-09 14:08         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-09  9:55 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 3356 bytes --]

Am 09.09.2019 um 10:25 hat Max Reitz geschrieben:
> On 05.09.19 16:05, Kevin Wolf wrote:
> > Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >> Use child access functions when iterating through backing chains so
> >> filters do not break the chain.
> >>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> ---
> >>  block.c | 40 ++++++++++++++++++++++++++++------------
> >>  1 file changed, 28 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/block.c b/block.c
> >> index 86b84bea21..42abbaf0ba 100644
> >> --- a/block.c
> >> +++ b/block.c
> >> @@ -4376,7 +4376,8 @@ int bdrv_change_backing_file(BlockDriverState *bs,
> >>  }
> >>  
> >>  /*
> >> - * Finds the image layer in the chain that has 'bs' as its backing file.
> >> + * Finds the image layer in the chain that has 'bs' (or a filter on
> >> + * top of it) as its backing file.
> >>   *
> >>   * active is the current topmost image.
> >>   *
> >> @@ -4388,11 +4389,18 @@ int bdrv_change_backing_file(BlockDriverState *bs,
> >>  BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
> >>                                      BlockDriverState *bs)
> >>  {
> >> -    while (active && bs != backing_bs(active)) {
> >> -        active = backing_bs(active);
> >> +    bs = bdrv_skip_rw_filters(bs);
> >> +    active = bdrv_skip_rw_filters(active);
> > 
> > This does more than the commit message says. In addition to iterating
> > through filters instead of stopping, it also changes the semantics of
> > the function to return the next non-filter on top of bs instead of the
> > next node.
> 
> Which is to say the overlay.
> 
> (I think we only ever use “overlay” in the COW sense.)

I think we do, but so far also only ever for immediate COW childs, not
for skipping through intermediate node.

> > The block jobs seem to use it only for bdrv_is_allocated_above(), which
> > should return the same thing in both cases, so the behaviour stays the
> > same. qmp_block_commit() will check op blockers on a different node now,
> > which could be a fix or a bug, I can't tell offhand. Probably the
> > blocking doesn't really work anyway.
> 
> You mean that the op blocker could have been on a block job filter node
> before?  I think that‘s pretty much the point of this fix; that that
> doesn’t make sense.  (We didn’t have mirror_top_bs and the like at
> 058223a6e3b.)

On the off chance that the op blocker actually works, it can't be a job
filter. I was thinking more of throttling, blkdebug etc.

> > All of this should be mentioned in the commit message at least. Maybe
> > it's also worth splitting in two patches.
> 
> I don’t know.  The function was written when there were no filters.

I doubt it. blkdebug is a really old filter.

> This change would have been a no-op then.  The fact that it isn’t to me
> just means that introducing filters broke it.
> 
> So I don’t know what I would write.  Maybe “bdrv_find_overlay() now
> actually finds the overlay, that is, it will not return a filter node.
> This is the behavior that all callers expect (because they work on COW
> backing chains).”

Maybe just something like "In addition, filter nodes are not returned as
overlays any more. Instead, the first non-filter node on top of bs is
considered the overlay now."?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-09-09  8:31         ` Max Reitz
@ 2019-09-09 10:01           ` Kevin Wolf
  2019-09-09 14:15             ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-09 10:01 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 5063 bytes --]

Am 09.09.2019 um 10:31 hat Max Reitz geschrieben:
> On 05.09.19 18:24, Kevin Wolf wrote:
> > Am 12.08.2019 um 14:58 hat Max Reitz geschrieben:
> >> On 10.08.19 17:36, Vladimir Sementsov-Ogievskiy wrote:
> >>> 09.08.2019 19:13, Max Reitz wrote:
> >>>> If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
> >>>> itself has to flush the children of the given node, it should not flush
> >>>> just bs->file->bs, but in fact all children.
> >>>>
> >>>> In any case, the BLKDBG_EVENT() should be emitted on the primary child,
> >>>> because that is where a blkdebug node would be if there is any.
> >>>>
> >>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>> ---
> >>>>   block/io.c | 23 +++++++++++++++++------
> >>>>   1 file changed, 17 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/block/io.c b/block/io.c
> >>>> index c5a8e3e6a3..bcc770d336 100644
> >>>> --- a/block/io.c
> >>>> +++ b/block/io.c
> >>>> @@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
> >>>>   
> >>>>   int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>>>   {
> >>>> +    BdrvChild *primary_child = bdrv_primary_child(bs);
> >>>> +    BdrvChild *child;
> >>>>       int current_gen;
> >>>>       int ret = 0;
> >>>>   
> >>>> @@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>>>       }
> >>>>   
> >>>>       /* Write back cached data to the OS even with cache=unsafe */
> >>>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
> >>>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
> >>>>       if (bs->drv->bdrv_co_flush_to_os) {
> >>>>           ret = bs->drv->bdrv_co_flush_to_os(bs);
> >>>>           if (ret < 0) {
> >>>> @@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>>>   
> >>>>       /* But don't actually force it to the disk with cache=unsafe */
> >>>>       if (bs->open_flags & BDRV_O_NO_FLUSH) {
> >>>> -        goto flush_parent;
> >>>> +        goto flush_children;
> >>>>       }
> >>>>   
> >>>>       /* Check if we really need to flush anything */
> >>>>       if (bs->flushed_gen == current_gen) {
> >>>> -        goto flush_parent;
> >>>> +        goto flush_children;
> >>>>       }
> >>>>   
> >>>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
> >>>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
> >>>>       if (!bs->drv) {
> >>>>           /* bs->drv->bdrv_co_flush() might have ejected the BDS
> >>>>            * (even in case of apparent success) */
> >>>> @@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> >>>>       /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
> >>>>        * in the case of cache=unsafe, so there are no useless flushes.
> >>>>        */
> >>>> -flush_parent:
> >>>> -    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
> >>>> +flush_children:
> >>>> +    ret = 0; > +    QLIST_FOREACH(child, &bs->children, next) {
> >>>> +        int this_child_ret;
> >>>> +
> >>>> +        this_child_ret = bdrv_co_flush(child->bs);
> >>>> +        if (!ret) {
> >>>> +            ret = this_child_ret;
> >>>> +        }
> >>>> +    }
> >>>
> >>> Hmm, you said that we want to flush only children with write-access from parent..
> >>
> >> Good that you remember it, I must have overlooked it (when reading the
> >> replies to the previous version). :-)
> >>
> >>> Shouldn't we check it? Or we assume that it's always safe to call bdrv_co_flush on
> >>> a node?
> >>
> >> I think it’s always safe.  But checking it seems like a nice touch, yes.
> > 
> > I'm not sure why we would unconditionally flush all children anyway. The
> > only drivers I can think of that really need to flush more than one
> > child are blkverify and quorum, and both of them already implement this.
> > blkverify implements .bdrv_co_flush, so it's not affected by the change
> > anyway, but quorum children will be flushed twice now.
> > 
> > But more than this, I'm worried about the overhead of needlessly
> > recursing through the whole backing chain and calling flush on every
> > node there.  Maybe bs->write_gen saves us so that at least this doesn't
> > result in an fdatasync() call for each, but still... Without a use case,
> > I'd rather not do this.
> > 
> > Oh, well, after having written all of this, I see that qcow2 with an
> > external data file is buggy... This could be fixed in the qcow2 driver,
> > but maybe restricting the recursion to read-only is actually good enough
> > then. Can you mention this case in the commit message and maybe build a
> > test for it?
> 
> And I should thus probably drop vmdk’s .bdrv_co_flush_to_disk()
> implementation.
> 
> I will indeed try to write a test, but to be completely honest, I feel
> like this series is long enough.

I guess I could already merge patch 1 to give you space for another test
patch. ;-)

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-09  9:36       ` Kevin Wolf
@ 2019-09-09 14:04         ` Max Reitz
  2019-09-09 16:13           ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-09 14:04 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 6792 bytes --]

On 09.09.19 11:36, Kevin Wolf wrote:
> Am 09.09.2019 um 09:56 hat Max Reitz geschrieben:
>> On 04.09.19 18:16, Kevin Wolf wrote:
>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>> There are BDS children that the general block layer code can access,
>>>> namely bs->file and bs->backing.  Since the introduction of filters and
>>>> external data files, their meaning is not quite clear.  bs->backing can
>>>> be a COW source, or it can be an R/W-filtered child; bs->file can be an
>>>> R/W-filtered child, it can be data and metadata storage, or it can be
>>>> just metadata storage.
>>>>
>>>> This overloading really is not helpful.  This patch adds function that
>>>> retrieve the correct child for each exact purpose.  Later patches in
>>>> this series will make use of them.  Doing so will allow us to handle
>>>> filter nodes and external data files in a meaningful way.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> Each time I look at this patch, I'm confused by the function names.
>>> Maybe I should just ask what the idea there was, or more specifically:
>>> What does the "filtered" in "filtered child" really mean?
>>>
>>> Apparently any child of a filter node is "filtered" (which makes sense),
>>
>> It isn’t, filters can have non-filter children.  For example, backup-top
>> could have the source as a filtered child and the target as a non-filter
>> child.
> 
> Hm, okay, makes sense. I had a definition in mind that says that filter
> nodes only have a single child node. Is it that a filter may have only a
> single _filtered_ child node?

Well, there’s Quorum...

>>> but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
>>> raw doesn't have any "filtered" child. What's the system behind this?
>>
>> “filtered” means: If the parent node returns data from this child, it
>> won’t modify it, neither its content nor its position.  COW and R/W
>> filters differ in how they handle writes; R/W filters pass them through
>> to the filtered child, COW filters copy them off to some other child
>> node (and then the filtered child’s data will no longer be visible at
>> that location).
> 
> But there is no reason why a node couldn't fulfill this condition for
> more than one child node. bdrv_filtered_child() isn't well-defined then.
> Technically, the description "Return any filtered child" is correct
> because "any" can be interpreted as "an arbitrary", but obviously that
> makes the function useless.

Which is why it currently returns NULL for Quorum.

> Specficially, according to your definition, qcow2 filters both the
> backing file (COW filter) and the external data file (R/W filter).

Not wrong.  But the same question as for raw arises: Is there any use to
declaring qcow2 an R/W filter driver just because it fits the definition?

>> The main reason behind the common “filtered” name is for the generic
>> functions that work on both COW and true filter (R/W filters) chains.
>> We need such functionality sometimes.  I personally felt like the
>> concept of true (R/W) filters and COW children was similar enough to
>> share a common name base.
> 
> We generally call this concept a "backing chain".

I suppose that’s an exclusive “we”?  Because I use ”backing chain” to
refer to COW chains exclusively.

Such a chain may or may not include filters, but they are not really
load-bearing nodes of the chain.  As such, I generally want to skip them
when looking at a backing chain (hence e.g. bdrv_backing_chain_next()).

From what I can tell nobody has ever formalized any terms regarding COW
backing chains or R/W filter chains.  This series is an attempt.

>> qcow2 has a COW child.  As such, it acts as a COW filter in the sense of
>> the function names.
>>
>> raw has neither a COW child nor acts as an R/W filter.  As such, it has
>> no filtered child.  My opinion on this hasn’t changed.
>>
>> (To reiterate, in practice I see no way anyone would ever use raw as an
>> R/W filter.
>> Either you use it without offset/size, in which case you simply use it
>> in lieu of a format node, so you precisely don’t want it to act as a
>> filter when it comes to allocation information and so on (even though it
>> can be classified a filter here).
>> Or you use it as kind of a filter with offset/size, but then it no
>> longer is a filter.
> 
> Agreed with offset, but with only size, it matches your definition of a
> filter.

So?

Should we treat it as a filter when @offset is 0 but otherwise not?
That totally wouldn’t be confusing to users.

>> Filters are defined by “Every filter must fulfill these conditions: ...”
>> – not by “Everything that fulfills these conditions is a filter”.
>> Marking a driver as a filter has consequences, and I don’t see why we
>> would want those consequences for raw.)
>>
>>> It looks like bdrv_filtered_child() is the right function to iterate
>>> along a backing file chain, but I just still fail to connect that and
>>> the name of the function in a meaningful way.
>>
>> It‘s the right function to iterate along a filter chain.  This includes
>> COW backing children and R/W filtered children.
> 
> qcow2 doesn't fulfill the conditions for begin a filter driver. Two of
> its possible children fulfill the conditions for being a filtered child.
> You can pick either approach, talking about a "filter chain" just
> doesn't make sense there. Either the chain is broken by a non-filter
> driver like qcow2, or it must become a filter tree.

I have no idea what your point is.  There is no point in making it a
filter tree at this point, just as we never had a backing tree.

And the good example is Quorum.  qcow2 is a bad example because there is
no benefit in marking it an R/W filter for its external data file,
exactly like is the case for raw.

> What we're really interested in is iterating the backing chain even
> across filter nodes, so your implementation achieves the right result.
> It just feels completely arbitrary, counterintuitive and confusing to
> call this a (or actually "the") "filter chain" and to pretend that the
> name tells anyone what it really is.

So exactly the same as “bs->backing” or “backing chain” for me.

You disagreeing with me on these terms to me shows that there is a need
to formalize.  This is precisely what I want to do in this series.

The fact that we don’t use the term “filter chain” so far is the reason
why I introduce it.  Because it comes as a clean slate.  “backing chain”
already means something to me, and it means something different.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains
  2019-09-09  9:55       ` Kevin Wolf
@ 2019-09-09 14:08         ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-09 14:08 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 3565 bytes --]

On 09.09.19 11:55, Kevin Wolf wrote:
> Am 09.09.2019 um 10:25 hat Max Reitz geschrieben:
>> On 05.09.19 16:05, Kevin Wolf wrote:
>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>> Use child access functions when iterating through backing chains so
>>>> filters do not break the chain.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>  block.c | 40 ++++++++++++++++++++++++++++------------
>>>>  1 file changed, 28 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 86b84bea21..42abbaf0ba 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -4376,7 +4376,8 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>>>>  }
>>>>  
>>>>  /*
>>>> - * Finds the image layer in the chain that has 'bs' as its backing file.
>>>> + * Finds the image layer in the chain that has 'bs' (or a filter on
>>>> + * top of it) as its backing file.
>>>>   *
>>>>   * active is the current topmost image.
>>>>   *
>>>> @@ -4388,11 +4389,18 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>>>>  BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
>>>>                                      BlockDriverState *bs)
>>>>  {
>>>> -    while (active && bs != backing_bs(active)) {
>>>> -        active = backing_bs(active);
>>>> +    bs = bdrv_skip_rw_filters(bs);
>>>> +    active = bdrv_skip_rw_filters(active);
>>>
>>> This does more than the commit message says. In addition to iterating
>>> through filters instead of stopping, it also changes the semantics of
>>> the function to return the next non-filter on top of bs instead of the
>>> next node.
>>
>> Which is to say the overlay.
>>
>> (I think we only ever use “overlay” in the COW sense.)
> 
> I think we do, but so far also only ever for immediate COW childs, not
> for skipping through intermediate node.

Yes, because it’s broken so far.

>>> The block jobs seem to use it only for bdrv_is_allocated_above(), which
>>> should return the same thing in both cases, so the behaviour stays the
>>> same. qmp_block_commit() will check op blockers on a different node now,
>>> which could be a fix or a bug, I can't tell offhand. Probably the
>>> blocking doesn't really work anyway.
>>
>> You mean that the op blocker could have been on a block job filter node
>> before?  I think that‘s pretty much the point of this fix; that that
>> doesn’t make sense.  (We didn’t have mirror_top_bs and the like at
>> 058223a6e3b.)
> 
> On the off chance that the op blocker actually works, it can't be a job
> filter. I was thinking more of throttling, blkdebug etc.

Well, that’s just broken altogether right now.

>>> All of this should be mentioned in the commit message at least. Maybe
>>> it's also worth splitting in two patches.
>>
>> I don’t know.  The function was written when there were no filters.
> 
> I doubt it. blkdebug is a really old filter.
> 
>> This change would have been a no-op then.  The fact that it isn’t to me
>> just means that introducing filters broke it.
>>
>> So I don’t know what I would write.  Maybe “bdrv_find_overlay() now
>> actually finds the overlay, that is, it will not return a filter node.
>> This is the behavior that all callers expect (because they work on COW
>> backing chains).”
> 
> Maybe just something like "In addition, filter nodes are not returned as
> overlays any more. Instead, the first non-filter node on top of bs is
> considered the overlay now."?

Isn’t that basically the same? :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code
  2019-09-09 10:01           ` Kevin Wolf
@ 2019-09-09 14:15             ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-09 14:15 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 5319 bytes --]

On 09.09.19 12:01, Kevin Wolf wrote:
> Am 09.09.2019 um 10:31 hat Max Reitz geschrieben:
>> On 05.09.19 18:24, Kevin Wolf wrote:
>>> Am 12.08.2019 um 14:58 hat Max Reitz geschrieben:
>>>> On 10.08.19 17:36, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 09.08.2019 19:13, Max Reitz wrote:
>>>>>> If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
>>>>>> itself has to flush the children of the given node, it should not flush
>>>>>> just bs->file->bs, but in fact all children.
>>>>>>
>>>>>> In any case, the BLKDBG_EVENT() should be emitted on the primary child,
>>>>>> because that is where a blkdebug node would be if there is any.
>>>>>>
>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>>> ---
>>>>>>   block/io.c | 23 +++++++++++++++++------
>>>>>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/block/io.c b/block/io.c
>>>>>> index c5a8e3e6a3..bcc770d336 100644
>>>>>> --- a/block/io.c
>>>>>> +++ b/block/io.c
>>>>>> @@ -2572,6 +2572,8 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque)
>>>>>>   
>>>>>>   int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>>>   {
>>>>>> +    BdrvChild *primary_child = bdrv_primary_child(bs);
>>>>>> +    BdrvChild *child;
>>>>>>       int current_gen;
>>>>>>       int ret = 0;
>>>>>>   
>>>>>> @@ -2601,7 +2603,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>>>       }
>>>>>>   
>>>>>>       /* Write back cached data to the OS even with cache=unsafe */
>>>>>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_OS);
>>>>>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_OS);
>>>>>>       if (bs->drv->bdrv_co_flush_to_os) {
>>>>>>           ret = bs->drv->bdrv_co_flush_to_os(bs);
>>>>>>           if (ret < 0) {
>>>>>> @@ -2611,15 +2613,15 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>>>   
>>>>>>       /* But don't actually force it to the disk with cache=unsafe */
>>>>>>       if (bs->open_flags & BDRV_O_NO_FLUSH) {
>>>>>> -        goto flush_parent;
>>>>>> +        goto flush_children;
>>>>>>       }
>>>>>>   
>>>>>>       /* Check if we really need to flush anything */
>>>>>>       if (bs->flushed_gen == current_gen) {
>>>>>> -        goto flush_parent;
>>>>>> +        goto flush_children;
>>>>>>       }
>>>>>>   
>>>>>> -    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
>>>>>> +    BLKDBG_EVENT(primary_child, BLKDBG_FLUSH_TO_DISK);
>>>>>>       if (!bs->drv) {
>>>>>>           /* bs->drv->bdrv_co_flush() might have ejected the BDS
>>>>>>            * (even in case of apparent success) */
>>>>>> @@ -2663,8 +2665,17 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>>>       /* Now flush the underlying protocol.  It will also have BDRV_O_NO_FLUSH
>>>>>>        * in the case of cache=unsafe, so there are no useless flushes.
>>>>>>        */
>>>>>> -flush_parent:
>>>>>> -    ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
>>>>>> +flush_children:
>>>>>> +    ret = 0; > +    QLIST_FOREACH(child, &bs->children, next) {
>>>>>> +        int this_child_ret;
>>>>>> +
>>>>>> +        this_child_ret = bdrv_co_flush(child->bs);
>>>>>> +        if (!ret) {
>>>>>> +            ret = this_child_ret;
>>>>>> +        }
>>>>>> +    }
>>>>>
>>>>> Hmm, you said that we want to flush only children with write-access from parent..
>>>>
>>>> Good that you remember it, I must have overlooked it (when reading the
>>>> replies to the previous version). :-)
>>>>
>>>>> Shouldn't we check it? Or we assume that it's always safe to call bdrv_co_flush on
>>>>> a node?
>>>>
>>>> I think it’s always safe.  But checking it seems like a nice touch, yes.
>>>
>>> I'm not sure why we would unconditionally flush all children anyway. The
>>> only drivers I can think of that really need to flush more than one
>>> child are blkverify and quorum, and both of them already implement this.
>>> blkverify implements .bdrv_co_flush, so it's not affected by the change
>>> anyway, but quorum children will be flushed twice now.
>>>
>>> But more than this, I'm worried about the overhead of needlessly
>>> recursing through the whole backing chain and calling flush on every
>>> node there.  Maybe bs->write_gen saves us so that at least this doesn't
>>> result in an fdatasync() call for each, but still... Without a use case,
>>> I'd rather not do this.
>>>
>>> Oh, well, after having written all of this, I see that qcow2 with an
>>> external data file is buggy... This could be fixed in the qcow2 driver,
>>> but maybe restricting the recursion to read-only is actually good enough
>>> then. Can you mention this case in the commit message and maybe build a
>>> test for it?
>>
>> And I should thus probably drop vmdk’s .bdrv_co_flush_to_disk()
>> implementation.
>>
>> I will indeed try to write a test, but to be completely honest, I feel
>> like this series is long enough.
> 
> I guess I could already merge patch 1 to give you space for another test
> patch. ;-)

Don’t forget the mirror-top patch.  And AFAIR, there was some comment
from Vladimir that also required an additional patch.  So it would need
to be three!

(Or I just start squashing from the back?)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-09 14:04         ` Max Reitz
@ 2019-09-09 16:13           ` Kevin Wolf
  2019-09-10  9:14             ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-09 16:13 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 10830 bytes --]

Am 09.09.2019 um 16:04 hat Max Reitz geschrieben:
> On 09.09.19 11:36, Kevin Wolf wrote:
> > Am 09.09.2019 um 09:56 hat Max Reitz geschrieben:
> >> On 04.09.19 18:16, Kevin Wolf wrote:
> >>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >>>> There are BDS children that the general block layer code can access,
> >>>> namely bs->file and bs->backing.  Since the introduction of filters and
> >>>> external data files, their meaning is not quite clear.  bs->backing can
> >>>> be a COW source, or it can be an R/W-filtered child; bs->file can be an
> >>>> R/W-filtered child, it can be data and metadata storage, or it can be
> >>>> just metadata storage.
> >>>>
> >>>> This overloading really is not helpful.  This patch adds function that
> >>>> retrieve the correct child for each exact purpose.  Later patches in
> >>>> this series will make use of them.  Doing so will allow us to handle
> >>>> filter nodes and external data files in a meaningful way.
> >>>>
> >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>
> >>> Each time I look at this patch, I'm confused by the function names.
> >>> Maybe I should just ask what the idea there was, or more specifically:
> >>> What does the "filtered" in "filtered child" really mean?
> >>>
> >>> Apparently any child of a filter node is "filtered" (which makes sense),
> >>
> >> It isn’t, filters can have non-filter children.  For example, backup-top
> >> could have the source as a filtered child and the target as a non-filter
> >> child.
> > 
> > Hm, okay, makes sense. I had a definition in mind that says that filter
> > nodes only have a single child node. Is it that a filter may have only a
> > single _filtered_ child node?
> 
> Well, there’s Quorum...

Ah, nice, quorum sets is_filter = true even though it neither fulfulls
the conditions for it before this series, nor the changed conditions
after this series.

So either quorum lies and isn't actually a filter driver, or our
definition in the documentation of is_filter is wrong.

> >>> but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
> >>> raw doesn't have any "filtered" child. What's the system behind this?
> >>
> >> “filtered” means: If the parent node returns data from this child, it
> >> won’t modify it, neither its content nor its position.  COW and R/W
> >> filters differ in how they handle writes; R/W filters pass them through
> >> to the filtered child, COW filters copy them off to some other child
> >> node (and then the filtered child’s data will no longer be visible at
> >> that location).
> > 
> > But there is no reason why a node couldn't fulfill this condition for
> > more than one child node. bdrv_filtered_child() isn't well-defined then.
> > Technically, the description "Return any filtered child" is correct
> > because "any" can be interpreted as "an arbitrary", but obviously that
> > makes the function useless.
> 
> Which is why it currently returns NULL for Quorum.

Which is about the only possible choice that breaks the contract...

 * Return any filtered child, independently of how it reacts to write
 * accesses and whether data is copied onto this BDS through COR.

Maybe the documentation of bdrv_filtered_child() needs to be rephrased?

Going back to qcow2, it's really not much different as it has multiple
(two) filtered children, too. So if quorum returns NULL to mean "no
unambiguous result", why does it return bs->backing instead of NULL for
a qcow2 node?

(Yes, I know, because it's useful. But I'm trying to get some basic
consistency into these interfaces.)

> > Specficially, according to your definition, qcow2 filters both the
> > backing file (COW filter) and the external data file (R/W filter).
> 
> Not wrong.  But the same question as for raw arises: Is there any use to
> declaring qcow2 an R/W filter driver just because it fits the definition?

Wait, where is there even a place where this could be declared?

The once thing I see that a driver even can declare is drv->is_filter,
which is about the whole driver and not about nodes. It is false for
qcow2.

Then you made some criteria above that tell us whether a specific child
of a node is a filtered child or not. As it happens, qcow2 (which is not
a filter driver) can have two children that match the criteria for being
filtered children.

I already think this is a bit inconsistent, because why should a driver
that declares itself a non-filter be considered to filter children?
Okay, you say a broader definition of a filtered child is useful because
you can then include all BdrvChild links in a backing/filter chain. Fair
enough, it's not intuitive, but use a broader definition then.

But the point where you say that even though two of the children
are filtered children under your broader definition, for the purpose of
the API only one of them should be considered because the other one
isn't that useful, that's really one inconsistency too much for me. You
can't use a broad definition and then arbitrarily restrict the
definition again so that it matches the special case you're currently
interested in.

Either use a narrow definition, or use a broad one. But use only one and
use it consistently.

> >> The main reason behind the common “filtered” name is for the generic
> >> functions that work on both COW and true filter (R/W filters) chains.
> >> We need such functionality sometimes.  I personally felt like the
> >> concept of true (R/W) filters and COW children was similar enough to
> >> share a common name base.
> > 
> > We generally call this concept a "backing chain".
> 
> I suppose that’s an exclusive “we”?  Because I use ”backing chain” to
> refer to COW chains exclusively.
> 
> Such a chain may or may not include filters, but they are not really
> load-bearing nodes of the chain.  As such, I generally want to skip them
> when looking at a backing chain (hence e.g. bdrv_backing_chain_next()).
> 
> From what I can tell nobody has ever formalized any terms regarding COW
> backing chains or R/W filter chains.  This series is an attempt.

Well, as you can see, this attempt feels confusing to me.

I agree with your naming of bdrv_backing_chain_next(), it's clear enough
what path it will follow down the graph. I just disagree that "filter
chain" is a good term for something that prefers backing file links when
it has a choice.

> >> qcow2 has a COW child.  As such, it acts as a COW filter in the sense of
> >> the function names.
> >>
> >> raw has neither a COW child nor acts as an R/W filter.  As such, it has
> >> no filtered child.  My opinion on this hasn’t changed.
> >>
> >> (To reiterate, in practice I see no way anyone would ever use raw as an
> >> R/W filter.
> >> Either you use it without offset/size, in which case you simply use it
> >> in lieu of a format node, so you precisely don’t want it to act as a
> >> filter when it comes to allocation information and so on (even though it
> >> can be classified a filter here).
> >> Or you use it as kind of a filter with offset/size, but then it no
> >> longer is a filter.
> > 
> > Agreed with offset, but with only size, it matches your definition of a
> > filter.
> 
> So?
> 
> Should we treat it as a filter when @offset is 0 but otherwise not?
> That totally wouldn’t be confusing to users.

No, I'm just applying your definitions to see if the contradictions
between them and your explanations are of any importance. *shrug*

> >> Filters are defined by “Every filter must fulfill these conditions: ...”
> >> – not by “Everything that fulfills these conditions is a filter”.
> >> Marking a driver as a filter has consequences, and I don’t see why we
> >> would want those consequences for raw.)
> >>
> >>> It looks like bdrv_filtered_child() is the right function to iterate
> >>> along a backing file chain, but I just still fail to connect that and
> >>> the name of the function in a meaningful way.
> >>
> >> It‘s the right function to iterate along a filter chain.  This includes
> >> COW backing children and R/W filtered children.
> > 
> > qcow2 doesn't fulfill the conditions for begin a filter driver. Two of
> > its possible children fulfill the conditions for being a filtered child.
> > You can pick either approach, talking about a "filter chain" just
> > doesn't make sense there. Either the chain is broken by a non-filter
> > driver like qcow2, or it must become a filter tree.
> 
> I have no idea what your point is.  There is no point in making it a
> filter tree at this point, just as we never had a backing tree.
> 
> And the good example is Quorum.  qcow2 is a bad example because there is
> no benefit in marking it an R/W filter for its external data file,
> exactly like is the case for raw.

My point is not about changing the logic in the code, but about using
names that actually describe accurately what the logic does.

And as I said above, neither is "not useful" a convincing argument for
ignoring filtered children (as I think we're trying to build something
rather generic, not something that works only for what we consider
useful today) nor do I see how qcow2 could be marked or not marked an
R/W filter (as mentioned above).

> > What we're really interested in is iterating the backing chain even
> > across filter nodes, so your implementation achieves the right result.
> > It just feels completely arbitrary, counterintuitive and confusing to
> > call this a (or actually "the") "filter chain" and to pretend that the
> > name tells anyone what it really is.
> 
> So exactly the same as “bs->backing” or “backing chain” for me.
> 
> You disagreeing with me on these terms to me shows that there is a need
> to formalize.  This is precisely what I want to do in this series.
> 
> The fact that we don’t use the term “filter chain” so far is the reason
> why I introduce it.  Because it comes as a clean slate.  “backing chain”
> already means something to me, and it means something different.

Well, if "backing chain" is too narrow, "filter chain" is both too
unspecific and inconsistent with the various definitions of "filter" and
"filtered" we're using, and we can't think of anything more concise, we
might have to use names that just mention both.

bdrv_cow_child() // don't call COW a filter, because .is_filter = false
bdrv_filter_child() // your R/W filter, only for .is_filter = true nodes
bdrv_filter_or_cow_child()

Or something like that. This would bring some more consistency into the
way we use the words filter/filtered at least.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-09 16:13           ` Kevin Wolf
@ 2019-09-10  9:14             ` Max Reitz
  2019-09-10 10:47               ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-10  9:14 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 9972 bytes --]

On 09.09.19 18:13, Kevin Wolf wrote:
> Am 09.09.2019 um 16:04 hat Max Reitz geschrieben:
>> On 09.09.19 11:36, Kevin Wolf wrote:
>>> Am 09.09.2019 um 09:56 hat Max Reitz geschrieben:
>>>> On 04.09.19 18:16, Kevin Wolf wrote:
>>>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>>>> There are BDS children that the general block layer code can access,
>>>>>> namely bs->file and bs->backing.  Since the introduction of filters and
>>>>>> external data files, their meaning is not quite clear.  bs->backing can
>>>>>> be a COW source, or it can be an R/W-filtered child; bs->file can be an
>>>>>> R/W-filtered child, it can be data and metadata storage, or it can be
>>>>>> just metadata storage.
>>>>>>
>>>>>> This overloading really is not helpful.  This patch adds function that
>>>>>> retrieve the correct child for each exact purpose.  Later patches in
>>>>>> this series will make use of them.  Doing so will allow us to handle
>>>>>> filter nodes and external data files in a meaningful way.
>>>>>>
>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> Each time I look at this patch, I'm confused by the function names.
>>>>> Maybe I should just ask what the idea there was, or more specifically:
>>>>> What does the "filtered" in "filtered child" really mean?
>>>>>
>>>>> Apparently any child of a filter node is "filtered" (which makes sense),
>>>>
>>>> It isn’t, filters can have non-filter children.  For example, backup-top
>>>> could have the source as a filtered child and the target as a non-filter
>>>> child.
>>>
>>> Hm, okay, makes sense. I had a definition in mind that says that filter
>>> nodes only have a single child node. Is it that a filter may have only a
>>> single _filtered_ child node?
>>
>> Well, there’s Quorum...
> 
> Ah, nice, quorum sets is_filter = true even though it neither fulfulls
> the conditions for it before this series, nor the changed conditions
> after this series.
> 
> So either quorum lies and isn't actually a filter driver, or our
> definition in the documentation of is_filter is wrong.

You could say it lies because in FIFO mode it clearly isn’t a filter for
all of its children.

There is a reason for lying, though, which is
bdrv_recurse_is_first_non_filter(), which is necessary to use the whole
to_replace mirror stuff.

(You mirror from a quorum with a failed child and then replace the
failed child.  mirror needs to ensure that there are only R/W filters
between the child and the mirror source so that replacing it will not
suddenly change any visible data.  Which is actually a lie for quorum,
because the child is clearly broken and thus precisely doesn’t show the
same data...)

Maybe we should stop declaring Quorum a filter and then rename the
bdrv_recurse_is_first_non_filter() to, I don’t know,
bdrv_recurse_can_be_replaced_by_mirror()?

>>>>> but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
>>>>> raw doesn't have any "filtered" child. What's the system behind this?
>>>>
>>>> “filtered” means: If the parent node returns data from this child, it
>>>> won’t modify it, neither its content nor its position.  COW and R/W
>>>> filters differ in how they handle writes; R/W filters pass them through
>>>> to the filtered child, COW filters copy them off to some other child
>>>> node (and then the filtered child’s data will no longer be visible at
>>>> that location).
>>>
>>> But there is no reason why a node couldn't fulfill this condition for
>>> more than one child node. bdrv_filtered_child() isn't well-defined then.
>>> Technically, the description "Return any filtered child" is correct
>>> because "any" can be interpreted as "an arbitrary", but obviously that
>>> makes the function useless.
>>
>> Which is why it currently returns NULL for Quorum.
> 
> Which is about the only possible choice that breaks the contract...
> 
>  * Return any filtered child, independently of how it reacts to write

I don’t know if you’re serious about this proposition, because I don’t
know whether that could be useful in any way. :-?

>  * accesses and whether data is copied onto this BDS through COR.

I meant the contract as “Return the single filtered child there is, or NULL”

> Maybe the documentation of bdrv_filtered_child() needs to be rephrased?
> 
> Going back to qcow2, it's really not much different as it has multiple
> (two) filtered children, too.

Well, it doesn’t.  It isn’t an R/W filter.

Maybe what we actually need to rephrase is the definition of .is_filter.
 (Namely something along the lines of “Fulfills these guarantees (same
data, etc. pp.), *and* should be skipped for allocation information
queries etc.”.

> So if quorum returns NULL to mean "no
> unambiguous result", why does it return bs->backing instead of NULL for
> a qcow2 node?
> 
> (Yes, I know, because it's useful. But I'm trying to get some basic
> consistency into these interfaces.)

Not precisely because it’s useful, but because qcow2 does not have
.is_filter set.  :-)
(And it doesn’t have it set because that wouldn’t be useful.)

>>> Specficially, according to your definition, qcow2 filters both the
>>> backing file (COW filter) and the external data file (R/W filter).
>>
>> Not wrong.  But the same question as for raw arises: Is there any use to
>> declaring qcow2 an R/W filter driver just because it fits the definition?
> 
> Wait, where is there even a place where this could be declared?
> 
> The once thing I see that a driver even can declare is drv->is_filter,
> which is about the whole driver and not about nodes. It is false for
> qcow2.

That’s correct.  But that’s not a fundamental problem, of course, we
could make it a per-BDS attribute if that made sense.

> Then you made some criteria above that tell us whether a specific child
> of a node is a filtered child or not. As it happens, qcow2 (which is not
> a filter driver) can have two children that match the criteria for being
> filtered children.

But just arguing that I’m incapable of giving a good definition won’t
bring us along.

> I already think this is a bit inconsistent, because why should a driver
> that declares itself a non-filter be considered to filter children?

.is_filter is for R/W filters.  COW filters have .supports_backing for that.

> Okay, you say a broader definition of a filtered child is useful because
> you can then include all BdrvChild links in a backing/filter chain. Fair
> enough, it's not intuitive, but use a broader definition then.
> 
> But the point where you say that even though two of the children
> are filtered children under your broader definition, for the purpose of
> the API only one of them should be considered because the other one
> isn't that useful, that's really one inconsistency too much for me. You
> can't use a broad definition and then arbitrarily restrict the
> definition again so that it matches the special case you're currently
> interested in.
> 
> Either use a narrow definition, or use a broad one. But use only one and
> use it consistently.

I think the problem appears because you restrict the process to a single
step where there’s actually two.

Drivers can be either
(1) R/W filters (e.g. throttle)
(2) COW filters (e.g. qcow2)
(3) None of the above (e.g. vhdx, curl)

This choice is made on the driver level, not on the node level (for good
reason, see below*).

And then we derive the node’s filtered children from what the driver is.
 If it’s an R/W filter, bdrv_filtered_child() will return the R/W
filtered child.  If it’s a COW filter, bdrv_filtered_child() will return
the potentially existing COW backing child (or NULL, if there is no
backing child).


*
What is clear to me is that it isn’t useful to treat nodes of a specific
driver sometimes as R/W filter nodes and sometimes not.  R/W filter
nodes are handled differently from other nodes, and it would be
confusing if a certain driver sometimes behaves this and sometimes that
way.  (For example, if you put a raw node on top of a qcow2 node,
sometimes it would stop the backing chain, sometimes it wouldn’t.  That
makes no sense to me.)

OTOH, for COW filters, we do exactly that.  Sometimes they have a
backing file, sometimes they don’t.  That’s completely fine because
their overall behavior doesn’t change.


That makes me agree that there is indeed too much of a difference
between R/W filters and COW filters to lump them together under the
“filter” label.

[...]

> My point is not about changing the logic in the code, but about using
> names that actually describe accurately what the logic does.

Again, naming things is hard.

[...]

>> You disagreeing with me on these terms to me shows that there is a need
>> to formalize.  This is precisely what I want to do in this series.
>>
>> The fact that we don’t use the term “filter chain” so far is the reason
>> why I introduce it.  Because it comes as a clean slate.  “backing chain”
>> already means something to me, and it means something different.
> 
> Well, if "backing chain" is too narrow, "filter chain" is both too
> unspecific and inconsistent with the various definitions of "filter" and
> "filtered" we're using, and we can't think of anything more concise, we
> might have to use names that just mention both.
> 
> bdrv_cow_child() // don't call COW a filter, because .is_filter = false
> bdrv_filter_child() // your R/W filter, only for .is_filter = true nodes
> bdrv_filter_or_cow_child()
> 
> Or something like that. This would bring some more consistency into the
> way we use the words filter/filtered at least.

I’ll see how that looks overall, but why not.  Sounds good to me.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-10  9:14             ` Max Reitz
@ 2019-09-10 10:47               ` Kevin Wolf
  2019-09-10 11:36                 ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 10:47 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 12625 bytes --]

Am 10.09.2019 um 11:14 hat Max Reitz geschrieben:
> On 09.09.19 18:13, Kevin Wolf wrote:
> > Am 09.09.2019 um 16:04 hat Max Reitz geschrieben:
> >> On 09.09.19 11:36, Kevin Wolf wrote:
> >>> Am 09.09.2019 um 09:56 hat Max Reitz geschrieben:
> >>>> On 04.09.19 18:16, Kevin Wolf wrote:
> >>>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >>>>>> There are BDS children that the general block layer code can access,
> >>>>>> namely bs->file and bs->backing.  Since the introduction of filters and
> >>>>>> external data files, their meaning is not quite clear.  bs->backing can
> >>>>>> be a COW source, or it can be an R/W-filtered child; bs->file can be an
> >>>>>> R/W-filtered child, it can be data and metadata storage, or it can be
> >>>>>> just metadata storage.
> >>>>>>
> >>>>>> This overloading really is not helpful.  This patch adds function that
> >>>>>> retrieve the correct child for each exact purpose.  Later patches in
> >>>>>> this series will make use of them.  Doing so will allow us to handle
> >>>>>> filter nodes and external data files in a meaningful way.
> >>>>>>
> >>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>>>
> >>>>> Each time I look at this patch, I'm confused by the function names.
> >>>>> Maybe I should just ask what the idea there was, or more specifically:
> >>>>> What does the "filtered" in "filtered child" really mean?
> >>>>>
> >>>>> Apparently any child of a filter node is "filtered" (which makes sense),
> >>>>
> >>>> It isn’t, filters can have non-filter children.  For example, backup-top
> >>>> could have the source as a filtered child and the target as a non-filter
> >>>> child.
> >>>
> >>> Hm, okay, makes sense. I had a definition in mind that says that filter
> >>> nodes only have a single child node. Is it that a filter may have only a
> >>> single _filtered_ child node?
> >>
> >> Well, there’s Quorum...
> > 
> > Ah, nice, quorum sets is_filter = true even though it neither fulfulls
> > the conditions for it before this series, nor the changed conditions
> > after this series.
> > 
> > So either quorum lies and isn't actually a filter driver, or our
> > definition in the documentation of is_filter is wrong.
> 
> You could say it lies because in FIFO mode it clearly isn’t a filter for
> all of its children.
> 
> There is a reason for lying, though, which is
> bdrv_recurse_is_first_non_filter(), which is necessary to use the whole
> to_replace mirror stuff.

Hm, actually, now that you mention bdrv_recurse_is_first_non_filter(),
quorum was the first driver to declare itself a filter, so strictly
speaking, if there is an inconsistency, it's the other uses that are
abusing the field...

> (You mirror from a quorum with a failed child and then replace the
> failed child.  mirror needs to ensure that there are only R/W filters
> between the child and the mirror source so that replacing it will not
> suddenly change any visible data.  Which is actually a lie for quorum,
> because the child is clearly broken and thus precisely doesn’t show the
> same data...)
> 
> Maybe we should stop declaring Quorum a filter and then rename the
> bdrv_recurse_is_first_non_filter() to, I don’t know,
> bdrv_recurse_can_be_replaced_by_mirror()?

Why not.

> >>>>> but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
> >>>>> raw doesn't have any "filtered" child. What's the system behind this?
> >>>>
> >>>> “filtered” means: If the parent node returns data from this child, it
> >>>> won’t modify it, neither its content nor its position.  COW and R/W
> >>>> filters differ in how they handle writes; R/W filters pass them through
> >>>> to the filtered child, COW filters copy them off to some other child
> >>>> node (and then the filtered child’s data will no longer be visible at
> >>>> that location).
> >>>
> >>> But there is no reason why a node couldn't fulfill this condition for
> >>> more than one child node. bdrv_filtered_child() isn't well-defined then.
> >>> Technically, the description "Return any filtered child" is correct
> >>> because "any" can be interpreted as "an arbitrary", but obviously that
> >>> makes the function useless.
> >>
> >> Which is why it currently returns NULL for Quorum.
> > 
> > Which is about the only possible choice that breaks the contract...
> > 
> >  * Return any filtered child, independently of how it reacts to write
> 
> I don’t know if you’re serious about this proposition, because I don’t
> know whether that could be useful in any way. :-?

Huh? This is just quoting the contract from your code?

> >  * accesses and whether data is copied onto this BDS through COR.
> 
> I meant the contract as “Return the single filtered child there is, or NULL”

Then that should probably be spelt out in the contract. Probably even
explicitly "NULL if there is either no filtered child or multiple
filtered children".

> > Maybe the documentation of bdrv_filtered_child() needs to be rephrased?
> > 
> > Going back to qcow2, it's really not much different as it has multiple
> > (two) filtered children, too.
> 
> Well, it doesn’t.  It isn’t an R/W filter.

What do I have to look at to see whether something is an R/W filter or
not? qcow2 matches your criteria for an R/W filter. You say that it's
not useful, so it's not an R/W filter anyway. But where in the code
could I get this information?

This just doesn't make sense to me. If a driver matches the criteria for
an R/W filter, then it should be one. If qcow2 should not be considered
a R/W filter, then the criteria must be changed so that it isn't.

> Maybe what we actually need to rephrase is the definition of .is_filter.
>  (Namely something along the lines of “Fulfills these guarantees (same
> data, etc. pp.), *and* should be skipped for allocation information
> queries etc.”.

Hm - does this imply that .is_filter == this is a R/W filter? Because
this was never spelt out, neither in code comments nor in commit
messages.

If we called R/W filters just "filters" (which makes it obvious how it
relates to .is_filter) and COW nodes something that doesn't include the
word "filter", things might become a lot clearer.

> > So if quorum returns NULL to mean "no
> > unambiguous result", why does it return bs->backing instead of NULL for
> > a qcow2 node?
> > 
> > (Yes, I know, because it's useful. But I'm trying to get some basic
> > consistency into these interfaces.)
> 
> Not precisely because it’s useful, but because qcow2 does not have
> .is_filter set.  :-)
> (And it doesn’t have it set because that wouldn’t be useful.)
> 
> >>> Specficially, according to your definition, qcow2 filters both the
> >>> backing file (COW filter) and the external data file (R/W filter).
> >>
> >> Not wrong.  But the same question as for raw arises: Is there any use to
> >> declaring qcow2 an R/W filter driver just because it fits the definition?
> > 
> > Wait, where is there even a place where this could be declared?
> > 
> > The once thing I see that a driver even can declare is drv->is_filter,
> > which is about the whole driver and not about nodes. It is false for
> > qcow2.
> 
> That’s correct.  But that’s not a fundamental problem, of course, we
> could make it a per-BDS attribute if that made sense.

I was thinking per-child, actually, because you declare one BdrvChild
filtered and another not filtered.

But by now I think most of the confusion is really just a result of COW
being considered a filter in some respects (mainly just the names of the
child access functions), but not in others (like .is_filter).

> > Then you made some criteria above that tell us whether a specific child
> > of a node is a filtered child or not. As it happens, qcow2 (which is not
> > a filter driver) can have two children that match the criteria for being
> > filtered children.
> 
> But just arguing that I’m incapable of giving a good definition won’t
> bring us along.
> 
> > I already think this is a bit inconsistent, because why should a driver
> > that declares itself a non-filter be considered to filter children?
> 
> .is_filter is for R/W filters.  COW filters have .supports_backing for that.

Okay, so you confirm what I concluded above.

> > Okay, you say a broader definition of a filtered child is useful because
> > you can then include all BdrvChild links in a backing/filter chain. Fair
> > enough, it's not intuitive, but use a broader definition then.
> > 
> > But the point where you say that even though two of the children
> > are filtered children under your broader definition, for the purpose of
> > the API only one of them should be considered because the other one
> > isn't that useful, that's really one inconsistency too much for me. You
> > can't use a broad definition and then arbitrarily restrict the
> > definition again so that it matches the special case you're currently
> > interested in.
> > 
> > Either use a narrow definition, or use a broad one. But use only one and
> > use it consistently.
> 
> I think the problem appears because you restrict the process to a single
> step where there’s actually two.
> 
> Drivers can be either
> (1) R/W filters (e.g. throttle)
> (2) COW filters (e.g. qcow2)
> (3) None of the above (e.g. vhdx, curl)
> 
> This choice is made on the driver level, not on the node level (for good
> reason, see below*).

What prevents a driver from being
(4) COW filter and R/W filter (e.g. qcow2 if it were useful)?

I mean, conceptually, not in the implementation.

> And then we derive the node’s filtered children from what the driver is.
>  If it’s an R/W filter, bdrv_filtered_child() will return the R/W
> filtered child.  If it’s a COW filter, bdrv_filtered_child() will return
> the potentially existing COW backing child (or NULL, if there is no
> backing child).

I guess it boils down to me just not being able to get behind the
concept that COW is some sort of filter (especially when other things
like an external data file aren't).

> *
> What is clear to me is that it isn’t useful to treat nodes of a specific
> driver sometimes as R/W filter nodes and sometimes not.  R/W filter
> nodes are handled differently from other nodes, and it would be
> confusing if a certain driver sometimes behaves this and sometimes that
> way.  (For example, if you put a raw node on top of a qcow2 node,
> sometimes it would stop the backing chain, sometimes it wouldn’t.  That
> makes no sense to me.)
> 
> OTOH, for COW filters, we do exactly that.  Sometimes they have a
> backing file, sometimes they don’t.  That’s completely fine because
> their overall behavior doesn’t change.
> 
> 
> That makes me agree that there is indeed too much of a difference
> between R/W filters and COW filters to lump them together under the
> “filter” label.
> 
> [...]
> 
> > My point is not about changing the logic in the code, but about using
> > names that actually describe accurately what the logic does.
> 
> Again, naming things is hard.
> 
> [...]
> 
> >> You disagreeing with me on these terms to me shows that there is a need
> >> to formalize.  This is precisely what I want to do in this series.
> >>
> >> The fact that we don’t use the term “filter chain” so far is the reason
> >> why I introduce it.  Because it comes as a clean slate.  “backing chain”
> >> already means something to me, and it means something different.
> > 
> > Well, if "backing chain" is too narrow, "filter chain" is both too
> > unspecific and inconsistent with the various definitions of "filter" and
> > "filtered" we're using, and we can't think of anything more concise, we
> > might have to use names that just mention both.
> > 
> > bdrv_cow_child() // don't call COW a filter, because .is_filter = false
> > bdrv_filter_child() // your R/W filter, only for .is_filter = true nodes
> > bdrv_filter_or_cow_child()
> > 
> > Or something like that. This would bring some more consistency into the
> > way we use the words filter/filtered at least.
> 
> I’ll see how that looks overall, but why not.  Sounds good to me.

Good. Or, well, good enough at least. ;-)

bdrv_filter_or_cow_child() is not a pretty name, but as long as we can't
think of anything that accurately covers both in a single word, it will
do the job...

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-10 10:47               ` Kevin Wolf
@ 2019-09-10 11:36                 ` Max Reitz
  2019-09-10 12:48                   ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-10 11:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 12125 bytes --]

On 10.09.19 12:47, Kevin Wolf wrote:
> Am 10.09.2019 um 11:14 hat Max Reitz geschrieben:
>> On 09.09.19 18:13, Kevin Wolf wrote:
>>> Am 09.09.2019 um 16:04 hat Max Reitz geschrieben:
>>>> On 09.09.19 11:36, Kevin Wolf wrote:
>>>>> Am 09.09.2019 um 09:56 hat Max Reitz geschrieben:
>>>>>> On 04.09.19 18:16, Kevin Wolf wrote:
>>>>>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>>>>>> There are BDS children that the general block layer code can access,
>>>>>>>> namely bs->file and bs->backing.  Since the introduction of filters and
>>>>>>>> external data files, their meaning is not quite clear.  bs->backing can
>>>>>>>> be a COW source, or it can be an R/W-filtered child; bs->file can be an
>>>>>>>> R/W-filtered child, it can be data and metadata storage, or it can be
>>>>>>>> just metadata storage.
>>>>>>>>
>>>>>>>> This overloading really is not helpful.  This patch adds function that
>>>>>>>> retrieve the correct child for each exact purpose.  Later patches in
>>>>>>>> this series will make use of them.  Doing so will allow us to handle
>>>>>>>> filter nodes and external data files in a meaningful way.
>>>>>>>>
>>>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>>
>>>>>>> Each time I look at this patch, I'm confused by the function names.
>>>>>>> Maybe I should just ask what the idea there was, or more specifically:
>>>>>>> What does the "filtered" in "filtered child" really mean?
>>>>>>>
>>>>>>> Apparently any child of a filter node is "filtered" (which makes sense),
>>>>>>
>>>>>> It isn’t, filters can have non-filter children.  For example, backup-top
>>>>>> could have the source as a filtered child and the target as a non-filter
>>>>>> child.
>>>>>
>>>>> Hm, okay, makes sense. I had a definition in mind that says that filter
>>>>> nodes only have a single child node. Is it that a filter may have only a
>>>>> single _filtered_ child node?
>>>>
>>>> Well, there’s Quorum...
>>>
>>> Ah, nice, quorum sets is_filter = true even though it neither fulfulls
>>> the conditions for it before this series, nor the changed conditions
>>> after this series.
>>>
>>> So either quorum lies and isn't actually a filter driver, or our
>>> definition in the documentation of is_filter is wrong.
>>
>> You could say it lies because in FIFO mode it clearly isn’t a filter for
>> all of its children.
>>
>> There is a reason for lying, though, which is
>> bdrv_recurse_is_first_non_filter(), which is necessary to use the whole
>> to_replace mirror stuff.
> 
> Hm, actually, now that you mention bdrv_recurse_is_first_non_filter(),
> quorum was the first driver to declare itself a filter, so strictly
> speaking, if there is an inconsistency, it's the other uses that are
> abusing the field...
> 
>> (You mirror from a quorum with a failed child and then replace the
>> failed child.  mirror needs to ensure that there are only R/W filters
>> between the child and the mirror source so that replacing it will not
>> suddenly change any visible data.  Which is actually a lie for quorum,
>> because the child is clearly broken and thus precisely doesn’t show the
>> same data...)
>>
>> Maybe we should stop declaring Quorum a filter and then rename the
>> bdrv_recurse_is_first_non_filter() to, I don’t know,
>> bdrv_recurse_can_be_replaced_by_mirror()?
> 
> Why not.

It feels difficult to do in this series because this is a whole new can
of worms.

In patch 35, I actually replace the mirror use case by
is_filtered_child().  So it looks to me as if that should not be done,
because I should instead fix bdrv_recurse_is_first_non_filter() (and
rename it), because quorum does allow replacing its children by mirror,
even if it does not act as a filter for them.

OTOH, there are other users of bdrv_is_first_non_filter().  Those are
qmp_block_resize() and external_snapshot_prepare(), who throw an error
if that returns false.

I think that’s just wrong.  First of all, I don’t even know why we have
that restriction anymore (I can imagine why it used to make sense before
the permission system).  qmp_block_resize() should always work as long
as it can get BLK_PERM_RESIZE; and I don’t know why the parents of some
node would care if you take a snapshot of their child.

>>>>>>> but also bs->backing of a qcow2 image, while bs->file of qcow2 isn't.
>>>>>>> raw doesn't have any "filtered" child. What's the system behind this?
>>>>>>
>>>>>> “filtered” means: If the parent node returns data from this child, it
>>>>>> won’t modify it, neither its content nor its position.  COW and R/W
>>>>>> filters differ in how they handle writes; R/W filters pass them through
>>>>>> to the filtered child, COW filters copy them off to some other child
>>>>>> node (and then the filtered child’s data will no longer be visible at
>>>>>> that location).
>>>>>
>>>>> But there is no reason why a node couldn't fulfill this condition for
>>>>> more than one child node. bdrv_filtered_child() isn't well-defined then.
>>>>> Technically, the description "Return any filtered child" is correct
>>>>> because "any" can be interpreted as "an arbitrary", but obviously that
>>>>> makes the function useless.
>>>>
>>>> Which is why it currently returns NULL for Quorum.
>>>
>>> Which is about the only possible choice that breaks the contract...
>>>
>>>  * Return any filtered child, independently of how it reacts to write
>>
>> I don’t know if you’re serious about this proposition, because I don’t
>> know whether that could be useful in any way. :-?
> 
> Huh? This is just quoting the contract from your code?

I see.  I was thinking about “any of COW/RW, of which only one exists”.
 There is an assertion for that (that only one filtered child exists at
a time) in the code.  (And I consider assertions part of the contract.)

>>>  * accesses and whether data is copied onto this BDS through COR.
>>
>> I meant the contract as “Return the single filtered child there is, or NULL”
> 
> Then that should probably be spelt out in the contract.Probably even
> explicitly "NULL if there is either no filtered child or multiple
> filtered children".

Well, it’s spelled out through the assertion, but not in the
documentation, yes.

>>> Maybe the documentation of bdrv_filtered_child() needs to be rephrased?
>>>
>>> Going back to qcow2, it's really not much different as it has multiple
>>> (two) filtered children, too.
>>
>> Well, it doesn’t.  It isn’t an R/W filter.
> 
> What do I have to look at to see whether something is an R/W filter or
> not? qcow2 matches your criteria for an R/W filter.

No.  Some qcow2 nodes match the criteria.  But not all, which makes the
qcow2 driver not a filter driver.

> You say that it's
> not useful, so it's not an R/W filter anyway. But where in the code
> could I get this information?

“Where in the code”?  Do you want to add a comment to every BlockDriver
structure on why it does or doesn’t set .is_filter?

> This just doesn't make sense to me. If a driver matches the criteria for
> an R/W filter, then it should be one. If qcow2 should not be considered
> a R/W filter, then the criteria must be changed so that it isn't.

See below.

>> Maybe what we actually need to rephrase is the definition of .is_filter.
>>  (Namely something along the lines of “Fulfills these guarantees (same
>> data, etc. pp.), *and* should be skipped for allocation information
>> queries etc.”.
> 
> Hm - does this imply that .is_filter == this is a R/W filter? Because
> this was never spelt out, neither in code comments nor in commit
> messages.

While I’m not a fan of comment-less code, I do think that it’s possible
to read code.  Which clearly stated this.

> If we called R/W filters just "filters" (which makes it obvious how it
> relates to .is_filter) and COW nodes something that doesn't include the
> word "filter", things might become a lot clearer.

Because you apparently wrote this before reading that I agreed to your
renaming proposal, I now feel free to argue that I could just as well
rename .is_filter to .is_rw_filter.

Obviously I won’t because I prefer your proposal.

[...]

>>>>> Specficially, according to your definition, qcow2 filters both the
>>>>> backing file (COW filter) and the external data file (R/W filter).
>>>>
>>>> Not wrong.  But the same question as for raw arises: Is there any use to
>>>> declaring qcow2 an R/W filter driver just because it fits the definition?
>>>
>>> Wait, where is there even a place where this could be declared?
>>>
>>> The once thing I see that a driver even can declare is drv->is_filter,
>>> which is about the whole driver and not about nodes. It is false for
>>> qcow2.
>>
>> That’s correct.  But that’s not a fundamental problem, of course, we
>> could make it a per-BDS attribute if that made sense.
> 
> I was thinking per-child, actually, because you declare one BdrvChild
> filtered and another not filtered.

Why don’t you say so from the start then?

(Sorry, but honestly about 30 % of this discussion to me feels like
you’re playing games with me.  Please don’t take this the wrong way, I
mean it very neutrally.  It’s just that I feel like I’m explaining
things to you that you very much know, but you just want me to say them.
 And that feels unproductive and sometimes indeed frustrating.)

One thing is that this wouldn’t make the quorum case any easier because
it actually doesn’t know for which children it acts as a filter and for
which it doesn’t.

> But by now I think most of the confusion is really just a result of COW
> being considered a filter in some respects (mainly just the names of the
> child access functions), but not in others (like .is_filter).

I don’t quite see how it’s “by now” when in your first mail you already
basically wrote that functionally, everything works (leaving out
quorum), but that you’re confused (or claim to be confused, I have no
idea what’s real and what’s pretended anymore) by the names.


We have come to two results, as far as I can see:

First, naming COW backing nodes “COW filtered children” clashes with our
existing use of ”filter”.  There is no point in forcing the ”filter”
label on everything.  We can just keep calling (R/W) filters filters and
COW backing children COW children.  The names are succinct enough.

In some cases, we don’t care whether something is a COW or filtered
child, in such a case a caller can be bothered to use the slightly
longer bdrv_cow_or_filtered_child().


Second, most of the time we want a filter node to have a clear and
unique path to go down.  This is the important property of filters: That
you can skip them and go to the node that actually has the data.

Quorum breaks this by having multiple children, and nobody knows which
of them has the data we will see on the next read operation.

All “filters” who could have multiple children would have this problem.
 Hence a filter must always have a single unique data child.  I think.

[...]

>>> Either use a narrow definition, or use a broad one. But use only one and
>>> use it consistently.
>>
>> I think the problem appears because you restrict the process to a single
>> step where there’s actually two.
>>
>> Drivers can be either
>> (1) R/W filters (e.g. throttle)
>> (2) COW filters (e.g. qcow2)
>> (3) None of the above (e.g. vhdx, curl)
>>
>> This choice is made on the driver level, not on the node level (for good
>> reason, see below*).
> 
> What prevents a driver from being
> (4) COW filter and R/W filter (e.g. qcow2 if it were useful)?
> 
> I mean, conceptually, not in the implementation.

An R/W filter always shows the same data as the filtered child.  So the
COW child‘s data can never be visible, and as such you couldn’t have a
COW child at the same time.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback Max Reitz
  2019-08-10 16:34   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-10 11:56   ` Kevin Wolf
  2019-09-10 12:04     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 11:56 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> If the top node's driver does not provide snapshot functionality and we
> want to fall back to a node down the chain, we need to snapshot all
> non-COW children.  For simplicity's sake, just do not fall back if there
> is more than one such child.
> 
> bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
> the actual child pointer, so it only works if the fallback child is
> bs->file or bs->backing (and then we have to find out which it is).
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 79 insertions(+), 21 deletions(-)
> 
> diff --git a/block/snapshot.c b/block/snapshot.c
> index f2f48f926a..35403c167f 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
>      return ret;
>  }
>  
> +/**
> + * Return the child BDS to which we can fall back if the given BDS
> + * does not support snapshots.
> + * Return NULL if there is no BDS to (safely) fall back to.
> + */
> +static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
> +{
> +    BlockDriverState *child_bs = NULL;
> +    BdrvChild *child;
> +
> +    QLIST_FOREACH(child, &bs->children, next) {
> +        if (child == bdrv_filtered_cow_child(bs)) {
> +            /* Ignore: COW children need not be included in snapshots */
> +            continue;
> +        }
> +
> +        if (child_bs) {
> +            /* Cannot fall back to a single child if there are multiple */
> +            return NULL;
> +        }
> +        child_bs = child->bs;
> +    }
> +
> +    return child_bs;
> +}

Why do we return child->bs here when bdrv_snapshot_goto() then needs to
reconstruct what the associated BdrvChild was? Wouldn't it make more
sense to return BdrvChild** from here and maybe have a small wrapper for
the other functions that only need a BDS?

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-09-10 11:56   ` Kevin Wolf
@ 2019-09-10 12:04     ` Max Reitz
  2019-09-10 12:49       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-10 12:04 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 2448 bytes --]

On 10.09.19 13:56, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> If the top node's driver does not provide snapshot functionality and we
>> want to fall back to a node down the chain, we need to snapshot all
>> non-COW children.  For simplicity's sake, just do not fall back if there
>> is more than one such child.
>>
>> bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
>> the actual child pointer, so it only works if the fallback child is
>> bs->file or bs->backing (and then we have to find out which it is).
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 79 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/snapshot.c b/block/snapshot.c
>> index f2f48f926a..35403c167f 100644
>> --- a/block/snapshot.c
>> +++ b/block/snapshot.c
>> @@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
>>      return ret;
>>  }
>>  
>> +/**
>> + * Return the child BDS to which we can fall back if the given BDS
>> + * does not support snapshots.
>> + * Return NULL if there is no BDS to (safely) fall back to.
>> + */
>> +static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
>> +{
>> +    BlockDriverState *child_bs = NULL;
>> +    BdrvChild *child;
>> +
>> +    QLIST_FOREACH(child, &bs->children, next) {
>> +        if (child == bdrv_filtered_cow_child(bs)) {
>> +            /* Ignore: COW children need not be included in snapshots */
>> +            continue;
>> +        }
>> +
>> +        if (child_bs) {
>> +            /* Cannot fall back to a single child if there are multiple */
>> +            return NULL;
>> +        }
>> +        child_bs = child->bs;
>> +    }
>> +
>> +    return child_bs;
>> +}
> 
> Why do we return child->bs here when bdrv_snapshot_goto() then needs to
> reconstruct what the associated BdrvChild was? Wouldn't it make more
> sense to return BdrvChild** from here and maybe have a small wrapper for
> the other functions that only need a BDS?

What would you return instead?  &child doesn’t work.

We could limit ourselves to bs->file and bs->backing.  It just seemed
like a bit of an artificial limit to me, because we only really have it
for bdrv_snapshot_goto().

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-10 11:36                 ` Max Reitz
@ 2019-09-10 12:48                   ` Kevin Wolf
  2019-09-10 12:59                     ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 12:48 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 7772 bytes --]

Am 10.09.2019 um 13:36 hat Max Reitz geschrieben:
> On 10.09.19 12:47, Kevin Wolf wrote:
> > Am 10.09.2019 um 11:14 hat Max Reitz geschrieben:
> >> Maybe we should stop declaring Quorum a filter and then rename the
> >> bdrv_recurse_is_first_non_filter() to, I don’t know,
> >> bdrv_recurse_can_be_replaced_by_mirror()?
> > 
> > Why not.
> 
> It feels difficult to do in this series because this is a whole new can
> of worms.
> 
> In patch 35, I actually replace the mirror use case by
> is_filtered_child().  So it looks to me as if that should not be done,
> because I should instead fix bdrv_recurse_is_first_non_filter() (and
> rename it), because quorum does allow replacing its children by mirror,
> even if it does not act as a filter for them.
> 
> OTOH, there are other users of bdrv_is_first_non_filter().  Those are
> qmp_block_resize() and external_snapshot_prepare(), who throw an error
> if that returns false.
> 
> I think that’s just wrong.  First of all, I don’t even know why we have
> that restriction anymore (I can imagine why it used to make sense before
> the permission system).  qmp_block_resize() should always work as long
> as it can get BLK_PERM_RESIZE; and I don’t know why the parents of some
> node would care if you take a snapshot of their child.

Hm, doesn't it make sense in a way for qmp_block_resize() at least? It
means that you can't resize just a filter, but you need to resize the
image that actually provides the data for the filter.

Of course, there is no reason for it to be the _first_ non-filter as
long as BLK_PERM_RESIZE is shared, but just some non-filter node.

Two more random observations:

* quorum uses bdrv_filter_default_perms(), which allows BLK_PERM_RESIZE.
  I think this is wrong and quorum should make sure that all children are
  always the same size because otherwise it can't tell what its own size
  is. (Or vote on size...? :-/) Probably not a problem in practice as
  long as we check bdrv_is_first_non_filter().

* child_file and child_backing don't implement .resize. So if you resize
  a non-top-level image, parents (in particular filters) don't get their
  size adjusted. This is probably a bug, too, but one that isn't
  prevented by bdrv_is_first_non_filter() and should be visible today.

> >>> Maybe the documentation of bdrv_filtered_child() needs to be rephrased?
> >>>
> >>> Going back to qcow2, it's really not much different as it has multiple
> >>> (two) filtered children, too.
> >>
> >> Well, it doesn’t.  It isn’t an R/W filter.
> > 
> > What do I have to look at to see whether something is an R/W filter or
> > not? qcow2 matches your criteria for an R/W filter.
> 
> No.  Some qcow2 nodes match the criteria.  But not all, which makes the
> qcow2 driver not a filter driver.
> 
> > You say that it's not useful, so it's not an R/W filter anyway. But
> > where in the code could I get this information?
> 
> “Where in the code”?  Do you want to add a comment to every BlockDriver
> structure on why it does or doesn’t set .is_filter?

Never mind, I just didn't understand that .is_filter is the thing that
defines a R/W filter. In fact, I didn't really understand what
.is_filter was supposed to mean at all because I was so confused. For
some reason I was sure it had to mean any kind of filter, but that
assumption just didn't match up with its use at all.

> >>>>> Specficially, according to your definition, qcow2 filters both the
> >>>>> backing file (COW filter) and the external data file (R/W filter).
> >>>>
> >>>> Not wrong.  But the same question as for raw arises: Is there any use to
> >>>> declaring qcow2 an R/W filter driver just because it fits the definition?
> >>>
> >>> Wait, where is there even a place where this could be declared?
> >>>
> >>> The once thing I see that a driver even can declare is drv->is_filter,
> >>> which is about the whole driver and not about nodes. It is false for
> >>> qcow2.
> >>
> >> That’s correct.  But that’s not a fundamental problem, of course, we
> >> could make it a per-BDS attribute if that made sense.
> > 
> > I was thinking per-child, actually, because you declare one BdrvChild
> > filtered and another not filtered.
> 
> Why don’t you say so from the start then?

Yes, I wrote "nodes", thought "child nodes" and should have said
"children" because edges are not nodes. My bad, sorry.

> (Sorry, but honestly about 30 % of this discussion to me feels like
> you’re playing games with me.  Please don’t take this the wrong way, I
> mean it very neutrally.  It’s just that I feel like I’m explaining
> things to you that you very much know, but you just want me to say them.
>  And that feels unproductive and sometimes indeed frustrating.)

No, certainly not. If my mails seemed confusing or pointless, it just
shows how thoroughly confused I was.

> One thing is that this wouldn’t make the quorum case any easier because
> it actually doesn’t know for which children it acts as a filter and for
> which it doesn’t.
> 
> > But by now I think most of the confusion is really just a result of COW
> > being considered a filter in some respects (mainly just the names of the
> > child access functions), but not in others (like .is_filter).
> 
> I don’t quite see how it’s “by now” when in your first mail you already
> basically wrote that functionally, everything works (leaving out
> quorum), but that you’re confused (or claim to be confused, I have no
> idea what’s real and what’s pretended anymore) by the names.

Well, I saw that the special cases in the patches that I had reviewed so
far seemed to be converted correctly, but I just didn't understand the
whole concept behind it. It's possible to both understand that a
transformation is correct and to fail to grasp the concept behind it.

And your first answer only confused me more because you gave definitions
for R/W and COW filters that honestly ended up a bit misleading,
possibly as a result of your endeavour to make R/W filters and COW
sound like the same thing. (Which made me lose sight of basic facts like
that R/W filters must forward _every_ request without exception to their
filtered child even though COW doesn't.)

> We have come to two results, as far as I can see:
> 
> First, naming COW backing nodes “COW filtered children” clashes with our
> existing use of ”filter”.  There is no point in forcing the ”filter”
> label on everything.  We can just keep calling (R/W) filters filters and
> COW backing children COW children.  The names are succinct enough.
> 
> In some cases, we don’t care whether something is a COW or filtered
> child, in such a case a caller can be bothered to use the slightly
> longer bdrv_cow_or_filtered_child().

Aye.

> Second, most of the time we want a filter node to have a clear and
> unique path to go down.  This is the important property of filters: That
> you can skip them and go to the node that actually has the data.
> 
> Quorum breaks this by having multiple children, and nobody knows which
> of them has the data we will see on the next read operation.
> 
> All “filters” who could have multiple children would have this problem.
>  Hence a filter must always have a single unique data child.  I think.

I agree, and this is the condition that I mentioned somewhere above, but
failed to actually find guaranteed somewhere. We should probably make
this explicit.

Of course, quorum and similar things intend all their children to
provide the same data, but the whole point of the driver is that this is
not always guaranteed, so they aren't actually filters.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-09-10 12:04     ` Max Reitz
@ 2019-09-10 12:49       ` Kevin Wolf
  2019-09-10 13:06         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 12:49 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2768 bytes --]

Am 10.09.2019 um 14:04 hat Max Reitz geschrieben:
> On 10.09.19 13:56, Kevin Wolf wrote:
> > Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >> If the top node's driver does not provide snapshot functionality and we
> >> want to fall back to a node down the chain, we need to snapshot all
> >> non-COW children.  For simplicity's sake, just do not fall back if there
> >> is more than one such child.
> >>
> >> bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
> >> the actual child pointer, so it only works if the fallback child is
> >> bs->file or bs->backing (and then we have to find out which it is).
> >>
> >> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> ---
> >>  block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
> >>  1 file changed, 79 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/block/snapshot.c b/block/snapshot.c
> >> index f2f48f926a..35403c167f 100644
> >> --- a/block/snapshot.c
> >> +++ b/block/snapshot.c
> >> @@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
> >>      return ret;
> >>  }
> >>  
> >> +/**
> >> + * Return the child BDS to which we can fall back if the given BDS
> >> + * does not support snapshots.
> >> + * Return NULL if there is no BDS to (safely) fall back to.
> >> + */
> >> +static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
> >> +{
> >> +    BlockDriverState *child_bs = NULL;
> >> +    BdrvChild *child;
> >> +
> >> +    QLIST_FOREACH(child, &bs->children, next) {
> >> +        if (child == bdrv_filtered_cow_child(bs)) {
> >> +            /* Ignore: COW children need not be included in snapshots */
> >> +            continue;
> >> +        }
> >> +
> >> +        if (child_bs) {
> >> +            /* Cannot fall back to a single child if there are multiple */
> >> +            return NULL;
> >> +        }
> >> +        child_bs = child->bs;
> >> +    }
> >> +
> >> +    return child_bs;
> >> +}
> > 
> > Why do we return child->bs here when bdrv_snapshot_goto() then needs to
> > reconstruct what the associated BdrvChild was? Wouldn't it make more
> > sense to return BdrvChild** from here and maybe have a small wrapper for
> > the other functions that only need a BDS?
> 
> What would you return instead?  &child doesn’t work.

Oops, brain fart. :-)

> We could limit ourselves to bs->file and bs->backing.  It just seemed
> like a bit of an artificial limit to me, because we only really have it
> for bdrv_snapshot_goto().

Hm, but then, what use is supporting other children for creating a
snapshot when you can't load it any more afterwards?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-10 12:48                   ` Kevin Wolf
@ 2019-09-10 12:59                     ` Max Reitz
  2019-09-10 13:10                       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-10 12:59 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 4242 bytes --]

On 10.09.19 14:48, Kevin Wolf wrote:
> Am 10.09.2019 um 13:36 hat Max Reitz geschrieben:
>> On 10.09.19 12:47, Kevin Wolf wrote:
>>> Am 10.09.2019 um 11:14 hat Max Reitz geschrieben:
>>>> Maybe we should stop declaring Quorum a filter and then rename the
>>>> bdrv_recurse_is_first_non_filter() to, I don’t know,
>>>> bdrv_recurse_can_be_replaced_by_mirror()?
>>>
>>> Why not.
>>
>> It feels difficult to do in this series because this is a whole new can
>> of worms.
>>
>> In patch 35, I actually replace the mirror use case by
>> is_filtered_child().  So it looks to me as if that should not be done,
>> because I should instead fix bdrv_recurse_is_first_non_filter() (and
>> rename it), because quorum does allow replacing its children by mirror,
>> even if it does not act as a filter for them.
>>
>> OTOH, there are other users of bdrv_is_first_non_filter().  Those are
>> qmp_block_resize() and external_snapshot_prepare(), who throw an error
>> if that returns false.
>>
>> I think that’s just wrong.  First of all, I don’t even know why we have
>> that restriction anymore (I can imagine why it used to make sense before
>> the permission system).  qmp_block_resize() should always work as long
>> as it can get BLK_PERM_RESIZE; and I don’t know why the parents of some
>> node would care if you take a snapshot of their child.
> 
> Hm, doesn't it make sense in a way for qmp_block_resize() at least? It
> means that you can't resize just a filter, but you need to resize the
> image that actually provides the data for the filter.

Filters generally implement .bdrv_truncate() by passing it through, so
it should be fine.

> Of course, there is no reason for it to be the _first_ non-filter as
> long as BLK_PERM_RESIZE is shared, but just some non-filter node.
> 
> Two more random observations:
> 
> * quorum uses bdrv_filter_default_perms(), which allows BLK_PERM_RESIZE.
>   I think this is wrong and quorum should make sure that all children are
>   always the same size because otherwise it can't tell what its own size
>   is. (Or vote on size...? :-/) Probably not a problem in practice as
>   long as we check bdrv_is_first_non_filter().

(“Quorum is broken” seems to be a recurring observation.)

I agree, it shouldn’t share that permission.

> * child_file and child_backing don't implement .resize. So if you resize
>   a non-top-level image, parents (in particular filters) don't get their
>   size adjusted. This is probably a bug, too, but one that isn't
>   prevented by bdrv_is_first_non_filter() and should be visible today.

Hm. :-/

The good news is that I can try to fix this independently of this series.

[...]

>> We have come to two results, as far as I can see:
>>
>> First, naming COW backing nodes “COW filtered children” clashes with our
>> existing use of ”filter”.  There is no point in forcing the ”filter”
>> label on everything.  We can just keep calling (R/W) filters filters and
>> COW backing children COW children.  The names are succinct enough.
>>
>> In some cases, we don’t care whether something is a COW or filtered
>> child, in such a case a caller can be bothered to use the slightly
>> longer bdrv_cow_or_filtered_child().
> 
> Aye.
> 
>> Second, most of the time we want a filter node to have a clear and
>> unique path to go down.  This is the important property of filters: That
>> you can skip them and go to the node that actually has the data.
>>
>> Quorum breaks this by having multiple children, and nobody knows which
>> of them has the data we will see on the next read operation.
>>
>> All “filters” who could have multiple children would have this problem.
>>  Hence a filter must always have a single unique data child.  I think.
> 
> I agree, and this is the condition that I mentioned somewhere above, but
> failed to actually find guaranteed somewhere. We should probably make
> this explicit.
> 
> Of course, quorum and similar things intend all their children to
> provide the same data, but the whole point of the driver is that this is
> not always guaranteed, so they aren't actually filters.

OK, great, I’ll get cracking then.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback
  2019-09-10 12:49       ` Kevin Wolf
@ 2019-09-10 13:06         ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-10 13:06 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 3409 bytes --]

On 10.09.19 14:49, Kevin Wolf wrote:
> Am 10.09.2019 um 14:04 hat Max Reitz geschrieben:
>> On 10.09.19 13:56, Kevin Wolf wrote:
>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>> If the top node's driver does not provide snapshot functionality and we
>>>> want to fall back to a node down the chain, we need to snapshot all
>>>> non-COW children.  For simplicity's sake, just do not fall back if there
>>>> is more than one such child.
>>>>
>>>> bdrv_snapshot_goto() becomes a bit weird because we may have to redirect
>>>> the actual child pointer, so it only works if the fallback child is
>>>> bs->file or bs->backing (and then we have to find out which it is).
>>>>
>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>  block/snapshot.c | 100 +++++++++++++++++++++++++++++++++++++----------
>>>>  1 file changed, 79 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/block/snapshot.c b/block/snapshot.c
>>>> index f2f48f926a..35403c167f 100644
>>>> --- a/block/snapshot.c
>>>> +++ b/block/snapshot.c
>>>> @@ -146,6 +146,32 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
>>>>      return ret;
>>>>  }
>>>>  
>>>> +/**
>>>> + * Return the child BDS to which we can fall back if the given BDS
>>>> + * does not support snapshots.
>>>> + * Return NULL if there is no BDS to (safely) fall back to.
>>>> + */
>>>> +static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
>>>> +{
>>>> +    BlockDriverState *child_bs = NULL;
>>>> +    BdrvChild *child;
>>>> +
>>>> +    QLIST_FOREACH(child, &bs->children, next) {
>>>> +        if (child == bdrv_filtered_cow_child(bs)) {
>>>> +            /* Ignore: COW children need not be included in snapshots */
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        if (child_bs) {
>>>> +            /* Cannot fall back to a single child if there are multiple */
>>>> +            return NULL;
>>>> +        }
>>>> +        child_bs = child->bs;
>>>> +    }
>>>> +
>>>> +    return child_bs;
>>>> +}
>>>
>>> Why do we return child->bs here when bdrv_snapshot_goto() then needs to
>>> reconstruct what the associated BdrvChild was? Wouldn't it make more
>>> sense to return BdrvChild** from here and maybe have a small wrapper for
>>> the other functions that only need a BDS?
>>
>> What would you return instead?  &child doesn’t work.
> 
> Oops, brain fart. :-)
> 
>> We could limit ourselves to bs->file and bs->backing.  It just seemed
>> like a bit of an artificial limit to me, because we only really have it
>> for bdrv_snapshot_goto().
> 
> Hm, but then, what use is supporting other children for creating a
> snapshot when you can't load it any more afterwards?

Well, the snapshot is still there, it’s just on a different node.  So in
theory, you could take a snapshot in a live VM (where the snapshotting
node is not at the top), and then later revert to it with qemu-img (by
accessing the file with the snapshot directly).

Though in practice this is just a fallback anyway, and I don’t think we
currently have anything where it would make sense to fall through some
node to any child but .file or .backing.  So why not.  It would be
shorter, too.  (Well, plus the short comment why looking at .file and
.backing is sufficient.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/42] block: Add child access functions
  2019-09-10 12:59                     ` Max Reitz
@ 2019-09-10 13:10                       ` Kevin Wolf
  0 siblings, 0 replies; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 13:10 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]

Am 10.09.2019 um 14:59 hat Max Reitz geschrieben:
> On 10.09.19 14:48, Kevin Wolf wrote:
> > Am 10.09.2019 um 13:36 hat Max Reitz geschrieben:
> >> On 10.09.19 12:47, Kevin Wolf wrote:
> >>> Am 10.09.2019 um 11:14 hat Max Reitz geschrieben:
> >>>> Maybe we should stop declaring Quorum a filter and then rename the
> >>>> bdrv_recurse_is_first_non_filter() to, I don’t know,
> >>>> bdrv_recurse_can_be_replaced_by_mirror()?
> >>>
> >>> Why not.
> >>
> >> It feels difficult to do in this series because this is a whole new can
> >> of worms.
> >>
> >> In patch 35, I actually replace the mirror use case by
> >> is_filtered_child().  So it looks to me as if that should not be done,
> >> because I should instead fix bdrv_recurse_is_first_non_filter() (and
> >> rename it), because quorum does allow replacing its children by mirror,
> >> even if it does not act as a filter for them.
> >>
> >> OTOH, there are other users of bdrv_is_first_non_filter().  Those are
> >> qmp_block_resize() and external_snapshot_prepare(), who throw an error
> >> if that returns false.
> >>
> >> I think that’s just wrong.  First of all, I don’t even know why we have
> >> that restriction anymore (I can imagine why it used to make sense before
> >> the permission system).  qmp_block_resize() should always work as long
> >> as it can get BLK_PERM_RESIZE; and I don’t know why the parents of some
> >> node would care if you take a snapshot of their child.
> > 
> > Hm, doesn't it make sense in a way for qmp_block_resize() at least? It
> > means that you can't resize just a filter, but you need to resize the
> > image that actually provides the data for the filter.
> 
> Filters generally implement .bdrv_truncate() by passing it through, so
> it should be fine.

Good point.

Then checking bdrv_is_first_non_filter() probably just forbids the only
command that would actually work correctly (resizing the top-level
filter).

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback Max Reitz
  2019-08-10 16:41   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-10 14:52   ` Kevin Wolf
  2019-09-11  6:20     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 14:52 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> If the driver does not implement bdrv_get_allocated_file_size(), we
> should fall back to cumulating the allocated size of all non-COW
> children instead of just bs->file.
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Max Reitz <mreitz@redhat.com>

This smells like an overgeneralisation, but if we want to count all vmdk
extents, the qcow2 external data file, etc. it's an improvement anyway.
A driver that has a child that should not be counted must just remember
to implement the callback.

Let me think of an example... How about quorum, for a change? :-)
Or the second blkverify child.

Or eventually the block job filter nodes.

Ehm... Maybe I should just take back what I said first. It almost feels
like it would be better if qcow2 and vmdk explicitly used a handler that
counts all children (could still be a generic one in block.c) rather
than having to remember to disable the functionality everywhere where we
don't want to have it.

And please adjust the comment for bdrv_get_allocated_file_size(), it
only talks about a single file as if trees didn't exist. Actually, it
doesn't even seem so easy to define. Maybe primary node + storage nodes?
Then vmdk needs to expose its extents as storage nodes (plural!), but
in the long run that might be needed anyway.

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare()
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare() Max Reitz
@ 2019-09-10 15:02   ` Kevin Wolf
  2019-09-11  6:21     ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-10 15:02 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> This allows us to differentiate between filters and nodes with COW
> backing files: Filters cannot be used as overlays at all (for this
> function).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Didn't we occasionally advertise blockdev-snapshot as the way to insert
filters on top at runtime? Though it seems it has always only worked for
filters that use bs->backing, among which I think there aren't any
user-creatable ones. So we're probably good.

Kevin

>  blockdev.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/blockdev.c b/blockdev.c
> index 29c6c6044a..c540802127 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1664,7 +1664,12 @@ static void external_snapshot_prepare(BlkActionState *common,
>          goto out;
>      }
>  
> -    if (state->new_bs->backing != NULL) {
> +    if (state->new_bs->drv->is_filter) {
> +        error_setg(errp, "Filters cannot be used as overlays");
> +        goto out;
> +    }
> +
> +    if (bdrv_filtered_cow_child(state->new_bs)) {
>          error_setg(errp, "The overlay already has a backing image");
>          goto out;
>      }
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-10 14:52   ` Kevin Wolf
@ 2019-09-11  6:20     ` Max Reitz
  2019-09-11  6:55       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-11  6:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 2112 bytes --]

On 10.09.19 16:52, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> If the driver does not implement bdrv_get_allocated_file_size(), we
>> should fall back to cumulating the allocated size of all non-COW
>> children instead of just bs->file.
>>
>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> This smells like an overgeneralisation, but if we want to count all vmdk
> extents, the qcow2 external data file, etc. it's an improvement anyway.
> A driver that has a child that should not be counted must just remember
> to implement the callback.
> 
> Let me think of an example... How about quorum, for a change? :-)
> Or the second blkverify child.
> 
> Or eventually the block job filter nodes.

I actually think it makes sense for all of these nodes to report the sum
of all of their children’s allocated sizes.

If a quorum node has three children with allocated sizes of 3 MB, 1 MB,
and 2 MB, respectively (totally possible if some have explicit zeroes
and others don’t; it may also depend on the protocol, the filesystem,
etc.), then I think it makes most sense to report indeed 6 MB for the
quorum subtree as a whole.  What would you report?  3 MB?

> Ehm... Maybe I should just take back what I said first. It almost feels
> like it would be better if qcow2 and vmdk explicitly used a handler that
> counts all children (could still be a generic one in block.c) rather
> than having to remember to disable the functionality everywhere where we
> don't want to have it.

I don’t, because everywhere we don’t want this functionality, we still
need to choose a child.  This has to be done by the driver anyway.

Max

> And please adjust the comment for bdrv_get_allocated_file_size(), it
> only talks about a single file as if trees didn't exist. Actually, it
> doesn't even seem so easy to define. Maybe primary node + storage nodes?
> Then vmdk needs to expose its extents as storage nodes (plural!), but
> in the long run that might be needed anyway.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare()
  2019-09-10 15:02   ` Kevin Wolf
@ 2019-09-11  6:21     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-11  6:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 1578 bytes --]

On 10.09.19 17:02, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> This allows us to differentiate between filters and nodes with COW
>> backing files: Filters cannot be used as overlays at all (for this
>> function).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Didn't we occasionally advertise blockdev-snapshot as the way to insert
> filters on top at runtime?

I can only remember advertising for it as the only graph manipulation
tool we had, and maybe saying “We’d want something like
blockdev-snapshot for filters, too”.

Max

> Though it seems it has always only worked for
> filters that use bs->backing, among which I think there aren't any
> user-creatable ones. So we're probably good.
> 
> Kevin
> 
>>  blockdev.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/blockdev.c b/blockdev.c
>> index 29c6c6044a..c540802127 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -1664,7 +1664,12 @@ static void external_snapshot_prepare(BlkActionState *common,
>>          goto out;
>>      }
>>  
>> -    if (state->new_bs->backing != NULL) {
>> +    if (state->new_bs->drv->is_filter) {
>> +        error_setg(errp, "Filters cannot be used as overlays");
>> +        goto out;
>> +    }
>> +
>> +    if (bdrv_filtered_cow_child(state->new_bs)) {
>>          error_setg(errp, "The overlay already has a backing image");
>>          goto out;
>>      }
>> -- 
>> 2.21.0
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11  6:20     ` Max Reitz
@ 2019-09-11  6:55       ` Kevin Wolf
  2019-09-11  7:37         ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-11  6:55 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2674 bytes --]

Am 11.09.2019 um 08:20 hat Max Reitz geschrieben:
> On 10.09.19 16:52, Kevin Wolf wrote:
> > Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >> If the driver does not implement bdrv_get_allocated_file_size(), we
> >> should fall back to cumulating the allocated size of all non-COW
> >> children instead of just bs->file.
> >>
> >> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> > 
> > This smells like an overgeneralisation, but if we want to count all vmdk
> > extents, the qcow2 external data file, etc. it's an improvement anyway.
> > A driver that has a child that should not be counted must just remember
> > to implement the callback.
> > 
> > Let me think of an example... How about quorum, for a change? :-)
> > Or the second blkverify child.
> > 
> > Or eventually the block job filter nodes.
> 
> I actually think it makes sense for all of these nodes to report the sum
> of all of their children’s allocated sizes.

Hm... Yes, in a way. But not much more than it would make sense to
report the sum of the sizes of all images in the whole backing chain
(this is a useful thing to ask for, just maybe not the right thing to
return for a low-level interface). But I can accept that it's maybe a
bit more expected for quorum and blkverify than for COW images.

If you include the block job filter nodes, I have to disagree, though.
If mirror_top_bs (or any other job filter) sits in the middle of the
source chain, then I certainly don't want to see the target size added
to it.

> If a quorum node has three children with allocated sizes of 3 MB, 1 MB,
> and 2 MB, respectively (totally possible if some have explicit zeroes
> and others don’t; it may also depend on the protocol, the filesystem,
> etc.), then I think it makes most sense to report indeed 6 MB for the
> quorum subtree as a whole.  What would you report?  3 MB?

Do it's the quorum way: Just vote!

No, you're right, of course. -ENOTSUP is probably the only other thing
you could do then.

> > Ehm... Maybe I should just take back what I said first. It almost feels
> > like it would be better if qcow2 and vmdk explicitly used a handler that
> > counts all children (could still be a generic one in block.c) rather
> > than having to remember to disable the functionality everywhere where we
> > don't want to have it.
> 
> I don’t, because everywhere we don’t want this functionality, we still
> need to choose a child.  This has to be done by the driver anyway.

Well, by default the primary child, which should cover like 90% of the
drivers?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11  6:55       ` Kevin Wolf
@ 2019-09-11  7:37         ` Max Reitz
  2019-09-11  8:27           ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-11  7:37 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 3280 bytes --]

On 11.09.19 08:55, Kevin Wolf wrote:
> Am 11.09.2019 um 08:20 hat Max Reitz geschrieben:
>> On 10.09.19 16:52, Kevin Wolf wrote:
>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>> If the driver does not implement bdrv_get_allocated_file_size(), we
>>>> should fall back to cumulating the allocated size of all non-COW
>>>> children instead of just bs->file.
>>>>
>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>
>>> This smells like an overgeneralisation, but if we want to count all vmdk
>>> extents, the qcow2 external data file, etc. it's an improvement anyway.
>>> A driver that has a child that should not be counted must just remember
>>> to implement the callback.
>>>
>>> Let me think of an example... How about quorum, for a change? :-)
>>> Or the second blkverify child.
>>>
>>> Or eventually the block job filter nodes.
>>
>> I actually think it makes sense for all of these nodes to report the sum
>> of all of their children’s allocated sizes.
> 
> Hm... Yes, in a way. But not much more than it would make sense to
> report the sum of the sizes of all images in the whole backing chain
> (this is a useful thing to ask for, just maybe not the right thing to
> return for a low-level interface). But I can accept that it's maybe a
> bit more expected for quorum and blkverify than for COW images.
> 
> If you include the block job filter nodes, I have to disagree, though.
> If mirror_top_bs (or any other job filter) sits in the middle of the
> source chain, then I certainly don't want to see the target size added
> to it.

Hm, I don’t care much either way.  I think it makes complete sense to
add the target size there, but OTOH it’s only temporary while the job
runs, so it may be a bit confusing if it suddenly goes up and then down
again.

But I think this is the special case, so this is what should be handled
in a driver callback.

>> If a quorum node has three children with allocated sizes of 3 MB, 1 MB,
>> and 2 MB, respectively (totally possible if some have explicit zeroes
>> and others don’t; it may also depend on the protocol, the filesystem,
>> etc.), then I think it makes most sense to report indeed 6 MB for the
>> quorum subtree as a whole.  What would you report?  3 MB?
> 
> Do it's the quorum way: Just vote!

Add an option for it?  Average, maximum, median, majority, sum? :-)

> No, you're right, of course. -ENOTSUP is probably the only other thing
> you could do then.
> 
>>> Ehm... Maybe I should just take back what I said first. It almost feels
>>> like it would be better if qcow2 and vmdk explicitly used a handler that
>>> counts all children (could still be a generic one in block.c) rather
>>> than having to remember to disable the functionality everywhere where we
>>> don't want to have it.
>>
>> I don’t, because everywhere we don’t want this functionality, we still
>> need to choose a child.  This has to be done by the driver anyway.
> 
> Well, by default the primary child, which should cover like 90% of the
> drivers?

Hm, yes.

But I still think that the drivers that do not want to count every
single non-COW child are the exception.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11  7:37         ` Max Reitz
@ 2019-09-11  8:27           ` Kevin Wolf
  2019-09-11 10:00             ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-11  8:27 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 6006 bytes --]

Am 11.09.2019 um 09:37 hat Max Reitz geschrieben:
> On 11.09.19 08:55, Kevin Wolf wrote:
> > Am 11.09.2019 um 08:20 hat Max Reitz geschrieben:
> >> On 10.09.19 16:52, Kevin Wolf wrote:
> >>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >>>> If the driver does not implement bdrv_get_allocated_file_size(), we
> >>>> should fall back to cumulating the allocated size of all non-COW
> >>>> children instead of just bs->file.
> >>>>
> >>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>
> >>> This smells like an overgeneralisation, but if we want to count all vmdk
> >>> extents, the qcow2 external data file, etc. it's an improvement anyway.
> >>> A driver that has a child that should not be counted must just remember
> >>> to implement the callback.
> >>>
> >>> Let me think of an example... How about quorum, for a change? :-)
> >>> Or the second blkverify child.
> >>>
> >>> Or eventually the block job filter nodes.
> >>
> >> I actually think it makes sense for all of these nodes to report the sum
> >> of all of their children’s allocated sizes.
> > 
> > Hm... Yes, in a way. But not much more than it would make sense to
> > report the sum of the sizes of all images in the whole backing chain
> > (this is a useful thing to ask for, just maybe not the right thing to
> > return for a low-level interface). But I can accept that it's maybe a
> > bit more expected for quorum and blkverify than for COW images.
> > 
> > If you include the block job filter nodes, I have to disagree, though.
> > If mirror_top_bs (or any other job filter) sits in the middle of the
> > source chain, then I certainly don't want to see the target size added
> > to it.
> 
> Hm, I don’t care much either way.  I think it makes complete sense to
> add the target size there, but OTOH it’s only temporary while the job
> runs, so it may be a bit confusing if it suddenly goes up and then down
> again.

I think the number that most users are interested in is knowing how much
space the image for their /dev/vda takes up on the host.

I can see how they might be interested in not only that one image file,
but all other image files connected to it, i.e. their /dev/vda with all
of its snapshots. This would mean counting backing files. I think adding
up the numbers for this should be done in the management layer.

I can possibly also imagine users wanting to count everything that's
even loosely connected to their /dev/vda, like copies of it. I doubt,
however, they want to count only copies that are currently being made,
but not snapshots and copies that have been completed earlier. So this
is clearly a management layer thing, too.

> But I think this is the special case, so this is what should be handled
> in a driver callback.

It's a special case, yes. But see below.

> >> If a quorum node has three children with allocated sizes of 3 MB, 1 MB,
> >> and 2 MB, respectively (totally possible if some have explicit zeroes
> >> and others don’t; it may also depend on the protocol, the filesystem,
> >> etc.), then I think it makes most sense to report indeed 6 MB for the
> >> quorum subtree as a whole.  What would you report?  3 MB?
> > 
> > Do it's the quorum way: Just vote!
> 
> Add an option for it?  Average, maximum, median, majority, sum? :-)

We could also introduce a mode with an Electoral College so that
sometimes an image that missed the majority has a chance to win anyway.

> > No, you're right, of course. -ENOTSUP is probably the only other thing
> > you could do then.
> > 
> >>> Ehm... Maybe I should just take back what I said first. It almost feels
> >>> like it would be better if qcow2 and vmdk explicitly used a handler that
> >>> counts all children (could still be a generic one in block.c) rather
> >>> than having to remember to disable the functionality everywhere where we
> >>> don't want to have it.
> >>
> >> I don’t, because everywhere we don’t want this functionality, we still
> >> need to choose a child.  This has to be done by the driver anyway.
> > 
> > Well, by default the primary child, which should cover like 90% of the
> > drivers?
> 
> Hm, yes.
> 
> But I still think that the drivers that do not want to count every
> single non-COW child are the exception.

They are, but drivers that want to count more than their primary node
are exceptions, too. And I think you're more likely to remember adding
the callback when you want to have a certain feature, not when you don't
want to have it.

I really think we're likely to forget adding the callback where we need
to disable the feature.

I can see two options that should address both of our views:

1. Just don't have a fallback at all, make the callback mandatory and
   provide implementations in block.c that can be referred to in
   BlockDriver. Not specifying the callback causes an assertion failure,
   so we'd hopefully notice it quite early (assuming that we run either
   'qemu-img info' or 'query-block' on a configuration with the block
   driver, but I think that's faily safe to assume).

2. Make the 90% solution a 100% solution: Allow drivers to have multiple
   storage children (for vmdk) and then have the fallback add up the
   primary child plus all storage children. This is what I suggested as
   the documented semantics in my initial reply to this patch (that you
   chose not to answer).

   Adding the size of storage children covers qcow2 and vmdk.

   As the job filter won't declare the target or any other involved
   nodes their storage nodes (I hope), this will do the right thing for
   them, too.

   For quorum and blkverify both ways could be justifiable. I think they
   probably shouldn't declare their children as storage nodes. They are
   more like filters that don't have a single filtered node. So some
   kind of almost-filters.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11  8:27           ` Kevin Wolf
@ 2019-09-11 10:00             ` Max Reitz
  2019-09-11 10:31               ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-11 10:00 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 8271 bytes --]

On 11.09.19 10:27, Kevin Wolf wrote:
> Am 11.09.2019 um 09:37 hat Max Reitz geschrieben:
>> On 11.09.19 08:55, Kevin Wolf wrote:
>>> Am 11.09.2019 um 08:20 hat Max Reitz geschrieben:
>>>> On 10.09.19 16:52, Kevin Wolf wrote:
>>>>> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>>>>>> If the driver does not implement bdrv_get_allocated_file_size(), we
>>>>>> should fall back to cumulating the allocated size of all non-COW
>>>>>> children instead of just bs->file.
>>>>>>
>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>>
>>>>> This smells like an overgeneralisation, but if we want to count all vmdk
>>>>> extents, the qcow2 external data file, etc. it's an improvement anyway.
>>>>> A driver that has a child that should not be counted must just remember
>>>>> to implement the callback.
>>>>>
>>>>> Let me think of an example... How about quorum, for a change? :-)
>>>>> Or the second blkverify child.
>>>>>
>>>>> Or eventually the block job filter nodes.
>>>>
>>>> I actually think it makes sense for all of these nodes to report the sum
>>>> of all of their children’s allocated sizes.
>>>
>>> Hm... Yes, in a way. But not much more than it would make sense to
>>> report the sum of the sizes of all images in the whole backing chain
>>> (this is a useful thing to ask for, just maybe not the right thing to
>>> return for a low-level interface). But I can accept that it's maybe a
>>> bit more expected for quorum and blkverify than for COW images.
>>>
>>> If you include the block job filter nodes, I have to disagree, though.
>>> If mirror_top_bs (or any other job filter) sits in the middle of the
>>> source chain, then I certainly don't want to see the target size added
>>> to it.
>>
>> Hm, I don’t care much either way.  I think it makes complete sense to
>> add the target size there, but OTOH it’s only temporary while the job
>> runs, so it may be a bit confusing if it suddenly goes up and then down
>> again.
> 
> I think the number that most users are interested in is knowing how much
> space the image for their /dev/vda takes up on the host.
> 
> I can see how they might be interested in not only that one image file,
> but all other image files connected to it, i.e. their /dev/vda with all
> of its snapshots. This would mean counting backing files. I think adding
> up the numbers for this should be done in the management layer.

My main argument against counting backing files is that we’ve never done it.

(Whereas for quorum, I’d argue we just forgot to adjust
bdrv_get_allocated_file_size() for it.)

> I can possibly also imagine users wanting to count everything that's
> even loosely connected to their /dev/vda, like copies of it. I doubt,
> however, they want to count only copies that are currently being made,
> but not snapshots and copies that have been completed earlier. So this
> is clearly a management layer thing, too.

OK.

>> But I think this is the special case, so this is what should be handled
>> in a driver callback.
> 
> It's a special case, yes. But see below.
> 
>>>> If a quorum node has three children with allocated sizes of 3 MB, 1 MB,
>>>> and 2 MB, respectively (totally possible if some have explicit zeroes
>>>> and others don’t; it may also depend on the protocol, the filesystem,
>>>> etc.), then I think it makes most sense to report indeed 6 MB for the
>>>> quorum subtree as a whole.  What would you report?  3 MB?
>>>
>>> Do it's the quorum way: Just vote!
>>
>> Add an option for it?  Average, maximum, median, majority, sum? :-)
> 
> We could also introduce a mode with an Electoral College so that
> sometimes an image that missed the majority has a chance to win anyway.

That’s actually a good idea for a quorum mode in general.  Who says the
majority is right?  Better let someone with more authority cross-check
the result.

>>> No, you're right, of course. -ENOTSUP is probably the only other thing
>>> you could do then.
>>>
>>>>> Ehm... Maybe I should just take back what I said first. It almost feels
>>>>> like it would be better if qcow2 and vmdk explicitly used a handler that
>>>>> counts all children (could still be a generic one in block.c) rather
>>>>> than having to remember to disable the functionality everywhere where we
>>>>> don't want to have it.
>>>>
>>>> I don’t, because everywhere we don’t want this functionality, we still
>>>> need to choose a child.  This has to be done by the driver anyway.
>>>
>>> Well, by default the primary child, which should cover like 90% of the
>>> drivers?
>>
>> Hm, yes.
>>
>> But I still think that the drivers that do not want to count every
>> single non-COW child are the exception.
> 
> They are, but drivers that want to count more than their primary node
> are exceptions, too. And I think you're more likely to remember adding
> the callback when you want to have a certain feature, not when you don't
> want to have it.
> 
> I really think we're likely to forget adding the callback where we need
> to disable the feature.

Well, I mean, we did forget adding it for qcow2.

> I can see two options that should address both of our views:
> 
> 1. Just don't have a fallback at all, make the callback mandatory and
>    provide implementations in block.c that can be referred to in
>    BlockDriver. Not specifying the callback causes an assertion failure,
>    so we'd hopefully notice it quite early (assuming that we run either
>    'qemu-img info' or 'query-block' on a configuration with the block
>    driver, but I think that's faily safe to assume).

Hm.  Seems a bit much, but if we can’t agree on what’s a good general
implementation that works for everything, this is probably the only
thing that would actually keep us from forgetting to add special cases.

Though I actually don’t know.  I’d probably add two globally available
helpers, one that returns the sum of everything but the backing node,
and one that just returns the primary node.

Now if I were to make qcow2 use the primary node helper function, would
we have remembered changing it once we added a data file?

Hmm.  Maybe not, but it should be OK to just make everything use the sum
helper, except the drivers that want the primary node.  That should work
for all cases.  (I think that whenever a format driver suddenly gains
more child nodes, we probably will want to count them.  OTOH, everything
that has nodes that shouldn’t be counted probably always wants to use
the primary node helper function from the start.)

> 2. Make the 90% solution a 100% solution: Allow drivers to have multiple
>    storage children (for vmdk) and then have the fallback add up the
>    primary child plus all storage children. This is what I suggested as
>    the documented semantics in my initial reply to this patch (that you
>    chose not to answer).

I didn’t answer that because I didn’t disagree.

>    Adding the size of storage children covers qcow2 and vmdk.

That’s of course exactly what we’re trying to do, but the question is,
how do we figure out that storage children?  Make it a per-BdrvChild
attribute?  That seems rather heavy-handed, because I think we’d need it
only here.

>    As the job filter won't declare the target or any other involved
>    nodes their storage nodes (I hope), this will do the right thing for
>    them, too.
> 
>    For quorum and blkverify both ways could be justifiable. I think they
>    probably shouldn't declare their children as storage nodes. They are
>    more like filters that don't have a single filtered node. So some
>    kind of almost-filters.

I don’t think quorum is a filter, and blkverify can only be justified to
be a filter because it quits qemu when there is a mismatch.

The better example is replication, but that has a clear filtered child
(the primary node).


So all in all I think it’s best to make the callback mandatory and add
two global helper functions.  That’s simple enough and should prevent us
from making mistakes by forgetting to adjust something in the future.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11 10:00             ` Max Reitz
@ 2019-09-11 10:31               ` Kevin Wolf
  2019-09-11 11:00                 ` Max Reitz
  0 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-11 10:31 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 5547 bytes --]

Am 11.09.2019 um 12:00 hat Max Reitz geschrieben:
> On 11.09.19 10:27, Kevin Wolf wrote:
> > Am 11.09.2019 um 09:37 hat Max Reitz geschrieben:
> >> On 11.09.19 08:55, Kevin Wolf wrote:
> >>> Well, by default the primary child, which should cover like 90% of the
> >>> drivers?
> >>
> >> Hm, yes.
> >>
> >> But I still think that the drivers that do not want to count every
> >> single non-COW child are the exception.
> > 
> > They are, but drivers that want to count more than their primary node
> > are exceptions, too. And I think you're more likely to remember adding
> > the callback when you want to have a certain feature, not when you don't
> > want to have it.
> > 
> > I really think we're likely to forget adding the callback where we need
> > to disable the feature.
> 
> Well, I mean, we did forget adding it for qcow2.

I'm afraid I have to agree. So the conclusion is that we won't get it
right anyway?

> > I can see two options that should address both of our views:
> > 
> > 1. Just don't have a fallback at all, make the callback mandatory and
> >    provide implementations in block.c that can be referred to in
> >    BlockDriver. Not specifying the callback causes an assertion failure,
> >    so we'd hopefully notice it quite early (assuming that we run either
> >    'qemu-img info' or 'query-block' on a configuration with the block
> >    driver, but I think that's faily safe to assume).
> 
> Hm.  Seems a bit much, but if we can’t agree on what’s a good general
> implementation that works for everything, this is probably the only
> thing that would actually keep us from forgetting to add special cases.
> 
> Though I actually don’t know.  I’d probably add two globally available
> helpers, one that returns the sum of everything but the backing node,
> and one that just returns the primary node.

Yes, I think this is the same as I meant by "provide implementations in
block.c".

> Now if I were to make qcow2 use the primary node helper function, would
> we have remembered changing it once we added a data file?
> 
> Hmm.  Maybe not, but it should be OK to just make everything use the sum
> helper, except the drivers that want the primary node.  That should work
> for all cases.  (I think that whenever a format driver suddenly gains
> more child nodes, we probably will want to count them.  OTOH, everything
> that has nodes that shouldn’t be counted probably always wants to use
> the primary node helper function from the start.)

The job filter nodes have only one child currently, which should be
counted. We'll add other children that shouldn't be counted only later.

But we already have an idea of what possible extensions look like, so we
can probably choose the right function from the start.

> > 2. Make the 90% solution a 100% solution: Allow drivers to have multiple
> >    storage children (for vmdk) and then have the fallback add up the
> >    primary child plus all storage children. This is what I suggested as
> >    the documented semantics in my initial reply to this patch (that you
> >    chose not to answer).
> 
> I didn’t answer that because I didn’t disagree.
> 
> >    Adding the size of storage children covers qcow2 and vmdk.
> 
> That’s of course exactly what we’re trying to do, but the question is,
> how do we figure out that storage children?  Make it a per-BdrvChild
> attribute?  That seems rather heavy-handed, because I think we’d need it
> only here.

Well, you added bdrv_storage_child(). I'd argue this interface is wrong
because it assumes that only one storage child exists. You just didn't
implement it for vmdk so that the problem didn't become apparent. It
would have to return a list rather than a single child. So fixing the
interface and then using it is what I was thinking.

Now that you mention a per-BdrvChild attribute, however, I start to
wonder if the distinction between COW children, filter children, storage
children, metadata children, etc. isn't really what BdrvChildRole was
supposed to represent?

Maybe we want to split off child_storage from child_file, though it's
not strictly necessary for this specific case because we want to treat
both metadata and storage nodes the same. But it could be useful for
other users of bdrv_storage_child(), if there are any.

> >    As the job filter won't declare the target or any other involved
> >    nodes their storage nodes (I hope), this will do the right thing for
> >    them, too.
> > 
> >    For quorum and blkverify both ways could be justifiable. I think they
> >    probably shouldn't declare their children as storage nodes. They are
> >    more like filters that don't have a single filtered node. So some
> >    kind of almost-filters.
> 
> I don’t think quorum is a filter, and blkverify can only be justified to
> be a filter because it quits qemu when there is a mismatch.
> 
> The better example is replication, but that has a clear filtered child
> (the primary node).
> 
> 
> So all in all I think it’s best to make the callback mandatory and add
> two global helper functions.  That’s simple enough and should prevent
> us from making mistakes by forgetting to adjust something in the
> future.

Yes, that should work.

We should probably still figure out what the relationship between the
child access functions and child roles is, even if we don't need it for
this solution. But it feels like an important part of the design.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11 10:31               ` Kevin Wolf
@ 2019-09-11 11:00                 ` Max Reitz
  2019-09-12 10:34                   ` Kevin Wolf
  2019-11-14 13:11                   ` Max Reitz
  0 siblings, 2 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-11 11:00 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 6312 bytes --]

On 11.09.19 12:31, Kevin Wolf wrote:
> Am 11.09.2019 um 12:00 hat Max Reitz geschrieben:
>> On 11.09.19 10:27, Kevin Wolf wrote:
>>> Am 11.09.2019 um 09:37 hat Max Reitz geschrieben:
>>>> On 11.09.19 08:55, Kevin Wolf wrote:
>>>>> Well, by default the primary child, which should cover like 90% of the
>>>>> drivers?
>>>>
>>>> Hm, yes.
>>>>
>>>> But I still think that the drivers that do not want to count every
>>>> single non-COW child are the exception.
>>>
>>> They are, but drivers that want to count more than their primary node
>>> are exceptions, too. And I think you're more likely to remember adding
>>> the callback when you want to have a certain feature, not when you don't
>>> want to have it.
>>>
>>> I really think we're likely to forget adding the callback where we need
>>> to disable the feature.
>>
>> Well, I mean, we did forget adding it for qcow2.
> 
> I'm afraid I have to agree. So the conclusion is that we won't get it
> right anyway?
> 
>>> I can see two options that should address both of our views:
>>>
>>> 1. Just don't have a fallback at all, make the callback mandatory and
>>>    provide implementations in block.c that can be referred to in
>>>    BlockDriver. Not specifying the callback causes an assertion failure,
>>>    so we'd hopefully notice it quite early (assuming that we run either
>>>    'qemu-img info' or 'query-block' on a configuration with the block
>>>    driver, but I think that's faily safe to assume).
>>
>> Hm.  Seems a bit much, but if we can’t agree on what’s a good general
>> implementation that works for everything, this is probably the only
>> thing that would actually keep us from forgetting to add special cases.
>>
>> Though I actually don’t know.  I’d probably add two globally available
>> helpers, one that returns the sum of everything but the backing node,
>> and one that just returns the primary node.
> 
> Yes, I think this is the same as I meant by "provide implementations in
> block.c".
> 
>> Now if I were to make qcow2 use the primary node helper function, would
>> we have remembered changing it once we added a data file?
>>
>> Hmm.  Maybe not, but it should be OK to just make everything use the sum
>> helper, except the drivers that want the primary node.  That should work
>> for all cases.  (I think that whenever a format driver suddenly gains
>> more child nodes, we probably will want to count them.  OTOH, everything
>> that has nodes that shouldn’t be counted probably always wants to use
>> the primary node helper function from the start.)
> 
> The job filter nodes have only one child currently, which should be
> counted. We'll add other children that shouldn't be counted only later.
> 
> But we already have an idea of what possible extensions look like, so we
> can probably choose the right function from the start.

Yep.

>>> 2. Make the 90% solution a 100% solution: Allow drivers to have multiple
>>>    storage children (for vmdk) and then have the fallback add up the
>>>    primary child plus all storage children. This is what I suggested as
>>>    the documented semantics in my initial reply to this patch (that you
>>>    chose not to answer).
>>
>> I didn’t answer that because I didn’t disagree.
>>
>>>    Adding the size of storage children covers qcow2 and vmdk.
>>
>> That’s of course exactly what we’re trying to do, but the question is,
>> how do we figure out that storage children?  Make it a per-BdrvChild
>> attribute?  That seems rather heavy-handed, because I think we’d need it
>> only here.
> 
> Well, you added bdrv_storage_child().I'd argue this interface is wrong

Yes, it probably is.

> because it assumes that only one storage child exists. You just didn't
> implement it for vmdk so that the problem didn't become apparent. It
> would have to return a list rather than a single child. So fixing the
> interface and then using it is what I was thinking.
> 
> Now that you mention a per-BdrvChild attribute, however, I start to
> wonder if the distinction between COW children, filter children, storage
> children, metadata children, etc. isn't really what BdrvChildRole was
> supposed to represent?

That’s a good point.

> Maybe we want to split off child_storage from child_file, though it's
> not strictly necessary for this specific case because we want to treat
> both metadata and storage nodes the same. But it could be useful for
> other users of bdrv_storage_child(), if there are any.

Possible.  Maybe it turns out that at least for this series I don’t need
bdrv_storage_child() at all.

>>>    As the job filter won't declare the target or any other involved
>>>    nodes their storage nodes (I hope), this will do the right thing for
>>>    them, too.
>>>
>>>    For quorum and blkverify both ways could be justifiable. I think they
>>>    probably shouldn't declare their children as storage nodes. They are
>>>    more like filters that don't have a single filtered node. So some
>>>    kind of almost-filters.
>>
>> I don’t think quorum is a filter, and blkverify can only be justified to
>> be a filter because it quits qemu when there is a mismatch.
>>
>> The better example is replication, but that has a clear filtered child
>> (the primary node).
>>
>>
>> So all in all I think it’s best to make the callback mandatory and add
>> two global helper functions.  That’s simple enough and should prevent
>> us from making mistakes by forgetting to adjust something in the
>> future.
> 
> Yes, that should work.
> 
> We should probably still figure out what the relationship between the
> child access functions and child roles is, even if we don't need it for
> this solution. But it feels like an important part of the design.

Hm.  It feels like something that should be done before this series,
actually.

So I think we should add at least a child role per child access function
so that they match?  And then maybe in bdrv_attach_child() assert that a
BDS never has more than one primary or filtered child (a filtered child
acts as a primary child, too), or more than one COW child.  (And that
these are always in bs->file or bs->backing so the child access
functions do work.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11 11:00                 ` Max Reitz
@ 2019-09-12 10:34                   ` Kevin Wolf
  2019-11-14 13:11                   ` Max Reitz
  1 sibling, 0 replies; 132+ messages in thread
From: Kevin Wolf @ 2019-09-12 10:34 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]

Am 11.09.2019 um 13:00 hat Max Reitz geschrieben:
> On 11.09.19 12:31, Kevin Wolf wrote:
> > Am 11.09.2019 um 12:00 hat Max Reitz geschrieben:
> >> So all in all I think it’s best to make the callback mandatory and add
> >> two global helper functions.  That’s simple enough and should prevent
> >> us from making mistakes by forgetting to adjust something in the
> >> future.
> > 
> > Yes, that should work.
> > 
> > We should probably still figure out what the relationship between the
> > child access functions and child roles is, even if we don't need it for
> > this solution. But it feels like an important part of the design.
> 
> Hm.  It feels like something that should be done before this series,
> actually.
> 
> So I think we should add at least a child role per child access function
> so that they match?  And then maybe in bdrv_attach_child() assert that a
> BDS never has more than one primary or filtered child (a filtered child
> acts as a primary child, too), or more than one COW child.  (And that
> these are always in bs->file or bs->backing so the child access
> functions do work.)

Makes sense to me.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters Max Reitz
  2019-08-12 11:09   ` Vladimir Sementsov-Ogievskiy
  2019-08-31  9:57   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-13 12:55   ` Kevin Wolf
  2019-09-16 10:26     ` Max Reitz
  2 siblings, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-13 12:55 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> This includes some permission limiting (for example, we only need to
> take the RESIZE permission for active commits where the base is smaller
> than the top).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>  blockdev.c     |  47 +++++++++++++++++---
>  2 files changed, 131 insertions(+), 33 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 54bafdf176..6ddbfb9708 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
>      BlockBackend *target;
>      BlockDriverState *mirror_top_bs;
>      BlockDriverState *base;
> +    BlockDriverState *base_overlay;
>  
>      /* The name of the graph node to replace */
>      char *replaces;
> @@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
>                               &error_abort);
>      if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>          BlockDriverState *backing = s->is_none_mode ? src : s->base;
> -        if (backing_bs(target_bs) != backing) {
> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
> +        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
> +
> +        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
> +            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
>              if (local_err) {
>                  error_report_err(local_err);
>                  ret = -EPERM;
> @@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
>       * valid.
>       */
>      block_job_remove_all_bdrv(bjob);
> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>  
>      /* We just changed the BDS the job BB refers to (with either or both of the
>       * bdrv_replace_node() calls), so switch the BB back so the cleanup does
> @@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>              return 0;
>          }
>  
> -        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
> +        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
> +                                      &count);
>          if (ret < 0) {
>              return ret;
>          }
> @@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>      } else {
>          s->target_cluster_size = BDRV_SECTOR_SIZE;
>      }
> -    if (backing_filename[0] && !target_bs->backing &&
> +    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
>          s->granularity < s->target_cluster_size) {
>          s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>          s->cow_bitmap = bitmap_new(length);
> @@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
>      if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>          int ret;
>  
> -        assert(!target->backing);
> -        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
> +        assert(!bdrv_backing_chain_next(target));
> +        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
> +                                     "backing", errp);
>          if (ret < 0) {
>              return;
>          }
> @@ -1531,8 +1536,8 @@ static BlockJob *mirror_start_job(
>      MirrorBlockJob *s;
>      MirrorBDSOpaque *bs_opaque;
>      BlockDriverState *mirror_top_bs;
> -    bool target_graph_mod;
>      bool target_is_backing;
> +    uint64_t target_perms, target_shared_perms;
>      Error *local_err = NULL;
>      int ret;
>  
> @@ -1551,7 +1556,7 @@ static BlockJob *mirror_start_job(
>          buf_size = DEFAULT_MIRROR_BUF_SIZE;
>      }
>  
> -    if (bs == target) {
> +    if (bdrv_skip_rw_filters(bs) == bdrv_skip_rw_filters(target)) {
>          error_setg(errp, "Can't mirror node into itself");
>          return NULL;
>      }
> @@ -1615,15 +1620,50 @@ static BlockJob *mirror_start_job(
>       * In the case of active commit, things look a bit different, though,
>       * because the target is an already populated backing file in active use.
>       * We can allow anything except resize there.*/
> +
> +    target_perms = BLK_PERM_WRITE;
> +    target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
> +
>      target_is_backing = bdrv_chain_contains(bs, target);
> -    target_graph_mod = (backing_mode != MIRROR_LEAVE_BACKING_CHAIN);
> +    if (target_is_backing) {
> +        int64_t bs_size, target_size;
> +        bs_size = bdrv_getlength(bs);
> +        if (bs_size < 0) {
> +            error_setg_errno(errp, -bs_size,
> +                             "Could not inquire top image size");
> +            goto fail;
> +        }
> +
> +        target_size = bdrv_getlength(target);
> +        if (target_size < 0) {
> +            error_setg_errno(errp, -target_size,
> +                             "Could not inquire base image size");
> +            goto fail;
> +        }
> +
> +        if (target_size < bs_size) {
> +            target_perms |= BLK_PERM_RESIZE;
> +        }
> +
> +        target_shared_perms |= BLK_PERM_CONSISTENT_READ
> +                            |  BLK_PERM_WRITE
> +                            |  BLK_PERM_GRAPH_MOD;
> +    } else if (bdrv_chain_contains(bs, bdrv_skip_rw_filters(target))) {
> +        /*
> +         * We may want to allow this in the future, but it would
> +         * require taking some extra care.
> +         */
> +        error_setg(errp, "Cannot mirror to a filter on top of a node in the "
> +                   "source's backing chain");
> +        goto fail;
> +    }
> +
> +    if (backing_mode != MIRROR_LEAVE_BACKING_CHAIN) {
> +        target_perms |= BLK_PERM_GRAPH_MOD;
> +    }

This is getting absurd. We keep moving GRAPH_MOD around, but still
nobody knows what it's actually supposed to mean. Maybe it would be
better to just remove it finally?

Of course, not a reason to stop this patch, after all it's moving the
nonsensical piece of code correctly...

>      s->target = blk_new(s->common.job.aio_context,
> -                        BLK_PERM_WRITE | BLK_PERM_RESIZE |
> -                        (target_graph_mod ? BLK_PERM_GRAPH_MOD : 0),
> -                        BLK_PERM_WRITE_UNCHANGED |
> -                        (target_is_backing ? BLK_PERM_CONSISTENT_READ |
> -                                             BLK_PERM_WRITE |
> -                                             BLK_PERM_GRAPH_MOD : 0));
> +                        target_perms, target_shared_perms);
>      ret = blk_insert_bs(s->target, target, errp);
>      if (ret < 0) {
>          goto fail;
> @@ -1647,6 +1687,7 @@ static BlockJob *mirror_start_job(
>      s->backing_mode = backing_mode;
>      s->copy_mode = copy_mode;
>      s->base = base;
> +    s->base_overlay = bdrv_find_overlay(bs, base);
>      s->granularity = granularity;
>      s->buf_size = ROUND_UP(buf_size, granularity);
>      s->unmap = unmap;
> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>      /* In commit_active_start() all intermediate nodes disappear, so
>       * any jobs in them must be blocked */
>      if (target_is_backing) {
> -        BlockDriverState *iter;
> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
> -             * ourselves at s->base (if writes are blocked for a node, they are
> -             * also blocked for its backing file). The other options would be a
> -             * second filter driver above s->base (== target). */
> +        BlockDriverState *iter, *filtered_target;
> +        uint64_t iter_shared_perms;
> +
> +        /*
> +         * The topmost node with
> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
> +         */
> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
> +
> +        assert(bdrv_skip_rw_filters(filtered_target) ==
> +               bdrv_skip_rw_filters(target));
> +
> +        /*
> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
> +         * ourselves at s->base (if writes are blocked for a node, they are
> +         * also blocked for its backing file). The other options would be a
> +         * second filter driver above s->base (== target).
> +         */
> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
> +
> +        for (iter = bdrv_filtered_bs(bs); iter != target;
> +             iter = bdrv_filtered_bs(iter))
> +        {
> +            if (iter == filtered_target) {
> +                /*
> +                 * From here on, all nodes are filters on the base.
> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
> +                 */
> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
> +            }
> +
>              ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
> -                                     errp);
> +                                     iter_shared_perms, errp);
>              if (ret < 0) {
>                  goto fail;
>              }
> @@ -1737,7 +1802,7 @@ fail:
>      bs_opaque->stop = true;
>      bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
>                               &error_abort);
> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>  
>      bdrv_unref(mirror_top_bs);
>  
> @@ -1764,7 +1829,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
>          return;
>      }
>      is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
>      mirror_start_job(job_id, bs, creation_flags, target, replaces,
>                       speed, granularity, buf_size, backing_mode,
>                       on_source_error, on_target_error, unmap, NULL, NULL,
> diff --git a/blockdev.c b/blockdev.c
> index c540802127..c451f553f7 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -3851,7 +3851,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>          return;
>      }
>  
> -    if (!bs->backing && sync == MIRROR_SYNC_MODE_TOP) {
> +    if (!bdrv_backing_chain_next(bs) && sync == MIRROR_SYNC_MODE_TOP) {
>          sync = MIRROR_SYNC_MODE_FULL;
>      }
>  
> @@ -3900,7 +3900,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>  
>  void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>  {
> -    BlockDriverState *bs;
> +    BlockDriverState *bs, *unfiltered_bs;
>      BlockDriverState *source, *target_bs;
>      AioContext *aio_context;
>      BlockMirrorBackingMode backing_mode;
> @@ -3909,6 +3909,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>      int flags;
>      int64_t size;
>      const char *format = arg->format;
> +    const char *replaces_node_name = NULL;
>      int ret;
>  
>      bs = qmp_get_root_bs(arg->device, errp);
> @@ -3921,6 +3922,16 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>          return;
>      }
>  
> +    /*
> +     * If the user has not instructed us otherwise, we should let the
> +     * block job run from @bs (thus taking into account all filters on
> +     * it) but replace @unfiltered_bs when it finishes (thus not
> +     * removing those filters).
> +     * (And if there are any explicit filters, we should assume the
> +     *  user knows how to use the @replaces option.)
> +     */
> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);

Should this behaviour be documented in the QAPI schema for drive-mirror?

>      aio_context = bdrv_get_aio_context(bs);
>      aio_context_acquire(aio_context);
>  
> @@ -3934,8 +3945,14 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>      }
>  
>      flags = bs->open_flags | BDRV_O_RDWR;
> -    source = backing_bs(bs);
> +    source = bdrv_filtered_cow_bs(unfiltered_bs);
>      if (!source && arg->sync == MIRROR_SYNC_MODE_TOP) {
> +        if (bdrv_filtered_bs(unfiltered_bs)) {
> +            /* @unfiltered_bs is an explicit filter */
> +            error_setg(errp, "Cannot perform sync=top mirror through an "
> +                       "explicitly added filter node on the source");
> +            goto out;
> +        }
>          arg->sync = MIRROR_SYNC_MODE_FULL;
>      }
>      if (arg->sync == MIRROR_SYNC_MODE_NONE) {
> @@ -3954,6 +3971,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>                               " named node of the graph");
>              goto out;
>          }
> +        replaces_node_name = arg->replaces;
> +    } else if (unfiltered_bs != bs) {
> +        replaces_node_name = unfiltered_bs->node_name;
>      }
>  
>      if (arg->mode == NEW_IMAGE_MODE_ABSOLUTE_PATHS) {
> @@ -3973,6 +3993,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>          bdrv_img_create(arg->target, format,
>                          NULL, NULL, NULL, size, flags, false, &local_err);
>      } else {
> +        /* Implicit filters should not appear in the filename */
> +        BlockDriverState *explicit_backing = bdrv_skip_implicit_filters(source);
> +
>          switch (arg->mode) {
>          case NEW_IMAGE_MODE_EXISTING:
>              break;
> @@ -3980,8 +4003,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>              /* create new image with backing file */
>              bdrv_refresh_filename(source);
>              bdrv_img_create(arg->target, format,
> -                            source->filename,
> -                            source->drv->format_name,
> +                            explicit_backing->filename,
> +                            explicit_backing->drv->format_name,
>                              NULL, size, flags, false, &local_err);
>              break;
>          default:
> @@ -4017,7 +4040,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>      }
>  
>      blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
> -                           arg->has_replaces, arg->replaces, arg->sync,
> +                           !!replaces_node_name, replaces_node_name, arg->sync,
>                             backing_mode, arg->has_speed, arg->speed,
>                             arg->has_granularity, arg->granularity,
>                             arg->has_buf_size, arg->buf_size,
> @@ -4053,7 +4076,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>                           bool has_auto_dismiss, bool auto_dismiss,
>                           Error **errp)
>  {
> -    BlockDriverState *bs;
> +    BlockDriverState *bs, *unfiltered_bs;
>      BlockDriverState *target_bs;
>      AioContext *aio_context;
>      BlockMirrorBackingMode backing_mode = MIRROR_LEAVE_BACKING_CHAIN;
> @@ -4065,6 +4088,16 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>          return;
>      }
>  
> +    /*
> +     * Same as in qmp_drive_mirror(): We want to run the job from @bs,
> +     * but we want to replace @unfiltered_bs on completion.
> +     */
> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);

Do we? I thought the idea with blockdev-mirror was that the client tells
us the exact node it is interested in, without any magic skipping nodes.

Skipping implicit nodes is a feature for compatibility with legacy
clients, but a client using blockdev-mirror isn't a legacy client.

> +    if (!has_replaces && unfiltered_bs != bs) {
> +        replaces = unfiltered_bs->node_name;
> +        has_replaces = true;
> +    }
> +
>      target_bs = bdrv_lookup_bs(target, target, errp);
>      if (!target_bs) {
>          return;

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 28/42] stream: Deal with filters
  2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 28/42] stream: " Max Reitz
  2019-08-12 11:55   ` Vladimir Sementsov-Ogievskiy
@ 2019-09-13 14:16   ` Kevin Wolf
  2019-09-16  9:52     ` Max Reitz
  1 sibling, 1 reply; 132+ messages in thread
From: Kevin Wolf @ 2019-09-13 14:16 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> Because of the recent changes that make the stream job independent of
> the base node and instead track the node above it, we have to split that
> "bottom" node into two cases: The bottom COW node, and the node directly
> above the base node (which may be an R/W filter or the bottom COW node).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  qapi/block-core.json |  4 ++++
>  block/stream.c       | 52 ++++++++++++++++++++++++++++----------------
>  blockdev.c           |  2 +-
>  3 files changed, 38 insertions(+), 20 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 38c4dbd7c3..3c54717870 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2516,6 +2516,10 @@
>  # On successful completion the image file is updated to drop the backing file
>  # and the BLOCK_JOB_COMPLETED event is emitted.
>  #
> +# In case @device is a filter node, block-stream modifies the first non-filter
> +# overlay node below it to point to base's backing node (or NULL if @base was
> +# not specified) instead of modifying @device itself.
> +#
>  # @job-id: identifier for the newly-created block job. If
>  #          omitted, the device name will be used. (Since 2.7)
>  #
> diff --git a/block/stream.c b/block/stream.c
> index 4c8b89884a..bd4a351dae 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -31,7 +31,8 @@ enum {
>  
>  typedef struct StreamBlockJob {
>      BlockJob common;
> -    BlockDriverState *bottom;
> +    BlockDriverState *bottom_cow_node;
> +    BlockDriverState *above_base;

Confusing naming, especially because in commit you used above_base for
what is bottom_cow_node here. Vladimir already suggested using
base_overlay consistently, so we should do this here too (for
bottom_cow_node). above_base can keep its name because the different
above_base in commit is going to be renamed).

>      BlockdevOnError on_error;
>      char *backing_file_str;
>      bool bs_read_only;
> @@ -54,7 +55,7 @@ static void stream_abort(Job *job)
>  
>      if (s->chain_frozen) {
>          BlockJob *bjob = &s->common;
> -        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->bottom);
> +        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->above_base);
>      }
>  }
>  
> @@ -63,14 +64,15 @@ static int stream_prepare(Job *job)
>      StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>      BlockJob *bjob = &s->common;
>      BlockDriverState *bs = blk_bs(bjob->blk);
> -    BlockDriverState *base = backing_bs(s->bottom);
> +    BlockDriverState *unfiltered_bs = bdrv_skip_rw_filters(bs);
> +    BlockDriverState *base = bdrv_filtered_bs(s->above_base);
>      Error *local_err = NULL;
>      int ret = 0;
>  
> -    bdrv_unfreeze_chain(bs, s->bottom);
> +    bdrv_unfreeze_chain(bs, s->above_base);
>      s->chain_frozen = false;
>  
> -    if (bs->backing) {
> +    if (bdrv_filtered_cow_child(unfiltered_bs)) {
>          const char *base_id = NULL, *base_fmt = NULL;
>          if (base) {
>              base_id = s->backing_file_str;
> @@ -78,8 +80,8 @@ static int stream_prepare(Job *job)
>                  base_fmt = base->drv->format_name;
>              }
>          }
> -        bdrv_set_backing_hd(bs, base, &local_err);
> -        ret = bdrv_change_backing_file(bs, base_id, base_fmt);
> +        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
> +        ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt);
>          if (local_err) {
>              error_report_err(local_err);
>              return -EPERM;
> @@ -110,7 +112,8 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>      StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>      BlockBackend *blk = s->common.blk;
>      BlockDriverState *bs = blk_bs(blk);
> -    bool enable_cor = !backing_bs(s->bottom);
> +    BlockDriverState *unfiltered_bs = bdrv_skip_rw_filters(bs);
> +    bool enable_cor = !bdrv_filtered_bs(s->above_base);
>      int64_t len;
>      int64_t offset = 0;
>      uint64_t delay_ns = 0;
> @@ -119,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>      int64_t n = 0; /* bytes */
>      void *buf;
>  
> -    if (bs == s->bottom) {
> +    if (unfiltered_bs == s->bottom_cow_node) {
>          /* Nothing to stream */
>          return 0;
>      }
> @@ -154,13 +157,14 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>  
>          copy = false;
>  
> -        ret = bdrv_is_allocated(bs, offset, STREAM_BUFFER_SIZE, &n);
> +        ret = bdrv_is_allocated(unfiltered_bs, offset, STREAM_BUFFER_SIZE, &n);
>          if (ret == 1) {
>              /* Allocated in the top, no need to copy.  */
>          } else if (ret >= 0) {
>              /* Copy if allocated in the intermediate images.  Limit to the
>               * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).  */
> -            ret = bdrv_is_allocated_above(backing_bs(bs), s->bottom, true,
> +            ret = bdrv_is_allocated_above(bdrv_filtered_cow_bs(unfiltered_bs),
> +                                          s->bottom_cow_node, true,
>                                            offset, n, &n);
>              /* Finish early if end of backing file has been reached */
>              if (ret == 0 && n == 0) {
> @@ -231,9 +235,16 @@ void stream_start(const char *job_id, BlockDriverState *bs,
>      BlockDriverState *iter;
>      bool bs_read_only;
>      int basic_flags = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
> -    BlockDriverState *bottom = bdrv_find_overlay(bs, base);
> +    BlockDriverState *bottom_cow_node = bdrv_find_overlay(bs, base);
> +    BlockDriverState *above_base;

Do we need to check for bottom_cow_node == NULL?

I think you could get a bs that is a filter of bottom_cow_node, and then
bdrv_find_overlay() returns NULL and...

> -    if (bdrv_freeze_chain(bs, bottom, errp) < 0) {
> +    /* Find the node directly above @base */
> +    for (above_base = bottom_cow_node;
> +         bdrv_filtered_bs(above_base) != base;
> +         above_base = bdrv_filtered_bs(above_base))
> +    {}

...bottom_cow_node == NULL turns this into an endless loop.

> +    if (bdrv_freeze_chain(bs, above_base, errp) < 0) {
>          return;
>      }

Hm... This feels odd. There are two places where stopping to freeze the
chain would make obvious sense: At base, like we originally did; or at
base_overlay, like we (intend to) do since commit c624b015, because we
say that we don't actually mind if the user replaces the base image. I
don't see how stopping at the first filter above base makes sense.

So should this use bottom_cow_node/base_overlay instead of above_base?

You couldn't use StreamBlockJob.above_base any more then because it
could change, but you also don't really need it anywhere. It's only used
for unfreezing (which would change) and for finding the base (you can
still find bdrv_backing_chain_next(s->base_overlay)). I guess this would
even be a code simplification.

> @@ -261,16 +272,19 @@ void stream_start(const char *job_id, BlockDriverState *bs,
>       * disappear from the chain after this operation. The streaming job reads
>       * every block only once, assuming that it doesn't change, so forbid writes
>       * and resizes. Reassign the base node pointer because the backing BS of the
> -     * bottom node might change after the call to bdrv_reopen_set_read_only()
> -     * due to parallel block jobs running.
> +     * above_base node might change after the call to
> +     * bdrv_reopen_set_read_only() due to parallel block jobs running.
>       */
> -    base = backing_bs(bottom);
> -    for (iter = backing_bs(bs); iter && iter != base; iter = backing_bs(iter)) {
> +    base = bdrv_filtered_bs(above_base);

We just calculated above_base such that it's the parent of base. Why
would base not already have the value we're assigning it again here?

> +    for (iter = bdrv_filtered_bs(bs); iter && iter != base;
> +         iter = bdrv_filtered_bs(iter))
> +    {
>          block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>                             basic_flags, &error_abort);
>      }
>  
> -    s->bottom = bottom;
> +    s->bottom_cow_node = bottom_cow_node;
> +    s->above_base = above_base;
>      s->backing_file_str = g_strdup(backing_file_str);
>      s->bs_read_only = bs_read_only;
>      s->chain_frozen = true;
> @@ -284,5 +298,5 @@ fail:
>      if (bs_read_only) {
>          bdrv_reopen_set_read_only(bs, true, NULL);
>      }
> -    bdrv_unfreeze_chain(bs, bottom);
> +    bdrv_unfreeze_chain(bs, above_base);
>  }

Kevin


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 28/42] stream: Deal with filters
  2019-09-13 14:16   ` Kevin Wolf
@ 2019-09-16  9:52     ` Max Reitz
  2019-09-16 14:47       ` Kevin Wolf
  0 siblings, 1 reply; 132+ messages in thread
From: Max Reitz @ 2019-09-16  9:52 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 9180 bytes --]

On 13.09.19 16:16, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> Because of the recent changes that make the stream job independent of
>> the base node and instead track the node above it, we have to split that
>> "bottom" node into two cases: The bottom COW node, and the node directly
>> above the base node (which may be an R/W filter or the bottom COW node).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  qapi/block-core.json |  4 ++++
>>  block/stream.c       | 52 ++++++++++++++++++++++++++++----------------
>>  blockdev.c           |  2 +-
>>  3 files changed, 38 insertions(+), 20 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 38c4dbd7c3..3c54717870 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2516,6 +2516,10 @@
>>  # On successful completion the image file is updated to drop the backing file
>>  # and the BLOCK_JOB_COMPLETED event is emitted.
>>  #
>> +# In case @device is a filter node, block-stream modifies the first non-filter
>> +# overlay node below it to point to base's backing node (or NULL if @base was
>> +# not specified) instead of modifying @device itself.
>> +#
>>  # @job-id: identifier for the newly-created block job. If
>>  #          omitted, the device name will be used. (Since 2.7)
>>  #
>> diff --git a/block/stream.c b/block/stream.c
>> index 4c8b89884a..bd4a351dae 100644
>> --- a/block/stream.c
>> +++ b/block/stream.c
>> @@ -31,7 +31,8 @@ enum {
>>  
>>  typedef struct StreamBlockJob {
>>      BlockJob common;
>> -    BlockDriverState *bottom;
>> +    BlockDriverState *bottom_cow_node;
>> +    BlockDriverState *above_base;
> 
> Confusing naming, especially because in commit you used above_base for
> what is bottom_cow_node here. Vladimir already suggested using
> base_overlay consistently, so we should do this here too (for
> bottom_cow_node). above_base can keep its name because the different
> above_base in commit is going to be renamed).

Sure.

>>      BlockdevOnError on_error;
>>      char *backing_file_str;
>>      bool bs_read_only;
>> @@ -54,7 +55,7 @@ static void stream_abort(Job *job)
>>  
>>      if (s->chain_frozen) {
>>          BlockJob *bjob = &s->common;
>> -        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->bottom);
>> +        bdrv_unfreeze_chain(blk_bs(bjob->blk), s->above_base);
>>      }
>>  }
>>  
>> @@ -63,14 +64,15 @@ static int stream_prepare(Job *job)
>>      StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>>      BlockJob *bjob = &s->common;
>>      BlockDriverState *bs = blk_bs(bjob->blk);
>> -    BlockDriverState *base = backing_bs(s->bottom);
>> +    BlockDriverState *unfiltered_bs = bdrv_skip_rw_filters(bs);
>> +    BlockDriverState *base = bdrv_filtered_bs(s->above_base);
>>      Error *local_err = NULL;
>>      int ret = 0;
>>  
>> -    bdrv_unfreeze_chain(bs, s->bottom);
>> +    bdrv_unfreeze_chain(bs, s->above_base);
>>      s->chain_frozen = false;
>>  
>> -    if (bs->backing) {
>> +    if (bdrv_filtered_cow_child(unfiltered_bs)) {
>>          const char *base_id = NULL, *base_fmt = NULL;
>>          if (base) {
>>              base_id = s->backing_file_str;
>> @@ -78,8 +80,8 @@ static int stream_prepare(Job *job)
>>                  base_fmt = base->drv->format_name;
>>              }
>>          }
>> -        bdrv_set_backing_hd(bs, base, &local_err);
>> -        ret = bdrv_change_backing_file(bs, base_id, base_fmt);
>> +        bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
>> +        ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt);
>>          if (local_err) {
>>              error_report_err(local_err);
>>              return -EPERM;
>> @@ -110,7 +112,8 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>>      StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>>      BlockBackend *blk = s->common.blk;
>>      BlockDriverState *bs = blk_bs(blk);
>> -    bool enable_cor = !backing_bs(s->bottom);
>> +    BlockDriverState *unfiltered_bs = bdrv_skip_rw_filters(bs);
>> +    bool enable_cor = !bdrv_filtered_bs(s->above_base);
>>      int64_t len;
>>      int64_t offset = 0;
>>      uint64_t delay_ns = 0;
>> @@ -119,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>>      int64_t n = 0; /* bytes */
>>      void *buf;
>>  
>> -    if (bs == s->bottom) {
>> +    if (unfiltered_bs == s->bottom_cow_node) {
>>          /* Nothing to stream */
>>          return 0;
>>      }
>> @@ -154,13 +157,14 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>>  
>>          copy = false;
>>  
>> -        ret = bdrv_is_allocated(bs, offset, STREAM_BUFFER_SIZE, &n);
>> +        ret = bdrv_is_allocated(unfiltered_bs, offset, STREAM_BUFFER_SIZE, &n);
>>          if (ret == 1) {
>>              /* Allocated in the top, no need to copy.  */
>>          } else if (ret >= 0) {
>>              /* Copy if allocated in the intermediate images.  Limit to the
>>               * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).  */
>> -            ret = bdrv_is_allocated_above(backing_bs(bs), s->bottom, true,
>> +            ret = bdrv_is_allocated_above(bdrv_filtered_cow_bs(unfiltered_bs),
>> +                                          s->bottom_cow_node, true,
>>                                            offset, n, &n);
>>              /* Finish early if end of backing file has been reached */
>>              if (ret == 0 && n == 0) {
>> @@ -231,9 +235,16 @@ void stream_start(const char *job_id, BlockDriverState *bs,
>>      BlockDriverState *iter;
>>      bool bs_read_only;
>>      int basic_flags = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
>> -    BlockDriverState *bottom = bdrv_find_overlay(bs, base);
>> +    BlockDriverState *bottom_cow_node = bdrv_find_overlay(bs, base);
>> +    BlockDriverState *above_base;
> 
> Do we need to check for bottom_cow_node == NULL?
> 
> I think you could get a bs that is a filter of bottom_cow_node, and then
> bdrv_find_overlay() returns NULL and...

Ah, yes.  It isn’t even about the infinite loop, it’s just a case of
“Nothing to stream” (if @bs is just a filter chain away from @base).

Also, I just noticed that bdrv_find_overlay() in the version of this
series won’t work if the BDS passed to it (@base here) is a filter, so
that’s something else to be fixed.

>> -    if (bdrv_freeze_chain(bs, bottom, errp) < 0) {
>> +    /* Find the node directly above @base */
>> +    for (above_base = bottom_cow_node;
>> +         bdrv_filtered_bs(above_base) != base;
>> +         above_base = bdrv_filtered_bs(above_base))
>> +    {}
>  	
> ...bottom_cow_node == NULL turns this into an endless loop.
> 
>> +    if (bdrv_freeze_chain(bs, above_base, errp) < 0) {
>>          return;
>>      }
> 
> Hm... This feels odd. There are two places where stopping to freeze the
> chain would make obvious sense: At base, like we originally did; or at
> base_overlay, like we (intend to) do since commit c624b015, because we
> say that we don't actually mind if the user replaces the base image. I
> don't see how stopping at the first filter above base makes sense.
> 
> So should this use bottom_cow_node/base_overlay instead of above_base?

I suppose I thought “Better be safe than sorry”.

> You couldn't use StreamBlockJob.above_base any more then because it
> could change, but you also don't really need it anywhere. It's only used
> for unfreezing (which would change) and for finding the base (you can
> still find bdrv_backing_chain_next(s->base_overlay)). I guess this would
> even be a code simplification.

Great, I’ll see to it.

>> @@ -261,16 +272,19 @@ void stream_start(const char *job_id, BlockDriverState *bs,
>>       * disappear from the chain after this operation. The streaming job reads
>>       * every block only once, assuming that it doesn't change, so forbid writes
>>       * and resizes. Reassign the base node pointer because the backing BS of the
>> -     * bottom node might change after the call to bdrv_reopen_set_read_only()
>> -     * due to parallel block jobs running.
>> +     * above_base node might change after the call to
>> +     * bdrv_reopen_set_read_only() due to parallel block jobs running.
>>       */
>> -    base = backing_bs(bottom);
>> -    for (iter = backing_bs(bs); iter && iter != base; iter = backing_bs(iter)) {
>> +    base = bdrv_filtered_bs(above_base);
> 
> We just calculated above_base such that it's the parent of base. Why
> would base not already have the value we're assigning it again here?

That’s no change to existing code, whose reasoning is explained in the
comment above: bdrv_reopen_set_read_only() can yield, which might lead
to children of the bottom node changing.

If you feel like either that’s superfluous, or that if something like
that were to happen we’d have much bigger problems, be my guest to drop
both.

But in this series I’d rather just not change it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters
  2019-09-13 12:55   ` Kevin Wolf
@ 2019-09-16 10:26     ` Max Reitz
  0 siblings, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-09-16 10:26 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 17785 bytes --]

On 13.09.19 14:55, Kevin Wolf wrote:
> Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
>> This includes some permission limiting (for example, we only need to
>> take the RESIZE permission for active commits where the base is smaller
>> than the top).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/mirror.c | 117 ++++++++++++++++++++++++++++++++++++++-----------
>>  blockdev.c     |  47 +++++++++++++++++---
>>  2 files changed, 131 insertions(+), 33 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 54bafdf176..6ddbfb9708 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
>>      BlockBackend *target;
>>      BlockDriverState *mirror_top_bs;
>>      BlockDriverState *base;
>> +    BlockDriverState *base_overlay;
>>  
>>      /* The name of the graph node to replace */
>>      char *replaces;
>> @@ -665,8 +666,10 @@ static int mirror_exit_common(Job *job)
>>                               &error_abort);
>>      if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>>          BlockDriverState *backing = s->is_none_mode ? src : s->base;
>> -        if (backing_bs(target_bs) != backing) {
>> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
>> +        BlockDriverState *unfiltered_target = bdrv_skip_rw_filters(target_bs);
>> +
>> +        if (bdrv_filtered_cow_bs(unfiltered_target) != backing) {
>> +            bdrv_set_backing_hd(unfiltered_target, backing, &local_err);
>>              if (local_err) {
>>                  error_report_err(local_err);
>>                  ret = -EPERM;
>> @@ -715,7 +718,7 @@ static int mirror_exit_common(Job *job)
>>       * valid.
>>       */
>>      block_job_remove_all_bdrv(bjob);
>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>  
>>      /* We just changed the BDS the job BB refers to (with either or both of the
>>       * bdrv_replace_node() calls), so switch the BB back so the cleanup does
>> @@ -812,7 +815,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
>>              return 0;
>>          }
>>  
>> -        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
>> +        ret = bdrv_is_allocated_above(bs, s->base_overlay, true, offset, bytes,
>> +                                      &count);
>>          if (ret < 0) {
>>              return ret;
>>          }
>> @@ -908,7 +912,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>>      } else {
>>          s->target_cluster_size = BDRV_SECTOR_SIZE;
>>      }
>> -    if (backing_filename[0] && !target_bs->backing &&
>> +    if (backing_filename[0] && !bdrv_backing_chain_next(target_bs) &&
>>          s->granularity < s->target_cluster_size) {
>>          s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>>          s->cow_bitmap = bitmap_new(length);
>> @@ -1088,8 +1092,9 @@ static void mirror_complete(Job *job, Error **errp)
>>      if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>>          int ret;
>>  
>> -        assert(!target->backing);
>> -        ret = bdrv_open_backing_file(target, NULL, "backing", errp);
>> +        assert(!bdrv_backing_chain_next(target));
>> +        ret = bdrv_open_backing_file(bdrv_skip_rw_filters(target), NULL,
>> +                                     "backing", errp);
>>          if (ret < 0) {
>>              return;
>>          }
>> @@ -1531,8 +1536,8 @@ static BlockJob *mirror_start_job(
>>      MirrorBlockJob *s;
>>      MirrorBDSOpaque *bs_opaque;
>>      BlockDriverState *mirror_top_bs;
>> -    bool target_graph_mod;
>>      bool target_is_backing;
>> +    uint64_t target_perms, target_shared_perms;
>>      Error *local_err = NULL;
>>      int ret;
>>  
>> @@ -1551,7 +1556,7 @@ static BlockJob *mirror_start_job(
>>          buf_size = DEFAULT_MIRROR_BUF_SIZE;
>>      }
>>  
>> -    if (bs == target) {
>> +    if (bdrv_skip_rw_filters(bs) == bdrv_skip_rw_filters(target)) {
>>          error_setg(errp, "Can't mirror node into itself");
>>          return NULL;
>>      }
>> @@ -1615,15 +1620,50 @@ static BlockJob *mirror_start_job(
>>       * In the case of active commit, things look a bit different, though,
>>       * because the target is an already populated backing file in active use.
>>       * We can allow anything except resize there.*/
>> +
>> +    target_perms = BLK_PERM_WRITE;
>> +    target_shared_perms = BLK_PERM_WRITE_UNCHANGED;
>> +
>>      target_is_backing = bdrv_chain_contains(bs, target);
>> -    target_graph_mod = (backing_mode != MIRROR_LEAVE_BACKING_CHAIN);
>> +    if (target_is_backing) {
>> +        int64_t bs_size, target_size;
>> +        bs_size = bdrv_getlength(bs);
>> +        if (bs_size < 0) {
>> +            error_setg_errno(errp, -bs_size,
>> +                             "Could not inquire top image size");
>> +            goto fail;
>> +        }
>> +
>> +        target_size = bdrv_getlength(target);
>> +        if (target_size < 0) {
>> +            error_setg_errno(errp, -target_size,
>> +                             "Could not inquire base image size");
>> +            goto fail;
>> +        }
>> +
>> +        if (target_size < bs_size) {
>> +            target_perms |= BLK_PERM_RESIZE;
>> +        }
>> +
>> +        target_shared_perms |= BLK_PERM_CONSISTENT_READ
>> +                            |  BLK_PERM_WRITE
>> +                            |  BLK_PERM_GRAPH_MOD;
>> +    } else if (bdrv_chain_contains(bs, bdrv_skip_rw_filters(target))) {
>> +        /*
>> +         * We may want to allow this in the future, but it would
>> +         * require taking some extra care.
>> +         */
>> +        error_setg(errp, "Cannot mirror to a filter on top of a node in the "
>> +                   "source's backing chain");
>> +        goto fail;
>> +    }
>> +
>> +    if (backing_mode != MIRROR_LEAVE_BACKING_CHAIN) {
>> +        target_perms |= BLK_PERM_GRAPH_MOD;
>> +    }
> 
> This is getting absurd. We keep moving GRAPH_MOD around, but still
> nobody knows what it's actually supposed to mean. Maybe it would be
> better to just remove it finally?

I suppose even if we ever needed it, we no longer do now with .freeze.

> Of course, not a reason to stop this patch, after all it's moving the
> nonsensical piece of code correctly...
> 
>>      s->target = blk_new(s->common.job.aio_context,
>> -                        BLK_PERM_WRITE | BLK_PERM_RESIZE |
>> -                        (target_graph_mod ? BLK_PERM_GRAPH_MOD : 0),
>> -                        BLK_PERM_WRITE_UNCHANGED |
>> -                        (target_is_backing ? BLK_PERM_CONSISTENT_READ |
>> -                                             BLK_PERM_WRITE |
>> -                                             BLK_PERM_GRAPH_MOD : 0));
>> +                        target_perms, target_shared_perms);
>>      ret = blk_insert_bs(s->target, target, errp);
>>      if (ret < 0) {
>>          goto fail;
>> @@ -1647,6 +1687,7 @@ static BlockJob *mirror_start_job(
>>      s->backing_mode = backing_mode;
>>      s->copy_mode = copy_mode;
>>      s->base = base;
>> +    s->base_overlay = bdrv_find_overlay(bs, base);
>>      s->granularity = granularity;
>>      s->buf_size = ROUND_UP(buf_size, granularity);
>>      s->unmap = unmap;
>> @@ -1693,15 +1734,39 @@ static BlockJob *mirror_start_job(
>>      /* In commit_active_start() all intermediate nodes disappear, so
>>       * any jobs in them must be blocked */
>>      if (target_is_backing) {
>> -        BlockDriverState *iter;
>> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) {
>> -            /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>> -             * ourselves at s->base (if writes are blocked for a node, they are
>> -             * also blocked for its backing file). The other options would be a
>> -             * second filter driver above s->base (== target). */
>> +        BlockDriverState *iter, *filtered_target;
>> +        uint64_t iter_shared_perms;
>> +
>> +        /*
>> +         * The topmost node with
>> +         * bdrv_skip_rw_filters(filtered_target) == bdrv_skip_rw_filters(target)
>> +         */
>> +        filtered_target = bdrv_filtered_cow_bs(bdrv_find_overlay(bs, target));
>> +
>> +        assert(bdrv_skip_rw_filters(filtered_target) ==
>> +               bdrv_skip_rw_filters(target));
>> +
>> +        /*
>> +         * XXX BLK_PERM_WRITE needs to be allowed so we don't block
>> +         * ourselves at s->base (if writes are blocked for a node, they are
>> +         * also blocked for its backing file). The other options would be a
>> +         * second filter driver above s->base (== target).
>> +         */
>> +        iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
>> +
>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
>> +             iter = bdrv_filtered_bs(iter))
>> +        {
>> +            if (iter == filtered_target) {
>> +                /*
>> +                 * From here on, all nodes are filters on the base.
>> +                 * This allows us to share BLK_PERM_CONSISTENT_READ.
>> +                 */
>> +                iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
>> +            }
>> +
>>              ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>> -                                     BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
>> -                                     errp);
>> +                                     iter_shared_perms, errp);
>>              if (ret < 0) {
>>                  goto fail;
>>              }
>> @@ -1737,7 +1802,7 @@ fail:
>>      bs_opaque->stop = true;
>>      bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
>>                               &error_abort);
>> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), &error_abort);
>> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
>>  
>>      bdrv_unref(mirror_top_bs);
>>  
>> @@ -1764,7 +1829,7 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
>>          return;
>>      }
>>      is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
>> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
>> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
>>      mirror_start_job(job_id, bs, creation_flags, target, replaces,
>>                       speed, granularity, buf_size, backing_mode,
>>                       on_source_error, on_target_error, unmap, NULL, NULL,
>> diff --git a/blockdev.c b/blockdev.c
>> index c540802127..c451f553f7 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -3851,7 +3851,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>          return;
>>      }
>>  
>> -    if (!bs->backing && sync == MIRROR_SYNC_MODE_TOP) {
>> +    if (!bdrv_backing_chain_next(bs) && sync == MIRROR_SYNC_MODE_TOP) {
>>          sync = MIRROR_SYNC_MODE_FULL;
>>      }
>>  
>> @@ -3900,7 +3900,7 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
>>  
>>  void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>  {
>> -    BlockDriverState *bs;
>> +    BlockDriverState *bs, *unfiltered_bs;
>>      BlockDriverState *source, *target_bs;
>>      AioContext *aio_context;
>>      BlockMirrorBackingMode backing_mode;
>> @@ -3909,6 +3909,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>      int flags;
>>      int64_t size;
>>      const char *format = arg->format;
>> +    const char *replaces_node_name = NULL;
>>      int ret;
>>  
>>      bs = qmp_get_root_bs(arg->device, errp);
>> @@ -3921,6 +3922,16 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>          return;
>>      }
>>  
>> +    /*
>> +     * If the user has not instructed us otherwise, we should let the
>> +     * block job run from @bs (thus taking into account all filters on
>> +     * it) but replace @unfiltered_bs when it finishes (thus not
>> +     * removing those filters).
>> +     * (And if there are any explicit filters, we should assume the
>> +     *  user knows how to use the @replaces option.)
>> +     */
>> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
> 
> Should this behaviour be documented in the QAPI schema for drive-mirror?

Hm.  I’d document it for @replaces.  But what would I write?  “By
default, @device is replaced, though implicitly created nodes on it are
kept”?

I feel bad about referencing implicit nodes, especially if our plan is
to remove them anyway.  OTOH, if I dropped special handling of implicit
nodes here, I should probably drop it in the whole series.  And I don’t
feel like that’s right as long as we haven’t actually removed implicit
nodes.

So I suppose I’ll go for the documentation addendum? :-/

>>      aio_context = bdrv_get_aio_context(bs);
>>      aio_context_acquire(aio_context);
>>  
>> @@ -3934,8 +3945,14 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>      }
>>  
>>      flags = bs->open_flags | BDRV_O_RDWR;
>> -    source = backing_bs(bs);
>> +    source = bdrv_filtered_cow_bs(unfiltered_bs);
>>      if (!source && arg->sync == MIRROR_SYNC_MODE_TOP) {
>> +        if (bdrv_filtered_bs(unfiltered_bs)) {
>> +            /* @unfiltered_bs is an explicit filter */
>> +            error_setg(errp, "Cannot perform sync=top mirror through an "
>> +                       "explicitly added filter node on the source");
>> +            goto out;
>> +        }
>>          arg->sync = MIRROR_SYNC_MODE_FULL;
>>      }
>>      if (arg->sync == MIRROR_SYNC_MODE_NONE) {
>> @@ -3954,6 +3971,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>                               " named node of the graph");
>>              goto out;
>>          }
>> +        replaces_node_name = arg->replaces;
>> +    } else if (unfiltered_bs != bs) {
>> +        replaces_node_name = unfiltered_bs->node_name;
>>      }
>>  
>>      if (arg->mode == NEW_IMAGE_MODE_ABSOLUTE_PATHS) {
>> @@ -3973,6 +3993,9 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>          bdrv_img_create(arg->target, format,
>>                          NULL, NULL, NULL, size, flags, false, &local_err);
>>      } else {
>> +        /* Implicit filters should not appear in the filename */
>> +        BlockDriverState *explicit_backing = bdrv_skip_implicit_filters(source);
>> +
>>          switch (arg->mode) {
>>          case NEW_IMAGE_MODE_EXISTING:
>>              break;
>> @@ -3980,8 +4003,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>              /* create new image with backing file */
>>              bdrv_refresh_filename(source);
>>              bdrv_img_create(arg->target, format,
>> -                            source->filename,
>> -                            source->drv->format_name,
>> +                            explicit_backing->filename,
>> +                            explicit_backing->drv->format_name,
>>                              NULL, size, flags, false, &local_err);
>>              break;
>>          default:
>> @@ -4017,7 +4040,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>      }
>>  
>>      blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
>> -                           arg->has_replaces, arg->replaces, arg->sync,
>> +                           !!replaces_node_name, replaces_node_name, arg->sync,
>>                             backing_mode, arg->has_speed, arg->speed,
>>                             arg->has_granularity, arg->granularity,
>>                             arg->has_buf_size, arg->buf_size,
>> @@ -4053,7 +4076,7 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>>                           bool has_auto_dismiss, bool auto_dismiss,
>>                           Error **errp)
>>  {
>> -    BlockDriverState *bs;
>> +    BlockDriverState *bs, *unfiltered_bs;
>>      BlockDriverState *target_bs;
>>      AioContext *aio_context;
>>      BlockMirrorBackingMode backing_mode = MIRROR_LEAVE_BACKING_CHAIN;
>> @@ -4065,6 +4088,16 @@ void qmp_blockdev_mirror(bool has_job_id, const char *job_id,
>>          return;
>>      }
>>  
>> +    /*
>> +     * Same as in qmp_drive_mirror(): We want to run the job from @bs,
>> +     * but we want to replace @unfiltered_bs on completion.
>> +     */
>> +    unfiltered_bs = bdrv_skip_implicit_filters(bs);
> 
> Do we? I thought the idea with blockdev-mirror was that the client tells
> us the exact node it is interested in, without any magic skipping nodes.
> 
> Skipping implicit nodes is a feature for compatibility with legacy
> clients, but a client using blockdev-mirror isn't a legacy client.

And I thought legacy clients didn’t let implicit nodes be created.

We could return an error if @bs is an implicit node.  But I don’t think
that would help anyone, and it’d be more complicated in my opinion.

Again, the best thing (IMO) is to remove the concept of implicit nodes
altogether and then we can drop this piece of code anyway.

Max

>> +    if (!has_replaces && unfiltered_bs != bs) {
>> +        replaces = unfiltered_bs->node_name;
>> +        has_replaces = true;
>> +    }
>> +
>>      target_bs = bdrv_lookup_bs(target, target, errp);
>>      if (!target_bs) {
>>          return;
> 
> Kevin
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Qemu-devel] [PATCH v6 28/42] stream: Deal with filters
  2019-09-16  9:52     ` Max Reitz
@ 2019-09-16 14:47       ` Kevin Wolf
  0 siblings, 0 replies; 132+ messages in thread
From: Kevin Wolf @ 2019-09-16 14:47 UTC (permalink / raw)
  To: Max Reitz; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

Am 16.09.2019 um 11:52 hat Max Reitz geschrieben:
> On 13.09.19 16:16, Kevin Wolf wrote:
> > Am 09.08.2019 um 18:13 hat Max Reitz geschrieben:
> >> @@ -261,16 +272,19 @@ void stream_start(const char *job_id, BlockDriverState *bs,
> >>       * disappear from the chain after this operation. The streaming job reads
> >>       * every block only once, assuming that it doesn't change, so forbid writes
> >>       * and resizes. Reassign the base node pointer because the backing BS of the
> >> -     * bottom node might change after the call to bdrv_reopen_set_read_only()
> >> -     * due to parallel block jobs running.
> >> +     * above_base node might change after the call to
> >> +     * bdrv_reopen_set_read_only() due to parallel block jobs running.
> >>       */
> >> -    base = backing_bs(bottom);
> >> -    for (iter = backing_bs(bs); iter && iter != base; iter = backing_bs(iter)) {
> >> +    base = bdrv_filtered_bs(above_base);
> > 
> > We just calculated above_base such that it's the parent of base. Why
> > would base not already have the value we're assigning it again here?
> 
> That’s no change to existing code, whose reasoning is explained in the
> comment above: bdrv_reopen_set_read_only() can yield, which might lead
> to children of the bottom node changing.
> 
> If you feel like either that’s superfluous, or that if something like
> that were to happen we’d have much bigger problems, be my guest to drop
> both.
> 
> But in this series I’d rather just not change it.

Ah, you mean comments are there to be read?

But actually, I think iterating down to base is too much anyway. The
reasoning in the comment for block_job_add_bdrv() is that the nodes will
be dropped at the end. But base with all of its filter will be kept
after this patch.

So I think the for loop should stop after bs->base_overlay. And then
concurrently changing links aren't even a problem any more because
that's exactly the place up to which we've frozen the chain.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback
  2019-09-11 11:00                 ` Max Reitz
  2019-09-12 10:34                   ` Kevin Wolf
@ 2019-11-14 13:11                   ` Max Reitz
  1 sibling, 0 replies; 132+ messages in thread
From: Max Reitz @ 2019-11-14 13:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

[-- Attachment #1.1: Type: text/plain, Size: 7888 bytes --]

On 11.09.19 13:00, Max Reitz wrote:
> On 11.09.19 12:31, Kevin Wolf wrote:
>> Am 11.09.2019 um 12:00 hat Max Reitz geschrieben:
>>> On 11.09.19 10:27, Kevin Wolf wrote:
>>>> Am 11.09.2019 um 09:37 hat Max Reitz geschrieben:
>>>>> On 11.09.19 08:55, Kevin Wolf wrote:
>>>>>> Well, by default the primary child, which should cover like 90% of the
>>>>>> drivers?
>>>>>
>>>>> Hm, yes.
>>>>>
>>>>> But I still think that the drivers that do not want to count every
>>>>> single non-COW child are the exception.
>>>>
>>>> They are, but drivers that want to count more than their primary node
>>>> are exceptions, too. And I think you're more likely to remember adding
>>>> the callback when you want to have a certain feature, not when you don't
>>>> want to have it.
>>>>
>>>> I really think we're likely to forget adding the callback where we need
>>>> to disable the feature.
>>>
>>> Well, I mean, we did forget adding it for qcow2.
>>
>> I'm afraid I have to agree. So the conclusion is that we won't get it
>> right anyway?
>>
>>>> I can see two options that should address both of our views:
>>>>
>>>> 1. Just don't have a fallback at all, make the callback mandatory and
>>>>    provide implementations in block.c that can be referred to in
>>>>    BlockDriver. Not specifying the callback causes an assertion failure,
>>>>    so we'd hopefully notice it quite early (assuming that we run either
>>>>    'qemu-img info' or 'query-block' on a configuration with the block
>>>>    driver, but I think that's faily safe to assume).
>>>
>>> Hm.  Seems a bit much, but if we can’t agree on what’s a good general
>>> implementation that works for everything, this is probably the only
>>> thing that would actually keep us from forgetting to add special cases.
>>>
>>> Though I actually don’t know.  I’d probably add two globally available
>>> helpers, one that returns the sum of everything but the backing node,
>>> and one that just returns the primary node.
>>
>> Yes, I think this is the same as I meant by "provide implementations in
>> block.c".
>>
>>> Now if I were to make qcow2 use the primary node helper function, would
>>> we have remembered changing it once we added a data file?
>>>
>>> Hmm.  Maybe not, but it should be OK to just make everything use the sum
>>> helper, except the drivers that want the primary node.  That should work
>>> for all cases.  (I think that whenever a format driver suddenly gains
>>> more child nodes, we probably will want to count them.  OTOH, everything
>>> that has nodes that shouldn’t be counted probably always wants to use
>>> the primary node helper function from the start.)
>>
>> The job filter nodes have only one child currently, which should be
>> counted. We'll add other children that shouldn't be counted only later.
>>
>> But we already have an idea of what possible extensions look like, so we
>> can probably choose the right function from the start.
> 
> Yep.
> 
>>>> 2. Make the 90% solution a 100% solution: Allow drivers to have multiple
>>>>    storage children (for vmdk) and then have the fallback add up the
>>>>    primary child plus all storage children. This is what I suggested as
>>>>    the documented semantics in my initial reply to this patch (that you
>>>>    chose not to answer).
>>>
>>> I didn’t answer that because I didn’t disagree.
>>>
>>>>    Adding the size of storage children covers qcow2 and vmdk.
>>>
>>> That’s of course exactly what we’re trying to do, but the question is,
>>> how do we figure out that storage children?  Make it a per-BdrvChild
>>> attribute?  That seems rather heavy-handed, because I think we’d need it
>>> only here.
>>
>> Well, you added bdrv_storage_child().I'd argue this interface is wrong
> 
> Yes, it probably is.
> 
>> because it assumes that only one storage child exists. You just didn't
>> implement it for vmdk so that the problem didn't become apparent. It
>> would have to return a list rather than a single child. So fixing the
>> interface and then using it is what I was thinking.
>>
>> Now that you mention a per-BdrvChild attribute, however, I start to
>> wonder if the distinction between COW children, filter children, storage
>> children, metadata children, etc. isn't really what BdrvChildRole was
>> supposed to represent?
> 
> That’s a good point.
> 
>> Maybe we want to split off child_storage from child_file, though it's
>> not strictly necessary for this specific case because we want to treat
>> both metadata and storage nodes the same. But it could be useful for
>> other users of bdrv_storage_child(), if there are any.
> 
> Possible.  Maybe it turns out that at least for this series I don’t need
> bdrv_storage_child() at all.
> 
>>>>    As the job filter won't declare the target or any other involved
>>>>    nodes their storage nodes (I hope), this will do the right thing for
>>>>    them, too.
>>>>
>>>>    For quorum and blkverify both ways could be justifiable. I think they
>>>>    probably shouldn't declare their children as storage nodes. They are
>>>>    more like filters that don't have a single filtered node. So some
>>>>    kind of almost-filters.
>>>
>>> I don’t think quorum is a filter, and blkverify can only be justified to
>>> be a filter because it quits qemu when there is a mismatch.
>>>
>>> The better example is replication, but that has a clear filtered child
>>> (the primary node).
>>>
>>>
>>> So all in all I think it’s best to make the callback mandatory and add
>>> two global helper functions.  That’s simple enough and should prevent
>>> us from making mistakes by forgetting to adjust something in the
>>> future.
>>
>> Yes, that should work.
>>
>> We should probably still figure out what the relationship between the
>> child access functions and child roles is, even if we don't need it for
>> this solution. But it feels like an important part of the design.
> 
> Hm.  It feels like something that should be done before this series,
> actually.
> 
> So I think we should add at least a child role per child access function
> so that they match?  And then maybe in bdrv_attach_child() assert that a
> BDS never has more than one primary or filtered child (a filtered child
> acts as a primary child, too), or more than one COW child.  (And that
> these are always in bs->file or bs->backing so the child access
> functions do work.)

I’ve been trying to make this work, but I don’t think it does.  It just
feels all wrong and I need up with things like
“child_metadata_and_data”.  The last straw was that blkverify should
have the raw file be the filtered child (because, well, it’s bs->file),
but then the format file would need to be a non-filtered child, and
those would default to BDRV_O_PROTOCOL (which we decidedly don’t want).

Anyway, I’m currently attempting to solve this differently:
BdrvChildRole isn’t suitable for the job, I think.  The name is
completely what we want, but it actually doesn’t look like something
that describes the child role to me.

Instead, I’m introducing a new BdrvChildRole enum mask that describes
how the child is going to be used: stay-at-node, cow, metadata, data, etc.

I’m going to rename the current BdrvChildRole structure to
BdrvChildParent (in want of a better name), because really most of what
it does is describe the parent, but precisely not the child.  I’m moving
.stay_as_node to the new BdrvChildRole enum.

I hope this lets me unify child_file, child_backing, and child_format
into a child_of_bds object.  The callbacks should then decide the
particularities based on the BdrvChildRole enum.

Hope that makes sense. (? :S)

At least I feel much happier implementing it this way, which I suppose
is a good sign.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, back to index

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 16:13 [Qemu-devel] [PATCH v6 00/42] block: Deal with filters Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 01/42] block: Mark commit and mirror as filter drivers Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 02/42] copy-on-read: Support compressed writes Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 03/42] throttle: " Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 04/42] block: Add child access functions Max Reitz
2019-08-09 16:56   ` Eric Blake
2019-09-04 16:16   ` Kevin Wolf
2019-09-09  7:56     ` Max Reitz
2019-09-09  9:36       ` Kevin Wolf
2019-09-09 14:04         ` Max Reitz
2019-09-09 16:13           ` Kevin Wolf
2019-09-10  9:14             ` Max Reitz
2019-09-10 10:47               ` Kevin Wolf
2019-09-10 11:36                 ` Max Reitz
2019-09-10 12:48                   ` Kevin Wolf
2019-09-10 12:59                     ` Max Reitz
2019-09-10 13:10                       ` Kevin Wolf
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 05/42] block: Add chain helper functions Max Reitz
2019-08-09 17:01   ` Eric Blake
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 06/42] qcow2: Implement .bdrv_storage_child() Max Reitz
2019-08-09 17:07   ` Eric Blake
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 07/42] block: *filtered_cow_child() for *has_zero_init() Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 08/42] block: bdrv_set_backing_hd() is about bs->backing Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 09/42] block: Include filters when freezing backing chain Max Reitz
2019-08-10 13:32   ` Vladimir Sementsov-Ogievskiy
2019-08-12 12:56     ` Max Reitz
2019-09-05 13:05   ` Kevin Wolf
2019-09-09  8:02     ` Max Reitz
2019-09-09  9:40       ` Kevin Wolf
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 10/42] block: Drop bdrv_is_encrypted() Max Reitz
2019-08-10 13:42   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 11/42] block: Add bdrv_supports_compressed_writes() Max Reitz
2019-09-05 13:11   ` Kevin Wolf
2019-09-09  8:09     ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 12/42] block: Use bdrv_filtered_rw* where obvious Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 13/42] block: Use CAFs in block status functions Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 14/42] block: Use CAFs when working with backing chains Max Reitz
2019-08-10 15:19   ` Vladimir Sementsov-Ogievskiy
2019-09-05 14:05   ` Kevin Wolf
2019-09-09  8:25     ` Max Reitz
2019-09-09  9:55       ` Kevin Wolf
2019-09-09 14:08         ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 15/42] block: Re-evaluate backing file handling in reopen Max Reitz
2019-08-10 16:05   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 16/42] block: Flush all children in generic code Max Reitz
2019-08-10 15:36   ` Vladimir Sementsov-Ogievskiy
2019-08-12 12:58     ` Max Reitz
2019-09-05 16:24       ` Kevin Wolf
2019-09-09  8:31         ` Max Reitz
2019-09-09 10:01           ` Kevin Wolf
2019-09-09 14:15             ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 17/42] block: Use CAFs in bdrv_refresh_limits() Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 18/42] block: Use CAFs in bdrv_refresh_filename() Max Reitz
2019-08-10 16:22   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 19/42] block: Use CAF in bdrv_co_rw_vmstate() Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 20/42] block/snapshot: Fix fallback Max Reitz
2019-08-10 16:34   ` Vladimir Sementsov-Ogievskiy
2019-08-12 13:06     ` Max Reitz
2019-09-10 11:56   ` Kevin Wolf
2019-09-10 12:04     ` Max Reitz
2019-09-10 12:49       ` Kevin Wolf
2019-09-10 13:06         ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 21/42] block: Use CAFs for debug breakpoints Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 22/42] block: Fix bdrv_get_allocated_file_size's fallback Max Reitz
2019-08-10 16:41   ` Vladimir Sementsov-Ogievskiy
2019-08-12 13:09     ` Max Reitz
2019-08-12 17:14       ` Vladimir Sementsov-Ogievskiy
2019-08-12 19:15         ` Max Reitz
2019-09-10 14:52   ` Kevin Wolf
2019-09-11  6:20     ` Max Reitz
2019-09-11  6:55       ` Kevin Wolf
2019-09-11  7:37         ` Max Reitz
2019-09-11  8:27           ` Kevin Wolf
2019-09-11 10:00             ` Max Reitz
2019-09-11 10:31               ` Kevin Wolf
2019-09-11 11:00                 ` Max Reitz
2019-09-12 10:34                   ` Kevin Wolf
2019-11-14 13:11                   ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 23/42] blockdev: Use CAF in external_snapshot_prepare() Max Reitz
2019-09-10 15:02   ` Kevin Wolf
2019-09-11  6:21     ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 24/42] block: Use child access functions for QAPI queries Max Reitz
2019-08-10 16:57   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 25/42] mirror: Deal with filters Max Reitz
2019-08-12 11:09   ` Vladimir Sementsov-Ogievskiy
2019-08-12 13:26     ` Max Reitz
2019-08-14 15:17       ` Vladimir Sementsov-Ogievskiy
2019-08-31  9:57   ` Vladimir Sementsov-Ogievskiy
2019-09-02 14:35     ` Max Reitz
2019-09-03  8:32       ` Vladimir Sementsov-Ogievskiy
2019-09-09  7:41         ` Max Reitz
2019-09-13 12:55   ` Kevin Wolf
2019-09-16 10:26     ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 26/42] backup: " Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 27/42] commit: " Max Reitz
2019-08-31 10:44   ` Vladimir Sementsov-Ogievskiy
2019-09-02 14:55     ` Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 28/42] stream: " Max Reitz
2019-08-12 11:55   ` Vladimir Sementsov-Ogievskiy
2019-09-13 14:16   ` Kevin Wolf
2019-09-16  9:52     ` Max Reitz
2019-09-16 14:47       ` Kevin Wolf
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 29/42] nbd: Use CAF when looking for dirty bitmap Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 30/42] qemu-img: Use child access functions Max Reitz
2019-08-12 12:14   ` Vladimir Sementsov-Ogievskiy
2019-08-12 13:28     ` Max Reitz
2019-08-14 16:04   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 31/42] block: Drop backing_bs() Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 32/42] block: Make bdrv_get_cumulative_perm() public Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 33/42] blockdev: Fix active commit choice Max Reitz
2019-08-09 16:13 ` [Qemu-devel] [PATCH v6 34/42] block: Inline bdrv_co_block_status_from_*() Max Reitz
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 35/42] block: Fix check_to_replace_node() Max Reitz
2019-08-15 15:21   ` Vladimir Sementsov-Ogievskiy
2019-08-15 17:01     ` Max Reitz
2019-08-16 11:01       ` Vladimir Sementsov-Ogievskiy
2019-08-16 13:30         ` Max Reitz
2019-08-16 14:24           ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 36/42] iotests: Add tests for mirror @replaces loops Max Reitz
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 37/42] block: Leave BDS.backing_file constant Max Reitz
2019-08-16 16:16   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 38/42] iotests: Let complete_and_wait() work with commit Max Reitz
2019-08-23  5:59   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 39/42] iotests: Add filter commit test cases Max Reitz
2019-08-31 11:41   ` Vladimir Sementsov-Ogievskiy
2019-09-02 15:06     ` Max Reitz
2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
2019-09-02 15:09     ` Max Reitz
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 40/42] iotests: Add filter mirror " Max Reitz
2019-08-31 12:35   ` Vladimir Sementsov-Ogievskiy
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 41/42] iotests: Add test for commit in sub directory Max Reitz
2019-08-09 16:14 ` [Qemu-devel] [PATCH v6 42/42] iotests: Test committing to overridden backing Max Reitz
2019-09-03  9:18   ` Vladimir Sementsov-Ogievskiy

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git