[Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup
@ 2019-05-29 15:46 Vladimir Sementsov-Ogievskiy
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing Vladimir Sementsov-Ogievskiy
                   ` (9 more replies)
  0 siblings, 10 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

Hi all!

These series introduce backup-top driver. It's a filter-node, which
do copy-before-write operation. Mirror uses filter-node for handling
guest writes, let's move to filter-node (from write-notifiers) for
backup too

v8:
01-03: new
05: add Max's r-b
others changed, change description in each patch mail in Notes section.

v6-v7 was preparing refactoring, which now is in Max's pull request, and
these series based on it:
Based-on: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg06526.html

Vladimir Sementsov-Ogievskiy (7):
  block: teach bdrv_debug_breakpoint skip filters with backing
  block: swap operation order in bdrv_append
  block: allow not one child for implicit node
  block: introduce backup-top filter driver
  block/io: refactor wait_serialising_requests
  block: add lock/unlock range functions
  block/backup: use backup-top instead of write notifiers

 block/backup-top.h         |  64 ++++++++
 include/block/block_int.h  |   4 +
 block.c                    |  60 +++++--
 block/backup-top.c         | 322 +++++++++++++++++++++++++++++++++++++
 block/backup.c             | 171 ++++++++------------
 block/io.c                 |  68 ++++++--
 block/Makefile.objs        |   2 +
 tests/qemu-iotests/056     |   2 +-
 tests/qemu-iotests/085.out |   2 +-
 tests/qemu-iotests/129     |   1 -
 tests/qemu-iotests/141.out |   2 +-
 11 files changed, 567 insertions(+), 131 deletions(-)
 create mode 100644 block/backup-top.h
 create mode 100644 block/backup-top.c

-- 
2.18.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 13:43   ` Max Reitz
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append Vladimir Sementsov-Ogievskiy
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

Teach bdrv_debug_breakpoint and bdrv_debug_remove_breakpoint skip
filters with backing. This is needed to implement and use in backup job
it's own backup_top filter driver (like mirror already has one), and
without this improvement, breakpoint removal will fail at least in 55
iotest.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 1a73e310c1..e6e9770704 100644
--- a/block.c
+++ b/block.c
@@ -4982,14 +4982,35 @@ void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
     bs->drv->bdrv_debug_event(bs, event);
 }
 
-int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
-                          const char *tag)
+static BlockDriverState *bdrv_find_debug_node(BlockDriverState *bs)
 {
     while (bs && bs->drv && !bs->drv->bdrv_debug_breakpoint) {
-        bs = bs->file ? bs->file->bs : NULL;
+        if (bs->file) {
+            bs = bs->file->bs;
+            continue;
+        }
+
+        if (bs->drv->is_filter && bs->backing) {
+            bs = bs->backing->bs;
+            continue;
+        }
+
+        break;
     }
 
     if (bs && bs->drv && bs->drv->bdrv_debug_breakpoint) {
+        assert(bs->drv->bdrv_debug_remove_breakpoint);
+        return bs;
+    }
+
+    return NULL;
+}
+
+int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
+                          const char *tag)
+{
+    bs = bdrv_find_debug_node(bs);
+    if (bs) {
         return bs->drv->bdrv_debug_breakpoint(bs, event, tag);
     }
 
@@ -4998,11 +5019,8 @@ int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
 
 int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag)
 {
-    while (bs && bs->drv && !bs->drv->bdrv_debug_remove_breakpoint) {
-        bs = bs->file ? bs->file->bs : NULL;
-    }
-
-    if (bs && bs->drv && bs->drv->bdrv_debug_remove_breakpoint) {
+    bs = bdrv_find_debug_node(bs);
+    if (bs) {
         return bs->drv->bdrv_debug_remove_breakpoint(bs, tag);
     }
 
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 13:45   ` Max Reitz
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 3/7] block: allow not one child for implicit node Vladimir Sementsov-Ogievskiy
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

bs_top parents may conflict with bs_new backing child permissions, so
let's do bdrv_replace_node first, it covers more possible cases.

It is needed for further implementation of backup-top filter, which
don't want to share write permission on its backing child.

Side effect is that we may set backing hd when device name is already
available, so 085 iotest output is changed.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c                    | 11 ++++++++---
 tests/qemu-iotests/085.out |  2 +-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index e6e9770704..57216f4115 100644
--- a/block.c
+++ b/block.c
@@ -4088,22 +4088,27 @@ void bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
 {
     Error *local_err = NULL;
 
-    bdrv_set_backing_hd(bs_new, bs_top, &local_err);
+    bdrv_ref(bs_top);
+
+    bdrv_replace_node(bs_top, bs_new, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
+        error_prepend(errp, "Failed to replace node: ");
         goto out;
     }
 
-    bdrv_replace_node(bs_top, bs_new, &local_err);
+    bdrv_set_backing_hd(bs_new, bs_top, &local_err);
     if (local_err) {
+        bdrv_replace_node(bs_new, bs_top, &error_abort);
         error_propagate(errp, local_err);
-        bdrv_set_backing_hd(bs_new, NULL, &error_abort);
+        error_prepend(errp, "Failed to set backing: ");
         goto out;
     }
 
     /* bs_new is now referenced by its new parents, we don't need the
      * additional reference any more. */
 out:
+    bdrv_unref(bs_top);
     bdrv_unref(bs_new);
 }
 
diff --git a/tests/qemu-iotests/085.out b/tests/qemu-iotests/085.out
index 6edf107f55..e5a2645bf5 100644
--- a/tests/qemu-iotests/085.out
+++ b/tests/qemu-iotests/085.out
@@ -74,7 +74,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/
 
 === Invalid command - snapshot node used as backing hd ===
 
-{"error": {"class": "GenericError", "desc": "Node 'snap_11' is busy: node is used as backing hd of 'snap_12'"}}
+{"error": {"class": "GenericError", "desc": "Node 'snap_11' is busy: node is used as backing hd of 'virtio0'"}}
 
 === Invalid command - snapshot node has a backing image ===
 
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 3/7] block: allow not one child for implicit node
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing Vladimir Sementsov-Ogievskiy
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 13:51   ` Max Reitz
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

Upcoming backup-top filter wants to operate like usual implicit filter
node with fall-through to backing child. But also needs additional
target child, let's support that.

On the other hand, after backup completion (before job dismiss) filter
is still attached to job blk, but don't have any children. Support this
too.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 57216f4115..3f4de3ae32 100644
--- a/block.c
+++ b/block.c
@@ -6200,9 +6200,20 @@ void bdrv_refresh_filename(BlockDriverState *bs)
     }
 
     if (bs->implicit) {
-        /* For implicit nodes, just copy everything from the single child */
+        /*
+         * For implicit nodes, just copy everything from the single child or
+         * from backing, if there are several children.
+         * If there are no children for some reason (filter is still attached
+         * to block-job blk, but already removed from backing chain of device)
+         * do nothing.
+         */
         child = QLIST_FIRST(&bs->children);
-        assert(QLIST_NEXT(child, next) == NULL);
+        if (!child) {
+            return;
+        } else if (QLIST_NEXT(child, next)) {
+            assert(bs->backing);
+            child = bs->backing;
+        }
 
         pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
                 child->bs->exact_filename);
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 3/7] block: allow not one child for implicit node Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 15:57   ` Max Reitz
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 5/7] block/io: refactor wait_serialising_requests Vladimir Sementsov-Ogievskiy
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

Backup-top filter does copy-before-write operation. It should be
inserted above active disk and has a target node for CBW, like the
following:

    +-------+
    | Guest |
    +-------+
        |r,w
        v
    +------------+  target   +---------------+
    | backup_top |---------->| target(qcow2) |
    +------------+   CBW     +---------------+
        |
backing |r,w
        v
    +-------------+
    | Active disk |
    +-------------+

The driver will be used in backup instead of write-notifiers.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup-top.h  |  64 +++++++++
 block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
 block/Makefile.objs |   2 +
 3 files changed, 388 insertions(+)
 create mode 100644 block/backup-top.h
 create mode 100644 block/backup-top.c

diff --git a/block/backup-top.h b/block/backup-top.h
new file mode 100644
index 0000000000..788e18c358
--- /dev/null
+++ b/block/backup-top.h
@@ -0,0 +1,64 @@
+/*
+ * backup-top filter driver
+ *
+ * The driver performs Copy-Before-Write (CBW) operation: it is injected above
+ * some node, and before each write it copies _old_ data to the target node.
+ *
+ * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef BACKUP_TOP_H
+#define BACKUP_TOP_H
+
+#include "qemu/osdep.h"
+
+#include "block/block_int.h"
+
+typedef void (*BackupTopProgressCallback)(uint64_t done, void *opaque);
+typedef struct BDRVBackupTopState {
+    HBitmap *copy_bitmap; /* what should be copied to @target on guest write. */
+    BdrvChild *target;
+
+    BackupTopProgressCallback progress_cb;
+    void *progress_opaque;
+} BDRVBackupTopState;
+
+/*
+ * bdrv_backup_top_append
+ *
+ * Append backup_top filter node above @source node. @target node will receive
+ * the data backed up during CBE operations. New filter together with @target
+ * node are attached to @source aio context.
+ *
+ * The resulting filter node is implicit.
+ *
+ * @copy_bitmap selects regions which needs CBW. Furthermore, backup_top will
+ * use exactly this bitmap, so it may be used to control backup_top behavior
+ * dynamically. Caller should not release @copy_bitmap during life-time of
+ * backup_top. Progress is tracked by calling @progress_cb function.
+ */
+BlockDriverState *bdrv_backup_top_append(
+        BlockDriverState *source, BlockDriverState *target,
+        HBitmap *copy_bitmap, Error **errp);
+void bdrv_backup_top_set_progress_callback(
+        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
+        void *progress_opaque);
+void bdrv_backup_top_drop(BlockDriverState *bs);
+
+#endif /* BACKUP_TOP_H */
diff --git a/block/backup-top.c b/block/backup-top.c
new file mode 100644
index 0000000000..1daa02f539
--- /dev/null
+++ b/block/backup-top.c
@@ -0,0 +1,322 @@
+/*
+ * backup-top filter driver
+ *
+ * The driver performs Copy-Before-Write (CBW) operation: it is injected above
+ * some node, and before each write it copies _old_ data to the target node.
+ *
+ * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/cutils.h"
+#include "qapi/error.h"
+#include "block/block_int.h"
+#include "block/qdict.h"
+
+#include "block/backup-top.h"
+
+static coroutine_fn int backup_top_co_preadv(
+        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+        QEMUIOVector *qiov, int flags)
+{
+    /*
+     * Features to be implemented:
+     * F1. COR. save read data to fleecing target for fast access
+     *     (to reduce reads). This possibly may be done with use of copy-on-read
+     *     filter, but we need an ability to make COR requests optional: for
+     *     example, if target is a ram-cache, and if it is full now, we should
+     *     skip doing COR request, as it is actually not necessary.
+     *
+     * F2. Feature for guest: read from fleecing target if data is in ram-cache
+     *     and is unchanged
+     */
+
+    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
+}
+
+static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
+                                       uint64_t bytes)
+{
+    int ret = 0;
+    BDRVBackupTopState *s = bs->opaque;
+    uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
+    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
+    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;
+    void *buf = qemu_blockalign(bs, end - off);
+
+    /*
+     * Features to be implemented:
+     * F3. parallelize copying loop
+     * F4. detect zeroes ? or, otherwise, drop detect zeroes from backup code
+     *     and just enable zeroes detecting on target
+     * F5. use block_status ?
+     * F6. don't copy clusters which are already cached by COR [see F1]
+     * F7. if target is ram-cache and it is full, there should be a possibility
+     *     to drop not necessary data (cached by COR [see F1]) to handle CBW
+     *     fast.
+     */
+
+    len = end - off;
+    while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
+        hbitmap_reset(s->copy_bitmap, off, len);
+
+        ret = bdrv_co_pread(bs->backing, off, len, buf,
+                            BDRV_REQ_NO_SERIALISING);
+        if (ret < 0) {
+            hbitmap_set(s->copy_bitmap, off, len);
+            goto out;
+        }
+
+        ret = bdrv_co_pwrite(s->target, off, len, buf, BDRV_REQ_SERIALISING);
+        if (ret < 0) {
+            hbitmap_set(s->copy_bitmap, off, len);
+            goto out;
+        }
+
+        if (s->progress_cb) {
+            s->progress_cb(len, s->progress_opaque);
+        }
+        off += len;
+        if (off >= end) {
+            break;
+        }
+        len = end - off;
+    }
+
+out:
+    qemu_vfree(buf);
+
+    /*
+     * F8. we fail guest request in case of error. We can alter it by
+     * possibility to fail copying process instead, or retry several times, or
+     * may be guest pause, etc.
+     */
+    return ret;
+}
+
+static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
+                                               int64_t offset, int bytes)
+{
+    int ret = backup_top_cbw(bs, offset, bytes);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /*
+     * Features to be implemented:
+     * F9. possibility of lazy discard: just defer the discard after fleecing
+     *     completion. If write (or new discard) occurs to the same area, just
+     *     drop deferred discard.
+     */
+
+    return bdrv_co_pdiscard(bs->backing, offset, bytes);
+}
+
+static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
+        int64_t offset, int bytes, BdrvRequestFlags flags)
+{
+    int ret = backup_top_cbw(bs, offset, bytes);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
+}
+
+static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
+                                              uint64_t offset,
+                                              uint64_t bytes,
+                                              QEMUIOVector *qiov, int flags)
+{
+    if (!(flags & BDRV_REQ_WRITE_UNCHANGED)) {
+        int ret = backup_top_cbw(bs, offset, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
+}
+
+static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
+{
+    if (!bs->backing) {
+        return 0;
+    }
+
+    return bdrv_co_flush(bs->backing->bs);
+}
+
+static void backup_top_refresh_filename(BlockDriverState *bs)
+{
+    if (bs->backing == NULL) {
+        /*
+         * we can be here after failed bdrv_attach_child in
+         * bdrv_set_backing_hd
+         */
+        return;
+    }
+    bdrv_refresh_filename(bs->backing->bs);
+    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
+            bs->backing->bs->filename);
+}
+
+static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
+                                  const BdrvChildRole *role,
+                                  BlockReopenQueue *reopen_queue,
+                                  uint64_t perm, uint64_t shared,
+                                  uint64_t *nperm, uint64_t *nshared)
+{
+    /*
+     * We have HBitmap in the state, its size is fixed, so we never allow
+     * resize.
+     */
+    uint64_t rw = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
+                  BLK_PERM_WRITE;
+
+    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
+                              nperm, nshared);
+
+    *nperm = *nperm & rw;
+    *nshared = *nshared & rw;
+
+    if (role == &child_file) {
+        /*
+         * Target child
+         *
+         * Share write to target (child_file), to not interfere
+         * with guest writes to its disk which may be in target backing chain.
+         */
+        if (perm & BLK_PERM_WRITE) {
+            *nshared = *nshared | BLK_PERM_WRITE;
+        }
+    } else {
+        /* Source child */
+        if (perm & BLK_PERM_WRITE) {
+            *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
+        }
+        *nshared =
+            *nshared & (BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED);
+    }
+}
+
+BlockDriver bdrv_backup_top_filter = {
+    .format_name = "backup-top",
+    .instance_size = sizeof(BDRVBackupTopState),
+
+    .bdrv_co_preadv             = backup_top_co_preadv,
+    .bdrv_co_pwritev            = backup_top_co_pwritev,
+    .bdrv_co_pwrite_zeroes      = backup_top_co_pwrite_zeroes,
+    .bdrv_co_pdiscard           = backup_top_co_pdiscard,
+    .bdrv_co_flush              = backup_top_co_flush,
+
+    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
+
+    .bdrv_refresh_filename      = backup_top_refresh_filename,
+
+    .bdrv_child_perm            = backup_top_child_perm,
+
+    .is_filter = true,
+};
+
+BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
+                                         BlockDriverState *target,
+                                         HBitmap *copy_bitmap,
+                                         Error **errp)
+{
+    Error *local_err = NULL;
+    BDRVBackupTopState *state;
+    BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
+                                                 NULL, BDRV_O_RDWR, errp);
+
+    if (!top) {
+        return NULL;
+    }
+
+    top->implicit = true;
+    top->total_sectors = source->total_sectors;
+    top->bl.opt_mem_alignment = MAX(bdrv_opt_mem_align(source),
+                                    bdrv_opt_mem_align(target));
+    top->opaque = state = g_new0(BDRVBackupTopState, 1);
+    state->copy_bitmap = copy_bitmap;
+
+    bdrv_ref(target);
+    state->target = bdrv_attach_child(top, target, "target", &child_file, errp);
+    if (!state->target) {
+        bdrv_unref(target);
+        bdrv_unref(top);
+        return NULL;
+    }
+
+    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
+    bdrv_set_aio_context(target, bdrv_get_aio_context(source));
+
+    bdrv_drained_begin(source);
+
+    bdrv_ref(top);
+    bdrv_append(top, source, &local_err);
+    if (local_err) {
+        error_prepend(&local_err, "Cannot append backup-top filter: ");
+    }
+
+    bdrv_drained_end(source);
+
+    if (local_err) {
+        bdrv_unref_child(top, state->target);
+        bdrv_unref(top);
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    return top;
+}
+
+void bdrv_backup_top_set_progress_callback(
+        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
+        void *progress_opaque)
+{
+    BDRVBackupTopState *s = bs->opaque;
+
+    s->progress_cb = progress_cb;
+    s->progress_opaque = progress_opaque;
+}
+
+void bdrv_backup_top_drop(BlockDriverState *bs)
+{
+    BDRVBackupTopState *s = bs->opaque;
+    AioContext *aio_context = bdrv_get_aio_context(bs);
+
+    aio_context_acquire(aio_context);
+
+    bdrv_drained_begin(bs);
+
+    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
+    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
+    bdrv_set_backing_hd(bs, NULL, &error_abort);
+
+    bdrv_drained_end(bs);
+
+    if (s->target) {
+        bdrv_unref_child(bs, s->target);
+    }
+    bdrv_unref(bs);
+
+    aio_context_release(aio_context);
+}
diff --git a/block/Makefile.objs b/block/Makefile.objs
index ae11605c9f..dfbdfe6ab4 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
 
 block-obj-y += crypto.o
 
+block-obj-y += backup-top.o
+
 common-obj-y += stream.o
 
 nfs.o-libs         := $(LIBNFS_LIBS)
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 5/7] block/io: refactor wait_serialising_requests
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 6/7] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

Split out do_wait_serialising_requests with additional possibility to
not actually wait but just check, that there is something to wait for.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/io.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/block/io.c b/block/io.c
index 3134a60a48..c347f90722 100644
--- a/block/io.c
+++ b/block/io.c
@@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
     bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+static bool coroutine_fn do_wait_serialising_requests(BdrvTrackedRequest *self,
+                                                      bool wait)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
     bool retry;
-    bool waited = false;
+    bool found = false;
 
     if (!atomic_read(&bs->serialising_in_flight)) {
         return false;
@@ -751,11 +752,13 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
                  * will wait for us as soon as it wakes up, then just go on
                  * (instead of producing a deadlock in the former case). */
                 if (!req->waiting_for) {
-                    self->waiting_for = req;
-                    qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
-                    self->waiting_for = NULL;
-                    retry = true;
-                    waited = true;
+                    found = true;
+                    if (wait) {
+                        self->waiting_for = req;
+                        qemu_co_queue_wait(&req->wait_queue, &bs->reqs_lock);
+                        self->waiting_for = NULL;
+                        retry = true;
+                    }
                     break;
                 }
             }
@@ -763,7 +766,12 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
         qemu_co_mutex_unlock(&bs->reqs_lock);
     } while (retry);
 
-    return waited;
+    return found;
+}
+
+static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+{
+    return do_wait_serialising_requests(self, true);
 }
 
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 6/7] block: add lock/unlock range functions
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 5/7] block/io: refactor wait_serialising_requests Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 16:31   ` Max Reitz
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

From: Vladimir Sementsov-Ogievskiy <etendren@gmail.com>

Introduce lock/unlock range functionality, based on serialized
requests. This is needed to refactor backup, dropping local
tracked-request-like synchronization.

Signed-off-by: Vladimir Sementsov-Ogievskiy <etendren@gmail.com>
---
 include/block/block_int.h |  4 ++++
 block/io.c                | 44 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1eebc7c8f3..9d9d4346e9 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -66,6 +66,7 @@ enum BdrvTrackedRequestType {
     BDRV_TRACKED_WRITE,
     BDRV_TRACKED_DISCARD,
     BDRV_TRACKED_TRUNCATE,
+    BDRV_TRACKED_LOCK,
 };
 
 typedef struct BdrvTrackedRequest {
@@ -928,6 +929,9 @@ int coroutine_fn bdrv_co_preadv(BdrvChild *child,
 int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
     int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
+void *coroutine_fn bdrv_co_try_lock(BlockDriverState *bs,
+                                    int64_t offset, unsigned int bytes);
+void coroutine_fn bdrv_co_unlock(void *opaque);
 
 static inline int coroutine_fn bdrv_co_pread(BdrvChild *child,
     int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
diff --git a/block/io.c b/block/io.c
index c347f90722..488e01e0a1 100644
--- a/block/io.c
+++ b/block/io.c
@@ -720,6 +720,15 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
     bdrv_wakeup(bs);
 }
 
+static bool ignore_intersection(BdrvTrackedRequest *a, BdrvTrackedRequest *b)
+{
+    return a == b || (!a->serialising && !b->serialising) ||
+        (a->type == BDRV_TRACKED_LOCK && b->type == BDRV_TRACKED_READ &&
+         !b->serialising) ||
+        (b->type == BDRV_TRACKED_LOCK && a->type == BDRV_TRACKED_READ &&
+         !a->serialising);
+}
+
 static bool coroutine_fn do_wait_serialising_requests(BdrvTrackedRequest *self,
                                                       bool wait)
 {
@@ -736,7 +745,7 @@ static bool coroutine_fn do_wait_serialising_requests(BdrvTrackedRequest *self,
         retry = false;
         qemu_co_mutex_lock(&bs->reqs_lock);
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
-            if (req == self || (!req->serialising && !self->serialising)) {
+            if (ignore_intersection(self, req)) {
                 continue;
             }
             if (tracked_request_overlaps(req, self->overlap_offset,
@@ -774,6 +783,12 @@ static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
     return do_wait_serialising_requests(self, true);
 }
 
+static bool coroutine_fn should_wait_serialising_requests(
+        BdrvTrackedRequest *self)
+{
+    return do_wait_serialising_requests(self, false);
+}
+
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
                                    size_t size)
 {
@@ -3184,3 +3199,30 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, PreallocMode prealloc,
 
     return tco.ret;
 }
+
+void *coroutine_fn bdrv_co_try_lock(BlockDriverState *bs,
+                                    int64_t offset, unsigned int bytes)
+{
+    BdrvTrackedRequest *req = g_new(BdrvTrackedRequest, 1);
+
+    tracked_request_begin(req, bs, offset, bytes, BDRV_TRACKED_LOCK);
+    mark_request_serialising(req, bdrv_get_cluster_size(bs));
+
+    if (should_wait_serialising_requests(req)) {
+        tracked_request_end(req);
+        g_free(req);
+        return NULL;
+    }
+
+    return req;
+}
+
+void coroutine_fn bdrv_co_unlock(void *opaque)
+{
+    BdrvTrackedRequest *req = opaque;
+
+    assert(req->type == BDRV_TRACKED_LOCK);
+
+    tracked_request_end(req);
+    g_free(req);
+}
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 6/7] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
@ 2019-05-29 15:46 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 18:02   ` Max Reitz
  2019-05-30 13:25 ` [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-29 15:46 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: fam, kwolf, vsementsov, mreitz, stefanha, den, jsnow

Drop write notifiers and use filter node instead. Changes:

1. copy-before-writes now handled by filter node, so, drop all
   is_write_notifier arguments.

2. we don't have intersecting requests, so their handling is dropped.
Instead, synchronization works as follows:
when backup or backup-top starts copying of some area it firstly
clears copy-bitmap bits, and nobody touches areas, not marked with
dirty bits in copy-bitmap, so there is no intersection. Also, backup
job copy operations are surrounded by bdrv region lock, which is
actually serializing request, to not interfere with guest writes and
not read changed data from source (before reading we clear
corresponding bit in copy-bitmap, so, this area is not more handled by
backup-top).

3. To sync with in-flight requests at job finish we now have drained
removing of the filter, we don't need rw-lock.

== RFC part ==

iotests changed:
56: op-blocker doesn't shot now, as we set it on source, but then check
on filter, when trying to start second backup... Should I workaround it
somehow?

129: Hmm, now it is not busy at this moment.. But it's illegal to check
busy, as job has pause-points and set busy to false in these points.
Why we assert it in this test?

141: Obvious, as drv0 is not root node now, but backing of the filter,
when we try to remove it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/backup.c             | 171 ++++++++++++++-----------------------
 tests/qemu-iotests/056     |   2 +-
 tests/qemu-iotests/129     |   1 -
 tests/qemu-iotests/141.out |   2 +-
 4 files changed, 68 insertions(+), 108 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 00f4f8af53..a5b8e04c9c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -2,6 +2,7 @@
  * QEMU backup
  *
  * Copyright (C) 2013 Proxmox Server Solutions
+ * Copyright (c) 2019 Virtuozzo International GmbH. All rights reserved.
  *
  * Authors:
  *  Dietmar Maurer (dietmar@proxmox.com)
@@ -26,14 +27,9 @@
 #include "qemu/bitmap.h"
 #include "qemu/error-report.h"
 
-#define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
+#include "block/backup-top.h"
 
-typedef struct CowRequest {
-    int64_t start_byte;
-    int64_t end_byte;
-    QLIST_ENTRY(CowRequest) list;
-    CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
+#define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
 
 typedef struct BackupBlockJob {
     BlockJob common;
@@ -43,13 +39,10 @@ typedef struct BackupBlockJob {
     MirrorSyncMode sync_mode;
     BlockdevOnError on_source_error;
     BlockdevOnError on_target_error;
-    CoRwlock flush_rwlock;
     uint64_t len;
     uint64_t bytes_read;
     int64_t cluster_size;
     bool compress;
-    NotifierWithReturn before_write;
-    QLIST_HEAD(, CowRequest) inflight_reqs;
 
     HBitmap *copy_bitmap;
     bool use_copy_range;
@@ -60,56 +53,17 @@ typedef struct BackupBlockJob {
 
 static const BlockJobDriver backup_job_driver;
 
-/* See if in-flight requests overlap and wait for them to complete */
-static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
-                                                       int64_t start,
-                                                       int64_t end)
-{
-    CowRequest *req;
-    bool retry;
-
-    do {
-        retry = false;
-        QLIST_FOREACH(req, &job->inflight_reqs, list) {
-            if (end > req->start_byte && start < req->end_byte) {
-                qemu_co_queue_wait(&req->wait_queue, NULL);
-                retry = true;
-                break;
-            }
-        }
-    } while (retry);
-}
-
-/* Keep track of an in-flight request */
-static void cow_request_begin(CowRequest *req, BackupBlockJob *job,
-                              int64_t start, int64_t end)
-{
-    req->start_byte = start;
-    req->end_byte = end;
-    qemu_co_queue_init(&req->wait_queue);
-    QLIST_INSERT_HEAD(&job->inflight_reqs, req, list);
-}
-
-/* Forget about a completed request */
-static void cow_request_end(CowRequest *req)
-{
-    QLIST_REMOVE(req, list);
-    qemu_co_queue_restart_all(&req->wait_queue);
-}
-
 /* Copy range to target with a bounce buffer and return the bytes copied. If
  * error occurred, return a negative error number */
 static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
                                                       int64_t start,
                                                       int64_t end,
-                                                      bool is_write_notifier,
                                                       bool *error_is_read,
                                                       void **bounce_buffer)
 {
     int ret;
     BlockBackend *blk = job->common.blk;
     int nbytes;
-    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
     int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
 
     assert(QEMU_IS_ALIGNED(start, job->cluster_size));
@@ -119,7 +73,7 @@ static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
         *bounce_buffer = blk_blockalign(blk, job->cluster_size);
     }
 
-    ret = blk_co_pread(blk, start, nbytes, *bounce_buffer, read_flags);
+    ret = blk_co_pread(blk, start, nbytes, *bounce_buffer, 0);
     if (ret < 0) {
         trace_backup_do_cow_read_fail(job, start, ret);
         if (error_is_read) {
@@ -154,15 +108,12 @@ fail:
 /* Copy range to target and return the bytes copied. If error occurred, return a
  * negative error number. */
 static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
-                                                int64_t start,
-                                                int64_t end,
-                                                bool is_write_notifier)
+                                                int64_t start, int64_t end)
 {
     int ret;
     int nr_clusters;
     BlockBackend *blk = job->common.blk;
     int nbytes;
-    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
     int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
 
     assert(QEMU_IS_ALIGNED(job->copy_range_size, job->cluster_size));
@@ -171,7 +122,7 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
     nr_clusters = DIV_ROUND_UP(nbytes, job->cluster_size);
     hbitmap_reset(job->copy_bitmap, start, job->cluster_size * nr_clusters);
     ret = blk_co_copy_range(blk, start, job->target, start, nbytes,
-                            read_flags, write_flags);
+                            0, write_flags);
     if (ret < 0) {
         trace_backup_do_cow_copy_range_fail(job, start, ret);
         hbitmap_set(job->copy_bitmap, start, job->cluster_size * nr_clusters);
@@ -183,24 +134,17 @@ static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
 
 static int coroutine_fn backup_do_cow(BackupBlockJob *job,
                                       int64_t offset, uint64_t bytes,
-                                      bool *error_is_read,
-                                      bool is_write_notifier)
+                                      bool *error_is_read)
 {
-    CowRequest cow_request;
     int ret = 0;
     int64_t start, end; /* bytes */
     void *bounce_buffer = NULL;
 
-    qemu_co_rwlock_rdlock(&job->flush_rwlock);
-
     start = QEMU_ALIGN_DOWN(offset, job->cluster_size);
     end = QEMU_ALIGN_UP(bytes + offset, job->cluster_size);
 
     trace_backup_do_cow_enter(job, start, offset, bytes);
 
-    wait_for_overlapping_requests(job, start, end);
-    cow_request_begin(&cow_request, job, start, end);
-
     while (start < end) {
         if (!hbitmap_get(job->copy_bitmap, start)) {
             trace_backup_do_cow_skip(job, start);
@@ -211,13 +155,13 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
         trace_backup_do_cow_process(job, start);
 
         if (job->use_copy_range) {
-            ret = backup_cow_with_offload(job, start, end, is_write_notifier);
+            ret = backup_cow_with_offload(job, start, end);
             if (ret < 0) {
                 job->use_copy_range = false;
             }
         }
         if (!job->use_copy_range) {
-            ret = backup_cow_with_bounce_buffer(job, start, end, is_write_notifier,
+            ret = backup_cow_with_bounce_buffer(job, start, end,
                                                 error_is_read, &bounce_buffer);
         }
         if (ret < 0) {
@@ -237,29 +181,11 @@ static int coroutine_fn backup_do_cow(BackupBlockJob *job,
         qemu_vfree(bounce_buffer);
     }
 
-    cow_request_end(&cow_request);
-
     trace_backup_do_cow_return(job, offset, bytes, ret);
 
-    qemu_co_rwlock_unlock(&job->flush_rwlock);
-
     return ret;
 }
 
-static int coroutine_fn backup_before_write_notify(
-        NotifierWithReturn *notifier,
-        void *opaque)
-{
-    BackupBlockJob *job = container_of(notifier, BackupBlockJob, before_write);
-    BdrvTrackedRequest *req = opaque;
-
-    assert(req->bs == blk_bs(job->common.blk));
-    assert(QEMU_IS_ALIGNED(req->offset, BDRV_SECTOR_SIZE));
-    assert(QEMU_IS_ALIGNED(req->bytes, BDRV_SECTOR_SIZE));
-
-    return backup_do_cow(job, req->offset, req->bytes, NULL, true);
-}
-
 static void backup_cleanup_sync_bitmap(BackupBlockJob *job, int ret)
 {
     BdrvDirtyBitmap *bm;
@@ -295,14 +221,30 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
+    BlockJob *bjob = &s->common;
+    BlockDriverState *backup_top = blk_bs(s->common.blk);
+    BlockDriverState *src = backup_top->backing->bs;
+
     assert(s->target);
     blk_unref(s->target);
     s->target = NULL;
 
+    bdrv_ref(backup_top);
+    bdrv_ref(src);
+
     if (s->copy_bitmap) {
         hbitmap_free(s->copy_bitmap);
         s->copy_bitmap = NULL;
     }
+
+    bdrv_backup_top_drop(backup_top);
+
+    blk_remove_bs(bjob->blk);
+    blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort);
+    blk_insert_bs(bjob->blk, backup_top, &error_abort);
+
+    bdrv_unref(src);
+    bdrv_unref(backup_top);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
@@ -391,28 +333,41 @@ static int coroutine_fn backup_loop(BackupBlockJob *job)
     int64_t offset;
     HBitmapIter hbi;
     BlockDriverState *bs = blk_bs(job->common.blk);
+    void *lock;
 
     hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
     while ((offset = hbitmap_iter_next(&hbi)) != -1) {
+        lock = bdrv_co_try_lock(backing_bs(blk_bs(job->common.blk)), offset,
+                                job->cluster_size);
+        /*
+         * Dirty bit is set, which means that there are no in-flight
+         * write requests on this area. We must succeed.
+         */
+        assert(lock);
+
         if (job->sync_mode == MIRROR_SYNC_MODE_TOP &&
             bdrv_is_unallocated_range(bs, offset, job->cluster_size))
         {
             hbitmap_reset(job->copy_bitmap, offset, job->cluster_size);
+            bdrv_co_unlock(lock);
             continue;
         }
 
         do {
             if (yield_and_check(job)) {
+                bdrv_co_unlock(lock);
                 return 0;
             }
-            ret = backup_do_cow(job, offset,
-                                job->cluster_size, &error_is_read, false);
+            ret = backup_do_cow(job, offset, job->cluster_size, &error_is_read);
             if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
                            BLOCK_ERROR_ACTION_REPORT)
             {
+                bdrv_co_unlock(lock);
                 return ret;
             }
         } while (ret < 0);
+
+        bdrv_co_unlock(lock);
     }
 
     return 0;
@@ -444,12 +399,8 @@ static void backup_incremental_init_copy_bitmap(BackupBlockJob *job)
 static int coroutine_fn backup_run(Job *job, Error **errp)
 {
     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-    BlockDriverState *bs = blk_bs(s->common.blk);
     int ret = 0;
 
-    QLIST_INIT(&s->inflight_reqs);
-    qemu_co_rwlock_init(&s->flush_rwlock);
-
     job_progress_set_remaining(job, s->len);
 
     if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
@@ -458,27 +409,20 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
         hbitmap_set(s->copy_bitmap, 0, s->len);
     }
 
-    s->before_write.notify = backup_before_write_notify;
-    bdrv_add_before_write_notifier(bs, &s->before_write);
-
     if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
         /* All bits are set in copy_bitmap to allow any cluster to be copied.
          * This does not actually require them to be copied. */
         while (!job_is_cancelled(job)) {
-            /* Yield until the job is cancelled.  We just let our before_write
-             * notify callback service CoW requests. */
+            /*
+             * Yield until the job is cancelled.  We just let our backup-top
+             * filter driver service CbW requests.
+             */
             job_yield(job);
         }
     } else {
         ret = backup_loop(s);
     }
 
-    notifier_with_return_remove(&s->before_write);
-
-    /* wait until pending backup_do_cow() calls have completed */
-    qemu_co_rwlock_wrlock(&s->flush_rwlock);
-    qemu_co_rwlock_unlock(&s->flush_rwlock);
-
     return ret;
 }
 
@@ -533,6 +477,11 @@ static int64_t backup_calculate_cluster_size(BlockDriverState *target,
     return MAX(BACKUP_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
 }
 
+static void backup_top_progress(uint64_t done, void *opaque)
+{
+    job_progress_update((Job *)opaque, done);
+}
+
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
                   BlockDriverState *target, int64_t speed,
                   MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
@@ -548,6 +497,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     int ret;
     int64_t cluster_size;
     HBitmap *copy_bitmap = NULL;
+    BlockDriverState *backup_top = NULL;
 
     assert(bs);
     assert(target);
@@ -616,8 +566,13 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
 
     copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
 
+    backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
+    if (!backup_top) {
+        goto error;
+    }
+
     /* job->len is fixed, so we can't allow resize */
-    job = block_job_create(job_id, &backup_job_driver, txn, bs,
+    job = block_job_create(job_id, &backup_job_driver, txn, backup_top,
                            BLK_PERM_CONSISTENT_READ,
                            BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
                            BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
@@ -626,6 +581,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
         goto error;
     }
 
+    bdrv_backup_top_set_progress_callback(backup_top, backup_top_progress,
+                                          &job->common.job);
+
     /* The target must match the source in size, so no resize here either */
     job->target = blk_new(BLK_PERM_WRITE,
                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
@@ -662,10 +620,6 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     return &job->common;
 
  error:
-    if (copy_bitmap) {
-        assert(!job || !job->copy_bitmap);
-        hbitmap_free(copy_bitmap);
-    }
     if (sync_bitmap) {
         bdrv_reclaim_dirty_bitmap(bs, sync_bitmap, NULL);
     }
@@ -673,6 +627,13 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
         backup_clean(&job->common.job);
         job_early_fail(&job->common.job);
     }
+    if (backup_top) {
+        bdrv_backup_top_drop(backup_top);
+    }
+    if (copy_bitmap) {
+        assert(!job || !job->copy_bitmap);
+        hbitmap_free(copy_bitmap);
+    }
 
     return NULL;
 }
diff --git a/tests/qemu-iotests/056 b/tests/qemu-iotests/056
index f40fc11a09..d3e6fe4b11 100755
--- a/tests/qemu-iotests/056
+++ b/tests/qemu-iotests/056
@@ -214,7 +214,7 @@ class BackupTest(iotests.QMPTestCase):
         res = self.vm.qmp('query-block-jobs')
         self.assert_qmp(res, 'return[0]/status', 'concluded')
         # Leave zombie job un-dismissed, observe a failure:
-        res = self.qmp_backup_and_wait(serror="Node 'drive0' is busy: block device is in use by block job: backup",
+        res = self.qmp_backup_and_wait(serror='Failed to get "write" lock',
                                        device='drive0', format=iotests.imgfmt,
                                        sync='full', target=self.dest_img,
                                        auto_dismiss=False)
diff --git a/tests/qemu-iotests/129 b/tests/qemu-iotests/129
index 9e87e1c8d9..d719492deb 100755
--- a/tests/qemu-iotests/129
+++ b/tests/qemu-iotests/129
@@ -66,7 +66,6 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
         result = self.vm.qmp("stop")
         self.assert_qmp(result, 'return', {})
         result = self.vm.qmp("query-block-jobs")
-        self.assert_qmp(result, 'return[0]/busy', True)
         self.assert_qmp(result, 'return[0]/ready', False)
 
     def test_drive_mirror(self):
diff --git a/tests/qemu-iotests/141.out b/tests/qemu-iotests/141.out
index 41c7291258..a2645bea1a 100644
--- a/tests/qemu-iotests/141.out
+++ b/tests/qemu-iotests/141.out
@@ -11,7 +11,7 @@ Formatting 'TEST_DIR/o.IMGFMT', fmt=IMGFMT size=1048576 backing_file=TEST_DIR/t.
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "job0"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "job0"}}
 {"return": {}}
-{"error": {"class": "GenericError", "desc": "Node drv0 is in use"}}
+{"error": {"class": "GenericError", "desc": "Node 'drv0' is busy: node is used as backing hd of 'NODE_NAME'"}}
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "job0"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "job0", "len": 1048576, "offset": 0, "speed": 0, "type": "backup"}}
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
@ 2019-05-30 13:25 ` Vladimir Sementsov-Ogievskiy
  2019-06-13 16:08 ` no-reply
  2019-06-13 16:41 ` no-reply
  9 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-05-30 13:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: fam, kwolf, Denis Lunev, mreitz, stefanha, jsnow

29.05.2019 18:46, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> These series introduce backup-top driver. It's a filter-node, which
> do copy-before-write operation. Mirror uses filter-node for handling
> guest writes, let's move to filter-node (from write-notifiers) for
> backup too
> 
> v8:
> 01-03: new
> 05: add Max's r-b
> others changed, change description in each patch mail in Notes section.
> 
> v6-v7 was preparing refactoring, which now is in Max's pull request, and
> these series based on it:
> Based-on: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg06526.html

It merged, so, now the series based on master.

> 
> Vladimir Sementsov-Ogievskiy (7):
>    block: teach bdrv_debug_breakpoint skip filters with backing
>    block: swap operation order in bdrv_append
>    block: allow not one child for implicit node
>    block: introduce backup-top filter driver
>    block/io: refactor wait_serialising_requests
>    block: add lock/unlock range functions
>    block/backup: use backup-top instead of write notifiers
> 
>   block/backup-top.h         |  64 ++++++++
>   include/block/block_int.h  |   4 +
>   block.c                    |  60 +++++--
>   block/backup-top.c         | 322 +++++++++++++++++++++++++++++++++++++
>   block/backup.c             | 171 ++++++++------------
>   block/io.c                 |  68 ++++++--
>   block/Makefile.objs        |   2 +
>   tests/qemu-iotests/056     |   2 +-
>   tests/qemu-iotests/085.out |   2 +-
>   tests/qemu-iotests/129     |   1 -
>   tests/qemu-iotests/141.out |   2 +-
>   11 files changed, 567 insertions(+), 131 deletions(-)
>   create mode 100644 block/backup-top.h
>   create mode 100644 block/backup-top.c
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing Vladimir Sementsov-Ogievskiy
@ 2019-06-13 13:43   ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2019-06-13 13:43 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, stefanha, den, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 652 bytes --]

On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> Teach bdrv_debug_breakpoint and bdrv_debug_remove_breakpoint skip
> filters with backing. This is needed to implement and use in backup job
> it's own backup_top filter driver (like mirror already has one), and
> without this improvement, breakpoint removal will fail at least in 55
> iotest.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block.c | 34 ++++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)

Well, it would work in the meantime, but the real fix of course is to
use bdrv_primary_bs().

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append Vladimir Sementsov-Ogievskiy
@ 2019-06-13 13:45   ` Max Reitz
  2019-06-13 14:02     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2019-06-13 13:45 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, stefanha, den, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 2910 bytes --]

On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> bs_top parents may conflict with bs_new backing child permissions, so
> let's do bdrv_replace_node first, it covers more possible cases.
> 
> It is needed for further implementation of backup-top filter, which
> don't want to share write permission on its backing child.
> 
> Side effect is that we may set backing hd when device name is already
> available, so 085 iotest output is changed.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block.c                    | 11 ++++++++---
>  tests/qemu-iotests/085.out |  2 +-
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index e6e9770704..57216f4115 100644
> --- a/block.c
> +++ b/block.c
> @@ -4088,22 +4088,27 @@ void bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>  {
>      Error *local_err = NULL;
>  
> -    bdrv_set_backing_hd(bs_new, bs_top, &local_err);
> +    bdrv_ref(bs_top);
> +
> +    bdrv_replace_node(bs_top, bs_new, &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
> +        error_prepend(errp, "Failed to replace node: ");
>          goto out;
>      }
>  
> -    bdrv_replace_node(bs_top, bs_new, &local_err);
> +    bdrv_set_backing_hd(bs_new, bs_top, &local_err);
>      if (local_err) {
> +        bdrv_replace_node(bs_new, bs_top, &error_abort);

Hm.  I see the need for switching this stuff around, but this
&error_abort is much more difficult to verify than the previous one for
bdrv_set_backing_hd(..., NULL, ...).  I think it at least warrants a
comment why the order has to be this way (using git blame has the
disadvantage of fading over time as other people modify a piece of
code), and why this &error_abort is safe.

Max

>          error_propagate(errp, local_err);
> -        bdrv_set_backing_hd(bs_new, NULL, &error_abort);
> +        error_prepend(errp, "Failed to set backing: ");
>          goto out;
>      }
>  
>      /* bs_new is now referenced by its new parents, we don't need the
>       * additional reference any more. */
>  out:
> +    bdrv_unref(bs_top);
>      bdrv_unref(bs_new);
>  }
>  
> diff --git a/tests/qemu-iotests/085.out b/tests/qemu-iotests/085.out
> index 6edf107f55..e5a2645bf5 100644
> --- a/tests/qemu-iotests/085.out
> +++ b/tests/qemu-iotests/085.out
> @@ -74,7 +74,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/
>  
>  === Invalid command - snapshot node used as backing hd ===
>  
> -{"error": {"class": "GenericError", "desc": "Node 'snap_11' is busy: node is used as backing hd of 'snap_12'"}}
> +{"error": {"class": "GenericError", "desc": "Node 'snap_11' is busy: node is used as backing hd of 'virtio0'"}}
>  
>  === Invalid command - snapshot node has a backing image ===
>  
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/7] block: allow not one child for implicit node
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 3/7] block: allow not one child for implicit node Vladimir Sementsov-Ogievskiy
@ 2019-06-13 13:51   ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2019-06-13 13:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, stefanha, den, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 1756 bytes --]

On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> Upcoming backup-top filter wants to operate like usual implicit filter
> node with fall-through to backing child. But also needs additional
> target child, let's support that.
> 
> On the other hand, after backup completion (before job dismiss) filter
> is still attached to job blk, but don't have any children. Support this
> too.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 57216f4115..3f4de3ae32 100644
> --- a/block.c
> +++ b/block.c
> @@ -6200,9 +6200,20 @@ void bdrv_refresh_filename(BlockDriverState *bs)
>      }
>  
>      if (bs->implicit) {
> -        /* For implicit nodes, just copy everything from the single child */
> +        /*
> +         * For implicit nodes, just copy everything from the single child or
> +         * from backing, if there are several children.
> +         * If there are no children for some reason (filter is still attached
> +         * to block-job blk, but already removed from backing chain of device)
> +         * do nothing.
> +         */
>          child = QLIST_FIRST(&bs->children);
> -        assert(QLIST_NEXT(child, next) == NULL);
> +        if (!child) {
> +            return;
> +        } else if (QLIST_NEXT(child, next)) {
> +            assert(bs->backing);
> +            child = bs->backing;
> +        }
>  
>          pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
>                  child->bs->exact_filename);

Reviewed-by: Max Reitz <mreitz@redhat.com>

(To be changed to bdrv_filtered_rw_bs(), of course)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append
  2019-06-13 13:45   ` Max Reitz
@ 2019-06-13 14:02     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-13 14:02 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

13.06.2019 16:45, Max Reitz wrote:
> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>> bs_top parents may conflict with bs_new backing child permissions, so
>> let's do bdrv_replace_node first, it covers more possible cases.
>>
>> It is needed for further implementation of backup-top filter, which
>> don't want to share write permission on its backing child.
>>
>> Side effect is that we may set backing hd when device name is already
>> available, so 085 iotest output is changed.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block.c                    | 11 ++++++++---
>>   tests/qemu-iotests/085.out |  2 +-
>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index e6e9770704..57216f4115 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -4088,22 +4088,27 @@ void bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>>   {
>>       Error *local_err = NULL;
>>   
>> -    bdrv_set_backing_hd(bs_new, bs_top, &local_err);
>> +    bdrv_ref(bs_top);
>> +
>> +    bdrv_replace_node(bs_top, bs_new, &local_err);
>>       if (local_err) {
>>           error_propagate(errp, local_err);
>> +        error_prepend(errp, "Failed to replace node: ");
>>           goto out;
>>       }
>>   
>> -    bdrv_replace_node(bs_top, bs_new, &local_err);
>> +    bdrv_set_backing_hd(bs_new, bs_top, &local_err);
>>       if (local_err) {
>> +        bdrv_replace_node(bs_new, bs_top, &error_abort);
> 
> Hm.  I see the need for switching this stuff around, but this
> &error_abort is much more difficult to verify than the previous one for
> bdrv_set_backing_hd(..., NULL, ...).  I think it at least warrants a
> comment why the order has to be this way (using git blame has the
> disadvantage of fading over time as other people modify a piece of

so, I always use git log -p -- <filepath> instead, and search through it)

> code), and why this &error_abort is safe.

ok, will add

> 
> Max
> 
>>           error_propagate(errp, local_err);
>> -        bdrv_set_backing_hd(bs_new, NULL, &error_abort);
>> +        error_prepend(errp, "Failed to set backing: ");
>>           goto out;
>>       }
>>   
>>       /* bs_new is now referenced by its new parents, we don't need the
>>        * additional reference any more. */
>>   out:
>> +    bdrv_unref(bs_top);
>>       bdrv_unref(bs_new);
>>   }
>>   
>> diff --git a/tests/qemu-iotests/085.out b/tests/qemu-iotests/085.out
>> index 6edf107f55..e5a2645bf5 100644
>> --- a/tests/qemu-iotests/085.out
>> +++ b/tests/qemu-iotests/085.out
>> @@ -74,7 +74,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/
>>   
>>   === Invalid command - snapshot node used as backing hd ===
>>   
>> -{"error": {"class": "GenericError", "desc": "Node 'snap_11' is busy: node is used as backing hd of 'snap_12'"}}
>> +{"error": {"class": "GenericError", "desc": "Node 'snap_11' is busy: node is used as backing hd of 'virtio0'"}}
>>   
>>   === Invalid command - snapshot node has a backing image ===
>>   
>>
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
@ 2019-06-13 15:57   ` Max Reitz
  2019-06-14  9:04     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2019-06-13 15:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, stefanha, den, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 18419 bytes --]

On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> Backup-top filter does copy-before-write operation. It should be
> inserted above active disk and has a target node for CBW, like the
> following:
> 
>     +-------+
>     | Guest |
>     +-------+
>         |r,w
>         v
>     +------------+  target   +---------------+
>     | backup_top |---------->| target(qcow2) |
>     +------------+   CBW     +---------------+
>         |
> backing |r,w
>         v
>     +-------------+
>     | Active disk |
>     +-------------+
> 
> The driver will be used in backup instead of write-notifiers.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup-top.h  |  64 +++++++++
>  block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>  block/Makefile.objs |   2 +
>  3 files changed, 388 insertions(+)
>  create mode 100644 block/backup-top.h
>  create mode 100644 block/backup-top.c
> 
> diff --git a/block/backup-top.h b/block/backup-top.h
> new file mode 100644
> index 0000000000..788e18c358
> --- /dev/null
> +++ b/block/backup-top.h
> @@ -0,0 +1,64 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef BACKUP_TOP_H
> +#define BACKUP_TOP_H
> +
> +#include "qemu/osdep.h"
> +
> +#include "block/block_int.h"
> +
> +typedef void (*BackupTopProgressCallback)(uint64_t done, void *opaque);
> +typedef struct BDRVBackupTopState {
> +    HBitmap *copy_bitmap; /* what should be copied to @target on guest write. */
> +    BdrvChild *target;
> +
> +    BackupTopProgressCallback progress_cb;
> +    void *progress_opaque;
> +} BDRVBackupTopState;
> +
> +/*
> + * bdrv_backup_top_append
> + *
> + * Append backup_top filter node above @source node. @target node will receive
> + * the data backed up during CBE operations. New filter together with @target
> + * node are attached to @source aio context.
> + *
> + * The resulting filter node is implicit.

Why?  It’s just as easy for the caller to just make it implicit if it
should be.  (And usually the caller should decide that.)

> + *
> + * @copy_bitmap selects regions which needs CBW. Furthermore, backup_top will
> + * use exactly this bitmap, so it may be used to control backup_top behavior
> + * dynamically. Caller should not release @copy_bitmap during life-time of
> + * backup_top. Progress is tracked by calling @progress_cb function.
> + */
> +BlockDriverState *bdrv_backup_top_append(
> +        BlockDriverState *source, BlockDriverState *target,
> +        HBitmap *copy_bitmap, Error **errp);
> +void bdrv_backup_top_set_progress_callback(
> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
> +        void *progress_opaque);
> +void bdrv_backup_top_drop(BlockDriverState *bs);
> +
> +#endif /* BACKUP_TOP_H */
> diff --git a/block/backup-top.c b/block/backup-top.c
> new file mode 100644
> index 0000000000..1daa02f539
> --- /dev/null
> +++ b/block/backup-top.c
> @@ -0,0 +1,322 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/cutils.h"
> +#include "qapi/error.h"
> +#include "block/block_int.h"
> +#include "block/qdict.h"
> +
> +#include "block/backup-top.h"
> +
> +static coroutine_fn int backup_top_co_preadv(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, int flags)
> +{
> +    /*
> +     * Features to be implemented:
> +     * F1. COR. save read data to fleecing target for fast access
> +     *     (to reduce reads). This possibly may be done with use of copy-on-read
> +     *     filter, but we need an ability to make COR requests optional: for
> +     *     example, if target is a ram-cache, and if it is full now, we should
> +     *     skip doing COR request, as it is actually not necessary.
> +     *
> +     * F2. Feature for guest: read from fleecing target if data is in ram-cache
> +     *     and is unchanged
> +     */
> +
> +    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
> +                                       uint64_t bytes)
> +{
> +    int ret = 0;
> +    BDRVBackupTopState *s = bs->opaque;
> +    uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
> +    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
> +    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;

The “, len” is a bit weirdly placed there, why not define it on a
separate line as "uint64_t len = end - off"?

> +    void *buf = qemu_blockalign(bs, end - off);
> +
> +    /*
> +     * Features to be implemented:
> +     * F3. parallelize copying loop
> +     * F4. detect zeroes ? or, otherwise, drop detect zeroes from backup code
> +     *     and just enable zeroes detecting on target
> +     * F5. use block_status ?
> +     * F6. don't copy clusters which are already cached by COR [see F1]
> +     * F7. if target is ram-cache and it is full, there should be a possibility
> +     *     to drop not necessary data (cached by COR [see F1]) to handle CBW
> +     *     fast.

Also “drop BDRV_REQ_SERIALISING from writes to s->target unless necessary”?

> +     */
> +
> +    len = end - off;
> +    while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
> +        hbitmap_reset(s->copy_bitmap, off, len);
> +
> +        ret = bdrv_co_pread(bs->backing, off, len, buf,
> +                            BDRV_REQ_NO_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto out;
> +        }
> +
> +        ret = bdrv_co_pwrite(s->target, off, len, buf, BDRV_REQ_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto out;
> +        }
> +
> +        if (s->progress_cb) {
> +            s->progress_cb(len, s->progress_opaque);
> +        }
> +        off += len;
> +        if (off >= end) {
> +            break;
> +        }
> +        len = end - off;
> +    }
> +
> +out:
> +    qemu_vfree(buf);
> +
> +    /*
> +     * F8. we fail guest request in case of error. We can alter it by
> +     * possibility to fail copying process instead, or retry several times, or
> +     * may be guest pause, etc.
> +     */
> +    return ret;
> +}
> +
> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
> +                                               int64_t offset, int bytes)
> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /*
> +     * Features to be implemented:
> +     * F9. possibility of lazy discard: just defer the discard after fleecing
> +     *     completion. If write (or new discard) occurs to the same area, just
> +     *     drop deferred discard.
> +     */
> +
> +    return bdrv_co_pdiscard(bs->backing, offset, bytes);
> +}
> +
> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
> +}
> +
> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
> +                                              uint64_t offset,
> +                                              uint64_t bytes,
> +                                              QEMUIOVector *qiov, int flags)
> +{
> +    if (!(flags & BDRV_REQ_WRITE_UNCHANGED)) {
> +        int ret = backup_top_cbw(bs, offset, bytes);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
> +    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
> +{
> +    if (!bs->backing) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_flush(bs->backing->bs);

Should we flush the target, too?

> +}
> +
> +static void backup_top_refresh_filename(BlockDriverState *bs)
> +{
> +    if (bs->backing == NULL) {
> +        /*
> +         * we can be here after failed bdrv_attach_child in
> +         * bdrv_set_backing_hd
> +         */
> +        return;
> +    }
> +    bdrv_refresh_filename(bs->backing->bs);

bdrv_refresh_filename() should have done this already.

> +    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
> +            bs->backing->bs->filename);
> +}
> +
> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
> +                                  const BdrvChildRole *role,
> +                                  BlockReopenQueue *reopen_queue,
> +                                  uint64_t perm, uint64_t shared,
> +                                  uint64_t *nperm, uint64_t *nshared)
> +{
> +    /*
> +     * We have HBitmap in the state, its size is fixed, so we never allow
> +     * resize.
> +     */
> +    uint64_t rw = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> +                  BLK_PERM_WRITE;
> +
> +    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
> +                              nperm, nshared);
> +
> +    *nperm = *nperm & rw;
> +    *nshared = *nshared & rw;

Somehow, that bdrv_filter_default_perm() call doesn’t make this function
easier for me.

It seems like this is basically just “*nperm = perm & rw” and
“*nshared = shared & rw”.

> +
> +    if (role == &child_file) {
> +        /*
> +         * Target child
> +         *
> +         * Share write to target (child_file), to not interfere
> +         * with guest writes to its disk which may be in target backing chain.
> +         */
> +        if (perm & BLK_PERM_WRITE) {
> +            *nshared = *nshared | BLK_PERM_WRITE;

Why not always share WRITE on the target?

> +        }
> +    } else {
> +        /* Source child */
> +        if (perm & BLK_PERM_WRITE) {

Or WRITE_UNCHANGED, no?

> +            *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> +        }
> +        *nshared =
> +            *nshared & (BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED);

And here it isn’t “*nshared = shared & rw” but “rw & ~WRITE”.

I feel like this function would be simpler if you just set *nperm and
*nshared separately in the two branches of this condition, without
setting a “default” first.

> +    }
> +}
> +
> +BlockDriver bdrv_backup_top_filter = {
> +    .format_name = "backup-top",
> +    .instance_size = sizeof(BDRVBackupTopState),
> +
> +    .bdrv_co_preadv             = backup_top_co_preadv,
> +    .bdrv_co_pwritev            = backup_top_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = backup_top_co_pwrite_zeroes,
> +    .bdrv_co_pdiscard           = backup_top_co_pdiscard,
> +    .bdrv_co_flush              = backup_top_co_flush,
> +
> +    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
> +
> +    .bdrv_refresh_filename      = backup_top_refresh_filename,
> +
> +    .bdrv_child_perm            = backup_top_child_perm,
> +
> +    .is_filter = true,
> +};
> +
> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
> +                                         BlockDriverState *target,
> +                                         HBitmap *copy_bitmap,
> +                                         Error **errp)
> +{
> +    Error *local_err = NULL;
> +    BDRVBackupTopState *state;
> +    BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
> +                                                 NULL, BDRV_O_RDWR, errp);
> +
> +    if (!top) {
> +        return NULL;
> +    }
> +
> +    top->implicit = true;

As I said above, I think the caller should set this.

> +    top->total_sectors = source->total_sectors;
> +    top->bl.opt_mem_alignment = MAX(bdrv_opt_mem_align(source),
> +                                    bdrv_opt_mem_align(target));
> +    top->opaque = state = g_new0(BDRVBackupTopState, 1);
> +    state->copy_bitmap = copy_bitmap;
> +
> +    bdrv_ref(target);
> +    state->target = bdrv_attach_child(top, target, "target", &child_file, errp);
> +    if (!state->target) {
> +        bdrv_unref(target);
> +        bdrv_unref(top);
> +        return NULL;
> +    }
> +
> +    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
> +    bdrv_set_aio_context(target, bdrv_get_aio_context(source));

I suppose these calls aren’t necessary anymore (compare d0ee0204f4009).
 (I’m not fully sure, though.  In any case, they would need to be
bdrv_try_set_aio_context() if they were still necessary.)

> +
> +    bdrv_drained_begin(source);
> +
> +    bdrv_ref(top);
> +    bdrv_append(top, source, &local_err);
> +    if (local_err) {
> +        error_prepend(&local_err, "Cannot append backup-top filter: ");
> +    }
> +
> +    bdrv_drained_end(source);
> +
> +    if (local_err) {
> +        bdrv_unref_child(top, state->target);
> +        bdrv_unref(top);
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +
> +    return top;
> +}

I guess it would be nice if it users could blockdev-add a backup-top
node to basically get a backup with sync=none.

(The bdrv_open implementation should then create a new bitmap and
initialize it to be fully set.)

But maybe it wouldn’t be so nice and I just have feverish dreams.

> +void bdrv_backup_top_set_progress_callback(
> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
> +        void *progress_opaque)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +
> +    s->progress_cb = progress_cb;
> +    s->progress_opaque = progress_opaque;
> +}
> +
> +void bdrv_backup_top_drop(BlockDriverState *bs)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +    AioContext *aio_context = bdrv_get_aio_context(bs);
> +
> +    aio_context_acquire(aio_context);
> +
> +    bdrv_drained_begin(bs);
> +
> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);

Pre-existing in other jobs, but I think calling this function is
dangerous.  (Which is why I remove it in my “block: Ignore loosening
perm restrictions failures” series.)

> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
> +    bdrv_set_backing_hd(bs, NULL, &error_abort);

I think some of this function should be in a .bdrv_close()
implementation, for example this bdrv_set_backing_hd() call.

> +    bdrv_drained_end(bs);
> +
> +    if (s->target) {
> +        bdrv_unref_child(bs, s->target);

And this.  Well, neither needs to be in a .bdrv_close() implementation,
actually, because bdrv_close() will just do it by itself.

I suppose you’re trying to remove the children so the node is no longer
usable after this point; but that isn’t quite right, then.  The filter
functions still refer to s->target and bs->backing, so you need to stop
them from doing anything then.

At this point, you might as well add a variable to BDRVBackupTopState
that says whether the filter is still supposed to be usable (like the
“stop” variable I added to mirror in “block/mirror: Fix child
permissions”).  If so, the permission function should signal 0/ALL for
the permissions on backing (and probably target), and all filter
functions always immediately return an error.

So I don’t think backing and target need to be removed here.  We can
wait for that until bdrv_close(), but we should ensure that the filter
really is unusable after this function has been called.

Max

> +    }
> +    bdrv_unref(bs);
> +
> +    aio_context_release(aio_context);
> +}
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..dfbdfe6ab4 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
>  
>  block-obj-y += crypto.o
>  
> +block-obj-y += backup-top.o
> +
>  common-obj-y += stream.o
>  
>  nfs.o-libs         := $(LIBNFS_LIBS)
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (7 preceding siblings ...)
  2019-05-30 13:25 ` [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
@ 2019-06-13 16:08 ` no-reply
  2019-06-13 16:41 ` no-reply
  9 siblings, 0 replies; 32+ messages in thread
From: no-reply @ 2019-06-13 16:08 UTC (permalink / raw)
  To: vsementsov
  Cc: fam, kwolf, vsementsov, qemu-block, qemu-devel, mreitz, stefanha,
	den, jsnow

Patchew URL: https://patchew.org/QEMU/20190529154654.95870-1-vsementsov@virtuozzo.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      crypto/cipher.o
  CC      crypto/tlscreds.o
  CC      crypto/tlscredsanon.o
/tmp/qemu-test/src/block/backup-top.c:268:5: error: implicit declaration of function 'bdrv_set_aio_context' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
    ^
/tmp/qemu-test/src/block/backup-top.c:268:5: note: did you mean 'bdrv_get_aio_context'?
/tmp/qemu-test/src/include/block/block.h:579:13: note: 'bdrv_get_aio_context' declared here
AioContext *bdrv_get_aio_context(BlockDriverState *bs);
            ^
/tmp/qemu-test/src/block/backup-top.c:268:5: error: this function declaration is not a prototype [-Werror,-Wstrict-prototypes]
    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
    ^
2 errors generated.


The full log is available at
http://patchew.org/logs/20190529154654.95870-1-vsementsov@virtuozzo.com/testing.asan/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 6/7] block: add lock/unlock range functions
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 6/7] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
@ 2019-06-13 16:31   ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2019-06-13 16:31 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, stefanha, den, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 584 bytes --]

On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> From: Vladimir Sementsov-Ogievskiy <etendren@gmail.com>
> 
> Introduce lock/unlock range functionality, based on serialized
> requests. This is needed to refactor backup, dropping local
> tracked-request-like synchronization.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <etendren@gmail.com>
> ---
>  include/block/block_int.h |  4 ++++
>  block/io.c                | 44 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 47 insertions(+), 1 deletion(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup
  2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
                   ` (8 preceding siblings ...)
  2019-06-13 16:08 ` no-reply
@ 2019-06-13 16:41 ` no-reply
  9 siblings, 0 replies; 32+ messages in thread
From: no-reply @ 2019-06-13 16:41 UTC (permalink / raw)
  To: vsementsov
  Cc: fam, kwolf, vsementsov, qemu-block, qemu-devel, mreitz, stefanha,
	den, jsnow

Patchew URL: https://patchew.org/QEMU/20190529154654.95870-1-vsementsov@virtuozzo.com/



Hi,

This series failed build test on s390x host. Please find the details below.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
# Testing script will be invoked under the git checkout with
# HEAD pointing to a commit that has the patches applied on top of "base"
# branch
set -e
CC=$HOME/bin/cc
INSTALL=$PWD/install
BUILD=$PWD/build
mkdir -p $BUILD $INSTALL
SRC=$PWD
cd $BUILD
$SRC/configure --cc=$CC --prefix=$INSTALL
make -j4
# XXX: we need reliable clean up
# make check -j4 V=1
make install

echo
echo "=== ENV ==="
env

echo
echo "=== PACKAGES ==="
rpm -qa
=== TEST SCRIPT END ===

  CC      block/replication.o
  CC      block/throttle.o
/var/tmp/patchew-tester-tmp-eg3dw7w5/src/block/backup.c: In function ‘backup_run’:
/var/tmp/patchew-tester-tmp-eg3dw7w5/src/block/backup.c:283:8: error: ‘error_is_read’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  283 |     if (read) {
      |        ^
/var/tmp/patchew-tester-tmp-eg3dw7w5/src/block/backup.c:332:10: note: ‘error_is_read’ was declared here


The full log is available at
http://patchew.org/logs/20190529154654.95870-1-vsementsov@virtuozzo.com/testing.s390x/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers
  2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
@ 2019-06-13 18:02   ` Max Reitz
  2019-06-14  9:14     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2019-06-13 18:02 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, stefanha, den, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 9030 bytes --]

On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> Drop write notifiers and use filter node instead. Changes:
> 
> 1. copy-before-writes now handled by filter node, so, drop all
>    is_write_notifier arguments.
> 
> 2. we don't have intersecting requests, so their handling is dropped.
> Instead, synchronization works as follows:
> when backup or backup-top starts copying of some area it firstly
> clears copy-bitmap bits, and nobody touches areas, not marked with
> dirty bits in copy-bitmap, so there is no intersection. Also, backup
> job copy operations are surrounded by bdrv region lock, which is
> actually serializing request, to not interfere with guest writes and
> not read changed data from source (before reading we clear
> corresponding bit in copy-bitmap, so, this area is not more handled by
> backup-top).
> 
> 3. To sync with in-flight requests at job finish we now have drained
> removing of the filter, we don't need rw-lock.
> 
> == RFC part ==
> 
> iotests changed:
> 56: op-blocker doesn't shot now, as we set it on source, but then check
> on filter, when trying to start second backup... Should I workaround it
> somehow?

Hm.  Where does that error message even come from?  The fact that the
target image is in use already (Due to file locks)?

It appears that way indeed.

It seems reasonable to me that you can now run a backup on top of
another backup.  Well, I mean, it is a stupid thing to do, but I don’t
see why the block layer would forbid doing so.

So the test seems superfluous to me.  If we want to keep it (why not),
it should test the opposite, namely that a backup to a different image
(with a different job ID) works.  (It seems simple enough to modify the
job that way, so why not.)

> 129: Hmm, now it is not busy at this moment.. But it's illegal to check
> busy, as job has pause-points and set busy to false in these points.
> Why we assert it in this test?

Nobody knows, it’s probably wrong.  All I know is that 129 is just
broken anyway.

> 141: Obvious, as drv0 is not root node now, but backing of the filter,
> when we try to remove it.

I get a failed assertion in 256.  That is probably because the
bdrv_set_aio_context() calls weren’t as unnecessary as I deemed them to be.

> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup.c             | 171 ++++++++++++++-----------------------
>  tests/qemu-iotests/056     |   2 +-
>  tests/qemu-iotests/129     |   1 -
>  tests/qemu-iotests/141.out |   2 +-
>  4 files changed, 68 insertions(+), 108 deletions(-)

For some reason, my gcc starts to complain that backup_loop() may not
initialize error_is_read after this patch.  I don’t know why that is.
Perhaps it inlines backup_do_cow() now?  (So before it just saw that a
pointer to error_is_read was passed to backup_do_cow() and took it as an
opaque function, so it surely would set this value somewhere.  Now it
inlines it and it can’t find whether that will definitely happen, so it
complains.)

I don’t think it is strictly necessary to initialize error_is_read, but,
well, it won’t hurt.

> diff --git a/block/backup.c b/block/backup.c
> index 00f4f8af53..a5b8e04c9c 100644
> --- a/block/backup.c
> +++ b/block/backup.c

[...]

> @@ -60,56 +53,17 @@ typedef struct BackupBlockJob {
>  
>  static const BlockJobDriver backup_job_driver;
>  
> -/* See if in-flight requests overlap and wait for them to complete */
> -static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
> -                                                       int64_t start,
> -                                                       int64_t end)
> -{
> -    CowRequest *req;
> -    bool retry;
> -
> -    do {
> -        retry = false;
> -        QLIST_FOREACH(req, &job->inflight_reqs, list) {
> -            if (end > req->start_byte && start < req->end_byte) {
> -                qemu_co_queue_wait(&req->wait_queue, NULL);
> -                retry = true;
> -                break;
> -            }
> -        }
> -    } while (retry);
> -}
> -
> -/* Keep track of an in-flight request */
> -static void cow_request_begin(CowRequest *req, BackupBlockJob *job,
> -                              int64_t start, int64_t end)
> -{
> -    req->start_byte = start;
> -    req->end_byte = end;
> -    qemu_co_queue_init(&req->wait_queue);
> -    QLIST_INSERT_HEAD(&job->inflight_reqs, req, list);
> -}
> -
> -/* Forget about a completed request */
> -static void cow_request_end(CowRequest *req)
> -{
> -    QLIST_REMOVE(req, list);
> -    qemu_co_queue_restart_all(&req->wait_queue);
> -}
> -
>  /* Copy range to target with a bounce buffer and return the bytes copied. If
>   * error occurred, return a negative error number */
>  static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
>                                                        int64_t start,
>                                                        int64_t end,
> -                                                      bool is_write_notifier,
>                                                        bool *error_is_read,
>                                                        void **bounce_buffer)

Future feature: Somehow get this functionality done with backup-top, I
suppose.  (This is effectively just backup_top_cbw() with some bells and
whistles, isn’t it?)

>  {
>      int ret;
>      BlockBackend *blk = job->common.blk;
>      int nbytes;
> -    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
>      int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
>  
>      assert(QEMU_IS_ALIGNED(start, job->cluster_size));

[...]

> @@ -154,15 +108,12 @@ fail:
>  /* Copy range to target and return the bytes copied. If error occurred, return a
>   * negative error number. */
>  static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
> -                                                int64_t start,
> -                                                int64_t end,
> -                                                bool is_write_notifier)
> +                                                int64_t start, int64_t end)

And I suppose this is something backup-top maybe should support, too.

>  {
>      int ret;
>      int nr_clusters;
>      BlockBackend *blk = job->common.blk;
>      int nbytes;
> -    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
>      int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
>  
>      assert(QEMU_IS_ALIGNED(job->copy_range_size, job->cluster_size));

[...]

> @@ -391,28 +333,41 @@ static int coroutine_fn backup_loop(BackupBlockJob *job)
>      int64_t offset;
>      HBitmapIter hbi;
>      BlockDriverState *bs = blk_bs(job->common.blk);
> +    void *lock;
>  
>      hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>      while ((offset = hbitmap_iter_next(&hbi)) != -1) {
> +        lock = bdrv_co_try_lock(backing_bs(blk_bs(job->common.blk)), offset,
> +                                job->cluster_size);
> +        /*
> +         * Dirty bit is set, which means that there are no in-flight
> +         * write requests on this area. We must succeed.
> +         */
> +        assert(lock);
> +

Hm.  It makes me uneasy but I suppose you’re right.

>          if (job->sync_mode == MIRROR_SYNC_MODE_TOP &&
>              bdrv_is_unallocated_range(bs, offset, job->cluster_size))

This can yield, right?  If it does, the bitmap is still set.  backup-top
will see this, unset the bitmap and try to start its CBW operation.
That is halted by the lock just taken, but the progress will still be
published after completion, so the job can go beyond 100 %, I think.

Even if it doesn’t, copying the data twice is weird.  It may even get
weirder if one of both requests fails.

Can we lock the backup-top node instead?  I don’t know whether locking
would always succeed there, though...

Max

>          {
>              hbitmap_reset(job->copy_bitmap, offset, job->cluster_size);
> +            bdrv_co_unlock(lock);
>              continue;
>          }
>  
>          do {
>              if (yield_and_check(job)) {
> +                bdrv_co_unlock(lock);
>                  return 0;
>              }
> -            ret = backup_do_cow(job, offset,
> -                                job->cluster_size, &error_is_read, false);
> +            ret = backup_do_cow(job, offset, job->cluster_size, &error_is_read);
>              if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>                             BLOCK_ERROR_ACTION_REPORT)
>              {
> +                bdrv_co_unlock(lock);
>                  return ret;
>              }
>          } while (ret < 0);
> +
> +        bdrv_co_unlock(lock);
>      }
>  
>      return 0;


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-13 15:57   ` Max Reitz
@ 2019-06-14  9:04     ` Vladimir Sementsov-Ogievskiy
  2019-06-14 12:57       ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-14  9:04 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

13.06.2019 18:57, Max Reitz wrote:
> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>> Backup-top filter does copy-before-write operation. It should be
>> inserted above active disk and has a target node for CBW, like the
>> following:
>>
>>      +-------+
>>      | Guest |
>>      +-------+
>>          |r,w
>>          v
>>      +------------+  target   +---------------+
>>      | backup_top |---------->| target(qcow2) |
>>      +------------+   CBW     +---------------+
>>          |
>> backing |r,w
>>          v
>>      +-------------+
>>      | Active disk |
>>      +-------------+
>>
>> The driver will be used in backup instead of write-notifiers.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/backup-top.h  |  64 +++++++++
>>   block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>   block/Makefile.objs |   2 +
>>   3 files changed, 388 insertions(+)
>>   create mode 100644 block/backup-top.h
>>   create mode 100644 block/backup-top.c
>>
>> diff --git a/block/backup-top.h b/block/backup-top.h
>> new file mode 100644
>> index 0000000000..788e18c358
>> --- /dev/null
>> +++ b/block/backup-top.h
>> @@ -0,0 +1,64 @@
>> +/*
>> + * backup-top filter driver
>> + *
>> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
>> + * some node, and before each write it copies _old_ data to the target node.
>> + *
>> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
>> + *
>> + * Author:
>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef BACKUP_TOP_H
>> +#define BACKUP_TOP_H
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "block/block_int.h"
>> +
>> +typedef void (*BackupTopProgressCallback)(uint64_t done, void *opaque);
>> +typedef struct BDRVBackupTopState {
>> +    HBitmap *copy_bitmap; /* what should be copied to @target on guest write. */
>> +    BdrvChild *target;
>> +
>> +    BackupTopProgressCallback progress_cb;
>> +    void *progress_opaque;
>> +} BDRVBackupTopState;
>> +
>> +/*
>> + * bdrv_backup_top_append
>> + *
>> + * Append backup_top filter node above @source node. @target node will receive
>> + * the data backed up during CBE operations. New filter together with @target
>> + * node are attached to @source aio context.
>> + *
>> + * The resulting filter node is implicit.
> 
> Why?  It’s just as easy for the caller to just make it implicit if it
> should be.  (And usually the caller should decide that.)

Personally, I don't know what are the reasons for filters to bi implicit or not,
I just made it like other job-filters.. I can move making-implicit to the caller
or drop it at all (if it will work).

> 
>> + *
>> + * @copy_bitmap selects regions which needs CBW. Furthermore, backup_top will
>> + * use exactly this bitmap, so it may be used to control backup_top behavior
>> + * dynamically. Caller should not release @copy_bitmap during life-time of
>> + * backup_top. Progress is tracked by calling @progress_cb function.
>> + */
>> +BlockDriverState *bdrv_backup_top_append(
>> +        BlockDriverState *source, BlockDriverState *target,
>> +        HBitmap *copy_bitmap, Error **errp);
>> +void bdrv_backup_top_set_progress_callback(
>> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
>> +        void *progress_opaque);
>> +void bdrv_backup_top_drop(BlockDriverState *bs);
>> +
>> +#endif /* BACKUP_TOP_H */
>> diff --git a/block/backup-top.c b/block/backup-top.c
>> new file mode 100644
>> index 0000000000..1daa02f539
>> --- /dev/null
>> +++ b/block/backup-top.c
>> @@ -0,0 +1,322 @@
>> +/*
>> + * backup-top filter driver
>> + *
>> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
>> + * some node, and before each write it copies _old_ data to the target node.
>> + *
>> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
>> + *
>> + * Author:
>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qemu/cutils.h"
>> +#include "qapi/error.h"
>> +#include "block/block_int.h"
>> +#include "block/qdict.h"
>> +
>> +#include "block/backup-top.h"
>> +
>> +static coroutine_fn int backup_top_co_preadv(
>> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
>> +        QEMUIOVector *qiov, int flags)
>> +{
>> +    /*
>> +     * Features to be implemented:
>> +     * F1. COR. save read data to fleecing target for fast access
>> +     *     (to reduce reads). This possibly may be done with use of copy-on-read
>> +     *     filter, but we need an ability to make COR requests optional: for
>> +     *     example, if target is a ram-cache, and if it is full now, we should
>> +     *     skip doing COR request, as it is actually not necessary.
>> +     *
>> +     * F2. Feature for guest: read from fleecing target if data is in ram-cache
>> +     *     and is unchanged
>> +     */
>> +
>> +    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
>> +}
>> +
>> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
>> +                                       uint64_t bytes)
>> +{
>> +    int ret = 0;
>> +    BDRVBackupTopState *s = bs->opaque;
>> +    uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
>> +    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
>> +    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;
> 
> The “, len” is a bit weirdly placed there, why not define it on a
> separate line as "uint64_t len = end - off"?
> 
>> +    void *buf = qemu_blockalign(bs, end - off);
>> +
>> +    /*
>> +     * Features to be implemented:
>> +     * F3. parallelize copying loop
>> +     * F4. detect zeroes ? or, otherwise, drop detect zeroes from backup code
>> +     *     and just enable zeroes detecting on target
>> +     * F5. use block_status ?
>> +     * F6. don't copy clusters which are already cached by COR [see F1]
>> +     * F7. if target is ram-cache and it is full, there should be a possibility
>> +     *     to drop not necessary data (cached by COR [see F1]) to handle CBW
>> +     *     fast.
> 
> Also “drop BDRV_REQ_SERIALISING from writes to s->target unless necessary”?

hmm, yes. It may be a filter parameter, serialize writes or not.

> 
>> +     */
>> +
>> +    len = end - off;
>> +    while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
>> +        hbitmap_reset(s->copy_bitmap, off, len);
>> +
>> +        ret = bdrv_co_pread(bs->backing, off, len, buf,
>> +                            BDRV_REQ_NO_SERIALISING);
>> +        if (ret < 0) {
>> +            hbitmap_set(s->copy_bitmap, off, len);
>> +            goto out;
>> +        }
>> +
>> +        ret = bdrv_co_pwrite(s->target, off, len, buf, BDRV_REQ_SERIALISING);
>> +        if (ret < 0) {
>> +            hbitmap_set(s->copy_bitmap, off, len);
>> +            goto out;
>> +        }
>> +
>> +        if (s->progress_cb) {
>> +            s->progress_cb(len, s->progress_opaque);
>> +        }
>> +        off += len;
>> +        if (off >= end) {
>> +            break;
>> +        }
>> +        len = end - off;
>> +    }
>> +
>> +out:
>> +    qemu_vfree(buf);
>> +
>> +    /*
>> +     * F8. we fail guest request in case of error. We can alter it by
>> +     * possibility to fail copying process instead, or retry several times, or
>> +     * may be guest pause, etc.
>> +     */
>> +    return ret;
>> +}
>> +
>> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
>> +                                               int64_t offset, int bytes)
>> +{
>> +    int ret = backup_top_cbw(bs, offset, bytes);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    /*
>> +     * Features to be implemented:
>> +     * F9. possibility of lazy discard: just defer the discard after fleecing
>> +     *     completion. If write (or new discard) occurs to the same area, just
>> +     *     drop deferred discard.
>> +     */
>> +
>> +    return bdrv_co_pdiscard(bs->backing, offset, bytes);
>> +}
>> +
>> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
>> +        int64_t offset, int bytes, BdrvRequestFlags flags)
>> +{
>> +    int ret = backup_top_cbw(bs, offset, bytes);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
>> +}
>> +
>> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
>> +                                              uint64_t offset,
>> +                                              uint64_t bytes,
>> +                                              QEMUIOVector *qiov, int flags)
>> +{
>> +    if (!(flags & BDRV_REQ_WRITE_UNCHANGED)) {
>> +        int ret = backup_top_cbw(bs, offset, bytes);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
>> +}
>> +
>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>> +{
>> +    if (!bs->backing) {
>> +        return 0;
>> +    }
>> +
>> +    return bdrv_co_flush(bs->backing->bs);
> 
> Should we flush the target, too?

Hm, you've asked it already, on previous version :) Backup don't do it,
so I just keep old behavior. And what is the reason to flush backup target
on any guest flush?

> 
>> +}
>> +
>> +static void backup_top_refresh_filename(BlockDriverState *bs)
>> +{
>> +    if (bs->backing == NULL) {
>> +        /*
>> +         * we can be here after failed bdrv_attach_child in
>> +         * bdrv_set_backing_hd
>> +         */
>> +        return;
>> +    }
>> +    bdrv_refresh_filename(bs->backing->bs);
> 
> bdrv_refresh_filename() should have done this already.
> 
>> +    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
>> +            bs->backing->bs->filename);
>> +}
>> +
>> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
>> +                                  const BdrvChildRole *role,
>> +                                  BlockReopenQueue *reopen_queue,
>> +                                  uint64_t perm, uint64_t shared,
>> +                                  uint64_t *nperm, uint64_t *nshared)
>> +{
>> +    /*
>> +     * We have HBitmap in the state, its size is fixed, so we never allow
>> +     * resize.
>> +     */
>> +    uint64_t rw = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
>> +                  BLK_PERM_WRITE;
>> +
>> +    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
>> +                              nperm, nshared);
>> +
>> +    *nperm = *nperm & rw;
>> +    *nshared = *nshared & rw;
> 
> Somehow, that bdrv_filter_default_perm() call doesn’t make this function
> easier for me.
> 
> It seems like this is basically just “*nperm = perm & rw” and
> “*nshared = shared & rw”.
> 
>> +
>> +    if (role == &child_file) {
>> +        /*
>> +         * Target child
>> +         *
>> +         * Share write to target (child_file), to not interfere
>> +         * with guest writes to its disk which may be in target backing chain.
>> +         */
>> +        if (perm & BLK_PERM_WRITE) {
>> +            *nshared = *nshared | BLK_PERM_WRITE;
> 
> Why not always share WRITE on the target?

Hmm, it's a bad thing to share writes on target, so I'm trying to reduce number of
cases when we have share it.

> 
>> +        }
>> +    } else {
>> +        /* Source child */
>> +        if (perm & BLK_PERM_WRITE) {
> 
> Or WRITE_UNCHANGED, no?

Why? We don't need doing CBW for unchanged write.

> 
>> +            *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
>> +        }
>> +        *nshared =
>> +            *nshared & (BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED);
> 
> And here it isn’t “*nshared = shared & rw” but “rw & ~WRITE”.
> 
> I feel like this function would be simpler if you just set *nperm and
> *nshared separately in the two branches of this condition, without
> setting a “default” first.

Ok, I'll try

> 
>> +    }
>> +}
>> +
>> +BlockDriver bdrv_backup_top_filter = {
>> +    .format_name = "backup-top",
>> +    .instance_size = sizeof(BDRVBackupTopState),
>> +
>> +    .bdrv_co_preadv             = backup_top_co_preadv,
>> +    .bdrv_co_pwritev            = backup_top_co_pwritev,
>> +    .bdrv_co_pwrite_zeroes      = backup_top_co_pwrite_zeroes,
>> +    .bdrv_co_pdiscard           = backup_top_co_pdiscard,
>> +    .bdrv_co_flush              = backup_top_co_flush,
>> +
>> +    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
>> +
>> +    .bdrv_refresh_filename      = backup_top_refresh_filename,
>> +
>> +    .bdrv_child_perm            = backup_top_child_perm,
>> +
>> +    .is_filter = true,
>> +};
>> +
>> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
>> +                                         BlockDriverState *target,
>> +                                         HBitmap *copy_bitmap,
>> +                                         Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    BDRVBackupTopState *state;
>> +    BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
>> +                                                 NULL, BDRV_O_RDWR, errp);
>> +
>> +    if (!top) {
>> +        return NULL;
>> +    }
>> +
>> +    top->implicit = true;
> 
> As I said above, I think the caller should set this.
> 
>> +    top->total_sectors = source->total_sectors;
>> +    top->bl.opt_mem_alignment = MAX(bdrv_opt_mem_align(source),
>> +                                    bdrv_opt_mem_align(target));
>> +    top->opaque = state = g_new0(BDRVBackupTopState, 1);
>> +    state->copy_bitmap = copy_bitmap;
>> +
>> +    bdrv_ref(target);
>> +    state->target = bdrv_attach_child(top, target, "target", &child_file, errp);
>> +    if (!state->target) {
>> +        bdrv_unref(target);
>> +        bdrv_unref(top);
>> +        return NULL;
>> +    }
>> +
>> +    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
>> +    bdrv_set_aio_context(target, bdrv_get_aio_context(source));
> 
> I suppose these calls aren’t necessary anymore (compare d0ee0204f4009).

Hmm, see it, agree.

>   (I’m not fully sure, though.  In any case, they would need to be
> bdrv_try_set_aio_context() if they were still necessary.)
> 
>> +
>> +    bdrv_drained_begin(source);
>> +
>> +    bdrv_ref(top);
>> +    bdrv_append(top, source, &local_err);
>> +    if (local_err) {
>> +        error_prepend(&local_err, "Cannot append backup-top filter: ");
>> +    }
>> +
>> +    bdrv_drained_end(source);
>> +
>> +    if (local_err) {
>> +        bdrv_unref_child(top, state->target);
>> +        bdrv_unref(top);
>> +        error_propagate(errp, local_err);
>> +        return NULL;
>> +    }
>> +
>> +    return top;
>> +}
> 
> I guess it would be nice if it users could blockdev-add a backup-top
> node to basically get a backup with sync=none.
> 
> (The bdrv_open implementation should then create a new bitmap and
> initialize it to be fully set.)
> 
> But maybe it wouldn’t be so nice and I just have feverish dreams.

When series begun, I was trying to make exactly this - user-available filter,
which may be used in separate, but you was against)

Maybe, not totally against, but I decided not to argue long, and instead make
filter implicit and drop public api (like mirror and commit filters), but place it
in a separate file (no one will argue against putting large enough and complete
new feature, represented by object into a separate file :). And this actually
makes it easy to publish this filter at some moment. And now I think it was
good decision anyway, as we postponed arguing around API and around shared dirty
bitmaps.

So publishing the filter is future step.

> 
>> +void bdrv_backup_top_set_progress_callback(
>> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
>> +        void *progress_opaque)
>> +{
>> +    BDRVBackupTopState *s = bs->opaque;
>> +
>> +    s->progress_cb = progress_cb;
>> +    s->progress_opaque = progress_opaque;
>> +}
>> +
>> +void bdrv_backup_top_drop(BlockDriverState *bs)
>> +{
>> +    BDRVBackupTopState *s = bs->opaque;
>> +    AioContext *aio_context = bdrv_get_aio_context(bs);
>> +
>> +    aio_context_acquire(aio_context);
>> +
>> +    bdrv_drained_begin(bs);
>> +
>> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
> 
> Pre-existing in other jobs, but I think calling this function is
> dangerous.  (Which is why I remove it in my “block: Ignore loosening
> perm restrictions failures” series.)

Hmm, good thing.. Should I rebase on it?

> 
>> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
>> +    bdrv_set_backing_hd(bs, NULL, &error_abort);
> 
> I think some of this function should be in a .bdrv_close()
> implementation, for example this bdrv_set_backing_hd() call.

Why? We don't have .bdrv_open, so why to have .bdrv_close? I think, when
we publish this filter most of _add() and _drop() will be refactored to
open() and close(). Is there a real reason to implement .close() now?

> 
>> +    bdrv_drained_end(bs);
>> +
>> +    if (s->target) {
>> +        bdrv_unref_child(bs, s->target);
> 
> And this.  Well, neither needs to be in a .bdrv_close() implementation,
> actually, because bdrv_close() will just do it by itself.
> 
> I suppose you’re trying to remove the children so the node is no longer
> usable after this point; but that isn’t quite right, then.  The filter
> functions still refer to s->target and bs->backing, so you need to stop
> them from doing anything then.
> 
> At this point, you might as well add a variable to BDRVBackupTopState
> that says whether the filter is still supposed to be usable (like the
> “stop” variable I added to mirror in “block/mirror: Fix child
> permissions”).  If so, the permission function should signal 0/ALL for
> the permissions on backing (and probably target), and all filter
> functions always immediately return an error.
> 
> So I don’t think backing and target need to be removed here.  We can
> wait for that until bdrv_close(), but we should ensure that the filter
> really is unusable after this function has been called.

Ok, thank you for reviewing!

> 
> Max
> 
>> +    }
>> +    bdrv_unref(bs);
>> +
>> +    aio_context_release(aio_context);
>> +}
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index ae11605c9f..dfbdfe6ab4 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
>>   
>>   block-obj-y += crypto.o
>>   
>> +block-obj-y += backup-top.o
>> +
>>   common-obj-y += stream.o
>>   
>>   nfs.o-libs         := $(LIBNFS_LIBS)
>>
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers
  2019-06-13 18:02   ` Max Reitz
@ 2019-06-14  9:14     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-14  9:14 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

13.06.2019 21:02, Max Reitz wrote:
> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>> Drop write notifiers and use filter node instead. Changes:
>>
>> 1. copy-before-writes now handled by filter node, so, drop all
>>     is_write_notifier arguments.
>>
>> 2. we don't have intersecting requests, so their handling is dropped.
>> Instead, synchronization works as follows:
>> when backup or backup-top starts copying of some area it firstly
>> clears copy-bitmap bits, and nobody touches areas, not marked with
>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>> job copy operations are surrounded by bdrv region lock, which is
>> actually serializing request, to not interfere with guest writes and
>> not read changed data from source (before reading we clear
>> corresponding bit in copy-bitmap, so, this area is not more handled by
>> backup-top).
>>
>> 3. To sync with in-flight requests at job finish we now have drained
>> removing of the filter, we don't need rw-lock.
>>
>> == RFC part ==
>>
>> iotests changed:
>> 56: op-blocker doesn't shot now, as we set it on source, but then check
>> on filter, when trying to start second backup... Should I workaround it
>> somehow?
> 
> Hm.  Where does that error message even come from?  The fact that the
> target image is in use already (Due to file locks)?
> 
> It appears that way indeed.
> 
> It seems reasonable to me that you can now run a backup on top of
> another backup.  Well, I mean, it is a stupid thing to do, but I don’t
> see why the block layer would forbid doing so.
> 
> So the test seems superfluous to me.  If we want to keep it (why not),
> it should test the opposite, namely that a backup to a different image
> (with a different job ID) works.  (It seems simple enough to modify the
> job that way, so why not.)
> 
>> 129: Hmm, now it is not busy at this moment.. But it's illegal to check
>> busy, as job has pause-points and set busy to false in these points.
>> Why we assert it in this test?
> 
> Nobody knows, it’s probably wrong.  All I know is that 129 is just
> broken anyway.
> 
>> 141: Obvious, as drv0 is not root node now, but backing of the filter,
>> when we try to remove it.
> 
> I get a failed assertion in 256.  That is probably because the
> bdrv_set_aio_context() calls weren’t as unnecessary as I deemed them to be.

hmm, will check.

> 
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/backup.c             | 171 ++++++++++++++-----------------------
>>   tests/qemu-iotests/056     |   2 +-
>>   tests/qemu-iotests/129     |   1 -
>>   tests/qemu-iotests/141.out |   2 +-
>>   4 files changed, 68 insertions(+), 108 deletions(-)
> 
> For some reason, my gcc starts to complain that backup_loop() may not
> initialize error_is_read after this patch.  I don’t know why that is.
> Perhaps it inlines backup_do_cow() now?  (So before it just saw that a
> pointer to error_is_read was passed to backup_do_cow() and took it as an
> opaque function, so it surely would set this value somewhere.  Now it
> inlines it and it can’t find whether that will definitely happen, so it
> complains.)
> 
> I don’t think it is strictly necessary to initialize error_is_read, but,
> well, it won’t hurt.
> 
>> diff --git a/block/backup.c b/block/backup.c
>> index 00f4f8af53..a5b8e04c9c 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
> 
> [...]
> 
>> @@ -60,56 +53,17 @@ typedef struct BackupBlockJob {
>>   
>>   static const BlockJobDriver backup_job_driver;
>>   
>> -/* See if in-flight requests overlap and wait for them to complete */
>> -static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
>> -                                                       int64_t start,
>> -                                                       int64_t end)
>> -{
>> -    CowRequest *req;
>> -    bool retry;
>> -
>> -    do {
>> -        retry = false;
>> -        QLIST_FOREACH(req, &job->inflight_reqs, list) {
>> -            if (end > req->start_byte && start < req->end_byte) {
>> -                qemu_co_queue_wait(&req->wait_queue, NULL);
>> -                retry = true;
>> -                break;
>> -            }
>> -        }
>> -    } while (retry);
>> -}
>> -
>> -/* Keep track of an in-flight request */
>> -static void cow_request_begin(CowRequest *req, BackupBlockJob *job,
>> -                              int64_t start, int64_t end)
>> -{
>> -    req->start_byte = start;
>> -    req->end_byte = end;
>> -    qemu_co_queue_init(&req->wait_queue);
>> -    QLIST_INSERT_HEAD(&job->inflight_reqs, req, list);
>> -}
>> -
>> -/* Forget about a completed request */
>> -static void cow_request_end(CowRequest *req)
>> -{
>> -    QLIST_REMOVE(req, list);
>> -    qemu_co_queue_restart_all(&req->wait_queue);
>> -}
>> -
>>   /* Copy range to target with a bounce buffer and return the bytes copied. If
>>    * error occurred, return a negative error number */
>>   static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
>>                                                         int64_t start,
>>                                                         int64_t end,
>> -                                                      bool is_write_notifier,
>>                                                         bool *error_is_read,
>>                                                         void **bounce_buffer)
> 
> Future feature: Somehow get this functionality done with backup-top, I
> suppose.  (This is effectively just backup_top_cbw() with some bells and
> whistles, isn’t it?)

or may be separate it as bdrv_co_pcopy or something like this.

> 
>>   {
>>       int ret;
>>       BlockBackend *blk = job->common.blk;
>>       int nbytes;
>> -    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
>>       int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
>>   
>>       assert(QEMU_IS_ALIGNED(start, job->cluster_size));
> 
> [...]
> 
>> @@ -154,15 +108,12 @@ fail:
>>   /* Copy range to target and return the bytes copied. If error occurred, return a
>>    * negative error number. */
>>   static int coroutine_fn backup_cow_with_offload(BackupBlockJob *job,
>> -                                                int64_t start,
>> -                                                int64_t end,
>> -                                                bool is_write_notifier)
>> +                                                int64_t start, int64_t end)
> 
> And I suppose this is something backup-top maybe should support, too.
> 
>>   {
>>       int ret;
>>       int nr_clusters;
>>       BlockBackend *blk = job->common.blk;
>>       int nbytes;
>> -    int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0;
>>       int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : 0;
>>   
>>       assert(QEMU_IS_ALIGNED(job->copy_range_size, job->cluster_size));
> 
> [...]
> 
>> @@ -391,28 +333,41 @@ static int coroutine_fn backup_loop(BackupBlockJob *job)
>>       int64_t offset;
>>       HBitmapIter hbi;
>>       BlockDriverState *bs = blk_bs(job->common.blk);
>> +    void *lock;
>>   
>>       hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>       while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>> +        lock = bdrv_co_try_lock(backing_bs(blk_bs(job->common.blk)), offset,
>> +                                job->cluster_size);
>> +        /*
>> +         * Dirty bit is set, which means that there are no in-flight
>> +         * write requests on this area. We must succeed.
>> +         */
>> +        assert(lock);
>> +
> 
> Hm.  It makes me uneasy but I suppose you’re right.
> 
>>           if (job->sync_mode == MIRROR_SYNC_MODE_TOP &&
>>               bdrv_is_unallocated_range(bs, offset, job->cluster_size))
> 
> This can yield, right?  If it does, the bitmap is still set.  backup-top
> will see this, unset the bitmap and try to start its CBW operation.
> That is halted by the lock just taken, but the progress will still be
> published after completion, so the job can go beyond 100 %, I think.
> 
> Even if it doesn’t, copying the data twice is weird.  It may even get
> weirder if one of both requests fails.
> 
> Can we lock the backup-top node instead?  I don’t know whether locking
> would always succeed there, though...
> 

Hmm, I'll look closely at the code, but seems that we'd better reset bit before
yield.


> 
>>           {
>>               hbitmap_reset(job->copy_bitmap, offset, job->cluster_size);
>> +            bdrv_co_unlock(lock);
>>               continue;
>>           }
>>   
>>           do {
>>               if (yield_and_check(job)) {
>> +                bdrv_co_unlock(lock);
>>                   return 0;
>>               }
>> -            ret = backup_do_cow(job, offset,
>> -                                job->cluster_size, &error_is_read, false);
>> +            ret = backup_do_cow(job, offset, job->cluster_size, &error_is_read);
>>               if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>>                              BLOCK_ERROR_ACTION_REPORT)
>>               {
>> +                bdrv_co_unlock(lock);
>>                   return ret;
>>               }
>>           } while (ret < 0);
>> +
>> +        bdrv_co_unlock(lock);
>>       }
>>   
>>       return 0;
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-14  9:04     ` Vladimir Sementsov-Ogievskiy
@ 2019-06-14 12:57       ` Max Reitz
  2019-06-14 16:22         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2019-06-14 12:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 7174 bytes --]

On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
> 13.06.2019 18:57, Max Reitz wrote:
>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>> Backup-top filter does copy-before-write operation. It should be
>>> inserted above active disk and has a target node for CBW, like the
>>> following:
>>>
>>>      +-------+
>>>      | Guest |
>>>      +-------+
>>>          |r,w
>>>          v
>>>      +------------+  target   +---------------+
>>>      | backup_top |---------->| target(qcow2) |
>>>      +------------+   CBW     +---------------+
>>>          |
>>> backing |r,w
>>>          v
>>>      +-------------+
>>>      | Active disk |
>>>      +-------------+
>>>
>>> The driver will be used in backup instead of write-notifiers.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/backup-top.h  |  64 +++++++++
>>>   block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>   block/Makefile.objs |   2 +
>>>   3 files changed, 388 insertions(+)
>>>   create mode 100644 block/backup-top.h
>>>   create mode 100644 block/backup-top.c
>>>
>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>> new file mode 100644
>>> index 0000000000..788e18c358
>>> --- /dev/null
>>> +++ b/block/backup-top.h

[...]

>>> +/*
>>> + * bdrv_backup_top_append
>>> + *
>>> + * Append backup_top filter node above @source node. @target node will receive
>>> + * the data backed up during CBE operations. New filter together with @target
>>> + * node are attached to @source aio context.
>>> + *
>>> + * The resulting filter node is implicit.
>>
>> Why?  It’s just as easy for the caller to just make it implicit if it
>> should be.  (And usually the caller should decide that.)
> 
> Personally, I don't know what are the reasons for filters to bi implicit or not,
> I just made it like other job-filters.. I can move making-implicit to the caller
> or drop it at all (if it will work).

Nodes are implicit if they haven’t been added consciously by the user.
A node added by a block job can be non-implicit, too, as mirror shows;
If the user specifies the filter-node-name option, they will know about
the node, thus it is no longer implicit.

If the user doesn’t know about the node (they didn’t give the
filter-node-name option), the node is implicit.

[...]

>>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>>> +{
>>> +    if (!bs->backing) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    return bdrv_co_flush(bs->backing->bs);
>>
>> Should we flush the target, too?
> 
> Hm, you've asked it already, on previous version :)

I wasn’t sure...

> Backup don't do it,
> so I just keep old behavior. And what is the reason to flush backup target
> on any guest flush?

Hm, well, what’s the reason not to do it?
Also, there are not only guest flushes.  bdrv_flush_all() exists, which
is called when the guest is stopped.  So who is going to flush the
target if not its parent?

[...]

>>> +
>>> +    if (role == &child_file) {
>>> +        /*
>>> +         * Target child
>>> +         *
>>> +         * Share write to target (child_file), to not interfere
>>> +         * with guest writes to its disk which may be in target backing chain.
>>> +         */
>>> +        if (perm & BLK_PERM_WRITE) {
>>> +            *nshared = *nshared | BLK_PERM_WRITE;
>>
>> Why not always share WRITE on the target?
> 
> Hmm, it's a bad thing to share writes on target, so I'm trying to reduce number of
> cases when we have share it.

Is it?  First of all, this filter doesn’t care.  It doesn’t even read
from the target (related note: we probably don’t need CONSISTENT_READ on
the target).

Second, there is generally going to be a parent on backup-top that has
the WRITE permission taken.  So this does not really effectively reduce
that number of cases.

>>> +        }
>>> +    } else {
>>> +        /* Source child */
>>> +        if (perm & BLK_PERM_WRITE) {
>>
>> Or WRITE_UNCHANGED, no?
> 
> Why? We don't need doing CBW for unchanged write.

But we will do it still, won’t we?

(If an unchanging write comes in, this driver will handle it just like a
normal write, will it not?)

[...]

>>> +
>>> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
>>> +                                         BlockDriverState *target,
>>> +                                         HBitmap *copy_bitmap,
>>> +                                         Error **errp)
>>> +{

[...]

>>> +}
>>
>> I guess it would be nice if it users could blockdev-add a backup-top
>> node to basically get a backup with sync=none.
>>
>> (The bdrv_open implementation should then create a new bitmap and
>> initialize it to be fully set.)
>>
>> But maybe it wouldn’t be so nice and I just have feverish dreams.
> 
> When series begun, I was trying to make exactly this - user-available filter,
> which may be used in separate, but you was against)

Past me is stupid.

> Maybe, not totally against, but I decided not to argue long, and instead make
> filter implicit and drop public api (like mirror and commit filters), but place it
> in a separate file (no one will argue against putting large enough and complete
> new feature, represented by object into a separate file :). And this actually
> makes it easy to publish this filter at some moment. And now I think it was
> good decision anyway, as we postponed arguing around API and around shared dirty
> bitmaps.
> 
> So publishing the filter is future step.

OK, sure.

>>> +void bdrv_backup_top_set_progress_callback(
>>> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
>>> +        void *progress_opaque)
>>> +{
>>> +    BDRVBackupTopState *s = bs->opaque;
>>> +
>>> +    s->progress_cb = progress_cb;
>>> +    s->progress_opaque = progress_opaque;
>>> +}
>>> +
>>> +void bdrv_backup_top_drop(BlockDriverState *bs)
>>> +{
>>> +    BDRVBackupTopState *s = bs->opaque;
>>> +    AioContext *aio_context = bdrv_get_aio_context(bs);
>>> +
>>> +    aio_context_acquire(aio_context);
>>> +
>>> +    bdrv_drained_begin(bs);
>>> +
>>> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
>>
>> Pre-existing in other jobs, but I think calling this function is
>> dangerous.  (Which is why I remove it in my “block: Ignore loosening
>> perm restrictions failures” series.)
> 
> Hmm, good thing.. Should I rebase on it?

It would help me at least.

>>> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
>>> +    bdrv_set_backing_hd(bs, NULL, &error_abort);
>>
>> I think some of this function should be in a .bdrv_close()
>> implementation, for example this bdrv_set_backing_hd() call.
> 
> Why? We don't have .bdrv_open, so why to have .bdrv_close? I think, when
> we publish this filter most of _add() and _drop() will be refactored to
> open() and close(). Is there a real reason to implement .close() now?

Not really if it isn’t a usable block driver yet, no.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-14 12:57       ` Max Reitz
@ 2019-06-14 16:22         ` Vladimir Sementsov-Ogievskiy
  2019-06-14 20:03           ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-14 16:22 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

14.06.2019 15:57, Max Reitz wrote:
> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
>> 13.06.2019 18:57, Max Reitz wrote:
>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>>> Backup-top filter does copy-before-write operation. It should be
>>>> inserted above active disk and has a target node for CBW, like the
>>>> following:
>>>>
>>>>       +-------+
>>>>       | Guest |
>>>>       +-------+
>>>>           |r,w
>>>>           v
>>>>       +------------+  target   +---------------+
>>>>       | backup_top |---------->| target(qcow2) |
>>>>       +------------+   CBW     +---------------+
>>>>           |
>>>> backing |r,w
>>>>           v
>>>>       +-------------+
>>>>       | Active disk |
>>>>       +-------------+
>>>>
>>>> The driver will be used in backup instead of write-notifiers.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>    block/backup-top.h  |  64 +++++++++
>>>>    block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>>    block/Makefile.objs |   2 +
>>>>    3 files changed, 388 insertions(+)
>>>>    create mode 100644 block/backup-top.h
>>>>    create mode 100644 block/backup-top.c
>>>>
>>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>>> new file mode 100644
>>>> index 0000000000..788e18c358
>>>> --- /dev/null
>>>> +++ b/block/backup-top.h
> 
> [...]
> 
>>>> +/*
>>>> + * bdrv_backup_top_append
>>>> + *
>>>> + * Append backup_top filter node above @source node. @target node will receive
>>>> + * the data backed up during CBE operations. New filter together with @target
>>>> + * node are attached to @source aio context.
>>>> + *
>>>> + * The resulting filter node is implicit.
>>>
>>> Why?  It’s just as easy for the caller to just make it implicit if it
>>> should be.  (And usually the caller should decide that.)
>>
>> Personally, I don't know what are the reasons for filters to bi implicit or not,
>> I just made it like other job-filters.. I can move making-implicit to the caller
>> or drop it at all (if it will work).
> 
> Nodes are implicit if they haven’t been added consciously by the user.
> A node added by a block job can be non-implicit, too, as mirror shows;
> If the user specifies the filter-node-name option, they will know about
> the node, thus it is no longer implicit.
> 
> If the user doesn’t know about the node (they didn’t give the
> filter-node-name option), the node is implicit.
> 

Ok, I understand it. But it doesn't show, why it should be implicit?
Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
from query-named-block-nodes (the only interface to explore the whole graph for now)
anyway. And we can't absolutely hide side effects of additional node in the graph.

So, is there any real benefit of supporting separation into implicit and explicit filters?
It seems for me that it only complicates things...
In other words, what will break if we make all filters explicit?

> [...]
> 
>>>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>>>> +{
>>>> +    if (!bs->backing) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    return bdrv_co_flush(bs->backing->bs);
>>>
>>> Should we flush the target, too?
>>
>> Hm, you've asked it already, on previous version :)
> 
> I wasn’t sure...
> 
>> Backup don't do it,
>> so I just keep old behavior. And what is the reason to flush backup target
>> on any guest flush?
> 
> Hm, well, what’s the reason not to do it?

guest flushes will be slowed down?

> Also, there are not only guest flushes.  bdrv_flush_all() exists, which
> is called when the guest is stopped.  So who is going to flush the
> target if not its parent?
> 
> [...]

Backup job? But filter should not relay on it..

But should really filter do that job, or it is here only to do CBW? Maybe target
must have another parent which will care about flushing.

Ok, I think I'll flush target here too for safety, and leave a comment, that something
smarter would be better.
(or, if we decide to flush all children by default in generic code, I'll drop this handler)

> 
>>>> +
>>>> +    if (role == &child_file) {
>>>> +        /*
>>>> +         * Target child
>>>> +         *
>>>> +         * Share write to target (child_file), to not interfere
>>>> +         * with guest writes to its disk which may be in target backing chain.
>>>> +         */
>>>> +        if (perm & BLK_PERM_WRITE) {
>>>> +            *nshared = *nshared | BLK_PERM_WRITE;
>>>
>>> Why not always share WRITE on the target?
>>
>> Hmm, it's a bad thing to share writes on target, so I'm trying to reduce number of
>> cases when we have share it.
> 
> Is it?  First of all, this filter doesn’t care.  It doesn’t even read
> from the target (related note: we probably don’t need CONSISTENT_READ on
> the target).
> 
> Second, there is generally going to be a parent on backup-top that has
> the WRITE permission taken.  So this does not really effectively reduce
> that number of cases.

Ok.

> 
>>>> +        }
>>>> +    } else {
>>>> +        /* Source child */
>>>> +        if (perm & BLK_PERM_WRITE) {
>>>
>>> Or WRITE_UNCHANGED, no?
>>
>> Why? We don't need doing CBW for unchanged write.
> 
> But we will do it still, won’t we?
> 
> (If an unchanging write comes in, this driver will handle it just like a
> normal write, will it not?)

No, it will not, I check this flag in backup_top_co_pwritev

> 
> [...]
> 
>>>> +
>>>> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
>>>> +                                         BlockDriverState *target,
>>>> +                                         HBitmap *copy_bitmap,
>>>> +                                         Error **errp)
>>>> +{
> 
> [...]
> 
>>>> +}
>>>
>>> I guess it would be nice if it users could blockdev-add a backup-top
>>> node to basically get a backup with sync=none.
>>>
>>> (The bdrv_open implementation should then create a new bitmap and
>>> initialize it to be fully set.)
>>>
>>> But maybe it wouldn’t be so nice and I just have feverish dreams.
>>
>> When series begun, I was trying to make exactly this - user-available filter,
>> which may be used in separate, but you was against)
> 
> Past me is stupid.
> 
>> Maybe, not totally against, but I decided not to argue long, and instead make
>> filter implicit and drop public api (like mirror and commit filters), but place it
>> in a separate file (no one will argue against putting large enough and complete
>> new feature, represented by object into a separate file :). And this actually
>> makes it easy to publish this filter at some moment. And now I think it was
>> good decision anyway, as we postponed arguing around API and around shared dirty
>> bitmaps.
>>
>> So publishing the filter is future step.
> 
> OK, sure.
> 
>>>> +void bdrv_backup_top_set_progress_callback(
>>>> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
>>>> +        void *progress_opaque)
>>>> +{
>>>> +    BDRVBackupTopState *s = bs->opaque;
>>>> +
>>>> +    s->progress_cb = progress_cb;
>>>> +    s->progress_opaque = progress_opaque;
>>>> +}
>>>> +
>>>> +void bdrv_backup_top_drop(BlockDriverState *bs)
>>>> +{
>>>> +    BDRVBackupTopState *s = bs->opaque;
>>>> +    AioContext *aio_context = bdrv_get_aio_context(bs);
>>>> +
>>>> +    aio_context_acquire(aio_context);
>>>> +
>>>> +    bdrv_drained_begin(bs);
>>>> +
>>>> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
>>>
>>> Pre-existing in other jobs, but I think calling this function is
>>> dangerous.  (Which is why I remove it in my “block: Ignore loosening
>>> perm restrictions failures” series.)
>>
>> Hmm, good thing.. Should I rebase on it?
> 
> It would help me at least.

And now it's already staged, so I should.

> 
>>>> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
>>>> +    bdrv_set_backing_hd(bs, NULL, &error_abort);
>>>
>>> I think some of this function should be in a .bdrv_close()
>>> implementation, for example this bdrv_set_backing_hd() call.
>>
>> Why? We don't have .bdrv_open, so why to have .bdrv_close? I think, when
>> we publish this filter most of _add() and _drop() will be refactored to
>> open() and close(). Is there a real reason to implement .close() now?
> 
> Not really if it isn’t a usable block driver yet, no.
> 
> Max
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-14 16:22         ` Vladimir Sementsov-Ogievskiy
@ 2019-06-14 20:03           ` Max Reitz
  2019-06-17 10:36             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2019-06-14 20:03 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 6902 bytes --]

On 14.06.19 18:22, Vladimir Sementsov-Ogievskiy wrote:
> 14.06.2019 15:57, Max Reitz wrote:
>> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
>>> 13.06.2019 18:57, Max Reitz wrote:
>>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Backup-top filter does copy-before-write operation. It should be
>>>>> inserted above active disk and has a target node for CBW, like the
>>>>> following:
>>>>>
>>>>>       +-------+
>>>>>       | Guest |
>>>>>       +-------+
>>>>>           |r,w
>>>>>           v
>>>>>       +------------+  target   +---------------+
>>>>>       | backup_top |---------->| target(qcow2) |
>>>>>       +------------+   CBW     +---------------+
>>>>>           |
>>>>> backing |r,w
>>>>>           v
>>>>>       +-------------+
>>>>>       | Active disk |
>>>>>       +-------------+
>>>>>
>>>>> The driver will be used in backup instead of write-notifiers.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> ---
>>>>>    block/backup-top.h  |  64 +++++++++
>>>>>    block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>    block/Makefile.objs |   2 +
>>>>>    3 files changed, 388 insertions(+)
>>>>>    create mode 100644 block/backup-top.h
>>>>>    create mode 100644 block/backup-top.c
>>>>>
>>>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>>>> new file mode 100644
>>>>> index 0000000000..788e18c358
>>>>> --- /dev/null
>>>>> +++ b/block/backup-top.h
>>
>> [...]
>>
>>>>> +/*
>>>>> + * bdrv_backup_top_append
>>>>> + *
>>>>> + * Append backup_top filter node above @source node. @target node will receive
>>>>> + * the data backed up during CBE operations. New filter together with @target
>>>>> + * node are attached to @source aio context.
>>>>> + *
>>>>> + * The resulting filter node is implicit.
>>>>
>>>> Why?  It’s just as easy for the caller to just make it implicit if it
>>>> should be.  (And usually the caller should decide that.)
>>>
>>> Personally, I don't know what are the reasons for filters to bi implicit or not,
>>> I just made it like other job-filters.. I can move making-implicit to the caller
>>> or drop it at all (if it will work).
>>
>> Nodes are implicit if they haven’t been added consciously by the user.
>> A node added by a block job can be non-implicit, too, as mirror shows;
>> If the user specifies the filter-node-name option, they will know about
>> the node, thus it is no longer implicit.
>>
>> If the user doesn’t know about the node (they didn’t give the
>> filter-node-name option), the node is implicit.
>>
> 
> Ok, I understand it. But it doesn't show, why it should be implicit?
> Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
> from query-named-block-nodes (the only interface to explore the whole graph for now)
> anyway. And we can't absolutely hide side effects of additional node in the graph.

Well, we try, at least.  At least we hide them from query-block.

> So, is there any real benefit of supporting separation into implicit and explicit filters?
> It seems for me that it only complicates things...
> In other words, what will break if we make all filters explicit?

The theory is that qemu may decide to add nodes at any point, but at
least when managing chains etc., they may not be visible to the user.  I
don’t think we can get rid of them so easily.

One example that isn’t implemented yet is copy-on-read.  In theory,
specifying copy-on-read=on for -drive should create an implicit COR node
on top.  The user shouldn’t see that node when inspecting the drive or
when issuing block jobs on it, etc.  And this node has to stay there
when the user does e.g. an active commit somewhere down the chain.

That sounds like a horrible ordeal to implement, so it hasn’t been done
yet.  Maybe it never will.  It isn’t that bad for the job filters,
because they generally freeze the block graph, so there is no problem
with potential modifications.

All in all I do think having implicit nodes makes sense.  Maybe not so
much now, but in the future (if someone implements converting -drive COR
and throttle options to implicit nodes...).

>> [...]
>>
>>>>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>>>>> +{
>>>>> +    if (!bs->backing) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    return bdrv_co_flush(bs->backing->bs);
>>>>
>>>> Should we flush the target, too?
>>>
>>> Hm, you've asked it already, on previous version :)
>>
>> I wasn’t sure...
>>
>>> Backup don't do it,
>>> so I just keep old behavior. And what is the reason to flush backup target
>>> on any guest flush?
>>
>> Hm, well, what’s the reason not to do it?
> 
> guest flushes will be slowed down?

Hm, the user could specify cache=unsafe if they don’t care.  Which gives
me second thoughs... [1]

>> Also, there are not only guest flushes.  bdrv_flush_all() exists, which
>> is called when the guest is stopped.  So who is going to flush the
>> target if not its parent?
>>
>> [...]
> 
> Backup job? But filter should not relay on it..

Hm.  Backup job actually doesn’t sound that wrong.

> But should really filter do that job, or it is here only to do CBW? Maybe target
> must have another parent which will care about flushing.
> 
> Ok, I think I'll flush target here too for safety, and leave a comment, that something
> smarter would be better.
> (or, if we decide to flush all children by default in generic code, I'll drop this handler)

[1] Thinking more about this, for normal backups the target file does
not reflect a valid disk state until the backup is done; just like for
qemu-img convert.  Just like qemu-img convert, there is therefore no
reason to flush the target until the job is done.

But there is also the other case of image fleecing.  In this case, the
target will have another parent, so bdrv_flush_all() will be done by
someone.  Still, it intuitively makes sense to me that in this case, the
backup-top filter should flush the target if the guest decides to flush
the device (so that the target stays consistent on disk).

backup-top currently cannot differentiate between the cases, but maybe
that is generally a useful hint to give to it?  (That is, whether the
target has a consistent state or not.)

[...]

>>>>> +        }
>>>>> +    } else {
>>>>> +        /* Source child */
>>>>> +        if (perm & BLK_PERM_WRITE) {
>>>>
>>>> Or WRITE_UNCHANGED, no?
>>>
>>> Why? We don't need doing CBW for unchanged write.
>>
>> But we will do it still, won’t we?
>>
>> (If an unchanging write comes in, this driver will handle it just like a
>> normal write, will it not?)
> 
> No, it will not, I check this flag in backup_top_co_pwritev

Oops.  My bad.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-14 20:03           ` Max Reitz
@ 2019-06-17 10:36             ` Vladimir Sementsov-Ogievskiy
  2019-06-17 14:56               ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-17 10:36 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

14.06.2019 23:03, Max Reitz wrote:
> On 14.06.19 18:22, Vladimir Sementsov-Ogievskiy wrote:
>> 14.06.2019 15:57, Max Reitz wrote:
>>> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
>>>> 13.06.2019 18:57, Max Reitz wrote:
>>>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Backup-top filter does copy-before-write operation. It should be
>>>>>> inserted above active disk and has a target node for CBW, like the
>>>>>> following:
>>>>>>
>>>>>>        +-------+
>>>>>>        | Guest |
>>>>>>        +-------+
>>>>>>            |r,w
>>>>>>            v
>>>>>>        +------------+  target   +---------------+
>>>>>>        | backup_top |---------->| target(qcow2) |
>>>>>>        +------------+   CBW     +---------------+
>>>>>>            |
>>>>>> backing |r,w
>>>>>>            v
>>>>>>        +-------------+
>>>>>>        | Active disk |
>>>>>>        +-------------+
>>>>>>
>>>>>> The driver will be used in backup instead of write-notifiers.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>> ---
>>>>>>     block/backup-top.h  |  64 +++++++++
>>>>>>     block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     block/Makefile.objs |   2 +
>>>>>>     3 files changed, 388 insertions(+)
>>>>>>     create mode 100644 block/backup-top.h
>>>>>>     create mode 100644 block/backup-top.c
>>>>>>
>>>>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>>>>> new file mode 100644
>>>>>> index 0000000000..788e18c358
>>>>>> --- /dev/null
>>>>>> +++ b/block/backup-top.h
>>>
>>> [...]
>>>
>>>>>> +/*
>>>>>> + * bdrv_backup_top_append
>>>>>> + *
>>>>>> + * Append backup_top filter node above @source node. @target node will receive
>>>>>> + * the data backed up during CBE operations. New filter together with @target
>>>>>> + * node are attached to @source aio context.
>>>>>> + *
>>>>>> + * The resulting filter node is implicit.
>>>>>
>>>>> Why?  It’s just as easy for the caller to just make it implicit if it
>>>>> should be.  (And usually the caller should decide that.)
>>>>
>>>> Personally, I don't know what are the reasons for filters to bi implicit or not,
>>>> I just made it like other job-filters.. I can move making-implicit to the caller
>>>> or drop it at all (if it will work).
>>>
>>> Nodes are implicit if they haven’t been added consciously by the user.
>>> A node added by a block job can be non-implicit, too, as mirror shows;
>>> If the user specifies the filter-node-name option, they will know about
>>> the node, thus it is no longer implicit.
>>>
>>> If the user doesn’t know about the node (they didn’t give the
>>> filter-node-name option), the node is implicit.
>>>
>>
>> Ok, I understand it. But it doesn't show, why it should be implicit?
>> Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
>> from query-named-block-nodes (the only interface to explore the whole graph for now)
>> anyway. And we can't absolutely hide side effects of additional node in the graph.
> 
> Well, we try, at least.  At least we hide them from query-block.
> 
>> So, is there any real benefit of supporting separation into implicit and explicit filters?
>> It seems for me that it only complicates things...
>> In other words, what will break if we make all filters explicit?
> 
> The theory is that qemu may decide to add nodes at any point, but at
> least when managing chains etc., they may not be visible to the user.  I
> don’t think we can get rid of them so easily.
> 
> One example that isn’t implemented yet is copy-on-read.  In theory,
> specifying copy-on-read=on for -drive should create an implicit COR node
> on top.  The user shouldn’t see that node when inspecting the drive or
> when issuing block jobs on it, etc.  And this node has to stay there
> when the user does e.g. an active commit somewhere down the chain.

Why instead not to just write in doc that yes, filter is created when you
enable copy-on-read? And do same for all operations which may create filter,
we don't have a lot of them? And add optional parameter to set node-name for
created filter.

> 
> That sounds like a horrible ordeal to implement, so it hasn’t been done
> yet.  Maybe it never will.  It isn’t that bad for the job filters,
> because they generally freeze the block graph, so there is no problem
> with potential modifications.
> 
> All in all I do think having implicit nodes makes sense.  Maybe not so
> much now, but in the future (if someone implements converting -drive COR
> and throttle options to implicit nodes...).

But do we have at least strong definition of how implicit nodes should behave
on any graph-modifying operations around them? Should new implicit/explicit
filters be created above or under them? We should specify it in doc about all
such operations, otherwise effect of implicit nodes may change unpredictably for
the user. I'm afraid, that implicit filters will just continue my list of
features which should not be used if we want to be able to maintain all this mess..

1. If you something around filenames don't work, use node-names, and better
don't use filename-based APIs

2. If you qemu-img can't do what you need, use paused qemu process, and better,
don't use qemu-img

and a new one:

3. If something is broken around jobs and filters and any block operations,
set filter-node-name in parameters to make filter node explicit. And better,
always set it..

So at least, I think we should make it possible for all filters to be explicit
if user wants, by setting its name like in mirror.

(and then, I will not really care of implicit-filters related logic, like I don't
really care about filename-api related logic)

Also, now, implict filters are actually available for direct manipulation by user,
as their node-names are exported in query-named-block-nodes and nothing prevents
using these names in differet APIs.

> 
>>> [...]
>>>
>>>>>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>>>>>> +{
>>>>>> +    if (!bs->backing) {
>>>>>> +        return 0;
>>>>>> +    }
>>>>>> +
>>>>>> +    return bdrv_co_flush(bs->backing->bs);
>>>>>
>>>>> Should we flush the target, too?
>>>>
>>>> Hm, you've asked it already, on previous version :)
>>>
>>> I wasn’t sure...
>>>
>>>> Backup don't do it,
>>>> so I just keep old behavior. And what is the reason to flush backup target
>>>> on any guest flush?
>>>
>>> Hm, well, what’s the reason not to do it?
>>
>> guest flushes will be slowed down?
> 
> Hm, the user could specify cache=unsafe if they don’t care.  Which gives
> me second thoughs... [1]
> 
>>> Also, there are not only guest flushes.  bdrv_flush_all() exists, which
>>> is called when the guest is stopped.  So who is going to flush the
>>> target if not its parent?
>>>
>>> [...]
>>
>> Backup job? But filter should not relay on it..
> 
> Hm.  Backup job actually doesn’t sound that wrong.
> 
>> But should really filter do that job, or it is here only to do CBW? Maybe target
>> must have another parent which will care about flushing.
>>
>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
>> smarter would be better.
>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
> 
> [1] Thinking more about this, for normal backups the target file does
> not reflect a valid disk state until the backup is done; just like for
> qemu-img convert.  Just like qemu-img convert, there is therefore no
> reason to flush the target until the job is done.
> 
> But there is also the other case of image fleecing.  In this case, the
> target will have another parent, so bdrv_flush_all() will be done by
> someone.  Still, it intuitively makes sense to me that in this case, the
> backup-top filter should flush the target if the guest decides to flush
> the device (so that the target stays consistent on disk).
> 
> backup-top currently cannot differentiate between the cases, but maybe
> that is generally a useful hint to give to it?  (That is, whether the
> target has a consistent state or not.)

Hmm, for fleecing we don't need consistent state of temporary image. If something fails,
the whole operation is considered to be failed. And anyway, I don't see relations
between consistency of temporary fleecing image and guest flushes, why should we
bind them?

> 
> [...]
> 
>>>>>> +        }
>>>>>> +    } else {
>>>>>> +        /* Source child */
>>>>>> +        if (perm & BLK_PERM_WRITE) {
>>>>>
>>>>> Or WRITE_UNCHANGED, no?
>>>>
>>>> Why? We don't need doing CBW for unchanged write.
>>>
>>> But we will do it still, won’t we?
>>>
>>> (If an unchanging write comes in, this driver will handle it just like a
>>> normal write, will it not?)
>>
>> No, it will not, I check this flag in backup_top_co_pwritev
> 
> Oops.  My bad.
> 
> Max
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-17 10:36             ` Vladimir Sementsov-Ogievskiy
@ 2019-06-17 14:56               ` Max Reitz
  2019-06-17 15:53                 ` Kevin Wolf
  2019-06-18  7:25                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 32+ messages in thread
From: Max Reitz @ 2019-06-17 14:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

[-- Attachment #1.1: Type: text/plain, Size: 14007 bytes --]

On 17.06.19 12:36, Vladimir Sementsov-Ogievskiy wrote:
> 14.06.2019 23:03, Max Reitz wrote:
>> On 14.06.19 18:22, Vladimir Sementsov-Ogievskiy wrote:
>>> 14.06.2019 15:57, Max Reitz wrote:
>>>> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 13.06.2019 18:57, Max Reitz wrote:
>>>>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> Backup-top filter does copy-before-write operation. It should be
>>>>>>> inserted above active disk and has a target node for CBW, like the
>>>>>>> following:
>>>>>>>
>>>>>>>        +-------+
>>>>>>>        | Guest |
>>>>>>>        +-------+
>>>>>>>            |r,w
>>>>>>>            v
>>>>>>>        +------------+  target   +---------------+
>>>>>>>        | backup_top |---------->| target(qcow2) |
>>>>>>>        +------------+   CBW     +---------------+
>>>>>>>            |
>>>>>>> backing |r,w
>>>>>>>            v
>>>>>>>        +-------------+
>>>>>>>        | Active disk |
>>>>>>>        +-------------+
>>>>>>>
>>>>>>> The driver will be used in backup instead of write-notifiers.
>>>>>>>
>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>> ---
>>>>>>>     block/backup-top.h  |  64 +++++++++
>>>>>>>     block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>     block/Makefile.objs |   2 +
>>>>>>>     3 files changed, 388 insertions(+)
>>>>>>>     create mode 100644 block/backup-top.h
>>>>>>>     create mode 100644 block/backup-top.c
>>>>>>>
>>>>>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>>>>>> new file mode 100644
>>>>>>> index 0000000000..788e18c358
>>>>>>> --- /dev/null
>>>>>>> +++ b/block/backup-top.h
>>>>
>>>> [...]
>>>>
>>>>>>> +/*
>>>>>>> + * bdrv_backup_top_append
>>>>>>> + *
>>>>>>> + * Append backup_top filter node above @source node. @target node will receive
>>>>>>> + * the data backed up during CBE operations. New filter together with @target
>>>>>>> + * node are attached to @source aio context.
>>>>>>> + *
>>>>>>> + * The resulting filter node is implicit.
>>>>>>
>>>>>> Why?  It’s just as easy for the caller to just make it implicit if it
>>>>>> should be.  (And usually the caller should decide that.)
>>>>>
>>>>> Personally, I don't know what are the reasons for filters to bi implicit or not,
>>>>> I just made it like other job-filters.. I can move making-implicit to the caller
>>>>> or drop it at all (if it will work).
>>>>
>>>> Nodes are implicit if they haven’t been added consciously by the user.
>>>> A node added by a block job can be non-implicit, too, as mirror shows;
>>>> If the user specifies the filter-node-name option, they will know about
>>>> the node, thus it is no longer implicit.
>>>>
>>>> If the user doesn’t know about the node (they didn’t give the
>>>> filter-node-name option), the node is implicit.
>>>>
>>>
>>> Ok, I understand it. But it doesn't show, why it should be implicit?
>>> Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
>>> from query-named-block-nodes (the only interface to explore the whole graph for now)
>>> anyway. And we can't absolutely hide side effects of additional node in the graph.
>>
>> Well, we try, at least.  At least we hide them from query-block.
>>
>>> So, is there any real benefit of supporting separation into implicit and explicit filters?
>>> It seems for me that it only complicates things...
>>> In other words, what will break if we make all filters explicit?
>>
>> The theory is that qemu may decide to add nodes at any point, but at
>> least when managing chains etc., they may not be visible to the user.  I
>> don’t think we can get rid of them so easily.
>>
>> One example that isn’t implemented yet is copy-on-read.  In theory,
>> specifying copy-on-read=on for -drive should create an implicit COR node
>> on top.  The user shouldn’t see that node when inspecting the drive or
>> when issuing block jobs on it, etc.  And this node has to stay there
>> when the user does e.g. an active commit somewhere down the chain.
> 
> Why instead not to just write in doc that yes, filter is created when you
> enable copy-on-read?

Because the problem is that existing users are not aware of that.

If they were, they could simply create a COR filter already.

I suppose we could interpret the deprecation policy in a way that we
could change the behavior of -drive copy-on-read=on, but as I already
said, what’s the point of supporting -drive copy-on-read=on, when you
can simply explicitly create a COR filter on top?

> And do same for all operations which may create filter,
> we don't have a lot of them? And add optional parameter to set node-name for
> created filter.

That optional parameter exists, and if it is given, the user shows that
they are aware of the node.  Hence the node becomes explicit then.

The problem is with legacy users that use an existing interface and
suddenly the externally visible interface of qemu changes in a way that
could break them.  I suppose additional entries in
query-named-block-nodes is not breakage.  Maybe it is, that would be a
bug then.

(If nobody has noticed so far, that may be because these legacy
applications didn’t use query-named-block-nodes.  But now maybe libvirt
does, and so if we converted copy-on-read=on to a COR node, it would
notice now and thus break.  So just because our handling of implicit
nodes is broken right now does not mean we can abandon the concept
altogether.)

>> That sounds like a horrible ordeal to implement, so it hasn’t been done
>> yet.  Maybe it never will.  It isn’t that bad for the job filters,
>> because they generally freeze the block graph, so there is no problem
>> with potential modifications.
>>
>> All in all I do think having implicit nodes makes sense.  Maybe not so
>> much now, but in the future (if someone implements converting -drive COR
>> and throttle options to implicit nodes...).
> 
> But do we have at least strong definition of how implicit nodes should behave
> on any graph-modifying operations around them?

No.  Depends on the node.

> Should new implicit/explicit
> filters be created above or under them?

That was always the most difficult question we had when we introduced
filters.

The problem is that we never answered it in our code base.

One day, we just added filters (“let’s see what happens”), and in the
beginning, they were all explicit, so we still didn’t have to answer
this question (in code).  Then job filters were added, and we still
didn’t have to, because they set blockers so the graph couldn’t be
reorganized with them in place anyway.

If we converted copy-on-read=on to a COR node, we would have to answer
that question.

> We should specify it in doc about all
> such operations, otherwise effect of implicit nodes may change unpredictably for
> the user. I'm afraid, that implicit filters will just continue my list of
> features which should not be used if we want to be able to maintain all this mess..

Yeah, well, that is nothing new at all.  For example, note the “Split
I/O throttling off into own BDS – Requires some care with snapshots
etc.” here:
https://wiki.qemu.org/ToDo/Block#Dynamic_graph_reconfiguration_.28e.g._adding_block_filters.2C_taking_snapshots.2C_etc..29

This was always a problem.  It was always the biggest problem with
filters.  We never answered it because it never become an acute problem,
as I described above.

We just said “For explicit filters, they just stay where they are, and
that’s OK because the user can take care of them.”  We never did add the
implicit filters that were actually going to become a problem (COR and
throttle).

So as I said above, the behavior of implicit nodes needs to be examined
on a case-by-case basis.  Well, actually, this is only about COR and
throttle, really, and both would behave the same: Always stay at the
BlockBackend, because that is the behavior when you currently use
BB-level throttling or COR.

The jobs that now use implicit nodes freeze the graph, so there is no
problem there.

(Also, by the way, we do not need the option to give COR/throttle nodes
created through BB options a node name.  As I said above, at that point
you can just create the node yourself.)

> 1. If you something around filenames don't work, use node-names, and better
> don't use filename-based APIs

I think we agree that filename-based APIs were a mistake.  (Though only
in hindsight.  They were added when there where no node names, so it did
make sense then.)

So now you should not use them, correct.

> 2. If you qemu-img can't do what you need, use paused qemu process, and better,
> don't use qemu-img

There was always a plan to give qemu-img feature parity to qemu proper.
 But often the interface is just a problem.  Node names really don’t
make much sense with qemu-img, I think.

(I mean, feel free to add something that automatically names the backing
nodes e.g. backing.0, backing.1, etc., and then add an interface to
qemu-img commit to use node names, but I find that weird.)

So I don’t quite see the problem.  Of course qemu has more functionality
than qemu-img.  qemu-img is fine as long as you have a simple set-up,
but once it gets more complicated, you need management software; and
that software shouldn’t care much whether it uses qemu-img or qemu itself.

> and a new one:
> 
> 3. If something is broken around jobs and filters and any block operations,
> set filter-node-name in parameters to make filter node explicit. And better,
> always set it..

Uh, yeah?  I don’t see the problem here.

This is like saying “If I want to use new features, I should use
-blockdev and not -hda”.  Yes, that is correct.  You should set
filter-node-name if you care about the exact state of the graph.

> So at least, I think we should make it possible for all filters to be explicit
> if user wants, by setting its name like in mirror.

For all filters that the user has no other way of creating them, yes.
(That is, the filters created by the block jobs.)

> (and then, I will not really care of implicit-filters related logic, like I don't
> really care about filename-api related logic)
> 
> Also, now, implict filters are actually available for direct manipulation by user,
> as their node-names are exported in query-named-block-nodes and nothing prevents
> using these names in differet APIs.

Hm, yes.  Is that a problem?

>>>> [...]
>>>>
>>>>>>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>>>>>>> +{
>>>>>>> +    if (!bs->backing) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return bdrv_co_flush(bs->backing->bs);
>>>>>>
>>>>>> Should we flush the target, too?
>>>>>
>>>>> Hm, you've asked it already, on previous version :)
>>>>
>>>> I wasn’t sure...
>>>>
>>>>> Backup don't do it,
>>>>> so I just keep old behavior. And what is the reason to flush backup target
>>>>> on any guest flush?
>>>>
>>>> Hm, well, what’s the reason not to do it?
>>>
>>> guest flushes will be slowed down?
>>
>> Hm, the user could specify cache=unsafe if they don’t care.  Which gives
>> me second thoughs... [1]
>>
>>>> Also, there are not only guest flushes.  bdrv_flush_all() exists, which
>>>> is called when the guest is stopped.  So who is going to flush the
>>>> target if not its parent?
>>>>
>>>> [...]
>>>
>>> Backup job? But filter should not relay on it..
>>
>> Hm.  Backup job actually doesn’t sound that wrong.
>>
>>> But should really filter do that job, or it is here only to do CBW? Maybe target
>>> must have another parent which will care about flushing.
>>>
>>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
>>> smarter would be better.
>>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
>>
>> [1] Thinking more about this, for normal backups the target file does
>> not reflect a valid disk state until the backup is done; just like for
>> qemu-img convert.  Just like qemu-img convert, there is therefore no
>> reason to flush the target until the job is done.
>>
>> But there is also the other case of image fleecing.  In this case, the
>> target will have another parent, so bdrv_flush_all() will be done by
>> someone.  Still, it intuitively makes sense to me that in this case, the
>> backup-top filter should flush the target if the guest decides to flush
>> the device (so that the target stays consistent on disk).
>>
>> backup-top currently cannot differentiate between the cases, but maybe
>> that is generally a useful hint to give to it?  (That is, whether the
>> target has a consistent state or not.)
> 
> Hmm, for fleecing we don't need consistent state of temporary image. If something fails,
> the whole operation is considered to be failed. And anyway, I don't see relations
> between consistency of temporary fleecing image and guest flushes, why should we
> bind them?

Guest flushes are about clearing the cache and keeping the disk state
consistent in case of e.g. power failure.  So if the target should be
consistent, it too should then survive a power failure.

Hm, but then again, if we don’t enforce on the host that the target is
always flushed before the source, that consistency is quickly lost when
the host decides to flush the cache just because it wants to.

Hmmm.  I guess we could say that the target is always inconsistent on
e.g. power failure.

My problem is that the target is also, well, a backup, you know.  So
it’s kind of a pain if we don’t care about it.  The user probably wanted
to save that old data, and now it may be lost just because we didn’t
flush it when the user entered “sync” in the VM.  I don’t know.

Max

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-17 14:56               ` Max Reitz
@ 2019-06-17 15:53                 ` Kevin Wolf
  2019-06-17 16:01                   ` Max Reitz
  2019-06-18  7:25                 ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2019-06-17 15:53 UTC (permalink / raw)
  To: Max Reitz
  Cc: fam, Vladimir Sementsov-Ogievskiy, Denis Lunev, qemu-block,
	qemu-devel, stefanha, jsnow

[-- Attachment #1: Type: text/plain, Size: 8616 bytes --]

Am 17.06.2019 um 16:56 hat Max Reitz geschrieben:
> On 17.06.19 12:36, Vladimir Sementsov-Ogievskiy wrote:
> > 14.06.2019 23:03, Max Reitz wrote:
> >> On 14.06.19 18:22, Vladimir Sementsov-Ogievskiy wrote:
> >>> 14.06.2019 15:57, Max Reitz wrote:
> >>>> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
> >>>>> 13.06.2019 18:57, Max Reitz wrote:
> >>>>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> >>>>>>> Backup-top filter does copy-before-write operation. It should be
> >>>>>>> inserted above active disk and has a target node for CBW, like the
> >>>>>>> following:
> >>>>>>>
> >>>>>>>        +-------+
> >>>>>>>        | Guest |
> >>>>>>>        +-------+
> >>>>>>>            |r,w
> >>>>>>>            v
> >>>>>>>        +------------+  target   +---------------+
> >>>>>>>        | backup_top |---------->| target(qcow2) |
> >>>>>>>        +------------+   CBW     +---------------+
> >>>>>>>            |
> >>>>>>> backing |r,w
> >>>>>>>            v
> >>>>>>>        +-------------+
> >>>>>>>        | Active disk |
> >>>>>>>        +-------------+
> >>>>>>>
> >>>>>>> The driver will be used in backup instead of write-notifiers.
> >>>>>>>
> >>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>>>>> ---
> >>>>>>>     block/backup-top.h  |  64 +++++++++
> >>>>>>>     block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>     block/Makefile.objs |   2 +
> >>>>>>>     3 files changed, 388 insertions(+)
> >>>>>>>     create mode 100644 block/backup-top.h
> >>>>>>>     create mode 100644 block/backup-top.c
> >>>>>>>
> >>>>>>> diff --git a/block/backup-top.h b/block/backup-top.h
> >>>>>>> new file mode 100644
> >>>>>>> index 0000000000..788e18c358
> >>>>>>> --- /dev/null
> >>>>>>> +++ b/block/backup-top.h
> >>>>
> >>>> [...]
> >>>>
> >>>>>>> +/*
> >>>>>>> + * bdrv_backup_top_append
> >>>>>>> + *
> >>>>>>> + * Append backup_top filter node above @source node. @target node will receive
> >>>>>>> + * the data backed up during CBE operations. New filter together with @target
> >>>>>>> + * node are attached to @source aio context.
> >>>>>>> + *
> >>>>>>> + * The resulting filter node is implicit.
> >>>>>>
> >>>>>> Why?  It’s just as easy for the caller to just make it implicit if it
> >>>>>> should be.  (And usually the caller should decide that.)
> >>>>>
> >>>>> Personally, I don't know what are the reasons for filters to bi implicit or not,
> >>>>> I just made it like other job-filters.. I can move making-implicit to the caller
> >>>>> or drop it at all (if it will work).
> >>>>
> >>>> Nodes are implicit if they haven’t been added consciously by the user.
> >>>> A node added by a block job can be non-implicit, too, as mirror shows;
> >>>> If the user specifies the filter-node-name option, they will know about
> >>>> the node, thus it is no longer implicit.
> >>>>
> >>>> If the user doesn’t know about the node (they didn’t give the
> >>>> filter-node-name option), the node is implicit.
> >>>>
> >>>
> >>> Ok, I understand it. But it doesn't show, why it should be implicit?
> >>> Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
> >>> from query-named-block-nodes (the only interface to explore the whole graph for now)
> >>> anyway. And we can't absolutely hide side effects of additional node in the graph.
> >>
> >> Well, we try, at least.  At least we hide them from query-block.
> >>
> >>> So, is there any real benefit of supporting separation into implicit and explicit filters?
> >>> It seems for me that it only complicates things...
> >>> In other words, what will break if we make all filters explicit?
> >>
> >> The theory is that qemu may decide to add nodes at any point, but at
> >> least when managing chains etc., they may not be visible to the user.  I
> >> don’t think we can get rid of them so easily.

I think the important point to understand here is that implicit filters
are a concept to maintain compatibility with legacy (pre-blockdev)
management tools which simply don't manage stuff on the node level. So
changing the structure of the graph is not a problem for them and we
just need to make sure that when they do something with a BlockBackend,
we perform the action on the node that they actually mean.

Modern management tools are expected to manage the graph on a node level
and to assign node names to everything.

Note that libvirt is close to becoming a modern management tool, so it's
probably a good idea to now make all block jobs add filters where we'll
need them in the long run.

> >> One example that isn’t implemented yet is copy-on-read.  In theory,
> >> specifying copy-on-read=on for -drive should create an implicit COR node
> >> on top.  The user shouldn’t see that node when inspecting the drive or
> >> when issuing block jobs on it, etc.  And this node has to stay there
> >> when the user does e.g. an active commit somewhere down the chain.
> > 
> > Why instead not to just write in doc that yes, filter is created when you
> > enable copy-on-read?
> 
> Because the problem is that existing users are not aware of that.
> 
> If they were, they could simply create a COR filter already.
> 
> I suppose we could interpret the deprecation policy in a way that we
> could change the behavior of -drive copy-on-read=on, but as I already
> said, what’s the point of supporting -drive copy-on-read=on, when you
> can simply explicitly create a COR filter on top?

I actually changed my opinion on how to do this since we introduced
implicit filters. I know think that the right way to move forward is to
make sure that the copy-on-read filter can do everything it needs to do,
and then just completely deprecate -drive copy-on-read=on instead of
adding compatibility magic to turn it into an implicit copy-on-read
filter internally.

> >> That sounds like a horrible ordeal to implement, so it hasn’t been done
> >> yet.  Maybe it never will.  It isn’t that bad for the job filters,
> >> because they generally freeze the block graph, so there is no problem
> >> with potential modifications.
> >>
> >> All in all I do think having implicit nodes makes sense.  Maybe not so
> >> much now, but in the future (if someone implements converting -drive COR
> >> and throttle options to implicit nodes...).
> > 
> > But do we have at least strong definition of how implicit nodes should behave
> > on any graph-modifying operations around them?
> 
> No.  Depends on the node.
> 
> > Should new implicit/explicit
> > filters be created above or under them?
> 
> That was always the most difficult question we had when we introduced
> filters.
> 
> The problem is that we never answered it in our code base.
> 
> One day, we just added filters (“let’s see what happens”), and in the
> beginning, they were all explicit, so we still didn’t have to answer
> this question (in code).  Then job filters were added, and we still
> didn’t have to, because they set blockers so the graph couldn’t be
> reorganized with them in place anyway.
> 
> If we converted copy-on-read=on to a COR node, we would have to answer
> that question.

That's why we shouldn't do that, but just remove copy-on-read=on. :-)

> >>> But should really filter do that job, or it is here only to do CBW? Maybe target
> >>> must have another parent which will care about flushing.
> >>>
> >>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
> >>> smarter would be better.
> >>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
> >>
> >> [1] Thinking more about this, for normal backups the target file does
> >> not reflect a valid disk state until the backup is done; just like for
> >> qemu-img convert.  Just like qemu-img convert, there is therefore no
> >> reason to flush the target until the job is done.

Depends, the target could have the source as its backing file (like
fleecing, but without exporting it right now), and then you could always
have a consistent view on the target. Well, almost.

Almost because to guarantee this, we'd have to flush between each CBW
and the corresponding write to the same block, to make sure that the old
data is backed up before it is overwritten.

Of course, this would perform about as badly as internal COW in qcow2...
So probably not a guarantee we should be making by default. But it might
make sense as an option.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-17 15:53                 ` Kevin Wolf
@ 2019-06-17 16:01                   ` Max Reitz
  2019-06-17 16:25                     ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2019-06-17 16:01 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: fam, Vladimir Sementsov-Ogievskiy, Denis Lunev, qemu-block,
	qemu-devel, stefanha, jsnow


[-- Attachment #1.1: Type: text/plain, Size: 9487 bytes --]

On 17.06.19 17:53, Kevin Wolf wrote:
> Am 17.06.2019 um 16:56 hat Max Reitz geschrieben:
>> On 17.06.19 12:36, Vladimir Sementsov-Ogievskiy wrote:
>>> 14.06.2019 23:03, Max Reitz wrote:
>>>> On 14.06.19 18:22, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 14.06.2019 15:57, Max Reitz wrote:
>>>>>> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 13.06.2019 18:57, Max Reitz wrote:
>>>>>>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>> Backup-top filter does copy-before-write operation. It should be
>>>>>>>>> inserted above active disk and has a target node for CBW, like the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>>        +-------+
>>>>>>>>>        | Guest |
>>>>>>>>>        +-------+
>>>>>>>>>            |r,w
>>>>>>>>>            v
>>>>>>>>>        +------------+  target   +---------------+
>>>>>>>>>        | backup_top |---------->| target(qcow2) |
>>>>>>>>>        +------------+   CBW     +---------------+
>>>>>>>>>            |
>>>>>>>>> backing |r,w
>>>>>>>>>            v
>>>>>>>>>        +-------------+
>>>>>>>>>        | Active disk |
>>>>>>>>>        +-------------+
>>>>>>>>>
>>>>>>>>> The driver will be used in backup instead of write-notifiers.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>>>> ---
>>>>>>>>>     block/backup-top.h  |  64 +++++++++
>>>>>>>>>     block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>     block/Makefile.objs |   2 +
>>>>>>>>>     3 files changed, 388 insertions(+)
>>>>>>>>>     create mode 100644 block/backup-top.h
>>>>>>>>>     create mode 100644 block/backup-top.c
>>>>>>>>>
>>>>>>>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>>>>>>>> new file mode 100644
>>>>>>>>> index 0000000000..788e18c358
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/block/backup-top.h
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>> +/*
>>>>>>>>> + * bdrv_backup_top_append
>>>>>>>>> + *
>>>>>>>>> + * Append backup_top filter node above @source node. @target node will receive
>>>>>>>>> + * the data backed up during CBE operations. New filter together with @target
>>>>>>>>> + * node are attached to @source aio context.
>>>>>>>>> + *
>>>>>>>>> + * The resulting filter node is implicit.
>>>>>>>>
>>>>>>>> Why?  It’s just as easy for the caller to just make it implicit if it
>>>>>>>> should be.  (And usually the caller should decide that.)
>>>>>>>
>>>>>>> Personally, I don't know what are the reasons for filters to bi implicit or not,
>>>>>>> I just made it like other job-filters.. I can move making-implicit to the caller
>>>>>>> or drop it at all (if it will work).
>>>>>>
>>>>>> Nodes are implicit if they haven’t been added consciously by the user.
>>>>>> A node added by a block job can be non-implicit, too, as mirror shows;
>>>>>> If the user specifies the filter-node-name option, they will know about
>>>>>> the node, thus it is no longer implicit.
>>>>>>
>>>>>> If the user doesn’t know about the node (they didn’t give the
>>>>>> filter-node-name option), the node is implicit.
>>>>>>
>>>>>
>>>>> Ok, I understand it. But it doesn't show, why it should be implicit?
>>>>> Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
>>>>> from query-named-block-nodes (the only interface to explore the whole graph for now)
>>>>> anyway. And we can't absolutely hide side effects of additional node in the graph.
>>>>
>>>> Well, we try, at least.  At least we hide them from query-block.
>>>>
>>>>> So, is there any real benefit of supporting separation into implicit and explicit filters?
>>>>> It seems for me that it only complicates things...
>>>>> In other words, what will break if we make all filters explicit?
>>>>
>>>> The theory is that qemu may decide to add nodes at any point, but at
>>>> least when managing chains etc., they may not be visible to the user.  I
>>>> don’t think we can get rid of them so easily.
> 
> I think the important point to understand here is that implicit filters
> are a concept to maintain compatibility with legacy (pre-blockdev)
> management tools which simply don't manage stuff on the node level. So
> changing the structure of the graph is not a problem for them and we
> just need to make sure that when they do something with a BlockBackend,
> we perform the action on the node that they actually mean.
> 
> Modern management tools are expected to manage the graph on a node level
> and to assign node names to everything.
> 
> Note that libvirt is close to becoming a modern management tool, so it's
> probably a good idea to now make all block jobs add filters where we'll
> need them in the long run.
> 
>>>> One example that isn’t implemented yet is copy-on-read.  In theory,
>>>> specifying copy-on-read=on for -drive should create an implicit COR node
>>>> on top.  The user shouldn’t see that node when inspecting the drive or
>>>> when issuing block jobs on it, etc.  And this node has to stay there
>>>> when the user does e.g. an active commit somewhere down the chain.
>>>
>>> Why instead not to just write in doc that yes, filter is created when you
>>> enable copy-on-read?
>>
>> Because the problem is that existing users are not aware of that.
>>
>> If they were, they could simply create a COR filter already.
>>
>> I suppose we could interpret the deprecation policy in a way that we
>> could change the behavior of -drive copy-on-read=on, but as I already
>> said, what’s the point of supporting -drive copy-on-read=on, when you
>> can simply explicitly create a COR filter on top?
> 
> I actually changed my opinion on how to do this since we introduced
> implicit filters. I know think that the right way to move forward is to
> make sure that the copy-on-read filter can do everything it needs to do,
> and then just completely deprecate -drive copy-on-read=on instead of
> adding compatibility magic to turn it into an implicit copy-on-read
> filter internally.

Sure, so your answer to “what’s the point of supporting -drive
copy-on-read=on” is “there is none”.  Fair enough. :-)

>>>> That sounds like a horrible ordeal to implement, so it hasn’t been done
>>>> yet.  Maybe it never will.  It isn’t that bad for the job filters,
>>>> because they generally freeze the block graph, so there is no problem
>>>> with potential modifications.
>>>>
>>>> All in all I do think having implicit nodes makes sense.  Maybe not so
>>>> much now, but in the future (if someone implements converting -drive COR
>>>> and throttle options to implicit nodes...).
>>>
>>> But do we have at least strong definition of how implicit nodes should behave
>>> on any graph-modifying operations around them?
>>
>> No.  Depends on the node.
>>
>>> Should new implicit/explicit
>>> filters be created above or under them?
>>
>> That was always the most difficult question we had when we introduced
>> filters.
>>
>> The problem is that we never answered it in our code base.
>>
>> One day, we just added filters (“let’s see what happens”), and in the
>> beginning, they were all explicit, so we still didn’t have to answer
>> this question (in code).  Then job filters were added, and we still
>> didn’t have to, because they set blockers so the graph couldn’t be
>> reorganized with them in place anyway.
>>
>> If we converted copy-on-read=on to a COR node, we would have to answer
>> that question.
> 
> That's why we shouldn't do that, but just remove copy-on-read=on. :-)

And BB-level throttling, fair enough.

(Although the first step would be probably to make throttle groups
non-experimental?  Like, drop the x- prefix for all their parameters.)

>>>>> But should really filter do that job, or it is here only to do CBW? Maybe target
>>>>> must have another parent which will care about flushing.
>>>>>
>>>>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
>>>>> smarter would be better.
>>>>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
>>>>
>>>> [1] Thinking more about this, for normal backups the target file does
>>>> not reflect a valid disk state until the backup is done; just like for
>>>> qemu-img convert.  Just like qemu-img convert, there is therefore no
>>>> reason to flush the target until the job is done.
> 
> Depends, the target could have the source as its backing file (like
> fleecing, but without exporting it right now), and then you could always
> have a consistent view on the target. Well, almost.
> 
> Almost because to guarantee this, we'd have to flush between each CBW
> and the corresponding write to the same block, to make sure that the old
> data is backed up before it is overwritten.

Yes, that’s what I meant by “enforce on the host that the target is
always flushed before the source”.  Well, I meant to say there is no
good way of doing that, and I kind of don’t consider this a good way.

> Of course, this would perform about as badly as internal COW in qcow2...
> So probably not a guarantee we should be making by default. But it might
> make sense as an option.

I don’t know.  “Use this option so your data isn’t lost in case of a
power failure, but it makes everything pretty slow”?  Who is going to do
explicitly enable that (before their first data loss)?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-17 16:01                   ` Max Reitz
@ 2019-06-17 16:25                     ` Kevin Wolf
  2019-06-18  7:19                       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2019-06-17 16:25 UTC (permalink / raw)
  To: Max Reitz
  Cc: fam, Vladimir Sementsov-Ogievskiy, Denis Lunev, qemu-block,
	qemu-devel, stefanha, jsnow

[-- Attachment #1: Type: text/plain, Size: 3847 bytes --]

Am 17.06.2019 um 18:01 hat Max Reitz geschrieben:
> >>> Should new implicit/explicit
> >>> filters be created above or under them?
> >>
> >> That was always the most difficult question we had when we introduced
> >> filters.
> >>
> >> The problem is that we never answered it in our code base.
> >>
> >> One day, we just added filters (“let’s see what happens”), and in the
> >> beginning, they were all explicit, so we still didn’t have to answer
> >> this question (in code).  Then job filters were added, and we still
> >> didn’t have to, because they set blockers so the graph couldn’t be
> >> reorganized with them in place anyway.
> >>
> >> If we converted copy-on-read=on to a COR node, we would have to answer
> >> that question.
> > 
> > That's why we shouldn't do that, but just remove copy-on-read=on. :-)
> 
> And BB-level throttling, fair enough.
> 
> (Although the first step would be probably to make throttle groups
> non-experimental?  Like, drop the x- prefix for all their parameters.)

The first step would be making the block drivers full replacements of
the old things, which unfortunately isn't true today.

After your "deal with filters" series, we should be a lot closer for the
core infrastructure at least.

Not sure about copy-on-read, but I know that throttle still doesn't have
feature parity with -drive throttling. At least I'm pretty sure that
something about changing the group or group options at runtime (and not
just dropping the x-) was missing.

> >>>>> But should really filter do that job, or it is here only to do CBW? Maybe target
> >>>>> must have another parent which will care about flushing.
> >>>>>
> >>>>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
> >>>>> smarter would be better.
> >>>>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
> >>>>
> >>>> [1] Thinking more about this, for normal backups the target file does
> >>>> not reflect a valid disk state until the backup is done; just like for
> >>>> qemu-img convert.  Just like qemu-img convert, there is therefore no
> >>>> reason to flush the target until the job is done.
> > 
> > Depends, the target could have the source as its backing file (like
> > fleecing, but without exporting it right now), and then you could always
> > have a consistent view on the target. Well, almost.
> > 
> > Almost because to guarantee this, we'd have to flush between each CBW
> > and the corresponding write to the same block, to make sure that the old
> > data is backed up before it is overwritten.
> 
> Yes, that’s what I meant by “enforce on the host that the target is
> always flushed before the source”.  Well, I meant to say there is no
> good way of doing that, and I kind of don’t consider this a good way.
> 
> > Of course, this would perform about as badly as internal COW in qcow2...
> > So probably not a guarantee we should be making by default. But it might
> > make sense as an option.
> 
> I don’t know.  “Use this option so your data isn’t lost in case of a
> power failure, but it makes everything pretty slow”?  Who is going to do
> explicitly enable that (before their first data loss)?

Maybe the backup job wasn't that clever after all. At least if you care
about keeping the point-in-time snapshot at the start of the backup job
instead of just retrying and getting a snapshot of a different point in
time that is just as good.

If you do care about the specific point in time, the safe way to do it
is to take a snapshot and copy that away, and then delete the snapshot
again.

The only problem is that merging external snapshots is slow and you end
up writing the new data twice. If only we had a COW image format... :-)

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-17 16:25                     ` Kevin Wolf
@ 2019-06-18  7:19                       ` Vladimir Sementsov-Ogievskiy
  2019-06-18  8:20                         ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-18  7:19 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, stefanha, jsnow

17.06.2019 19:25, Kevin Wolf wrote:
> Am 17.06.2019 um 18:01 hat Max Reitz geschrieben:
>>>>> Should new implicit/explicit
>>>>> filters be created above or under them?
>>>>
>>>> That was always the most difficult question we had when we introduced
>>>> filters.
>>>>
>>>> The problem is that we never answered it in our code base.
>>>>
>>>> One day, we just added filters (“let’s see what happens”), and in the
>>>> beginning, they were all explicit, so we still didn’t have to answer
>>>> this question (in code).  Then job filters were added, and we still
>>>> didn’t have to, because they set blockers so the graph couldn’t be
>>>> reorganized with them in place anyway.
>>>>
>>>> If we converted copy-on-read=on to a COR node, we would have to answer
>>>> that question.
>>>
>>> That's why we shouldn't do that, but just remove copy-on-read=on. :-)
>>
>> And BB-level throttling, fair enough.
>>
>> (Although the first step would be probably to make throttle groups
>> non-experimental?  Like, drop the x- prefix for all their parameters.)
> 
> The first step would be making the block drivers full replacements of
> the old things, which unfortunately isn't true today.
> 
> After your "deal with filters" series, we should be a lot closer for the
> core infrastructure at least.
> 
> Not sure about copy-on-read, but I know that throttle still doesn't have
> feature parity with -drive throttling. At least I'm pretty sure that
> something about changing the group or group options at runtime (and not
> just dropping the x-) was missing.

OK, thanks, I understand now that implicit filters are not just a feature but
compatibility mechanism.

So, can we at some point deprecate "optionality" of filter-node-name parameters of jobs,
in addition to deprecation of things like old copy-on-read option?
And actually deprecate implicit filters by this?

> 
>>>>>>> But should really filter do that job, or it is here only to do CBW? Maybe target
>>>>>>> must have another parent which will care about flushing.
>>>>>>>
>>>>>>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
>>>>>>> smarter would be better.
>>>>>>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
>>>>>>
>>>>>> [1] Thinking more about this, for normal backups the target file does
>>>>>> not reflect a valid disk state until the backup is done; just like for
>>>>>> qemu-img convert.  Just like qemu-img convert, there is therefore no
>>>>>> reason to flush the target until the job is done.
>>>
>>> Depends, the target could have the source as its backing file (like
>>> fleecing, but without exporting it right now), and then you could always
>>> have a consistent view on the target. Well, almost.
>>>
>>> Almost because to guarantee this, we'd have to flush between each CBW
>>> and the corresponding write to the same block, to make sure that the old
>>> data is backed up before it is overwritten.
>>
>> Yes, that’s what I meant by “enforce on the host that the target is
>> always flushed before the source”.  Well, I meant to say there is no
>> good way of doing that, and I kind of don’t consider this a good way.
>>
>>> Of course, this would perform about as badly as internal COW in qcow2...
>>> So probably not a guarantee we should be making by default. But it might
>>> make sense as an option.
>>
>> I don’t know.  “Use this option so your data isn’t lost in case of a
>> power failure, but it makes everything pretty slow”?  Who is going to do
>> explicitly enable that (before their first data loss)?
> 
> Maybe the backup job wasn't that clever after all. At least if you care
> about keeping the point-in-time snapshot at the start of the backup job
> instead of just retrying and getting a snapshot of a different point in
> time that is just as good.
> 
> If you do care about the specific point in time, the safe way to do it
> is to take a snapshot and copy that away, and then delete the snapshot
> again.
> 
> The only problem is that merging external snapshots is slow and you end
> up writing the new data twice. If only we had a COW image format... :-)
> 

So, I still don't understand the reason of flushing backup target in a meantime.
Backup target remains invalid until the successful end of the job, at which point
we definitely flush target, but only once.

If node crashes during backup, backup would be invalid independently of were there
flushes after each write (or better just enable O_DIRECT) or not.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-17 14:56               ` Max Reitz
  2019-06-17 15:53                 ` Kevin Wolf
@ 2019-06-18  7:25                 ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-18  7:25 UTC (permalink / raw)
  To: Max Reitz, qemu-devel, qemu-block
  Cc: fam, kwolf, Denis Lunev, stefanha, jsnow

17.06.2019 17:56, Max Reitz wrote:
> On 17.06.19 12:36, Vladimir Sementsov-Ogievskiy wrote:
>> 14.06.2019 23:03, Max Reitz wrote:
>>> On 14.06.19 18:22, Vladimir Sementsov-Ogievskiy wrote:
>>>> 14.06.2019 15:57, Max Reitz wrote:
>>>>> On 14.06.19 11:04, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 13.06.2019 18:57, Max Reitz wrote:
>>>>>>> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> Backup-top filter does copy-before-write operation. It should be
>>>>>>>> inserted above active disk and has a target node for CBW, like the
>>>>>>>> following:
>>>>>>>>
>>>>>>>>         +-------+
>>>>>>>>         | Guest |
>>>>>>>>         +-------+
>>>>>>>>             |r,w
>>>>>>>>             v
>>>>>>>>         +------------+  target   +---------------+
>>>>>>>>         | backup_top |---------->| target(qcow2) |
>>>>>>>>         +------------+   CBW     +---------------+
>>>>>>>>             |
>>>>>>>> backing |r,w
>>>>>>>>             v
>>>>>>>>         +-------------+
>>>>>>>>         | Active disk |
>>>>>>>>         +-------------+
>>>>>>>>
>>>>>>>> The driver will be used in backup instead of write-notifiers.
>>>>>>>>
>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>>> ---
>>>>>>>>      block/backup-top.h  |  64 +++++++++
>>>>>>>>      block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      block/Makefile.objs |   2 +
>>>>>>>>      3 files changed, 388 insertions(+)
>>>>>>>>      create mode 100644 block/backup-top.h
>>>>>>>>      create mode 100644 block/backup-top.c
>>>>>>>>
>>>>>>>> diff --git a/block/backup-top.h b/block/backup-top.h
>>>>>>>> new file mode 100644
>>>>>>>> index 0000000000..788e18c358
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/block/backup-top.h
>>>>>
>>>>> [...]
>>>>>
>>>>>>>> +/*
>>>>>>>> + * bdrv_backup_top_append
>>>>>>>> + *
>>>>>>>> + * Append backup_top filter node above @source node. @target node will receive
>>>>>>>> + * the data backed up during CBE operations. New filter together with @target
>>>>>>>> + * node are attached to @source aio context.
>>>>>>>> + *
>>>>>>>> + * The resulting filter node is implicit.
>>>>>>>
>>>>>>> Why?  It’s just as easy for the caller to just make it implicit if it
>>>>>>> should be.  (And usually the caller should decide that.)
>>>>>>
>>>>>> Personally, I don't know what are the reasons for filters to bi implicit or not,
>>>>>> I just made it like other job-filters.. I can move making-implicit to the caller
>>>>>> or drop it at all (if it will work).
>>>>>
>>>>> Nodes are implicit if they haven’t been added consciously by the user.
>>>>> A node added by a block job can be non-implicit, too, as mirror shows;
>>>>> If the user specifies the filter-node-name option, they will know about
>>>>> the node, thus it is no longer implicit.
>>>>>
>>>>> If the user doesn’t know about the node (they didn’t give the
>>>>> filter-node-name option), the node is implicit.
>>>>>
>>>>
>>>> Ok, I understand it. But it doesn't show, why it should be implicit?
>>>> Isn't it less buggy to make all filters explicit? We don't hide implicit nodes
>>>> from query-named-block-nodes (the only interface to explore the whole graph for now)
>>>> anyway. And we can't absolutely hide side effects of additional node in the graph.
>>>
>>> Well, we try, at least.  At least we hide them from query-block.
>>>
>>>> So, is there any real benefit of supporting separation into implicit and explicit filters?
>>>> It seems for me that it only complicates things...
>>>> In other words, what will break if we make all filters explicit?
>>>
>>> The theory is that qemu may decide to add nodes at any point, but at
>>> least when managing chains etc., they may not be visible to the user.  I
>>> don’t think we can get rid of them so easily.
>>>
>>> One example that isn’t implemented yet is copy-on-read.  In theory,
>>> specifying copy-on-read=on for -drive should create an implicit COR node
>>> on top.  The user shouldn’t see that node when inspecting the drive or
>>> when issuing block jobs on it, etc.  And this node has to stay there
>>> when the user does e.g. an active commit somewhere down the chain.
>>
>> Why instead not to just write in doc that yes, filter is created when you
>> enable copy-on-read?
> 
> Because the problem is that existing users are not aware of that.
> 
> If they were, they could simply create a COR filter already.
> 
> I suppose we could interpret the deprecation policy in a way that we
> could change the behavior of -drive copy-on-read=on, but as I already
> said, what’s the point of supporting -drive copy-on-read=on, when you
> can simply explicitly create a COR filter on top?
> 
>> And do same for all operations which may create filter,
>> we don't have a lot of them? And add optional parameter to set node-name for
>> created filter.
> 
> That optional parameter exists, and if it is given, the user shows that
> they are aware of the node.  Hence the node becomes explicit then.
> 
> The problem is with legacy users that use an existing interface and
> suddenly the externally visible interface of qemu changes in a way that
> could break them.  I suppose additional entries in
> query-named-block-nodes is not breakage.  Maybe it is, that would be a
> bug then.
> 
> (If nobody has noticed so far, that may be because these legacy
> applications didn’t use query-named-block-nodes.  But now maybe libvirt
> does, and so if we converted copy-on-read=on to a COR node, it would
> notice now and thus break.  So just because our handling of implicit
> nodes is broken right now does not mean we can abandon the concept
> altogether.)
> 
>>> That sounds like a horrible ordeal to implement, so it hasn’t been done
>>> yet.  Maybe it never will.  It isn’t that bad for the job filters,
>>> because they generally freeze the block graph, so there is no problem
>>> with potential modifications.
>>>
>>> All in all I do think having implicit nodes makes sense.  Maybe not so
>>> much now, but in the future (if someone implements converting -drive COR
>>> and throttle options to implicit nodes...).
>>
>> But do we have at least strong definition of how implicit nodes should behave
>> on any graph-modifying operations around them?
> 
> No.  Depends on the node.
> 
>> Should new implicit/explicit
>> filters be created above or under them?
> 
> That was always the most difficult question we had when we introduced
> filters.
> 
> The problem is that we never answered it in our code base.
> 
> One day, we just added filters (“let’s see what happens”), and in the
> beginning, they were all explicit, so we still didn’t have to answer
> this question (in code).  Then job filters were added, and we still
> didn’t have to, because they set blockers so the graph couldn’t be
> reorganized with them in place anyway.
> 
> If we converted copy-on-read=on to a COR node, we would have to answer
> that question.
> 
>> We should specify it in doc about all
>> such operations, otherwise effect of implicit nodes may change unpredictably for
>> the user. I'm afraid, that implicit filters will just continue my list of
>> features which should not be used if we want to be able to maintain all this mess..
> 
> Yeah, well, that is nothing new at all.  For example, note the “Split
> I/O throttling off into own BDS – Requires some care with snapshots
> etc.” here:
> https://wiki.qemu.org/ToDo/Block#Dynamic_graph_reconfiguration_.28e.g._adding_block_filters.2C_taking_snapshots.2C_etc..29
> 
> This was always a problem.  It was always the biggest problem with
> filters.  We never answered it because it never become an acute problem,
> as I described above.
> 
> We just said “For explicit filters, they just stay where they are, and
> that’s OK because the user can take care of them.”  We never did add the
> implicit filters that were actually going to become a problem (COR and
> throttle).
> 
> So as I said above, the behavior of implicit nodes needs to be examined
> on a case-by-case basis.  Well, actually, this is only about COR and
> throttle, really, and both would behave the same: Always stay at the
> BlockBackend, because that is the behavior when you currently use
> BB-level throttling or COR.
> 
> The jobs that now use implicit nodes freeze the graph, so there is no
> problem there.
> 
> (Also, by the way, we do not need the option to give COR/throttle nodes
> created through BB options a node name.  As I said above, at that point
> you can just create the node yourself.)
> 
>> 1. If you something around filenames don't work, use node-names, and better
>> don't use filename-based APIs
> 
> I think we agree that filename-based APIs were a mistake.  (Though only
> in hindsight.  They were added when there where no node names, so it did
> make sense then.)
> 
> So now you should not use them, correct.
> 
>> 2. If you qemu-img can't do what you need, use paused qemu process, and better,
>> don't use qemu-img
> 
> There was always a plan to give qemu-img feature parity to qemu proper.
>   But often the interface is just a problem.  Node names really don’t
> make much sense with qemu-img, I think.
> 
> (I mean, feel free to add something that automatically names the backing
> nodes e.g. backing.0, backing.1, etc., and then add an interface to
> qemu-img commit to use node names, but I find that weird.)
> 
> So I don’t quite see the problem.  Of course qemu has more functionality
> than qemu-img.  qemu-img is fine as long as you have a simple set-up,
> but once it gets more complicated, you need management software; and
> that software shouldn’t care much whether it uses qemu-img or qemu itself.
> 
>> and a new one:
>>
>> 3. If something is broken around jobs and filters and any block operations,
>> set filter-node-name in parameters to make filter node explicit. And better,
>> always set it..
> 
> Uh, yeah?  I don’t see the problem here.
> 
> This is like saying “If I want to use new features, I should use
> -blockdev and not -hda”.  Yes, that is correct.  You should set
> filter-node-name if you care about the exact state of the graph.

OK, yes, understand now. Sorry for me listing quite unrelated things, but the
picture is clearer for me now, thanks.

> 
>> So at least, I think we should make it possible for all filters to be explicit
>> if user wants, by setting its name like in mirror.
> 
> For all filters that the user has no other way of creating them, yes.
> (That is, the filters created by the block jobs.)
> 
>> (and then, I will not really care of implicit-filters related logic, like I don't
>> really care about filename-api related logic)
>>
>> Also, now, implict filters are actually available for direct manipulation by user,
>> as their node-names are exported in query-named-block-nodes and nothing prevents
>> using these names in differet APIs.
> 
> Hm, yes.  Is that a problem?

Not. I tried to prove that implicit filters are "bad feature", but now when I see
that it's for compatibility, these reasons become weak.

[..]


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-18  7:19                       ` Vladimir Sementsov-Ogievskiy
@ 2019-06-18  8:20                         ` Kevin Wolf
  2019-06-18  8:29                           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2019-06-18  8:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, Max Reitz, stefanha, jsnow

Am 18.06.2019 um 09:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 17.06.2019 19:25, Kevin Wolf wrote:
> > The first step would be making the block drivers full replacements of
> > the old things, which unfortunately isn't true today.
> > 
> > After your "deal with filters" series, we should be a lot closer for the
> > core infrastructure at least.
> > 
> > Not sure about copy-on-read, but I know that throttle still doesn't have
> > feature parity with -drive throttling. At least I'm pretty sure that
> > something about changing the group or group options at runtime (and not
> > just dropping the x-) was missing.
> 
> OK, thanks, I understand now that implicit filters are not just a
> feature but compatibility mechanism.
> 
> So, can we at some point deprecate "optionality" of filter-node-name
> parameters of jobs, in addition to deprecation of things like old
> copy-on-read option?  And actually deprecate implicit filters by this?

Hm, I'm not sure if we have ever moved an optional feature to required,
and how to communicate this to libvirt, but this would be ideal, yes.

> >>>>>>> But should really filter do that job, or it is here only to do CBW? Maybe target
> >>>>>>> must have another parent which will care about flushing.
> >>>>>>>
> >>>>>>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
> >>>>>>> smarter would be better.
> >>>>>>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
> >>>>>>
> >>>>>> [1] Thinking more about this, for normal backups the target file does
> >>>>>> not reflect a valid disk state until the backup is done; just like for
> >>>>>> qemu-img convert.  Just like qemu-img convert, there is therefore no
> >>>>>> reason to flush the target until the job is done.
> >>>
> >>> Depends, the target could have the source as its backing file (like
> >>> fleecing, but without exporting it right now), and then you could always
> >>> have a consistent view on the target. Well, almost.
> >>>
> >>> Almost because to guarantee this, we'd have to flush between each CBW
> >>> and the corresponding write to the same block, to make sure that the old
> >>> data is backed up before it is overwritten.
> >>
> >> Yes, that’s what I meant by “enforce on the host that the target is
> >> always flushed before the source”.  Well, I meant to say there is no
> >> good way of doing that, and I kind of don’t consider this a good way.
> >>
> >>> Of course, this would perform about as badly as internal COW in qcow2...
> >>> So probably not a guarantee we should be making by default. But it might
> >>> make sense as an option.
> >>
> >> I don’t know.  “Use this option so your data isn’t lost in case of a
> >> power failure, but it makes everything pretty slow”?  Who is going to do
> >> explicitly enable that (before their first data loss)?
> > 
> > Maybe the backup job wasn't that clever after all. At least if you care
> > about keeping the point-in-time snapshot at the start of the backup job
> > instead of just retrying and getting a snapshot of a different point in
> > time that is just as good.
> > 
> > If you do care about the specific point in time, the safe way to do it
> > is to take a snapshot and copy that away, and then delete the snapshot
> > again.
> > 
> > The only problem is that merging external snapshots is slow and you end
> > up writing the new data twice. If only we had a COW image format... :-)
> 
> So, I still don't understand the reason of flushing backup target in a
> meantime.  Backup target remains invalid until the successful end of
> the job, at which point we definitely flush target, but only once.
> 
> If node crashes during backup, backup would be invalid independently
> of were there flushes after each write (or better just enable
> O_DIRECT) or not.

Say your VM is running on disk.qcow2 and you use a backup job to copy
data to backup.qcow2. At some point, the host crashes. If we made sure
that every CBW to backup.qcow2 has completed before we write new data to
disk.qcow2, backup.qcow2 contains valid data that represents the state
at the start of the backup job as long as you use disk.qcow2 as its
backing file.

The only way to ensure the right order is flushing between the two
operations. If you don't do that, then yes, your backing is invalid
before it has completed.

O_DIRECT doesn't guarantee that the data is on disk, only a flush does
that. Maybe your chances that things actually make it to the disk in
case of a host crash are higher if it sits in the disk cache rather than
in the host's RAM, but there is no guarantee without a flush.

Now flushing the target when the guest flushes its disk doesn't give you
new guarantees. Maybe it increases your chances that you're lucky and
your data is correct, but you won't be able to tell. So... not really a
reason not to do it, but it's probably kind of useless.

Kevin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver
  2019-06-18  8:20                         ` Kevin Wolf
@ 2019-06-18  8:29                           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-06-18  8:29 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: fam, Denis Lunev, qemu-block, qemu-devel, Max Reitz, stefanha, jsnow

18.06.2019 11:20, Kevin Wolf wrote:
> Am 18.06.2019 um 09:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 17.06.2019 19:25, Kevin Wolf wrote:
>>> The first step would be making the block drivers full replacements of
>>> the old things, which unfortunately isn't true today.
>>>
>>> After your "deal with filters" series, we should be a lot closer for the
>>> core infrastructure at least.
>>>
>>> Not sure about copy-on-read, but I know that throttle still doesn't have
>>> feature parity with -drive throttling. At least I'm pretty sure that
>>> something about changing the group or group options at runtime (and not
>>> just dropping the x-) was missing.
>>
>> OK, thanks, I understand now that implicit filters are not just a
>> feature but compatibility mechanism.
>>
>> So, can we at some point deprecate "optionality" of filter-node-name
>> parameters of jobs, in addition to deprecation of things like old
>> copy-on-read option?  And actually deprecate implicit filters by this?
> 
> Hm, I'm not sure if we have ever moved an optional feature to required,
> and how to communicate this to libvirt, but this would be ideal, yes.
> 
>>>>>>>>> But should really filter do that job, or it is here only to do CBW? Maybe target
>>>>>>>>> must have another parent which will care about flushing.
>>>>>>>>>
>>>>>>>>> Ok, I think I'll flush target here too for safety, and leave a comment, that something
>>>>>>>>> smarter would be better.
>>>>>>>>> (or, if we decide to flush all children by default in generic code, I'll drop this handler)
>>>>>>>>
>>>>>>>> [1] Thinking more about this, for normal backups the target file does
>>>>>>>> not reflect a valid disk state until the backup is done; just like for
>>>>>>>> qemu-img convert.  Just like qemu-img convert, there is therefore no
>>>>>>>> reason to flush the target until the job is done.
>>>>>
>>>>> Depends, the target could have the source as its backing file (like
>>>>> fleecing, but without exporting it right now), and then you could always
>>>>> have a consistent view on the target. Well, almost.
>>>>>
>>>>> Almost because to guarantee this, we'd have to flush between each CBW
>>>>> and the corresponding write to the same block, to make sure that the old
>>>>> data is backed up before it is overwritten.
>>>>
>>>> Yes, that’s what I meant by “enforce on the host that the target is
>>>> always flushed before the source”.  Well, I meant to say there is no
>>>> good way of doing that, and I kind of don’t consider this a good way.
>>>>
>>>>> Of course, this would perform about as badly as internal COW in qcow2...
>>>>> So probably not a guarantee we should be making by default. But it might
>>>>> make sense as an option.
>>>>
>>>> I don’t know.  “Use this option so your data isn’t lost in case of a
>>>> power failure, but it makes everything pretty slow”?  Who is going to do
>>>> explicitly enable that (before their first data loss)?
>>>
>>> Maybe the backup job wasn't that clever after all. At least if you care
>>> about keeping the point-in-time snapshot at the start of the backup job
>>> instead of just retrying and getting a snapshot of a different point in
>>> time that is just as good.
>>>
>>> If you do care about the specific point in time, the safe way to do it
>>> is to take a snapshot and copy that away, and then delete the snapshot
>>> again.
>>>
>>> The only problem is that merging external snapshots is slow and you end
>>> up writing the new data twice. If only we had a COW image format... :-)
>>
>> So, I still don't understand the reason of flushing backup target in a
>> meantime.  Backup target remains invalid until the successful end of
>> the job, at which point we definitely flush target, but only once.
>>
>> If node crashes during backup, backup would be invalid independently
>> of were there flushes after each write (or better just enable
>> O_DIRECT) or not.
> 
> Say your VM is running on disk.qcow2 and you use a backup job to copy
> data to backup.qcow2. At some point, the host crashes. If we made sure
> that every CBW to backup.qcow2 has completed before we write new data to
> disk.qcow2, backup.qcow2 contains valid data that represents the state
> at the start of the backup job as long as you use disk.qcow2 as its
> backing file.
> 
> The only way to ensure the right order is flushing between the two
> operations. If you don't do that, then yes, your backing is invalid
> before it has completed.
> 
> O_DIRECT doesn't guarantee that the data is on disk, only a flush does
> that. Maybe your chances that things actually make it to the disk in
> case of a host crash are higher if it sits in the disk cache rather than
> in the host's RAM, but there is no guarantee without a flush.
> 
> Now flushing the target when the guest flushes its disk doesn't give you
> new guarantees. Maybe it increases your chances that you're lucky and
> your data is correct, but you won't be able to tell. So... not really a
> reason not to do it, but it's probably kind of useless.
> 

All clear now, thanks!


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-06-18  8:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 15:46 [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 1/7] block: teach bdrv_debug_breakpoint skip filters with backing Vladimir Sementsov-Ogievskiy
2019-06-13 13:43   ` Max Reitz
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 2/7] block: swap operation order in bdrv_append Vladimir Sementsov-Ogievskiy
2019-06-13 13:45   ` Max Reitz
2019-06-13 14:02     ` Vladimir Sementsov-Ogievskiy
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 3/7] block: allow not one child for implicit node Vladimir Sementsov-Ogievskiy
2019-06-13 13:51   ` Max Reitz
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 4/7] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
2019-06-13 15:57   ` Max Reitz
2019-06-14  9:04     ` Vladimir Sementsov-Ogievskiy
2019-06-14 12:57       ` Max Reitz
2019-06-14 16:22         ` Vladimir Sementsov-Ogievskiy
2019-06-14 20:03           ` Max Reitz
2019-06-17 10:36             ` Vladimir Sementsov-Ogievskiy
2019-06-17 14:56               ` Max Reitz
2019-06-17 15:53                 ` Kevin Wolf
2019-06-17 16:01                   ` Max Reitz
2019-06-17 16:25                     ` Kevin Wolf
2019-06-18  7:19                       ` Vladimir Sementsov-Ogievskiy
2019-06-18  8:20                         ` Kevin Wolf
2019-06-18  8:29                           ` Vladimir Sementsov-Ogievskiy
2019-06-18  7:25                 ` Vladimir Sementsov-Ogievskiy
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 5/7] block/io: refactor wait_serialising_requests Vladimir Sementsov-Ogievskiy
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 6/7] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
2019-06-13 16:31   ` Max Reitz
2019-05-29 15:46 ` [Qemu-devel] [PATCH v8 7/7] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
2019-06-13 18:02   ` Max Reitz
2019-06-14  9:14     ` Vladimir Sementsov-Ogievskiy
2019-05-30 13:25 ` [Qemu-devel] [PATCH v8 0/7] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
2019-06-13 16:08 ` no-reply
2019-06-13 16:41 ` no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.