[PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
@ 2021-03-19 10:08 Vladimir Sementsov-Ogievskiy
  2021-03-19 10:08 ` [PATCH v4 1/3] qemu-io: add aio_discard Vladimir Sementsov-Ogievskiy
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-19 10:08 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Look at 03 for the problem and fix. 01 is preparation and 02 is the
test.

Actually previous version of this thing is 
   [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard

Still
   [PATCH v3 0/6] qcow2: compressed write cache
includes another fix (more complicated) for the bug, so this is called
v4.

So, what's new:

It's still a CoRwlock based solution as suggested by Kevin.

Now I think that "writer" of the lock should be code in
update_refcount() which wants to set refcount to zero. If we consider
only guest discard request as "writer" we may miss other sources of
discarding host clusters (like rewriting compressed cluster to normal,
maybe some snapshot operations, who knows what's more).

And this means that we want to take rw-lock under qcow2 s->lock. And
this brings ordering restriction for the two locks: if we want both
locks taken, we should always take s->lock first, and never take s->lock
when rw-lock is already taken (otherwise we get classic deadlock).

This leads us to taking rd-lock for in-flight writes under s->lock in
same critical section where cluster is allocated (or just got from
metadata) and releasing after data writing completion.

This in turn leads to a bit tricky logic around transferring rd-lock to
task coroutine on normal write path (see 03).. But this is still simpler
than inflight-write-counters solution in v3..

Vladimir Sementsov-Ogievskiy (3):
  qemu-io: add aio_discard
  iotests: add qcow2-discard-during-rewrite
  block/qcow2: introduce discard_rw_lock: fix discarding host clusters

 block/qcow2.h                                 |  20 +++
 block/qcow2-refcount.c                        |  22 ++++
 block/qcow2.c                                 |  73 +++++++++--
 qemu-io-cmds.c                                | 117 ++++++++++++++++++
 .../tests/qcow2-discard-during-rewrite        |  99 +++++++++++++++
 .../tests/qcow2-discard-during-rewrite.out    |  17 +++
 6 files changed, 341 insertions(+), 7 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

-- 
2.29.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/3] qemu-io: add aio_discard
  2021-03-19 10:08 [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
@ 2021-03-19 10:08 ` Vladimir Sementsov-Ogievskiy
  2021-03-19 10:08 ` [PATCH v4 2/3] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-19 10:08 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Add aio_discard command like existing aio_write. It will be used in
further test.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 qemu-io-cmds.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 97611969cb..28b5c3c092 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1332,6 +1332,7 @@ struct aio_ctx {
     BlockBackend *blk;
     QEMUIOVector qiov;
     int64_t offset;
+    int64_t discard_bytes;
     char *buf;
     bool qflag;
     bool vflag;
@@ -1343,6 +1344,34 @@ struct aio_ctx {
     struct timespec t1;
 };
 
+static void aio_discard_done(void *opaque, int ret)
+{
+    struct aio_ctx *ctx = opaque;
+    struct timespec t2;
+
+    clock_gettime(CLOCK_MONOTONIC, &t2);
+
+
+    if (ret < 0) {
+        printf("aio_discard failed: %s\n", strerror(-ret));
+        block_acct_failed(blk_get_stats(ctx->blk), &ctx->acct);
+        goto out;
+    }
+
+    block_acct_done(blk_get_stats(ctx->blk), &ctx->acct);
+
+    if (ctx->qflag) {
+        goto out;
+    }
+
+    /* Finally, report back -- -C gives a parsable format */
+    t2 = tsub(t2, ctx->t1);
+    print_report("discarded", &t2, ctx->offset, ctx->discard_bytes,
+                 ctx->discard_bytes, 1, ctx->Cflag);
+out:
+    g_free(ctx);
+}
+
 static void aio_write_done(void *opaque, int ret)
 {
     struct aio_ctx *ctx = opaque;
@@ -1671,6 +1700,93 @@ static int aio_write_f(BlockBackend *blk, int argc, char **argv)
     return 0;
 }
 
+static void aio_discard_help(void)
+{
+    printf(
+"\n"
+" asynchronously discards a range of bytes from the given offset\n"
+"\n"
+" Example:\n"
+" 'aio_discard 0 64k' - discards 64K at start of a disk\n"
+"\n"
+" Note that due to its asynchronous nature, this command will be\n"
+" considered successful once the request is submitted, independently\n"
+" of potential I/O errors or pattern mismatches.\n"
+" -C, -- report statistics in a machine parsable format\n"
+" -i, -- treat request as invalid, for exercising stats\n"
+" -q, -- quiet mode, do not show I/O statistics\n"
+"\n");
+}
+
+static int aio_discard_f(BlockBackend *blk, int argc, char **argv);
+
+static const cmdinfo_t aio_discard_cmd = {
+    .name       = "aio_discard",
+    .cfunc      = aio_discard_f,
+    .perm       = BLK_PERM_WRITE,
+    .argmin     = 2,
+    .argmax     = -1,
+    .args       = "[-Ciq] off len",
+    .oneline    = "asynchronously discards a number of bytes",
+    .help       = aio_discard_help,
+};
+
+static int aio_discard_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int c;
+    struct aio_ctx *ctx = g_new0(struct aio_ctx, 1);
+
+    ctx->blk = blk;
+    while ((c = getopt(argc, argv, "Ciq")) != -1) {
+        switch (c) {
+        case 'C':
+            ctx->Cflag = true;
+            break;
+        case 'q':
+            ctx->qflag = true;
+            break;
+        case 'i':
+            printf("injecting invalid discard request\n");
+            block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_UNMAP);
+            g_free(ctx);
+            return 0;
+        default:
+            g_free(ctx);
+            qemuio_command_usage(&aio_write_cmd);
+            return -EINVAL;
+        }
+    }
+
+    if (optind != argc - 2) {
+        g_free(ctx);
+        qemuio_command_usage(&aio_write_cmd);
+        return -EINVAL;
+    }
+
+    ctx->offset = cvtnum(argv[optind]);
+    if (ctx->offset < 0) {
+        ret = ctx->offset;
+        print_cvtnum_err(ret, argv[optind]);
+        g_free(ctx);
+        return ret;
+    }
+    optind++;
+
+    ctx->discard_bytes = cvtnum(argv[optind]);
+    if (ctx->discard_bytes < 0) {
+        ret = ctx->discard_bytes;
+        print_cvtnum_err(ret, argv[optind]);
+        g_free(ctx);
+        return ret;
+    }
+
+    blk_aio_pdiscard(blk, ctx->offset, ctx->discard_bytes,
+                     aio_discard_done, ctx);
+
+    return 0;
+}
+
 static int aio_flush_f(BlockBackend *blk, int argc, char **argv)
 {
     BlockAcctCookie cookie;
@@ -2494,6 +2610,7 @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&readv_cmd);
     qemuio_add_command(&write_cmd);
     qemuio_add_command(&writev_cmd);
+    qemuio_add_command(&aio_discard_cmd);
     qemuio_add_command(&aio_read_cmd);
     qemuio_add_command(&aio_write_cmd);
     qemuio_add_command(&aio_flush_cmd);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/3] iotests: add qcow2-discard-during-rewrite
  2021-03-19 10:08 [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
  2021-03-19 10:08 ` [PATCH v4 1/3] qemu-io: add aio_discard Vladimir Sementsov-Ogievskiy
@ 2021-03-19 10:08 ` Vladimir Sementsov-Ogievskiy
  2021-03-19 10:08 ` [PATCH v4 3/3] block/qcow2: introduce discard_rw_lock: fix discarding host clusters Vladimir Sementsov-Ogievskiy
  2021-03-25 19:12 ` [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
  3 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-19 10:08 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Simple test:
 - start writing to allocated cluster A
 - discard this cluster
 - write to another unallocated cluster B (it's allocated in same place
   where A was allocated)
 - continue writing to A

For now last action pollutes cluster B which is a bug fixed by the
following commit.

For now, add test to "disabled" group, so that it doesn't run
automatically.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 .../tests/qcow2-discard-during-rewrite        | 99 +++++++++++++++++++
 .../tests/qcow2-discard-during-rewrite.out    | 17 ++++
 2 files changed, 116 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
new file mode 100755
index 0000000000..dd9964ef3f
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -0,0 +1,99 @@
+#!/usr/bin/env bash
+# group: quick disabled
+#
+# Test discarding (and reusing) host cluster during writing data to it.
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=vsementsov@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./../common.rc
+. ./../common.filter
+
+_supported_fmt qcow2
+_supported_proto file fuse
+_supported_os Linux
+
+size=1M
+_make_test_img $size
+
+(
+cat <<EOF
+write -P 1 0 64K
+
+break pwritev A
+aio_write -P 2 0 64K
+wait_break A
+
+aio_discard 0 64K
+EOF
+
+# Now discard must be blocked by discard_rw_lock.
+# Prior to introducing of discard_rw_lock discard
+# will run and discard the cluster we are writing
+# to. Give it a chance to do it.
+sleep 1
+
+# Write another cluster. Data writing should be
+# blocked by discard_rw_lock (as we have pending
+# writer). Prior to introducing of the lock this
+# write will allocate same cluster that was
+# discarded and successfully write to it.
+# Give it a chance.
+cat <<EOF
+aio_write -P 3 128K 64K
+EOF
+
+# a chance for write to finish
+sleep 1
+
+cat <<EOF
+resume A
+EOF
+
+# On good scenario wait for
+# 1. finish first write
+# 2. finish discard
+# 3. finish second write
+# On bad scenario discard and second write are finished,
+# wait only for first write, which will pollute data
+# of second write
+sleep 1
+
+cat <<EOF
+read -P 0 0 64K
+read -P 3 128K 64K
+EOF
+) | $QEMU_IO blkdebug::$TEST_IMG | _filter_qemu_io
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
new file mode 100644
index 0000000000..dd54e928a6
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
@@ -0,0 +1,17 @@
+QA output created by qcow2-discard-during-rewrite
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+blkdebug: Resuming request 'A'
+discarded 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/3] block/qcow2: introduce discard_rw_lock: fix discarding host clusters
  2021-03-19 10:08 [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
  2021-03-19 10:08 ` [PATCH v4 1/3] qemu-io: add aio_discard Vladimir Sementsov-Ogievskiy
  2021-03-19 10:08 ` [PATCH v4 2/3] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
@ 2021-03-19 10:08 ` Vladimir Sementsov-Ogievskiy
  2021-03-25 19:12 ` [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
  3 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-19 10:08 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another data cluster (recently allocated) or even metadata.

To fix the issue introduce rw-lock: hold read-lock during data writing
take write-lock when refcount becomes 0.

Enable qcow2-discard-during-rewrite as it is fixed.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.h                                 | 20 +++++
 block/qcow2-refcount.c                        | 22 ++++++
 block/qcow2.c                                 | 73 +++++++++++++++++--
 .../tests/qcow2-discard-during-rewrite        |  2 +-
 4 files changed, 109 insertions(+), 8 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0fe5f74ed3..7d387fe39d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,26 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    /*
+     * discard_rw_lock prevents discarding host cluster when there are in-flight
+     * writes into it. So, in-flight writes are "readers" and the code reducing
+     * cluster refcount to zero in update_refcount() is "writer".
+     *
+     * Data-writes should works as follows:
+     * 1. rd-lock should be acquired in the same s->lock critical section where
+     *    corresponding host clusters were allocated or somehow got from the
+     *    metadata. Otherwise we risk miss discarding the cluster on s->lock
+     *    unlocking (unlock should only schedule another coroutine, not enter
+     *    it, but better be absolutely safe).
+     *
+     * 2. rd-lock should be released after data writing finish.
+     *
+     * For reducing refcount to zero (and therefore allowing reallocating the
+     * host cluster for other needs) it's enough to take rw-lock (to wait for
+     * all in-flight writes) and immediately release it (see update_refcount()).
+     */
+    CoRwlock discard_rw_lock;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..bb6842cd8f 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -878,6 +878,28 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
         } else {
             refcount += addend;
         }
+
+        if (qemu_in_coroutine() && refcount == 0) {
+            /*
+             * Refcount becomes zero, but there are still may be in-fligth
+             * writes, that writing data to the cluster (that's done without
+             * qcow2 mutext held).
+             *
+             * Data writing protected by rd-locked discard_rw_lock. So here
+             * it's enough to take and immediately release wr-lock on it.
+             * We can immediately release it, because new write requests can't
+             * came to cluster which refcount becomes 0 (There should not be any
+             * links from L2 tables to it).
+             *
+             * We can't do it if we are not in coroutine. But if we are not in
+             * coroutine, that also means that we are modifying metadata not
+             * taking qcow2 s->lock mutex, which is wrong as well.. So, let's
+             * hope for a bright future.
+             */
+            qemu_co_rwlock_wrlock(&s->discard_rw_lock);
+            qemu_co_rwlock_unlock(&s->discard_rw_lock);
+        }
+
         if (refcount == 0 && cluster_index < s->free_cluster_index) {
             s->free_cluster_index = cluster_index;
         }
diff --git a/block/qcow2.c b/block/qcow2.c
index 0db1227ac9..aea7aea334 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1899,6 +1899,7 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
 
     /* Initialise locks */
     qemu_co_mutex_init(&s->lock);
+    qemu_co_rwlock_init(&s->discard_rw_lock);
 
     if (qemu_in_coroutine()) {
         /* From bdrv_co_create.  */
@@ -2182,9 +2183,17 @@ typedef struct Qcow2AioTask {
     QEMUIOVector *qiov;
     uint64_t qiov_offset;
     QCowL2Meta *l2meta; /* only for write */
+    bool rdlock; /* only for write */
 } Qcow2AioTask;
 
 static coroutine_fn int qcow2_co_preadv_task_entry(AioTask *task);
+
+/*
+ * @rdlock: If true, it means that qcow2_add_task is called with discard_rw_lock
+ * rd-locked, and this rd-lock must be transaferred to the task. The task itself
+ * will release the lock. The caller expects that after qcow2_add_task() call
+ * the lock is already released.
+ */
 static coroutine_fn int qcow2_add_task(BlockDriverState *bs,
                                        AioTaskPool *pool,
                                        AioTaskFunc func,
@@ -2194,8 +2203,10 @@ static coroutine_fn int qcow2_add_task(BlockDriverState *bs,
                                        uint64_t bytes,
                                        QEMUIOVector *qiov,
                                        size_t qiov_offset,
-                                       QCowL2Meta *l2meta)
+                                       QCowL2Meta *l2meta,
+                                       bool rdlock)
 {
+    BDRVQcow2State *s = bs->opaque;
     Qcow2AioTask local_task;
     Qcow2AioTask *task = pool ? g_new(Qcow2AioTask, 1) : &local_task;
 
@@ -2209,6 +2220,7 @@ static coroutine_fn int qcow2_add_task(BlockDriverState *bs,
         .bytes = bytes,
         .qiov_offset = qiov_offset,
         .l2meta = l2meta,
+        .rdlock = rdlock,
     };
 
     trace_qcow2_add_task(qemu_coroutine_self(), bs, pool,
@@ -2217,10 +2229,24 @@ static coroutine_fn int qcow2_add_task(BlockDriverState *bs,
                          qiov, qiov_offset);
 
     if (!pool) {
+        /*
+         * func will release rd-lock if needed and caller's expectation would be
+         * satisfied, so we should not care.
+         */
         return func(&task->task);
     }
 
+    /*
+     * We are going to run task in a different coroutine. We can't acquire lock
+     * in one coroutine and release in another. So, the new coroutine should
+     * take it's own rd-lock, and we should release ours one.
+     */
+    task->rdlock = false;
     aio_task_pool_start_task(pool, &task->task);
+    if (rdlock) {
+        assert(task->rdlock); /* caller took ownership */
+        qemu_co_rwlock_unlock(&s->discard_rw_lock);
+    }
 
     return 0;
 }
@@ -2274,6 +2300,7 @@ static coroutine_fn int qcow2_co_preadv_task_entry(AioTask *task)
     Qcow2AioTask *t = container_of(task, Qcow2AioTask, task);
 
     assert(!t->l2meta);
+    assert(!t->rdlock);
 
     return qcow2_co_preadv_task(t->bs, t->subcluster_type,
                                 t->host_offset, t->offset, t->bytes,
@@ -2320,7 +2347,7 @@ static coroutine_fn int qcow2_co_preadv_part(BlockDriverState *bs,
             }
             ret = qcow2_add_task(bs, aio, qcow2_co_preadv_task_entry, type,
                                  host_offset, offset, cur_bytes,
-                                 qiov, qiov_offset, NULL);
+                                 qiov, qiov_offset, NULL, false);
             if (ret < 0) {
                 goto out;
             }
@@ -2483,19 +2510,32 @@ static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
  * Called with s->lock unlocked
  * l2meta  - if not NULL, qcow2_co_pwritev_task() will consume it. Caller must
  *           not use it somehow after qcow2_co_pwritev_task() call
+ *
+ * @rdlock must be non-NULL.
+ * If *@rdlock is true it means that discard_rw_lock is already taken. We should
+ * not reacquire it, but caller expects that we release it.
+ * If *@rdlock is false, we should take it ourselves (and still release in the
+ * end). When rd-lock is taken, we should set *@rdlock to true, so that parent
+ * coroutine can check it.
  */
 static coroutine_fn int qcow2_co_pwritev_task(BlockDriverState *bs,
                                               uint64_t host_offset,
                                               uint64_t offset, uint64_t bytes,
                                               QEMUIOVector *qiov,
                                               uint64_t qiov_offset,
-                                              QCowL2Meta *l2meta)
+                                              QCowL2Meta *l2meta,
+                                              bool *rdlock)
 {
     int ret;
     BDRVQcow2State *s = bs->opaque;
     void *crypt_buf = NULL;
     QEMUIOVector encrypted_qiov;
 
+    if (!*rdlock) {
+        qemu_co_rwlock_rdlock(&s->discard_rw_lock);
+        *rdlock = true;
+    }
+
     if (bs->encrypted) {
         assert(s->crypto);
         assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
@@ -2538,12 +2578,14 @@ static coroutine_fn int qcow2_co_pwritev_task(BlockDriverState *bs,
         }
     }
 
+    qemu_co_rwlock_unlock(&s->discard_rw_lock);
     qemu_co_mutex_lock(&s->lock);
 
     ret = qcow2_handle_l2meta(bs, &l2meta, true);
     goto out_locked;
 
 out_unlocked:
+    qemu_co_rwlock_unlock(&s->discard_rw_lock);
     qemu_co_mutex_lock(&s->lock);
 
 out_locked:
@@ -2563,7 +2605,7 @@ static coroutine_fn int qcow2_co_pwritev_task_entry(AioTask *task)
 
     return qcow2_co_pwritev_task(t->bs, t->host_offset,
                                  t->offset, t->bytes, t->qiov, t->qiov_offset,
-                                 t->l2meta);
+                                 t->l2meta, &t->rdlock);
 }
 
 static coroutine_fn int qcow2_co_pwritev_part(
@@ -2607,6 +2649,8 @@ static coroutine_fn int qcow2_co_pwritev_part(
             goto out_locked;
         }
 
+        qemu_co_rwlock_rdlock(&s->discard_rw_lock);
+
         qemu_co_mutex_unlock(&s->lock);
 
         if (!aio && cur_bytes != bytes) {
@@ -2614,7 +2658,10 @@ static coroutine_fn int qcow2_co_pwritev_part(
         }
         ret = qcow2_add_task(bs, aio, qcow2_co_pwritev_task_entry, 0,
                              host_offset, offset,
-                             cur_bytes, qiov, qiov_offset, l2meta);
+                             cur_bytes, qiov, qiov_offset, l2meta, true);
+        /*
+         * now discard_rw_lock is released and we are safe to take s->lock again
+         */
         l2meta = NULL; /* l2meta is consumed by qcow2_co_pwritev_task() */
         if (ret < 0) {
             goto fail_nometa;
@@ -4094,10 +4141,15 @@ qcow2_co_copy_range_to(BlockDriverState *bs,
             goto fail;
         }
 
+        qemu_co_rwlock_rdlock(&s->discard_rw_lock);
         qemu_co_mutex_unlock(&s->lock);
+
         ret = bdrv_co_copy_range_to(src, src_offset, s->data_file, host_offset,
                                     cur_bytes, read_flags, write_flags);
+
+        qemu_co_rwlock_unlock(&s->discard_rw_lock);
         qemu_co_mutex_lock(&s->lock);
+
         if (ret < 0) {
             goto fail;
         }
@@ -4533,13 +4585,19 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
     }
 
     ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
-    qemu_co_mutex_unlock(&s->lock);
     if (ret < 0) {
+        qemu_co_mutex_unlock(&s->lock);
         goto fail;
     }
 
+    qemu_co_rwlock_rdlock(&s->discard_rw_lock);
+    qemu_co_mutex_unlock(&s->lock);
+
     BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
     ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+
+    qemu_co_rwlock_unlock(&s->discard_rw_lock);
+
     if (ret < 0) {
         goto fail;
     }
@@ -4608,7 +4666,8 @@ qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
         }
 
         ret = qcow2_add_task(bs, aio, qcow2_co_pwritev_compressed_task_entry,
-                             0, 0, offset, chunk_size, qiov, qiov_offset, NULL);
+                             0, 0, offset, chunk_size, qiov, qiov_offset, NULL,
+                             false);
         if (ret < 0) {
             break;
         }
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
index dd9964ef3f..5df0048757 100755
--- a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# group: quick disabled
+# group: quick
 #
 # Test discarding (and reusing) host cluster during writing data to it.
 #
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-19 10:08 [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2021-03-19 10:08 ` [PATCH v4 3/3] block/qcow2: introduce discard_rw_lock: fix discarding host clusters Vladimir Sementsov-Ogievskiy
@ 2021-03-25 19:12 ` Vladimir Sementsov-Ogievskiy
  2021-03-30  9:49   ` Max Reitz
  3 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-25 19:12 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, den

ping. Do we want it for 6.0?

19.03.2021 13:08, Vladimir Sementsov-Ogievskiy wrote:
> Look at 03 for the problem and fix. 01 is preparation and 02 is the
> test.
> 
> Actually previous version of this thing is
>     [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard
> 
> Still
>     [PATCH v3 0/6] qcow2: compressed write cache
> includes another fix (more complicated) for the bug, so this is called
> v4.
> 
> So, what's new:
> 
> It's still a CoRwlock based solution as suggested by Kevin.
> 
> Now I think that "writer" of the lock should be code in
> update_refcount() which wants to set refcount to zero. If we consider
> only guest discard request as "writer" we may miss other sources of
> discarding host clusters (like rewriting compressed cluster to normal,
> maybe some snapshot operations, who knows what's more).
> 
> And this means that we want to take rw-lock under qcow2 s->lock. And
> this brings ordering restriction for the two locks: if we want both
> locks taken, we should always take s->lock first, and never take s->lock
> when rw-lock is already taken (otherwise we get classic deadlock).
> 
> This leads us to taking rd-lock for in-flight writes under s->lock in
> same critical section where cluster is allocated (or just got from
> metadata) and releasing after data writing completion.
> 
> This in turn leads to a bit tricky logic around transferring rd-lock to
> task coroutine on normal write path (see 03).. But this is still simpler
> than inflight-write-counters solution in v3..
> 
> Vladimir Sementsov-Ogievskiy (3):
>    qemu-io: add aio_discard
>    iotests: add qcow2-discard-during-rewrite
>    block/qcow2: introduce discard_rw_lock: fix discarding host clusters
> 
>   block/qcow2.h                                 |  20 +++
>   block/qcow2-refcount.c                        |  22 ++++
>   block/qcow2.c                                 |  73 +++++++++--
>   qemu-io-cmds.c                                | 117 ++++++++++++++++++
>   .../tests/qcow2-discard-during-rewrite        |  99 +++++++++++++++
>   .../tests/qcow2-discard-during-rewrite.out    |  17 +++
>   6 files changed, 341 insertions(+), 7 deletions(-)
>   create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
>   create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-25 19:12 ` [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
@ 2021-03-30  9:49   ` Max Reitz
  2021-03-30 10:51     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Max Reitz @ 2021-03-30  9:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel

On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
> ping. Do we want it for 6.0?

I’d rather wait.  I think the conclusion was that guests shouldn’t hit 
this because they serialize discards?

There’s also something Kevin wrote on IRC a couple of weeks ago, for 
which I had hoped he’d sent an email but I don’t think he did, so I’ll 
try to remember and paraphrase as well as I can...

He basically asked whether it wouldn’t be conceptually simpler to take a 
reference to some cluster in get_cluster_offset() and later release it 
with a to-be-added put_cluster_offset().

He also noted that reading is problematic, too, because if you read a 
discarded and reused cluster, this might result in an information leak 
(some guest application might be able to read data it isn’t allowed to 
read); that’s why making get_cluster_offset() the point of locking 
clusters against discarding would be better.

This would probably work with both of your solutions.  For the in-memory 
solutions, you’d take a refcount to an actual cluster; in the CoRwLock 
solution, you’d take that lock.

What do you think?

Max

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-30  9:49   ` Max Reitz
@ 2021-03-30 10:51     ` Vladimir Sementsov-Ogievskiy
  2021-03-30 12:51       ` Max Reitz
  0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-30 10:51 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, kwolf, den

30.03.2021 12:49, Max Reitz wrote:
> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>> ping. Do we want it for 6.0?
> 
> I’d rather wait.  I think the conclusion was that guests shouldn’t hit this because they serialize discards?

I think, that we never had bugs, so we of course can wait.

> 
> There’s also something Kevin wrote on IRC a couple of weeks ago, for which I had hoped he’d sent an email but I don’t think he did, so I’ll try to remember and paraphrase as well as I can...
> 
> He basically asked whether it wouldn’t be conceptually simpler to take a reference to some cluster in get_cluster_offset() and later release it with a to-be-added put_cluster_offset().
> 
> He also noted that reading is problematic, too, because if you read a discarded and reused cluster, this might result in an information leak (some guest application might be able to read data it isn’t allowed to read); that’s why making get_cluster_offset() the point of locking clusters against discarding would be better.

Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless))

> 
> This would probably work with both of your solutions.  For the in-memory solutions, you’d take a refcount to an actual cluster; in the CoRwLock solution, you’d take that lock.
> 
> What do you think?
> 

Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to make it more native to use for read operations as well?

Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a kind of "dynamic reference count" by get_cluster_offset() and then call corresponding put() somewhere? In this case I'm afraid it's a lot more work.. It would be also the problem that a lot of paths in qcow2 are not in coroutine and don't even take s->lock when they actually should. This will also mean that we do same job as normal qcow2 refcounts already do: no sense in keeping additional "dynamic refcount" for L2 table cluster while reading it, as we already have non-zero qcow2 normal refcount for it..

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-30 10:51     ` Vladimir Sementsov-Ogievskiy
@ 2021-03-30 12:51       ` Max Reitz
  2021-03-30 13:25         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Max Reitz @ 2021-03-30 12:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel

On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
> 30.03.2021 12:49, Max Reitz wrote:
>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>> ping. Do we want it for 6.0?
>>
>> I’d rather wait.  I think the conclusion was that guests shouldn’t hit 
>> this because they serialize discards?
> 
> I think, that we never had bugs, so we of course can wait.
> 
>>
>> There’s also something Kevin wrote on IRC a couple of weeks ago, for 
>> which I had hoped he’d sent an email but I don’t think he did, so I’ll 
>> try to remember and paraphrase as well as I can...
>>
>> He basically asked whether it wouldn’t be conceptually simpler to take 
>> a reference to some cluster in get_cluster_offset() and later release 
>> it with a to-be-added put_cluster_offset().
>>
>> He also noted that reading is problematic, too, because if you read a 
>> discarded and reused cluster, this might result in an information leak 
>> (some guest application might be able to read data it isn’t allowed to 
>> read); that’s why making get_cluster_offset() the point of locking 
>> clusters against discarding would be better.
> 
> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] 
> qcow2: fix parallel rewrite and discard (lockless))
> 
>>
>> This would probably work with both of your solutions.  For the 
>> in-memory solutions, you’d take a refcount to an actual cluster; in 
>> the CoRwLock solution, you’d take that lock.
>>
>> What do you think?
>>
> 
> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and 
> qcow2_inflight_writes_dec() to 
> get_cluster_offset()/put_cluster_offset(), to make it more native to use 
> for read operations as well?

Hm.  Our discussion wasn’t so detailed.

I interpreted it to mean all qcow2 functions that find an offset to a 
qcow2 cluster, namely qcow2_get_host_offset(), 
qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().

When those functions return an offset (in)to some cluster, that cluster 
(or the image as a whole) should be locked against discards.  Every 
offset received this way would require an accompanying 
qcow2_put_host_offset().

> Or to update any kind of "getting cluster offset" in the whole qcow2 
> driver to take a kind of "dynamic reference count" by 
> get_cluster_offset() and then call corresponding put() somewhere? In 
> this case I'm afraid it's a lot more work..

Hm, really?  I would have assumed we need to do some locking in all 
functions that get a cluster offset this way, so it should be less work 
to take the lock in the functions they invoke to get the offset.

> It would be also the problem 
> that a lot of paths in qcow2 are not in coroutine and don't even take 
> s->lock when they actually should.

I’m not sure what you mean here, because all functions that invoke any 
of the three functions I listed above are coroutine_fns (or, well, I 
didn’t look it up, but they all have *_co_* in their name).

> This will also mean that we do same 
> job as normal qcow2 refcounts already do: no sense in keeping additional 
> "dynamic refcount" for L2 table cluster while reading it, as we already 
> have non-zero qcow2 normal refcount for it..

I’m afraid I don’t understand how normal refcounts relate to this.  For 
example, qcow2_get_host_offset() doesn’t touch refcounts at all.

Max



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-30 12:51       ` Max Reitz
@ 2021-03-30 13:25         ` Vladimir Sementsov-Ogievskiy
  2021-03-30 16:39           ` Max Reitz
  0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-30 13:25 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, kwolf, den

30.03.2021 15:51, Max Reitz wrote:
> On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
>> 30.03.2021 12:49, Max Reitz wrote:
>>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>>> ping. Do we want it for 6.0?
>>>
>>> I’d rather wait.  I think the conclusion was that guests shouldn’t hit this because they serialize discards?
>>
>> I think, that we never had bugs, so we of course can wait.
>>
>>>
>>> There’s also something Kevin wrote on IRC a couple of weeks ago, for which I had hoped he’d sent an email but I don’t think he did, so I’ll try to remember and paraphrase as well as I can...
>>>
>>> He basically asked whether it wouldn’t be conceptually simpler to take a reference to some cluster in get_cluster_offset() and later release it with a to-be-added put_cluster_offset().
>>>
>>> He also noted that reading is problematic, too, because if you read a discarded and reused cluster, this might result in an information leak (some guest application might be able to read data it isn’t allowed to read); that’s why making get_cluster_offset() the point of locking clusters against discarding would be better.
>>
>> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless))
>>
>>>
>>> This would probably work with both of your solutions.  For the in-memory solutions, you’d take a refcount to an actual cluster; in the CoRwLock solution, you’d take that lock.
>>>
>>> What do you think?
>>>
>>
>> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to make it more native to use for read operations as well?
> 
> Hm.  Our discussion wasn’t so detailed.
> 
> I interpreted it to mean all qcow2 functions that find an offset to a qcow2 cluster, namely qcow2_get_host_offset(), qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().

What about qcow2_alloc_clusters() ?

> 
> When those functions return an offset (in)to some cluster, that cluster (or the image as a whole) should be locked against discards.  Every offset received this way would require an accompanying qcow2_put_host_offset().
> 
>> Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a kind of "dynamic reference count" by get_cluster_offset() and then call corresponding put() somewhere? In this case I'm afraid it's a lot more work..
> 
> Hm, really?  I would have assumed we need to do some locking in all functions that get a cluster offset this way, so it should be less work to take the lock in the functions they invoke to get the offset.
> 
>> It would be also the problem that a lot of paths in qcow2 are not in coroutine and don't even take s->lock when they actually should.
> 
> I’m not sure what you mean here, because all functions that invoke any of the three functions I listed above are coroutine_fns (or, well, I didn’t look it up, but they all have *_co_* in their name).

qcow2_alloc_clusters() has a lot more callers..

> 
>> This will also mean that we do same job as normal qcow2 refcounts already do: no sense in keeping additional "dynamic refcount" for L2 table cluster while reading it, as we already have non-zero qcow2 normal refcount for it..
> 
> I’m afraid I don’t understand how normal refcounts relate to this.  For example, qcow2_get_host_offset() doesn’t touch refcounts at all.
> 

I mean the following: remember our discussion about what is free-cluster. If we add "dynamic-refcount", or "infligth-write-counter" thing only to count inflight data-writing (or, as discussed, we should count data-reads as well) operations, than "full reference count" of the cluster is inflight-write-count + qcow2-metadata-refcount.

But if we add a kind of "dynamic refcount" for any use of host cluster, for example reading of L2 table, than we duplicate the reference in qcow2-metadata to this L2 table (represented as refcount) by our "dynamic refcount", and we don't have a concept of "full reference count" as the sum above.. We still should treat a cluster as free when both "dynamic refcount" and qcow2-metadata-refcount are zero, but their sum doesn't have a good sense. Not a problem maybe.. But looks like a complication with no benefit.


==

OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only functions returning an offset to cluster with "guest data", not to any kind of host cluster. Than what you propose looks like this to me:

  - take my v5
  - rename qcow2_inflight_writes_dec() to put_cluster_offset()
  - call qcow2_inflight_writes_inc() from the three functions you mention

That make sense to me. Still, put_cluster_offset() name doesn't make it obvious that it's only for clusters with "guest data", and we shouldn't call it when work with metadata clusters.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-30 13:25         ` Vladimir Sementsov-Ogievskiy
@ 2021-03-30 16:39           ` Max Reitz
  2021-03-30 18:33             ` Vladimir Sementsov-Ogievskiy
  2021-03-31  6:23             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 12+ messages in thread
From: Max Reitz @ 2021-03-30 16:39 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel

On 30.03.21 15:25, Vladimir Sementsov-Ogievskiy wrote:
> 30.03.2021 15:51, Max Reitz wrote:
>> On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
>>> 30.03.2021 12:49, Max Reitz wrote:
>>>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>>>> ping. Do we want it for 6.0?
>>>>
>>>> I’d rather wait.  I think the conclusion was that guests shouldn’t 
>>>> hit this because they serialize discards?
>>>
>>> I think, that we never had bugs, so we of course can wait.
>>>
>>>>
>>>> There’s also something Kevin wrote on IRC a couple of weeks ago, for 
>>>> which I had hoped he’d sent an email but I don’t think he did, so 
>>>> I’ll try to remember and paraphrase as well as I can...
>>>>
>>>> He basically asked whether it wouldn’t be conceptually simpler to 
>>>> take a reference to some cluster in get_cluster_offset() and later 
>>>> release it with a to-be-added put_cluster_offset().
>>>>
>>>> He also noted that reading is problematic, too, because if you read 
>>>> a discarded and reused cluster, this might result in an information 
>>>> leak (some guest application might be able to read data it isn’t 
>>>> allowed to read); that’s why making get_cluster_offset() the point 
>>>> of locking clusters against discarding would be better.
>>>
>>> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 
>>> 0/6] qcow2: fix parallel rewrite and discard (lockless))
>>>
>>>>
>>>> This would probably work with both of your solutions.  For the 
>>>> in-memory solutions, you’d take a refcount to an actual cluster; in 
>>>> the CoRwLock solution, you’d take that lock.
>>>>
>>>> What do you think?
>>>>
>>>
>>> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and 
>>> qcow2_inflight_writes_dec() to 
>>> get_cluster_offset()/put_cluster_offset(), to make it more native to 
>>> use for read operations as well?
>>
>> Hm.  Our discussion wasn’t so detailed.
>>
>> I interpreted it to mean all qcow2 functions that find an offset to a 
>> qcow2 cluster, namely qcow2_get_host_offset(), 
>> qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().
> 
> What about qcow2_alloc_clusters() ?

Seems like all callers for that but do_alloc_cluster_offset() call it to 
allocate metadata clusters, which cannot be discarded from the guest.

Is it really possible that some metadata cluster is used while qcow2 
discards it internally at the same time, or isn’t all of this only a 
problem for data clusters?

>> When those functions return an offset (in)to some cluster, that 
>> cluster (or the image as a whole) should be locked against discards.  
>> Every offset received this way would require an accompanying 
>> qcow2_put_host_offset().
>>
>>> Or to update any kind of "getting cluster offset" in the whole qcow2 
>>> driver to take a kind of "dynamic reference count" by 
>>> get_cluster_offset() and then call corresponding put() somewhere? In 
>>> this case I'm afraid it's a lot more work..
>>
>> Hm, really?  I would have assumed we need to do some locking in all 
>> functions that get a cluster offset this way, so it should be less 
>> work to take the lock in the functions they invoke to get the offset.
>>
>>> It would be also the problem that a lot of paths in qcow2 are not in 
>>> coroutine and don't even take s->lock when they actually should.
>>
>> I’m not sure what you mean here, because all functions that invoke any 
>> of the three functions I listed above are coroutine_fns (or, well, I 
>> didn’t look it up, but they all have *_co_* in their name).
> 
> qcow2_alloc_clusters() has a lot more callers..

Let’s hope we don’t need to worry about it then. O:)

>>> This will also mean that we do same job as normal qcow2 refcounts 
>>> already do: no sense in keeping additional "dynamic refcount" for L2 
>>> table cluster while reading it, as we already have non-zero qcow2 
>>> normal refcount for it..
>>
>> I’m afraid I don’t understand how normal refcounts relate to this.  
>> For example, qcow2_get_host_offset() doesn’t touch refcounts at all.
>>
> 
> I mean the following: remember our discussion about what is 
> free-cluster. If we add "dynamic-refcount", or "infligth-write-counter" 
> thing only to count inflight data-writing (or, as discussed, we should 
> count data-reads as well) operations, than "full reference count" of the 
> cluster is inflight-write-count + qcow2-metadata-refcount.
> 
> But if we add a kind of "dynamic refcount" for any use of host cluster, 
> for example reading of L2 table, than we duplicate the reference in 
> qcow2-metadata to this L2 table (represented as refcount) by our 
> "dynamic refcount", and we don't have a concept of "full reference 
> count" as the sum above.. We still should treat a cluster as free when 
> both "dynamic refcount" and qcow2-metadata-refcount are zero, but their 
> sum doesn't have a good sense. Not a problem maybe.. But looks like a 
> complication with no benefit.

You’re right, but I don’t think that’s a real problem.  Perhaps the sum 
was even a false equivalency.  There is a difference between the dynamic 
refcount and the on-disk refcount: We must wait with discarding until 
the the dynamic refcount is 0, and discarding will then drop the on-disk 
refcount to 0, too.  I think.  So I’m not sure whether the sum really 
means anything.

But if metadata isn’t a problem and that means we don’t have ask these 
questions at all, well, that’ll be even better.

> ==
> 
> OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are 
> saying about only functions returning an offset to cluster with "guest 
> data", not to any kind of host cluster. Than what you propose looks like 
> this to me:
> 
>   - take my v5
>   - rename qcow2_inflight_writes_dec() to put_cluster_offset()
>   - call qcow2_inflight_writes_inc() from the three functions you mention

Yes, I think so.  Or you take the CoRwLock in those three functions, 
depending on which implementation we want.

Sorry if we’ve discussed this before and I just forgot, but: What are 
the performance implications of either solution?  As far as I remember, 
the inflight-write-counter solution had the problem of always doing 
stuff on every I/O access.  You said the impact was small and yes, it 
is, but it’s still there.
I haven’t looked at the CoRwLock solution but it I would assume it’s 
basically zero cost for common cases, right?  I.e. the case where the 
guest already serializes discards from other accesses, which I thought 
is what e.g. Linux does.  (And even if it doesn’t, I would assume that 
concurrent I/O and discards are rather rare.)

> That make sense to me. Still, put_cluster_offset() name doesn't make it 
> obvious that it's only for clusters with "guest data", and we shouldn't 
> call it when work with metadata clusters.

Yeah, it was meant for symmetry with qcow2_get_host_offset() (i.e. it 
would be named “qcow2_put_host_offset()”).  Now that there are three 
functions that would take a reference, it should get some other name.  I 
don’t know.  qcow2_put_data_cluster_offset()?

Max



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-30 16:39           ` Max Reitz
@ 2021-03-30 18:33             ` Vladimir Sementsov-Ogievskiy
  2021-03-31  6:23             ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-30 18:33 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, kwolf, den

30.03.2021 19:39, Max Reitz wrote:
> On 30.03.21 15:25, Vladimir Sementsov-Ogievskiy wrote:
>> 30.03.2021 15:51, Max Reitz wrote:
>>> On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
>>>> 30.03.2021 12:49, Max Reitz wrote:
>>>>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> ping. Do we want it for 6.0?
>>>>>
>>>>> I’d rather wait.  I think the conclusion was that guests shouldn’t hit this because they serialize discards?
>>>>
>>>> I think, that we never had bugs, so we of course can wait.
>>>>
>>>>>
>>>>> There’s also something Kevin wrote on IRC a couple of weeks ago, for which I had hoped he’d sent an email but I don’t think he did, so I’ll try to remember and paraphrase as well as I can...
>>>>>
>>>>> He basically asked whether it wouldn’t be conceptually simpler to take a reference to some cluster in get_cluster_offset() and later release it with a to-be-added put_cluster_offset().
>>>>>
>>>>> He also noted that reading is problematic, too, because if you read a discarded and reused cluster, this might result in an information leak (some guest application might be able to read data it isn’t allowed to read); that’s why making get_cluster_offset() the point of locking clusters against discarding would be better.
>>>>
>>>> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless))
>>>>
>>>>>
>>>>> This would probably work with both of your solutions.  For the in-memory solutions, you’d take a refcount to an actual cluster; in the CoRwLock solution, you’d take that lock.
>>>>>
>>>>> What do you think?
>>>>>
>>>>
>>>> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to make it more native to use for read operations as well?
>>>
>>> Hm.  Our discussion wasn’t so detailed.
>>>
>>> I interpreted it to mean all qcow2 functions that find an offset to a qcow2 cluster, namely qcow2_get_host_offset(), qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().
>>
>> What about qcow2_alloc_clusters() ?
> 
> Seems like all callers for that but do_alloc_cluster_offset() call it to allocate metadata clusters, which cannot be discarded from the guest.
> 
> Is it really possible that some metadata cluster is used while qcow2 discards it internally at the same time, or isn’t all of this only a problem for data clusters?

It's a good question which I think doesn't have fast answer.

I think:

1. On normal code paths we usually handle qcow2 metadata always under s->lock mutex. So, we can't decrease refcount to zero in parallel, as update_refcount() should be called under s->lock too.

2. [1] is violated at least by code paths that are not in coroutine (for example qcow2-snapshots). But seems that on these paths we are guarded by other things (like drained section).. For sure we should move the whole qcow2 driver to coroutine and do every metadata update under s->lock.

3. Does [1] violated on some coroutine paths I don't know. But I hope that it doesn't, otherwise we should have bugs..

In future we'll probably want to work with metadata without s->lock, at least for IO. Then we'll need same mechanisms like we are now inventing for data writes.. But I'm not sure that for metadata it will be so simple (as for now mutex prevents not only parallel write & discard of metadata, but also parallel updates. (And for parallel guest writes to same cluster we don't care, it's a problem of the guest)

> 
>>> When those functions return an offset (in)to some cluster, that cluster (or the image as a whole) should be locked against discards. Every offset received this way would require an accompanying qcow2_put_host_offset().
>>>
>>>> Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a kind of "dynamic reference count" by get_cluster_offset() and then call corresponding put() somewhere? In this case I'm afraid it's a lot more work..
>>>
>>> Hm, really?  I would have assumed we need to do some locking in all functions that get a cluster offset this way, so it should be less work to take the lock in the functions they invoke to get the offset.
>>>
>>>> It would be also the problem that a lot of paths in qcow2 are not in coroutine and don't even take s->lock when they actually should.
>>>
>>> I’m not sure what you mean here, because all functions that invoke any of the three functions I listed above are coroutine_fns (or, well, I didn’t look it up, but they all have *_co_* in their name).
>>
>> qcow2_alloc_clusters() has a lot more callers..
> 
> Let’s hope we don’t need to worry about it then. O:)
> 
>>>> This will also mean that we do same job as normal qcow2 refcounts already do: no sense in keeping additional "dynamic refcount" for L2 table cluster while reading it, as we already have non-zero qcow2 normal refcount for it..
>>>
>>> I’m afraid I don’t understand how normal refcounts relate to this. For example, qcow2_get_host_offset() doesn’t touch refcounts at all.
>>>
>>
>> I mean the following: remember our discussion about what is free-cluster. If we add "dynamic-refcount", or "infligth-write-counter" thing only to count inflight data-writing (or, as discussed, we should count data-reads as well) operations, than "full reference count" of the cluster is inflight-write-count + qcow2-metadata-refcount.
>>
>> But if we add a kind of "dynamic refcount" for any use of host cluster, for example reading of L2 table, than we duplicate the reference in qcow2-metadata to this L2 table (represented as refcount) by our "dynamic refcount", and we don't have a concept of "full reference count" as the sum above.. We still should treat a cluster as free when both "dynamic refcount" and qcow2-metadata-refcount are zero, but their sum doesn't have a good sense. Not a problem maybe.. But looks like a complication with no benefit.
> 
> You’re right, but I don’t think that’s a real problem.  Perhaps the sum was even a false equivalency.  There is a difference between the dynamic refcount and the on-disk refcount: We must wait with discarding until the the dynamic refcount is 0, and discarding will then drop the on-disk refcount to 0, too.  I think.  So I’m not sure whether the sum really means anything.
> 
> But if metadata isn’t a problem and that means we don’t have ask these questions at all, well, that’ll be even better.
> 
>> ==
>>
>> OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only functions returning an offset to cluster with "guest data", not to any kind of host cluster. Than what you propose looks like this to me:
>>
>>   - take my v5
>>   - rename qcow2_inflight_writes_dec() to put_cluster_offset()
>>   - call qcow2_inflight_writes_inc() from the three functions you mention
> 
> Yes, I think so.  Or you take the CoRwLock in those three functions, depending on which implementation we want.

Yes.

> 
> Sorry if we’ve discussed this before and I just forgot, but: What are the performance implications of either solution?  As far as I remember, the inflight-write-counter solution had the problem of always doing stuff on every I/O access.  You said the impact was small and yes, it is, but it’s still there.

Yes. I think, I can measure it on tmpfs, and if degradation less than 1%, I'd just ignore it. Still inflight-counter solution is more complicated.

Hmm, note that CoRwLock solution has one non-trivial thing too: moving the lock to task coroutine. With inflight-counters we don't need such thing.

> I haven’t looked at the CoRwLock solution but it I would assume it’s basically zero cost for common cases, right?  I.e. the case where the guest already serializes discards from other accesses, which I thought is what e.g. Linux does.  (And even if it doesn’t, I would assume that concurrent I/O and discards are rather rare.)

With CoRwLock any kind of host-cluster-discard becomes blocking. We block the guest io (remember that we want to protect reads too), wait until all operations completed, then we do discard, than unfreeze the whole io.. What about removing persistent dirty bitmap? It becomes such an operation.. Hmm. Can I imagine another example? Switch-to-snapshot should be in a drained section anyway.. Rewriting compressed cluster to normal cluster is rare case. Hmm what else?

We don't have it now, but what about a feature like this:

  During block-commit, discard clusters that are already committed to base from all intermediate images: thus we avoid increasing of disk-usage during commit job.

But, anyway, we don't have such feature yet.

So, I think that in general lockless solution is better.. But currently we don't have a significant use-case which would be a show-stopper for CoRwLock solution. Or at least I can't imagine it.

Ha, or, I have one in mind: mirror. mirror does discards (it has MIRROR_METHOD_DISCARD). Honestly, I always doubt about that fact. I think that mirror job should not use discard at all, it should either write or write-zeroes, nothing else.. But it's my opinion, probably I don't know the use-case. Anyway, with current implementation of mirror-job write-lock on discard may reduce performance of mirror job (and of guest as well) in cases when mirror does discard operations.. Mirror does discard when block_status return neither ZERO nor DATA. And accordingly to my old mail https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html , only iscsi may reasonably report such block status... Hmm. Any thoughts?


> 
>> That make sense to me. Still, put_cluster_offset() name doesn't make it obvious that it's only for clusters with "guest data", and we shouldn't call it when work with metadata clusters.
> 
> Yeah, it was meant for symmetry with qcow2_get_host_offset() (i.e. it would be named “qcow2_put_host_offset()”).  Now that there are three functions that would take a reference, it should get some other name.  I don’t know.  qcow2_put_data_cluster_offset()?
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
  2021-03-30 16:39           ` Max Reitz
  2021-03-30 18:33             ` Vladimir Sementsov-Ogievskiy
@ 2021-03-31  6:23             ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-31  6:23 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, kwolf, den

30.03.2021 19:39, Max Reitz wrote:
>> ==
>>
>> OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only functions returning an offset to cluster with "guest data", not to any kind of host cluster. Than what you propose looks like this to me:
>>
>>   - take my v5
>>   - rename qcow2_inflight_writes_dec() to put_cluster_offset()
>>   - call qcow2_inflight_writes_inc() from the three functions you mention
> 
> Yes, I think so.  Or you take the CoRwLock in those three functions, depending on which implementation we want.

CoRwLock brings one inconvenient feature: to pass ownership on the reference to coroutine (for example to task coroutine in qcow2_co_pwritev) we have to unlock the lock in original coroutine and lock in another, as CoRwLock doesn't support transferring to other coroutine. So we'll do put_cluster_offset() in qcow2_co_pwritev() and do get_cluster_offset() in qcow2_co_pwritev_task(). (and we must be sure that the second lock taken before releasing the first one). So in this case it probably better to keep direct CoRwLock interface, not shadowing it by get/put.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-31  6:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-19 10:08 [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
2021-03-19 10:08 ` [PATCH v4 1/3] qemu-io: add aio_discard Vladimir Sementsov-Ogievskiy
2021-03-19 10:08 ` [PATCH v4 2/3] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
2021-03-19 10:08 ` [PATCH v4 3/3] block/qcow2: introduce discard_rw_lock: fix discarding host clusters Vladimir Sementsov-Ogievskiy
2021-03-25 19:12 ` [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Vladimir Sementsov-Ogievskiy
2021-03-30  9:49   ` Max Reitz
2021-03-30 10:51     ` Vladimir Sementsov-Ogievskiy
2021-03-30 12:51       ` Max Reitz
2021-03-30 13:25         ` Vladimir Sementsov-Ogievskiy
2021-03-30 16:39           ` Max Reitz
2021-03-30 18:33             ` Vladimir Sementsov-Ogievskiy
2021-03-31  6:23             ` Vladimir Sementsov-Ogievskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).