All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] qcow2: compressed write cache
@ 2021-03-05 17:35 Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 1/6] block-jobs: flush target at the end of .run() Vladimir Sementsov-Ogievskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

The series inherits "[PATCH 0/7] qcow2: compressed write cache"
(changed a lot, the cache is rewritten), and
"[PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and "
(the fix solution is taken from v1 instead, as the are problems with v2,
 described in cover letter)

Core changes in v3:

 - cache is rewritten and now is separated even from block-layer
 - I'm tired of trying to catch degradation when use both pagecache and
   my new cache.. So, I decided that using both caches is just a bad
   idea and in v3 cache is enabled only when qcow2 file child opened in
   O_DIRECT

Vladimir Sementsov-Ogievskiy (6):
  block-jobs: flush target at the end of .run()
  iotests: add qcow2-discard-during-rewrite
  block/qcow2: introduce inflight writes counters: fix discard
  util: implement seqcache
  block-coroutine-wrapper: allow non bdrv_ prefix
  block/qcow2: use seqcache for compressed writes

 block/coroutines.h                            |   3 +
 block/qcow2.h                                 |  13 +
 include/block/blockjob_int.h                  |  18 +
 include/qemu/seqcache.h                       |  42 ++
 block/backup.c                                |   8 +-
 block/commit.c                                |   2 +
 block/mirror.c                                |   2 +
 block/qcow2-refcount.c                        | 159 +++++++-
 block/qcow2.c                                 | 178 ++++++++-
 block/stream.c                                |   2 +
 blockjob.c                                    |  16 +
 util/seqcache.c                               | 361 ++++++++++++++++++
 MAINTAINERS                                   |   6 +
 scripts/block-coroutine-wrapper.py            |   7 +-
 .../tests/qcow2-discard-during-rewrite        |  72 ++++
 .../tests/qcow2-discard-during-rewrite.out    |  21 +
 util/meson.build                              |   1 +
 17 files changed, 893 insertions(+), 18 deletions(-)
 create mode 100644 include/qemu/seqcache.h
 create mode 100644 util/seqcache.c
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

-- 
2.29.2



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 1/6] block-jobs: flush target at the end of .run()
  2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
@ 2021-03-05 17:35 ` Vladimir Sementsov-Ogievskiy
  2021-03-11 16:57   ` Max Reitz
  2021-03-05 17:35 ` [PATCH v3 2/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

We are going to implement compressed write cache to improve performance
of compressed backup when target is opened in O_DIRECT mode. We
definitely want to flush the cache at backup finish, and if flush fails
it should be reported as block-job failure, not simply ignored in
bdrv_close(). So, teach all block-jobs to flush their targets at the
end.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/block/blockjob_int.h | 18 ++++++++++++++++++
 block/backup.c               |  8 +++++---
 block/commit.c               |  2 ++
 block/mirror.c               |  2 ++
 block/stream.c               |  2 ++
 blockjob.c                   | 16 ++++++++++++++++
 6 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
index 6633d83da2..6ef3123120 100644
--- a/include/block/blockjob_int.h
+++ b/include/block/blockjob_int.h
@@ -119,4 +119,22 @@ int64_t block_job_ratelimit_get_delay(BlockJob *job, uint64_t n);
 BlockErrorAction block_job_error_action(BlockJob *job, BlockdevOnError on_err,
                                         int is_read, int error);
 
+/**
+ * block_job_final_target_flush:
+ * @job: The job to signal an error for if flush failed.
+ * @target_bs: The bs to flush.
+ * @ret: Will be updated (to return code of bdrv_flush()) only if it is zero
+ *       now. This is a bit unusual interface but all callers are comfortable
+ *       with it.
+ *
+ * The function is intended to be called at the end of .run() for any data
+ * copying job.
+ *
+ * There are may be some internal caches in format layers of target,
+ * like compressed_cache in qcow2 format. So we should call flush to
+ * be sure that all data reached the destination protocol layer.
+ */
+void block_job_final_target_flush(BlockJob *job, BlockDriverState *target_bs,
+                                  int *ret);
+
 #endif
diff --git a/block/backup.c b/block/backup.c
index 94e6dcd72e..d3ba8e0f75 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -255,7 +255,7 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 static int coroutine_fn backup_run(Job *job, Error **errp)
 {
     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-    int ret;
+    int ret = 0;
 
     backup_init_bcs_bitmap(s);
 
@@ -297,10 +297,12 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
             job_yield(job);
         }
     } else {
-        return backup_loop(s);
+        ret = backup_loop(s);
     }
 
-    return 0;
+    block_job_final_target_flush(&s->common, s->target_bs, &ret);
+
+    return ret;
 }
 
 static void coroutine_fn backup_pause(Job *job)
diff --git a/block/commit.c b/block/commit.c
index dd9ba87349..1b61b60ccd 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -193,6 +193,8 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
     ret = 0;
 
 out:
+    block_job_final_target_flush(&s->common, blk_bs(s->base), &ret);
+
     qemu_vfree(buf);
 
     return ret;
diff --git a/block/mirror.c b/block/mirror.c
index 1803c6988b..bc559bd053 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1095,6 +1095,8 @@ immediate_exit:
     g_free(s->in_flight_bitmap);
     bdrv_dirty_iter_free(s->dbi);
 
+    block_job_final_target_flush(&s->common, blk_bs(s->target), &ret);
+
     if (need_drain) {
         s->in_drain = true;
         bdrv_drained_begin(bs);
diff --git a/block/stream.c b/block/stream.c
index 1fa742b0db..cda41e4a64 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -182,6 +182,8 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
         }
     }
 
+    block_job_final_target_flush(&s->common, s->target_bs, &error);
+
     /* Do not remove the backing file if an error was there but ignored. */
     return error;
 }
diff --git a/blockjob.c b/blockjob.c
index f2feff051d..e226bfbbfb 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -525,3 +525,19 @@ BlockErrorAction block_job_error_action(BlockJob *job, BlockdevOnError on_err,
     }
     return action;
 }
+
+void block_job_final_target_flush(BlockJob *job, BlockDriverState *target_bs,
+                                  int *ret)
+{
+    int flush_ret = bdrv_flush(target_bs);
+
+    if (flush_ret < 0 && !block_job_is_internal(job)) {
+        qapi_event_send_block_job_error(job->job.id,
+                                        IO_OPERATION_TYPE_WRITE,
+                                        BLOCK_ERROR_ACTION_REPORT);
+    }
+
+    if (*ret == 0) {
+        *ret = flush_ret;
+    }
+}
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/6] iotests: add qcow2-discard-during-rewrite
  2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 1/6] block-jobs: flush target at the end of .run() Vladimir Sementsov-Ogievskiy
@ 2021-03-05 17:35 ` Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

Simple test:
 - start writing to allocated cluster A
 - discard this cluster
 - write to another unallocated cluster B (it's allocated in same place
   where A was allocated)
 - continue writing to A

For now last action pollutes cluster B which is a bug fixed by the
following commit.

For now, add test to "disabled" group, so that it doesn't run
automatically.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 .../tests/qcow2-discard-during-rewrite        | 72 +++++++++++++++++++
 .../tests/qcow2-discard-during-rewrite.out    | 21 ++++++
 2 files changed, 93 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
new file mode 100755
index 0000000000..7f0d8a107a
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+# group: quick disabled
+#
+# Test discarding (and reusing) host cluster during writing data to it.
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=vsementsov@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./../common.rc
+. ./../common.filter
+
+_supported_fmt qcow2
+_supported_proto file fuse
+_supported_os Linux
+
+size=1M
+_make_test_img $size
+
+(
+cat <<EOF
+write -P 1 0 64K
+
+break pwritev A
+aio_write -P 2 0 64K
+wait_break A
+
+discard 0 64K
+write -P 3 128K 64K
+read -P 3 128K 64K
+
+break pwritev_done B
+resume A
+wait_break B
+resume B
+
+read -P 0 0 64K
+read -P 3 128K 64K
+EOF
+) | $QEMU_IO blkdebug::$TEST_IMG | _filter_qemu_io
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
new file mode 100644
index 0000000000..8e75b2fbff
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
@@ -0,0 +1,21 @@
+QA output created by qcow2-discard-during-rewrite
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+discard 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+blkdebug: Suspended request 'B'
+blkdebug: Resuming request 'B'
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 1/6] block-jobs: flush target at the end of .run() Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 2/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
@ 2021-03-05 17:35 ` Vladimir Sementsov-Ogievskiy
  2021-03-11 19:58   ` Max Reitz
  2021-03-05 17:35 ` [PATCH v3 4/6] util: implement seqcache Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.

To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.

Enable qcow2-discard-during-rewrite as it is fixed.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.h                                 |   9 ++
 block/qcow2-refcount.c                        | 149 +++++++++++++++++-
 block/qcow2.c                                 |  26 ++-
 .../tests/qcow2-discard-during-rewrite        |   2 +-
 4 files changed, 181 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..e18d8dfbae 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    GHashTable *inflight_writes_counters;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
 int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
 
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+                                     int64_t length);
+
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..464d133368 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -799,6 +799,140 @@ found:
     }
 }
 
+/*
+ * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
+ * hasm map. And it's keys are cluster indices.
+ */
+typedef struct Qcow2InFlightRefcount {
+    /*
+     * Number of in-flight writes to the cluster, always > 0, as when becomes
+     * 0 the entry is removed from s->inflight_writes_counters.
+     */
+    uint64_t inflight_writes_cnt;
+
+    /* Cluster refcount is known to be zero */
+    bool refcount_zero;
+
+    /* Cluster refcount was made zero with this discard-type */
+    enum qcow2_discard_type type;
+} Qcow2InFlightRefcount;
+
+static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
+                                           int64_t cluster_index)
+{
+    Qcow2InFlightRefcount *infl;
+
+    if (!s->inflight_writes_counters) {
+        return NULL;
+    }
+
+    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
+
+    if (infl) {
+        assert(infl->inflight_writes_cnt > 0);
+    }
+
+    return infl;
+}
+
+/*
+ * Returns true if there are any in-flight writes to the cluster blocking
+ * its reallocation.
+ */
+static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
+{
+    return !!find_infl_wr(s, cluster_index);
+}
+
+static int update_inflight_write_cnt(BlockDriverState *bs,
+                                     int64_t offset, int64_t length,
+                                     bool decrease, bool locked)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t start, last, cluster_offset;
+
+    if (locked) {
+        qemu_co_mutex_assert_locked(&s->lock);
+    }
+
+    start = start_of_cluster(s, offset);
+    last = start_of_cluster(s, offset + length - 1);
+    for (cluster_offset = start; cluster_offset <= last;
+         cluster_offset += s->cluster_size)
+    {
+        int64_t cluster_index = cluster_offset >> s->cluster_bits;
+        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+        if (!infl) {
+            infl = g_new0(Qcow2InFlightRefcount, 1);
+            g_hash_table_insert(s->inflight_writes_counters,
+                                g_memdup(&cluster_index, sizeof(cluster_index)),
+                                infl);
+        }
+
+        if (decrease) {
+            assert(infl->inflight_writes_cnt >= 1);
+            infl->inflight_writes_cnt--;
+        } else {
+            infl->inflight_writes_cnt++;
+        }
+
+        if (infl->inflight_writes_cnt == 0) {
+            bool refcount_zero = infl->refcount_zero;
+            enum qcow2_discard_type type = infl->type;
+
+            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+
+            if (refcount_zero) {
+                /*
+                 * Slow path. We must reset normal refcount to actually release
+                 * the cluster.
+                 */
+                int ret;
+
+                if (!locked) {
+                    qemu_co_mutex_lock(&s->lock);
+                }
+                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
+                                                    true, type);
+                if (!locked) {
+                    qemu_co_mutex_unlock(&s->lock);
+                }
+
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+        }
+
+    }
+
+    return 0;
+}
+
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, false, false);
+}
+
+/*
+ * Called with s->lock not locked by caller. Will take s->lock only if need to
+ * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
+ */
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, false);
+}
+
+/* Called with s->lock locked. */
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+                                     int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, true);
+}
+
 /* XXX: cache several refcount block clusters ? */
 /* @addend is the absolute value of the addend; if @decrease is set, @addend
  * will be subtracted from the current refcount, otherwise it will be added */
@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
 
         if (refcount == 0) {
             void *table;
+            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+            if (infl) {
+                infl->refcount_zero = true;
+                infl->type = type;
+                continue;
+            }
 
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                 offset);
@@ -983,7 +1124,7 @@ retry:
 
         if (ret < 0) {
             return ret;
-        } else if (refcount != 0) {
+        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
             goto retry;
         }
     }
@@ -1046,7 +1187,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
             ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
             if (ret < 0) {
                 return ret;
-            } else if (refcount != 0) {
+            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
                 break;
             }
         }
@@ -2294,7 +2435,9 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
          contiguous_free_clusters < cluster_count;
          cluster++)
     {
-        if (!s->get_refcount(*refcount_table, cluster)) {
+        if (!s->get_refcount(*refcount_table, cluster) &&
+            !has_infl_wr(s, cluster))
+        {
             contiguous_free_clusters++;
             if (first_gap) {
                 /* If this is the first free cluster found, update
diff --git a/block/qcow2.c b/block/qcow2.c
index d9f49a52e7..6ee94421e0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1831,6 +1831,8 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
 #endif
 
     qemu_co_queue_init(&s->thread_task_queue);
+    s->inflight_writes_counters =
+        g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
 
     return ret;
 
@@ -2546,6 +2548,9 @@ out_unlocked:
 
 out_locked:
     qcow2_handle_l2meta(bs, &l2meta, false);
+
+    qcow2_inflight_writes_dec_locked(bs, host_offset, bytes);
+
     qemu_co_mutex_unlock(&s->lock);
 
     qemu_vfree(crypt_buf);
@@ -2605,6 +2610,8 @@ static coroutine_fn int qcow2_co_pwritev_part(
             goto out_locked;
         }
 
+        qcow2_inflight_writes_inc(bs, host_offset, cur_bytes);
+
         qemu_co_mutex_unlock(&s->lock);
 
         if (!aio && cur_bytes != bytes) {
@@ -2707,6 +2714,9 @@ static void qcow2_close(BlockDriverState *bs)
     g_free(s->image_backing_file);
     g_free(s->image_backing_format);
 
+    assert(g_hash_table_size(s->inflight_writes_counters) == 0);
+    g_hash_table_unref(s->inflight_writes_counters);
+
     if (has_data_file(bs)) {
         bdrv_unref_child(bs, s->data_file);
         s->data_file = NULL;
@@ -4097,10 +4107,17 @@ qcow2_co_copy_range_to(BlockDriverState *bs,
             goto fail;
         }
 
+        qcow2_inflight_writes_inc(bs, host_offset, cur_bytes);
+
         qemu_co_mutex_unlock(&s->lock);
+
         ret = bdrv_co_copy_range_to(src, src_offset, s->data_file, host_offset,
                                     cur_bytes, read_flags, write_flags);
+
         qemu_co_mutex_lock(&s->lock);
+
+        qcow2_inflight_writes_dec_locked(bs, host_offset, cur_bytes);
+
         if (ret < 0) {
             goto fail;
         }
@@ -4536,13 +4553,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
     }
 
     ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
-    qemu_co_mutex_unlock(&s->lock);
     if (ret < 0) {
+        qemu_co_mutex_unlock(&s->lock);
         goto fail;
     }
 
+    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
+
+    qemu_co_mutex_unlock(&s->lock);
+
     BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
     ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+
+    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
+
     if (ret < 0) {
         goto fail;
     }
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
index 7f0d8a107a..2e2e0d2cb0 100755
--- a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# group: quick disabled
+# group: quick
 #
 # Test discarding (and reusing) host cluster during writing data to it.
 #
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/6] util: implement seqcache
  2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2021-03-05 17:35 ` [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard Vladimir Sementsov-Ogievskiy
@ 2021-03-05 17:35 ` Vladimir Sementsov-Ogievskiy
  2021-03-12 13:41   ` Max Reitz
  2021-06-04 14:31   ` Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix Vladimir Sementsov-Ogievskiy
  2021-03-05 17:35 ` [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes Vladimir Sementsov-Ogievskiy
  5 siblings, 2 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/qemu/seqcache.h |  42 +++++
 util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS             |   6 +
 util/meson.build        |   1 +
 4 files changed, 410 insertions(+)
 create mode 100644 include/qemu/seqcache.h
 create mode 100644 util/seqcache.c

diff --git a/include/qemu/seqcache.h b/include/qemu/seqcache.h
new file mode 100644
index 0000000000..a308d54b01
--- /dev/null
+++ b/include/qemu/seqcache.h
@@ -0,0 +1,42 @@
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef QEMU_SEQCACHE_H
+#define QEMU_SEQCACHE_H
+
+typedef struct SeqCache SeqCache;
+
+SeqCache *seqcache_new(int64_t cluster_size);
+void seqcache_free(SeqCache *s);
+
+void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
+int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
+
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
+                             uint8_t **buf, bool *unfinished);
+void seqcache_discard_cluster(SeqCache *s, int64_t offset);
+
+uint64_t seqcache_nb_clusters(SeqCache *s);
+
+#endif /* QEMU_SEQCACHE_H */
diff --git a/util/seqcache.c b/util/seqcache.c
new file mode 100644
index 0000000000..d923560eab
--- /dev/null
+++ b/util/seqcache.c
@@ -0,0 +1,361 @@
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ *
+ * = Description =
+ *
+ * SeqCache is an abbreviation for Sequential Cache.
+ *
+ * The Cache is intended to improve performance of small unaligned sequential
+ * writes. Cache has a cluster_size parameter and the unit of caching is aligned
+ * cluster.  Cache has a list of cached clusters, several "finished" ones and at
+ * most one "unfinished". "unfinished" cluster is a cluster where last byte of
+ * last write operation is cached and there is a free place after that byte to
+ * the end of cluster. "finished" clusters are just stored in cache to be read
+ * or flushed and don't allow any writes to them.
+ *
+ * If write to the cache intersects cluster bounds, it's splat into several
+ * requests by cluster bounds. So, consider a write that doesn't intersect
+ * cluster bounds to describe the whole picture:
+ *
+ * There are two cases allowed:
+ *
+ * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
+ *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
+ *    filled up to the cluster bound it becomes "finished".
+ *
+ * 2. Write to new cluster, not existing in the cache. It may be sequential to
+ *    previous write or not. Current "unfinshed" cluster (if exists) becomes
+ *    "finished" and new "unfinished" cluster is started. Note also that offset
+ *    of the write to new cluster is not required to be aligned.
+ *
+ *    Any other write operation (non-sequential write to "unfinished" cluster
+ *    or write to any of "finished" clusters) will crash.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/queue.h"
+#include "qemu/seqcache.h"
+
+/*
+ * Cluster
+ *
+ * Representation of one cached cluster, aligned to SeqCache::cluster_size.
+ * Caches only one subregion of the cluster, started at @offset (may be
+ * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
+ * The whole subregion always lay in one aligned cluster:
+ *
+ *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
+ *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
+ *
+ * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
+ * allocated @buf length is at least:
+ *
+ *      cluster_size - offset % cluster_size
+ */
+typedef struct Cluster {
+   int64_t offset;
+   int64_t bytes;
+   uint8_t *buf;
+   QSIMPLEQ_ENTRY(Cluster) entry;
+} Cluster;
+
+/*
+ * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
+ * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
+ * calls.
+ *
+ * @cur_write->offset may be unaligned to @cluster_size if first write offset is
+ * not aligned (for example, if there was a flush request and cache was flushed,
+ * then we continue from the middle of the cluster with an empty cache).
+ *
+ * @cur_write may be NULL, which means we don't have current cluster and next
+ * seqcache_write() will start a new one.
+ *
+ * @all is a list of all clusters cached in the cache, some "finished" and one
+ * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
+ * one in the list.
+ *
+ * @nb_clusters is number of elements in @all list.
+ *
+ * @next_flush is an iterator for flushing. SeqCache knows nothing about how
+ * data should be flushing, so for flush user calls seqcache_get_next_flush()
+ * (which moves @next_flush iterator) and when data is flushed somehow and cache
+ * is not needed anymore, user can call seqcache_discard_cluster().
+ */
+typedef struct SeqCache {
+    int64_t cluster_size;
+    int64_t nb_clusters;
+    Cluster *cur_write;
+    Cluster *next_flush;
+    QSIMPLEQ_HEAD(, Cluster) all;
+} SeqCache;
+
+static void cluster_free(Cluster *req)
+{
+    if (req) {
+        g_free(req->buf);
+        g_free(req);
+    }
+}
+
+SeqCache *seqcache_new(int64_t cluster_size)
+{
+    SeqCache *s = g_new(SeqCache, 1);
+
+    *s = (SeqCache) {
+        .cluster_size = cluster_size,
+    };
+
+    QSIMPLEQ_INIT(&s->all);
+
+    return s;
+}
+
+/*
+ * User should clean the cache calling seqcache_get_next_flush() and
+ * seqcache_discard_cluster() sequentially before freeing it.
+ */
+void seqcache_free(SeqCache *s)
+{
+    if (s) {
+        assert(QSIMPLEQ_EMPTY(&s->all));
+        g_free(s);
+    }
+}
+
+/* End of cached region inside one cluster */
+static int64_t cached_end(Cluster *cl)
+{
+    return cl->offset + cl->bytes;
+}
+
+/* Start of aligned cluster containing the @offset */
+static int64_t cluster_start(SeqCache *s, int64_t offset)
+{
+    return QEMU_ALIGN_DOWN(offset, s->cluster_size);
+}
+
+/* End of aligned cluster containing the @offset */
+static int64_t cluster_end(SeqCache *s, int64_t offset)
+{
+    return cluster_start(s, offset) + s->cluster_size;
+}
+
+/* Align down @offset to s->cluster_size and search for corresponding cluster */
+static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
+{
+    Cluster *cl;
+    int64_t cl_start = cluster_start(s, offset);
+
+    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
+        if (cluster_start(s, cl->offset) == cl_start) {
+            return cl;
+        }
+    }
+
+    return NULL;
+}
+
+/* Makes current "unfinished" cluster the "finished" one. */
+static void seqcache_finalize_current_cluster(SeqCache *s)
+{
+    if (s->cur_write && !s->next_flush) {
+        s->next_flush = s->cur_write;
+    }
+    s->cur_write = NULL;
+}
+
+/*
+ * Write an @offset, @bytes, @buf request into the cache. The requests MUST not
+ * intersect cluster bounds.
+ */
+static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
+                               uint8_t *buf)
+{
+    assert(bytes > 0);
+    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
+
+    if (s->cur_write && offset == cached_end(s->cur_write)) {
+        /* Continue sequential process */
+        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
+        s->cur_write->bytes += bytes;
+
+        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
+            seqcache_finalize_current_cluster(s);
+        }
+
+        return;
+    }
+
+    /* We are starting a new cluster. Check that it's really new */
+    assert(!seqcache_find_cluster(s, offset));
+
+    seqcache_finalize_current_cluster(s);
+
+    s->cur_write = g_new(Cluster, 1);
+    *s->cur_write = (Cluster) {
+        .offset = offset,
+        .bytes = bytes,
+        .buf = g_malloc(s->cluster_size),
+    };
+
+    memcpy(s->cur_write->buf, buf, bytes);
+    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
+    s->nb_clusters++;
+}
+
+/* Write an @offset, @bytes, @buf request into the cache. */
+void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
+{
+    while (bytes) {
+        int64_t cl_end = cluster_end(s, offset);
+        int64_t chunk = MIN(bytes, cl_end - offset);
+
+        seqcache_write_one(s, offset, chunk, buf);
+        offset += chunk;
+        bytes -= chunk;
+        buf += chunk;
+    }
+}
+
+/*
+ * Read from the cache.
+ *
+ * If @offset misses the cache, return 0.
+ *
+ * If @offset is inside the cache, copy corresponding available data to @buf
+ * (may be less than required @bytes if hole reached earlier) and return number
+ * of bytes copied.
+ */
+int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
+{
+    uint8_t *pos = buf;
+
+    while (bytes) {
+        Cluster *cl = seqcache_find_cluster(s, offset);
+        int64_t chunk;
+
+        if (!cl || offset < cl->offset || offset >= cached_end(cl)) {
+            break;
+        }
+
+        chunk = MIN(bytes, cached_end(cl) - offset);
+        memcpy(pos, cl->buf + offset - cl->offset, chunk);
+        offset += chunk;
+        bytes -= chunk;
+        pos += chunk;
+
+        if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) {
+            break;
+        }
+    }
+
+    return pos - buf;
+}
+
+/*
+ * Get next region for flushing. @offset, @bytes and @buf are out-parameters
+ * to return the region.
+ *
+ * @unfinished is in-out argument which means that user is interested in
+ * flushing unfinished cluster too:
+ *
+ * If there are "finished" clusters, "finished" cluster is returned and
+ * *@unfinished is set to false, independently of its original value.
+ *
+ * If there are no "finished" clusters but "unfinished" exists (i.e.
+ * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
+ * value remains the same and the following logic works:
+ *
+ *    If *@unfinished:
+ *       return s->cur_write unfinished cluster for flushing
+ *    Else
+ *       return nothing
+ *
+ *
+ * Returns true and set @offset, @bytes, @buf and @unfinished if there is
+ * something to flush (accordingly to @unfinished value), returns false
+ * otherwise.
+ *
+ * Nothing is removed from the cache.
+ */
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
+                             uint8_t **buf, bool *unfinished)
+{
+    Cluster *req = s->next_flush;
+
+    if (s->next_flush) {
+        *unfinished = false;
+        req = s->next_flush;
+        s->next_flush = QSIMPLEQ_NEXT(req, entry);
+        if (s->next_flush == s->cur_write) {
+            s->next_flush = NULL;
+        }
+    } else if (s->cur_write && *unfinished) {
+        req = s->cur_write;
+    } else {
+        return false;
+    }
+
+    *offset = req->offset;
+    *bytes = req->bytes;
+    *buf = req->buf;
+
+    return true;
+}
+
+/*
+ * Find corresponding cluster and drop it. No matter does requested @offset is
+ * cached itself or not.
+ */
+void seqcache_discard_cluster(SeqCache *s, int64_t offset)
+{
+    Cluster *cl = seqcache_find_cluster(s, offset);
+
+    if (!cl) {
+        return;
+    }
+
+    if (cl == s->cur_write) {
+        assert(cl != s->next_flush);
+        s->cur_write = NULL;
+    } else if (cl == s->next_flush) {
+        assert(cl != s->cur_write);
+        s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);
+        if (s->next_flush == s->cur_write) {
+            s->next_flush = NULL;
+        }
+    }
+
+    QSIMPLEQ_REMOVE(&s->all, cl, Cluster, entry);
+    cluster_free(cl);
+    s->nb_clusters--;
+}
+
+/* Returns number of cached clusters including unfinished */
+uint64_t seqcache_nb_clusters(SeqCache *s)
+{
+    return s->nb_clusters;
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index 9b2aa18e1f..5e263c3343 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3347,3 +3347,9 @@ Performance Tools and Tests
 M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
 S: Maintained
 F: scripts/performance/
+
+Sequential Cache
+M: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+S: Supported
+F: util/seqcache.c
+F: include/qemu/seqcache.h
diff --git a/util/meson.build b/util/meson.build
index 984fba965f..6682e463ac 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -75,6 +75,7 @@ if have_block
   util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
+  util_ss.add(files('seqcache.c'))
   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
   util_ss.add(files('readline.c'))
   util_ss.add(files('throttle.c'))
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix
  2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2021-03-05 17:35 ` [PATCH v3 4/6] util: implement seqcache Vladimir Sementsov-Ogievskiy
@ 2021-03-05 17:35 ` Vladimir Sementsov-Ogievskiy
  2021-03-12 16:53   ` Max Reitz
  2021-03-05 17:35 ` [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes Vladimir Sementsov-Ogievskiy
  5 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

We are going to reuse the script to generate a qcow2_ function in
further commit. Prepare the script now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 scripts/block-coroutine-wrapper.py | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/scripts/block-coroutine-wrapper.py b/scripts/block-coroutine-wrapper.py
index 0461fd1c45..85dbeb9ecf 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -98,12 +98,13 @@ def snake_to_camel(func_name: str) -> str:
 
 
 def gen_wrapper(func: FuncDecl) -> str:
-    assert func.name.startswith('bdrv_')
-    assert not func.name.startswith('bdrv_co_')
+    assert not '_co_' in func.name
     assert func.return_type == 'int'
     assert func.args[0].type in ['BlockDriverState *', 'BdrvChild *']
 
-    name = 'bdrv_co_' + func.name[5:]
+    subsystem, subname = func.name.split('_', 1)
+
+    name = f'{subsystem}_co_{subname}'
     bs = 'bs' if func.args[0].type == 'BlockDriverState *' else 'child->bs'
     struct_name = snake_to_camel(name)
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2021-03-05 17:35 ` [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix Vladimir Sementsov-Ogievskiy
@ 2021-03-05 17:35 ` Vladimir Sementsov-Ogievskiy
  2021-03-12 18:15   ` Max Reitz
  5 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-05 17:35 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, vsementsov, jsnow

Compressed writes are unaligned to 512, which works very slow in
O_DIRECT mode. Let's use the cache.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/coroutines.h     |   3 +
 block/qcow2.h          |   4 ++
 block/qcow2-refcount.c |  10 +++
 block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 164 insertions(+), 11 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 4cfb4946e6..cb8e05830e 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
 int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
                                         QEMUIOVector *qiov, int64_t pos);
 
+int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs);
+int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs);
+
 #endif /* BLOCK_COROUTINES_INT_H */
diff --git a/block/qcow2.h b/block/qcow2.h
index e18d8dfbae..8b8fed0950 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -28,6 +28,7 @@
 #include "crypto/block.h"
 #include "qemu/coroutine.h"
 #include "qemu/units.h"
+#include "qemu/seqcache.h"
 #include "block/block_int.h"
 
 //#define DEBUG_ALLOC
@@ -422,6 +423,9 @@ typedef struct BDRVQcow2State {
     Qcow2CompressionType compression_type;
 
     GHashTable *inflight_writes_counters;
+
+    SeqCache *compressed_cache;
+    int compressed_flush_ret;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..218917308e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
 #include "qemu/bswap.h"
 #include "qemu/cutils.h"
 #include "trace.h"
+#include "block/coroutines.h"
 
 static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
                                     uint64_t max);
@@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
 
+            if (s->compressed_cache) {
+                seqcache_discard_cluster(s->compressed_cache, cluster_offset);
+            }
+
             if (s->discard_passthrough[type]) {
                 update_refcount_discard(bs, cluster_offset, s->cluster_size);
             }
@@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs)
     BDRVQcow2State *s = bs->opaque;
     int ret;
 
+    ret = qcow2_flush_compressed_cache(bs);
+    if (ret < 0) {
+        return ret;
+    }
+
     ret = qcow2_cache_write(bs, s->l2_table_cache);
     if (ret < 0) {
         return ret;
diff --git a/block/qcow2.c b/block/qcow2.c
index 6ee94421e0..5f3713cd6f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -42,6 +42,7 @@
 #include "qapi/qapi-visit-block-core.h"
 #include "crypto.h"
 #include "block/aio_task.h"
+#include "block/coroutines.h"
 
 /*
   Differences with QCOW:
@@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
     s->inflight_writes_counters =
         g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
 
+    if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {
+        s->compressed_cache = seqcache_new(s->cluster_size);
+    }
+
     return ret;
 
  fail:
@@ -2653,6 +2658,91 @@ fail_nometa:
     return ret;
 }
 
+/*
+ * Flush one cluster of compressed cache if exists. If @unfinished is true and
+ * there is no finished cluster to flush, flush the unfinished one. Otherwise
+ * flush only finished clusters.
+ *
+ * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 on
+ * error.
+ */
+static int coroutine_fn
+qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int ret;
+    int64_t align = 1;
+    int64_t offset, bytes;
+    uint8_t *buf;
+
+    if (!s->compressed_cache) {
+        return 0;
+    }
+
+    if (!seqcache_get_next_flush(s->compressed_cache, &offset, &bytes, &buf,
+                                 &unfinished))
+    {
+        return 0;
+    }
+
+    qcow2_inflight_writes_inc(bs, offset, bytes);
+
+    /*
+     * Don't try to align-up unfinished cached cluster, parallel write to it is
+     * possible! For finished host clusters data in the tail of the cluster will
+     * be never used. So, take some good alignment for speed.
+     *
+     * Note also, that seqcache guarantees that allocated size of @buf is enough
+     * to fill the cluster up to the end, so we are safe to align up with
+     * align <= cluster_size.
+     */
+    if (!unfinished) {
+        align = MIN(s->cluster_size,
+                    MAX(s->data_file->bs->bl.request_alignment,
+                        bdrv_opt_mem_align(bs)));
+    }
+
+    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
+    ret = bdrv_co_pwrite(s->data_file, offset,
+                         QEMU_ALIGN_UP(offset + bytes, align) - offset,
+                         buf, 0);
+    if (ret < 0 && s->compressed_flush_ret == 0) {
+        /*
+         * The data that was "written" earlier is lost. We don't want to
+         * care with storing it somehow to retry flushing later. Also, how much
+         * data will we have to store, if not drop it after failed flush?
+         * After this point qcow2_co_flush_compressed_cache() (and
+         * qcow2_flush_compressed_cache()) never return success.
+         */
+        s->compressed_flush_ret = ret;
+    }
+
+    seqcache_discard_cluster(s->compressed_cache, offset);
+
+    qcow2_inflight_writes_dec(bs, offset, bytes);
+
+    return ret < 0 ? ret : 1;
+}
+
+int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    if (s->compressed_cache) {
+        uint64_t nb_clusters = seqcache_nb_clusters(s->compressed_cache);
+
+        /*
+         * If parallel writes are possible we don't want to loop forever. So
+         * flush only clusters available at start of flush operation.
+         */
+        while (nb_clusters--) {
+            qcow2_co_compressed_flush_one(bs, true);
+        }
+    }
+
+    return s->compressed_flush_ret;
+}
+
 static int qcow2_inactivate(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
@@ -2667,6 +2757,13 @@ static int qcow2_inactivate(BlockDriverState *bs)
                           bdrv_get_device_or_node_name(bs));
     }
 
+    ret = qcow2_flush_compressed_cache(bs);
+    if (ret) {
+        result = ret;
+        error_report("Failed to flush the compressed write cache: %s",
+                     strerror(-ret));
+    }
+
     ret = qcow2_cache_flush(bs, s->l2_table_cache);
     if (ret) {
         result = ret;
@@ -2699,6 +2796,12 @@ static void qcow2_close(BlockDriverState *bs)
         qcow2_inactivate(bs);
     }
 
+    /*
+     * Cache should be flushed in qcow2_inactivate() and should be empty in
+     * inactive mode. So we are safe to free it.
+     */
+    seqcache_free(s->compressed_cache);
+
     cache_clean_timer_del(bs);
     qcow2_cache_destroy(s->l2_table_cache);
     qcow2_cache_destroy(s->refcount_block_cache);
@@ -4558,18 +4661,42 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
         goto fail;
     }
 
-    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
+    if (s->compressed_cache) {
+        /*
+         * It's important to do seqcache_write() in the same critical section
+         * (by s->lock) as qcow2_alloc_compressed_cluster_offset(), so that the
+         * cache is filled sequentially.
+         */
+        seqcache_write(s->compressed_cache, cluster_offset, out_len, out_buf);
 
-    qemu_co_mutex_unlock(&s->lock);
+        qemu_co_mutex_unlock(&s->lock);
 
-    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
-    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+        ret = qcow2_co_compressed_flush_one(bs, false);
+        if (ret < 0) {
+            /*
+             * if ret < 0 it probably not this request which failed, but some
+             * previous one that was cached. Moreover, when we write to the
+             * cache we probably may always report success, because
+             * seqcache_write() never fails. Still, it's better to inform
+             * compressed backup now then wait until final flush.
+             */
+            goto fail;
+        }
+    } else {
+        qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
 
-    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
+        qemu_co_mutex_unlock(&s->lock);
 
-    if (ret < 0) {
-        goto fail;
+        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
+        ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+
+        qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
+
+        if (ret < 0) {
+            goto fail;
+        }
     }
+
 success:
     ret = 0;
 fail:
@@ -4681,10 +4808,19 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
 
     out_buf = qemu_blockalign(bs, s->cluster_size);
 
-    BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
-    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
-    if (ret < 0) {
-        goto fail;
+    /*
+     * seqcache_read may return less bytes than csize, as csize may exceed
+     * actual compressed data size. So we are OK if seqcache_read returns
+     * something > 0.
+     */
+    if (!s->compressed_cache ||
+        seqcache_read(s->compressed_cache, coffset, csize, buf) <= 0)
+    {
+        BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
+        ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
+        if (ret < 0) {
+            goto fail;
+        }
     }
 
     if (qcow2_co_decompress(bs, out_buf, s->cluster_size, buf, csize) < 0) {
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/6] block-jobs: flush target at the end of .run()
  2021-03-05 17:35 ` [PATCH v3 1/6] block-jobs: flush target at the end of .run() Vladimir Sementsov-Ogievskiy
@ 2021-03-11 16:57   ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-03-11 16:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> We are going to implement compressed write cache to improve performance
> of compressed backup when target is opened in O_DIRECT mode. We
> definitely want to flush the cache at backup finish, and if flush fails
> it should be reported as block-job failure, not simply ignored in
> bdrv_close(). So, teach all block-jobs to flush their targets at the
> end.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   include/block/blockjob_int.h | 18 ++++++++++++++++++
>   block/backup.c               |  8 +++++---
>   block/commit.c               |  2 ++
>   block/mirror.c               |  2 ++
>   block/stream.c               |  2 ++
>   blockjob.c                   | 16 ++++++++++++++++
>   6 files changed, 45 insertions(+), 3 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

Just a nit on the function’s description.

> diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
> index 6633d83da2..6ef3123120 100644
> --- a/include/block/blockjob_int.h
> +++ b/include/block/blockjob_int.h
> @@ -119,4 +119,22 @@ int64_t block_job_ratelimit_get_delay(BlockJob *job, uint64_t n);
>   BlockErrorAction block_job_error_action(BlockJob *job, BlockdevOnError on_err,
>                                           int is_read, int error);
>   
> +/**
> + * block_job_final_target_flush:
> + * @job: The job to signal an error for if flush failed.
> + * @target_bs: The bs to flush.
> + * @ret: Will be updated (to return code of bdrv_flush()) only if it is zero
> + *       now. This is a bit unusual interface but all callers are comfortable
> + *       with it.
> + *
> + * The function is intended to be called at the end of .run() for any data
> + * copying job.
> + *
> + * There are may be some internal caches in format layers of target,
-are, *in the format layers of the target

> + * like compressed_cache in qcow2 format. So we should call flush to
> + * be sure that all data reached the destination protocol layer.
> + */
> +void block_job_final_target_flush(BlockJob *job, BlockDriverState *target_bs,
> +                                  int *ret);
> +
>   #endif



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-05 17:35 ` [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard Vladimir Sementsov-Ogievskiy
@ 2021-03-11 19:58   ` Max Reitz
  2021-03-12  9:09     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-11 19:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> There is a bug in qcow2: host cluster can be discarded (refcount
> becomes 0) and reused during data write. In this case data write may
> pollute another cluster (recently allocated) or even metadata.

I was about to ask whether we couldn’t somehow let discard requests 
check overlapping I/O with the tracked_request infrastructure from 
block/io.c (or perhaps even just let them be serializing).

But I suppose there may be other reasons for clusters being discarded 
other than explicit discard requests, so we have to have something in 
qcow2...

> To fix the issue let's track inflight writes to host cluster in the
> hash table and consider new counter when discarding and reusing host
> clusters.

This didn’t really explain anything that would help me understand what’s 
going on in this patch.  The patch itself doesn’t really help me with 
comments either.  It’s possible that I’m just daft, but honestly I have 
a hard time really understanding what’s supposed to be going on.

Coming back from jumping all over this patch, I also wonder why this 
isn’t split in multiple patches, where the first introduces the 
infrastructure and explains exactly what’s going on, and the next 
patches make use of it so I could understand what each check etc. is 
supposed to represent and do.


To serve as an example, after reading through the patch, I still have no 
idea how it prevents that discard problem you’re trying to fix.  Maybe 
I’m lazy, but I would have liked exactly that to be explained by this 
commit message.  Actually, I don’t even understand the problem just from 
this commit message alone, but I could resort to HEAD^ and some 
thinking.  Not sure if that’s ideal, though.

So the commit message says that “host cluster can be discarded and 
reused during data write”, which to me just seems like the exact 
behavior we want.  The second sentence describes a problem, but doesn’t 
say how reclaiming discarded clusters leads there.

I suppose what leads to the problem is that there first needs to be a 
background write even before the discard that is settled only after the 
re-claiming write (which isn’t mentioned).

As far as I understand, this patch addresses that problem by preventing 
the discarded clusters from being allocated while said background write 
is not settled yet.  This is done by keeping track of all host clusters 
that are currently the target of some write operation, and preventing 
them from being allocated.  OK, makes sense.
(This latter part does roughly correspond to the second paragraph of 
this commit message, but I couldn’t make sense of it without 
understanding why we’d do it.  What’s notably missing from my 
explanation is the part that you hinted at with “when discarding”, but 
my problem is that that I just don’t understand what that means at all. 
  Perhaps it has to do with how I don’t understand the change to 
update_refcount(), and consequently the whole “slow path” in 
update_inflight_write_cnt().)


Side note: Intuitively, this hash table feels like an unnecessarily big 
hammer to me, but I can’t think of a better solution either, so.  (I 
mainly don’t like how every write request will result in one allocation 
and hash table entry per cluster it touches, even when no data cluster 
is ever discarded.)

> Enable qcow2-discard-during-rewrite as it is fixed.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/qcow2.h                                 |   9 ++
>   block/qcow2-refcount.c                        | 149 +++++++++++++++++-
>   block/qcow2.c                                 |  26 ++-
>   .../tests/qcow2-discard-during-rewrite        |   2 +-
>   4 files changed, 181 insertions(+), 5 deletions(-)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 0678073b74..e18d8dfbae 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
>        * is to convert the image with the desired compression type set.
>        */
>       Qcow2CompressionType compression_type;
> +
> +    GHashTable *inflight_writes_counters;
>   } BDRVQcow2State;
>   
>   typedef struct Qcow2COWRegion {
> @@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>   
> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
> +                              int64_t length);
> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
> +                              int64_t length);
> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
> +                                     int64_t length);
> +
>   /* qcow2-cluster.c functions */
>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>                           bool exact_size);
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 8e649b008e..464d133368 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -799,6 +799,140 @@ found:
>       }
>   }
>   
> +/*
> + * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
> + * hasm map. And it's keys are cluster indices.

*hash, *its

I’d rather document this for s->inflight_writes_counters, though (like 
“/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)

> + */
> +typedef struct Qcow2InFlightRefcount {
> +    /*
> +     * Number of in-flight writes to the cluster, always > 0, as when becomes

s/when becomes/when it becomes/

> +     * 0 the entry is removed from s->inflight_writes_counters.
> +     */
> +    uint64_t inflight_writes_cnt;
> +
> +    /* Cluster refcount is known to be zero */
> +    bool refcount_zero;
> +
> +    /* Cluster refcount was made zero with this discard-type */
> +    enum qcow2_discard_type type;
> +} Qcow2InFlightRefcount;
> +
> +static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
> +                                           int64_t cluster_index)
> +{
> +    Qcow2InFlightRefcount *infl;
> +
> +    if (!s->inflight_writes_counters) {
> +        return NULL;
> +    }
> +
> +    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
> +
> +    if (infl) {
> +        assert(infl->inflight_writes_cnt > 0);
> +    }
> +
> +    return infl;
> +}
> +
> +/*
> + * Returns true if there are any in-flight writes to the cluster blocking
> + * its reallocation.
> + */
> +static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
> +{
> +    return !!find_infl_wr(s, cluster_index);

I wonder if g_hash_table_contains() would be quicker. *shrug*

> +}
> +
> +static int update_inflight_write_cnt(BlockDriverState *bs,
> +                                     int64_t offset, int64_t length,
> +                                     bool decrease, bool locked)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int64_t start, last, cluster_offset;
> +
> +    if (locked) {
> +        qemu_co_mutex_assert_locked(&s->lock);
> +    }
> +
> +    start = start_of_cluster(s, offset);
> +    last = start_of_cluster(s, offset + length - 1);
> +    for (cluster_offset = start; cluster_offset <= last;
> +         cluster_offset += s->cluster_size)
> +    {
> +        int64_t cluster_index = cluster_offset >> s->cluster_bits;
> +        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
> +
> +        if (!infl) {
> +            infl = g_new0(Qcow2InFlightRefcount, 1);
> +            g_hash_table_insert(s->inflight_writes_counters,
> +                                g_memdup(&cluster_index, sizeof(cluster_index)),
> +                                infl);

I suppose we could save one allocation by putting the cluster index into 
Qcow2InFlightRefcount and then giving the key as &infl->cluster_index. 
Not sure if that’s too nasty, though.

> +        }
> +
> +        if (decrease) {
> +            assert(infl->inflight_writes_cnt >= 1);
> +            infl->inflight_writes_cnt--;
> +        } else {
> +            infl->inflight_writes_cnt++;
> +        }
> +
> +        if (infl->inflight_writes_cnt == 0) {
> +            bool refcount_zero = infl->refcount_zero;
> +            enum qcow2_discard_type type = infl->type;
> +
> +            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
> +
> +            if (refcount_zero) {
> +                /*
> +                 * Slow path. We must reset normal refcount to actually release

I don’t understand what “slow path” means in this context.  (Nor do I 
really understand the rest, but more on that below.)

> +                 * the cluster.
> +                 */
> +                int ret;
> +
> +                if (!locked) {
> +                    qemu_co_mutex_lock(&s->lock);
> +                }
> +                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
> +                                                    true, type);

I don’t understand this, but then again I’m writing this very early in 
my review, and at this point I don’t understand anything, really.  (The 
comment doesn’t explain to me what’s happening here.)

Revisiting later, refcount_zero is set by update_refcount() when the 
refcount drops to zero, so invoking this function here will finalize the 
change.  I don’t quite understand what’s going on in update_refcount(), 
though.

(And even after finding out why, I don’t understand why the comment 
doesn’t explain why we must invoke qcow2_update_cluster_refcount() with 
addend=0.)

> +                if (!locked) {
> +                    qemu_co_mutex_unlock(&s->lock);
> +                }
> +
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +            }
> +        }
> +
> +    }
> +
> +    return 0;
> +}
> +
> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
> +                              int64_t length)
> +{
> +    return update_inflight_write_cnt(bs, offset, length, false, false);
> +}
> +
> +/*
> + * Called with s->lock not locked by caller. Will take s->lock only if need to
> + * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
> + */

Why doesn’t qcow2_inflight_writes_inc() get the same comment?

...Oh, because @locked doesn’t have an influence there.  Why isn’t it 
worth a comment that *_inc() works both with a locked and an unlocked mutex?

> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
> +                              int64_t length)
> +{
> +    return update_inflight_write_cnt(bs, offset, length, true, false);
> +}
> +
> +/* Called with s->lock locked. */
> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
> +                                     int64_t length)
> +{
> +    return update_inflight_write_cnt(bs, offset, length, true, true);
> +}
> +
>   /* XXX: cache several refcount block clusters ? */
>   /* @addend is the absolute value of the addend; if @decrease is set, @addend
>    * will be subtracted from the current refcount, otherwise it will be added */
> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>   
>           if (refcount == 0) {
>               void *table;
> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
> +
> +            if (infl) {
> +                infl->refcount_zero = true;	
> +                infl->type = type;
> +                continue;
> +            }

I don’t understand what this is supposed to do exactly.  It seems like 
it wants to keep metadata structures in the cache that are still in use 
(because dropping them from the caches is what happens next), but users 
of metadata structures won’t set in-flight counters for those metadata 
structures, will they?

(I also wondered why the continue wasn’t put before the 
s->set_refcount() call, but I suppose the same effect is had by the 
has_infl_wr() check in alloc_clusters_noref() and 
qcow2_alloc_clusters_at().)

>   
>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>                                                   offset);
> @@ -983,7 +1124,7 @@ retry:
>   
>           if (ret < 0) {
>               return ret;
> -        } else if (refcount != 0) {
> +        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
>               goto retry;
>           }
>       }
> @@ -1046,7 +1187,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
>               ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
>               if (ret < 0) {
>                   return ret;
> -            } else if (refcount != 0) {
> +            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
>                   break;
>               }
>           }
> @@ -2294,7 +2435,9 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>            contiguous_free_clusters < cluster_count;
>            cluster++)
>       {
> -        if (!s->get_refcount(*refcount_table, cluster)) {
> +        if (!s->get_refcount(*refcount_table, cluster) &&
> +            !has_infl_wr(s, cluster))

Is this really necessary?

> +        {
>               contiguous_free_clusters++;
>               if (first_gap) {
>                   /* If this is the first free cluster found, update
> diff --git a/block/qcow2.c b/block/qcow2.c
> index d9f49a52e7..6ee94421e0 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c

[...]

> @@ -4536,13 +4553,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>       }
>   
>       ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
> -    qemu_co_mutex_unlock(&s->lock);
>       if (ret < 0) {
> +        qemu_co_mutex_unlock(&s->lock);
>           goto fail;
>       }
>   
> +    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);

It was my impression that this function could be called either with or 
without a lock, that it wouldn’t really matter.  Well, at least that’s 
what I figured out for lack of a comment regarding the contract how it 
is to be invoked.

Max

> +
> +    qemu_co_mutex_unlock(&s->lock);
> +
>       BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>       ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
> +
> +    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
> +
>       if (ret < 0) {
>           goto fail;
>       }



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-11 19:58   ` Max Reitz
@ 2021-03-12  9:09     ` Vladimir Sementsov-Ogievskiy
  2021-03-12 11:17       ` Max Reitz
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12  9:09 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

11.03.2021 22:58, Max Reitz wrote:
> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>> There is a bug in qcow2: host cluster can be discarded (refcount
>> becomes 0) and reused during data write. In this case data write may
>> pollute another cluster (recently allocated) or even metadata.
> 
> I was about to ask whether we couldn’t somehow let discard requests check overlapping I/O with the tracked_request infrastructure from block/io.c (or perhaps even just let them be serializing).
> 
> But I suppose there may be other reasons for clusters being discarded other than explicit discard requests, so we have to have something in qcow2...

For example, host cluster which contain exactly one compressed cluster may be discarded when corresponding guest cluster is rewritten by non-compressed write.

Also, (not very good argument for fixing existing bug but still) serializing will not help with compressed-cache, because with compressed cache we'll have data writes (cache flushes) not related to current in-flight request. And that leads us to something like serializing of operations with .file child of qcow2 node. But we can't do it, as "discarding" a host cluster is operation with metadata, which may do no operation with underlying .file child.

> 
>> To fix the issue let's track inflight writes to host cluster in the
>> hash table and consider new counter when discarding and reusing host
>> clusters.
> 
> This didn’t really explain anything that would help me understand what’s going on in this patch.  The patch itself doesn’t really help me with comments either.  It’s possible that I’m just daft, but honestly I have a hard time really understanding what’s supposed to be going on.

Sorry for this. I believe into your experience of understanding my patches, so that shows that something is wrong with the patch itself)

> 
> Coming back from jumping all over this patch, I also wonder why this isn’t split in multiple patches, where the first introduces the infrastructure and explains exactly what’s going on, and the next patches make use of it so I could understand what each check etc. is supposed to represent and do.

OK

> 
> 
> To serve as an example, after reading through the patch, I still have no idea how it prevents that discard problem you’re trying to fix.  Maybe I’m lazy, but I would have liked exactly that to be explained by this commit message.  Actually, I don’t even understand the problem just from this commit message alone, but I could resort to HEAD^ and some thinking.  Not sure if that’s ideal, though.

The problem:

1. Start write to qcow2. Assume guest cluster G and corresponding host cluster is H.

2. The write requests come to the point of data writing to .file. The write to .file is started and qcow2 mutex is unlocked.

3. At this time refcount of H becomes 0. For example, it may be due to discard operation on qcow2 node, or rewriting compressed data by normal write, or some operation with snapshots..

4. Next, some operations occurs and leads to allocation of H for some other needs: it may be another write-to-qcow2-node operation, or allocation of L2 table or some other data or metadata cluster allocation.

5. So, at this point H is used for something other. Assume, L2 table is written into H.

6. And now, our write from [2] finishes. And pollutes L2 table in H. That's a bug.


The problem becomes more possible with compressed write cache, because any flush operation may trigger deferred data writes, so problem can be reproduced like:

1. Write to guest cluster G2, which triggers flushing of data to host cluster H, which is unrelated to G2.

2. Discard cluster G in parallel

[...]

So, problem is simply triggered by parallel write and discard to different guest clusters. That's why I should do something with this old (and most probably never triggered) problem in qcow2 driver prior to implementing compressed cache.


Hmm, probably we can avoid all these difficulties by implementing compressed cache as a filter driver between qcow2 and its .file nodes.. I tried to do it, encountered some problems, but I don't remember what exactly. Probably I should return to this try. Some obvious problems:

1. Metadata write will go through cache, and we should handle it somehow to not break the sequence of unaligned chunks. Simplest way is to add BDRV_REQ_COLORED flag and do compressed data writes to .file with this flag. And cache filter would track this flag for sequential caching.

2. We can't enable caching by default this way, so user will have to explicitly add a filter.. And actually, it's a real problem of qcow2 driver that in O_DIRECT mode it writes compressed data by small unaligned chunks. So, good to fix it inside qcow2..

With filter we avoid the problem of parallel reuse of host cluster, as all writes go through the cache and flushes will be serialized. So, the problem in qcow2 driver is not enlarged by the cache and we can leave it unfixed..

> 
> So the commit message says that “host cluster can be discarded and reused during data write”, which to me just seems like the exact behavior we want.  The second sentence describes a problem, but doesn’t say how reclaiming discarded clusters leads there.
> 
> I suppose what leads to the problem is that there first needs to be a background write even before the discard that is settled only after the re-claiming write (which isn’t mentioned).
> 
> As far as I understand, this patch addresses that problem by preventing the discarded clusters from being allocated while said background write is not settled yet.  This is done by keeping track of all host clusters that are currently the target of some write operation, and preventing them from being allocated.  OK, makes sense.
> (This latter part does roughly correspond to the second paragraph of this commit message, but I couldn’t make sense of it without understanding why we’d do it.  What’s notably missing from my explanation is the part that you hinted at with “when discarding”, but my problem is that that I just don’t understand what that means at all.  Perhaps it has to do with how I don’t understand the change to update_refcount(), and consequently the whole “slow path” in update_inflight_write_cnt().)


Let me explain:

What is the problem? User uses dynamic object, and the object disappear during the usage. Use-after-free follows. This means that we probably want reference counting for the object, so that user will keep the reference during usage, and object is freed when refcount becomes 0. But for qcow2 host clusters we already have reference counting! Still, we don't want to simply reuse qcow2 refcounts as normal reference counters for two reasons:

1. qcow2 refcounts is part of qcow2 metadata and strictly defined. Qcow2 refcounts shows only usage of the host cluster by other objects in qcow2 metadata. But what we want is usage by the process of writing data.
2. we don't want to do additional read/write of refcounts from/to physical drive

So, what comes to my mind is and additional "dynamic" in-RAM reference counter for host cluster. So that actual full reference count of host cluster is qcow2-refcount + inflight-write-cnt. So, what we need for this:

1. calculate these new inflight-write-cnt, increase it before data write and decrease after.

2. update the core thing of reference counting: freeing the object. Currently the object is "freed" when qcow2-refcoutn becomes 0, that's the code in update_recounts(). But we want to postpone freeing until full reference count becomes zero, i.e. both qcow2-refcount is zero and inflight-write-cnt is zero. So, if qcow2-refcount becomes zero but inflight-write-cnt is nonzero, we postpone discarding the cluster, storing needed information in Qcow2InFlightRefcount. And when inflight-write-cnt becomes zero (and we see this information) we call update_refcount(addend=0) to trigger actual discarding of the cluster.

3. we should consider full reference count when search for free host clusters.

> 
> 
> Side note: Intuitively, this hash table feels like an unnecessarily big hammer to me, but I can’t think of a better solution either, so.  (I mainly don’t like how every write request will result in one allocation and hash table entry per cluster it touches, even when no data cluster is ever discarded.)

We can have a list of inflight-data-writes instead. I afraid, we can't reuse tracked requests of underlying .file node, because we should add inflight-write to the list before releasing the qcow2 mutex.

> 
>> Enable qcow2-discard-during-rewrite as it is fixed.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.h                                 |   9 ++
>>   block/qcow2-refcount.c                        | 149 +++++++++++++++++-
>>   block/qcow2.c                                 |  26 ++-
>>   .../tests/qcow2-discard-during-rewrite        |   2 +-
>>   4 files changed, 181 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 0678073b74..e18d8dfbae 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
>>        * is to convert the image with the desired compression type set.
>>        */
>>       Qcow2CompressionType compression_type;
>> +
>> +    GHashTable *inflight_writes_counters;
>>   } BDRVQcow2State;
>>   typedef struct Qcow2COWRegion {
>> @@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length);
>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length);
>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
>> +                                     int64_t length);
>> +
>>   /* qcow2-cluster.c functions */
>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>                           bool exact_size);
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 8e649b008e..464d133368 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -799,6 +799,140 @@ found:
>>       }
>>   }
>> +/*
>> + * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
>> + * hasm map. And it's keys are cluster indices.
> 
> *hash, *its
> 
> I’d rather document this for s->inflight_writes_counters, though (like “/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)
> 
>> + */
>> +typedef struct Qcow2InFlightRefcount {
>> +    /*
>> +     * Number of in-flight writes to the cluster, always > 0, as when becomes
> 
> s/when becomes/when it becomes/
> 
>> +     * 0 the entry is removed from s->inflight_writes_counters.
>> +     */
>> +    uint64_t inflight_writes_cnt;
>> +
>> +    /* Cluster refcount is known to be zero */
>> +    bool refcount_zero;
>> +
>> +    /* Cluster refcount was made zero with this discard-type */
>> +    enum qcow2_discard_type type;
>> +} Qcow2InFlightRefcount;
>> +
>> +static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
>> +                                           int64_t cluster_index)
>> +{
>> +    Qcow2InFlightRefcount *infl;
>> +
>> +    if (!s->inflight_writes_counters) {
>> +        return NULL;
>> +    }
>> +
>> +    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
>> +
>> +    if (infl) {
>> +        assert(infl->inflight_writes_cnt > 0);
>> +    }
>> +
>> +    return infl;
>> +}
>> +
>> +/*
>> + * Returns true if there are any in-flight writes to the cluster blocking
>> + * its reallocation.
>> + */
>> +static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
>> +{
>> +    return !!find_infl_wr(s, cluster_index);
> 
> I wonder if g_hash_table_contains() would be quicker. *shrug*
> 
>> +}
>> +
>> +static int update_inflight_write_cnt(BlockDriverState *bs,
>> +                                     int64_t offset, int64_t length,
>> +                                     bool decrease, bool locked)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int64_t start, last, cluster_offset;
>> +
>> +    if (locked) {
>> +        qemu_co_mutex_assert_locked(&s->lock);
>> +    }
>> +
>> +    start = start_of_cluster(s, offset);
>> +    last = start_of_cluster(s, offset + length - 1);
>> +    for (cluster_offset = start; cluster_offset <= last;
>> +         cluster_offset += s->cluster_size)
>> +    {
>> +        int64_t cluster_index = cluster_offset >> s->cluster_bits;
>> +        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>> +
>> +        if (!infl) {
>> +            infl = g_new0(Qcow2InFlightRefcount, 1);
>> +            g_hash_table_insert(s->inflight_writes_counters,
>> +                                g_memdup(&cluster_index, sizeof(cluster_index)),
>> +                                infl);
> 
> I suppose we could save one allocation by putting the cluster index into Qcow2InFlightRefcount and then giving the key as &infl->cluster_index. Not sure if that’s too nasty, though.

Allocations are a lot faster than IO.. So, I don't think we'll see the difference in normal cases.

> 
>> +        }
>> +
>> +        if (decrease) {
>> +            assert(infl->inflight_writes_cnt >= 1);
>> +            infl->inflight_writes_cnt--;
>> +        } else {
>> +            infl->inflight_writes_cnt++;
>> +        }
>> +
>> +        if (infl->inflight_writes_cnt == 0) {
>> +            bool refcount_zero = infl->refcount_zero;
>> +            enum qcow2_discard_type type = infl->type;
>> +
>> +            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
>> +
>> +            if (refcount_zero) {
>> +                /*
>> +                 * Slow path. We must reset normal refcount to actually release
> 
> I don’t understand what “slow path” means in this context.  (Nor do I really understand the rest, but more on that below.)


In most cases inc/dev of inflight-writes-cnt is fast: it only update in-memory structure. But when full reference count of the cluster becomes zero we call real update_refcount to trigger discarding of the cluster.. This may be slower. Probably the needed code should be moved from update_refcount to separate function like host_cluster_free(), to not cheat with addend=0.


> 
>> +                 * the cluster.
>> +                 */
>> +                int ret;
>> +
>> +                if (!locked) {
>> +                    qemu_co_mutex_lock(&s->lock);
>> +                }
>> +                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
>> +                                                    true, type);
> 
> I don’t understand this, but then again I’m writing this very early in my review, and at this point I don’t understand anything, really.  (The comment doesn’t explain to me what’s happening here.)
> 
> Revisiting later, refcount_zero is set by update_refcount() when the refcount drops to zero, so invoking this function here will finalize the change.  I don’t quite understand what’s going on in update_refcount(), though.
> 
> (And even after finding out why, I don’t understand why the comment doesn’t explain why we must invoke qcow2_update_cluster_refcount() with addend=0.)

Sorry for the mess. Here we want to trigger the code in update_refcount() that is freeing host cluster.. I.e. the code that runs when refcount becomes zero.

> 
>> +                if (!locked) {
>> +                    qemu_co_mutex_unlock(&s->lock);
>> +                }
>> +
>> +                if (ret < 0) {
>> +                    return ret;
>> +                }
>> +            }
>> +        }
>> +
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length)
>> +{
>> +    return update_inflight_write_cnt(bs, offset, length, false, false);
>> +}
>> +
>> +/*
>> + * Called with s->lock not locked by caller. Will take s->lock only if need to
>> + * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
>> + */
> 
> Why doesn’t qcow2_inflight_writes_inc() get the same comment?
> 
> ...Oh, because @locked doesn’t have an influence there.  Why isn’t it worth a comment that *_inc() works both with a locked and an unlocked mutex?

OK, it worth

> 
>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length)
>> +{
>> +    return update_inflight_write_cnt(bs, offset, length, true, false);
>> +}
>> +
>> +/* Called with s->lock locked. */
>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
>> +                                     int64_t length)
>> +{
>> +    return update_inflight_write_cnt(bs, offset, length, true, true);
>> +}
>> +
>>   /* XXX: cache several refcount block clusters ? */
>>   /* @addend is the absolute value of the addend; if @decrease is set, @addend
>>    * will be subtracted from the current refcount, otherwise it will be added */
>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>           if (refcount == 0) {
>>               void *table;
>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>> +
>> +            if (infl) {
>> +                infl->refcount_zero = true;
>> +                infl->type = type;
>> +                continue;
>> +            }
> 
> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?

Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().

> 
> (I also wondered why the continue wasn’t put before the s->set_refcount() call, but I suppose the same effect is had by the has_infl_wr() check in alloc_clusters_noref() and qcow2_alloc_clusters_at().)

The only change of update_refcount() logic is postponing of freeing the cluster. So, handling of qcow2-refacount is kept as is. qcow2-refcount becomes zero, and s->set_refcount() is called. inflight-write-cnt is a separate thing, not the cache for qcow2-refcount.

> 
>>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>>                                                   offset);
>> @@ -983,7 +1124,7 @@ retry:
>>           if (ret < 0) {
>>               return ret;
>> -        } else if (refcount != 0) {
>> +        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
>>               goto retry;
>>           }
>>       }
>> @@ -1046,7 +1187,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
>>               ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
>>               if (ret < 0) {
>>                   return ret;
>> -            } else if (refcount != 0) {
>> +            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
>>                   break;
>>               }
>>           }
>> @@ -2294,7 +2435,9 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>>            contiguous_free_clusters < cluster_count;
>>            cluster++)
>>       {
>> -        if (!s->get_refcount(*refcount_table, cluster)) {
>> +        if (!s->get_refcount(*refcount_table, cluster) &&
>> +            !has_infl_wr(s, cluster))
> 
> Is this really necessary?

Yes. Everywhere when we want free cluster, we should check that full reference count is zero, which is qcow2-refcount + inflight-write-cnt.. Which is equal to check that both qcow2-refcount and inflight-write-cnt are zero. And for for zero inflight-write-cnt it must just be absent in the hash.

> 
>> +        {
>>               contiguous_free_clusters++;
>>               if (first_gap) {
>>                   /* If this is the first free cluster found, update
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index d9f49a52e7..6ee94421e0 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
> 
> [...]
> 
>> @@ -4536,13 +4553,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>       }
>>       ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
>> -    qemu_co_mutex_unlock(&s->lock);
>>       if (ret < 0) {
>> +        qemu_co_mutex_unlock(&s->lock);
>>           goto fail;
>>       }
>> +    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
> 
> It was my impression that this function could be called either with or without a lock, that it wouldn’t really matter.  Well, at least that’s what I figured out for lack of a comment regarding the contract how it is to be invoked.
> 

I'm sorry:(


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12  9:09     ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 11:17       ` Max Reitz
  2021-03-12 12:32         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-12 11:17 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
> 11.03.2021 22:58, Max Reitz wrote:
>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>> becomes 0) and reused during data write. In this case data write may
>>> pollute another cluster (recently allocated) or even metadata.
>>
>> I was about to ask whether we couldn’t somehow let discard requests 
>> check overlapping I/O with the tracked_request infrastructure from 
>> block/io.c (or perhaps even just let them be serializing).
>>
>> But I suppose there may be other reasons for clusters being discarded 
>> other than explicit discard requests, so we have to have something in 
>> qcow2...
> 
> For example, host cluster which contain exactly one compressed cluster 
> may be discarded when corresponding guest cluster is rewritten by 
> non-compressed write.
> 
> Also, (not very good argument for fixing existing bug but still) 
> serializing will not help with compressed-cache, because with compressed 
> cache we'll have data writes (cache flushes) not related to current 
> in-flight request. And that leads us to something like serializing of 
> operations with .file child of qcow2 node. But we can't do it, as 
> "discarding" a host cluster is operation with metadata, which may do no 
> operation with underlying .file child.

Yes, makes sense.  I think it is a good argument, because the point is 
that qcow2 host cluster discards are not necessarily related to guest 
discards.  The compressed cache is just one new example for that.

>>> To fix the issue let's track inflight writes to host cluster in the
>>> hash table and consider new counter when discarding and reusing host
>>> clusters.
>>
>> This didn’t really explain anything that would help me understand 
>> what’s going on in this patch.  The patch itself doesn’t really help 
>> me with comments either.  It’s possible that I’m just daft, but 
>> honestly I have a hard time really understanding what’s supposed to be 
>> going on.
> 
> Sorry for this. I believe into your experience of understanding my 
> patches, so that shows that something is wrong with the patch itself)
> 
>>
>> Coming back from jumping all over this patch, I also wonder why this 
>> isn’t split in multiple patches, where the first introduces the 
>> infrastructure and explains exactly what’s going on, and the next 
>> patches make use of it so I could understand what each check etc. is 
>> supposed to represent and do.
> 
> OK
> 
>>
>>
>> To serve as an example, after reading through the patch, I still have 
>> no idea how it prevents that discard problem you’re trying to fix.  
>> Maybe I’m lazy, but I would have liked exactly that to be explained by 
>> this commit message.  Actually, I don’t even understand the problem 
>> just from this commit message alone, but I could resort to HEAD^ and 
>> some thinking.  Not sure if that’s ideal, though.
> 
> The problem:
> 
> 1. Start write to qcow2. Assume guest cluster G and corresponding host 
> cluster is H.
> 
> 2. The write requests come to the point of data writing to .file. The 
> write to .file is started and qcow2 mutex is unlocked.
> 
> 3. At this time refcount of H becomes 0. For example, it may be due to 
> discard operation on qcow2 node, or rewriting compressed data by normal 
> write, or some operation with snapshots..
> 
> 4. Next, some operations occurs and leads to allocation of H for some 
> other needs: it may be another write-to-qcow2-node operation, or 
> allocation of L2 table or some other data or metadata cluster allocation.
> 
> 5. So, at this point H is used for something other. Assume, L2 table is 
> written into H.
> 
> 6. And now, our write from [2] finishes. And pollutes L2 table in H. 
> That's a bug.

Understood.

> The problem becomes more possible with compressed write cache, because 
> any flush operation may trigger deferred data writes, so problem can be 
> reproduced like:
> 
> 1. Write to guest cluster G2, which triggers flushing of data to host 
> cluster H, which is unrelated to G2.
> 
> 2. Discard cluster G in parallel
> 
> [...]
> 
> So, problem is simply triggered by parallel write and discard to 
> different guest clusters. That's why I should do something with this old 
> (and most probably never triggered) problem in qcow2 driver prior to 
> implementing compressed cache.

OK!  That makes sense.

> Hmm, probably we can avoid all these difficulties by implementing 
> compressed cache as a filter driver between qcow2 and its .file nodes..

As I see it, you’ve now opened the can of worms, and I’m not sure we can 
just close it and say we haven’t seen any problem. :)

So, regardless of the compressed cache, I think this should be fixed.

> I tried to do it, encountered some problems, but I don't remember what 
> exactly. Probably I should return to this try. Some obvious problems:
> 
> 1. Metadata write will go through cache, and we should handle it somehow 
> to not break the sequence of unaligned chunks. Simplest way is to add 
> BDRV_REQ_COLORED flag and do compressed data writes to .file with this 
> flag. And cache filter would track this flag for sequential caching.
> 
> 2. We can't enable caching by default this way, so user will have to 
> explicitly add a filter.. And actually, it's a real problem of qcow2 
> driver that in O_DIRECT mode it writes compressed data by small 
> unaligned chunks. So, good to fix it inside qcow2..

Agreed.

> With filter we avoid the problem of parallel reuse of host cluster, as 
> all writes go through the cache and flushes will be serialized. So, the 
> problem in qcow2 driver is not enlarged by the cache and we can leave it 
> unfixed..

Not sure I agree. ;)

>> So the commit message says that “host cluster can be discarded and 
>> reused during data write”, which to me just seems like the exact 
>> behavior we want.  The second sentence describes a problem, but 
>> doesn’t say how reclaiming discarded clusters leads there.
>>
>> I suppose what leads to the problem is that there first needs to be a 
>> background write even before the discard that is settled only after 
>> the re-claiming write (which isn’t mentioned).
>>
>> As far as I understand, this patch addresses that problem by 
>> preventing the discarded clusters from being allocated while said 
>> background write is not settled yet.  This is done by keeping track of 
>> all host clusters that are currently the target of some write 
>> operation, and preventing them from being allocated.  OK, makes sense.
>> (This latter part does roughly correspond to the second paragraph of 
>> this commit message, but I couldn’t make sense of it without 
>> understanding why we’d do it.  What’s notably missing from my 
>> explanation is the part that you hinted at with “when discarding”, but 
>> my problem is that that I just don’t understand what that means at 
>> all.  Perhaps it has to do with how I don’t understand the change to 
>> update_refcount(), and consequently the whole “slow path” in 
>> update_inflight_write_cnt().)
> 
> 
> Let me explain:
> 
> What is the problem? User uses dynamic object, and the object disappear 
> during the usage. Use-after-free follows. This means that we probably 
> want reference counting for the object, so that user will keep the 
> reference during usage, and object is freed when refcount becomes 0. But 
> for qcow2 host clusters we already have reference counting!

Makes sense.

> Still, we 
> don't want to simply reuse qcow2 refcounts as normal reference counters 
> for two reasons:
> 
> 1. qcow2 refcounts is part of qcow2 metadata and strictly defined. Qcow2 
> refcounts shows only usage of the host cluster by other objects in qcow2 
> metadata. But what we want is usage by the process of writing data.

Indeed.

> 2. we don't want to do additional read/write of refcounts from/to 
> physical drive

Right.

> So, what comes to my mind is and additional "dynamic" in-RAM reference 
> counter for host cluster. So that actual full reference count of host 
> cluster is qcow2-refcount + inflight-write-cnt.

OK, yes, that makes a lot of sense.

> So, what we need for this:
> 
> 1. calculate these new inflight-write-cnt, increase it before data write 
> and decrease after.
> 
> 2. update the core thing of reference counting: freeing the object.

Makes sense, but I don’t see how your change to update_refcount() does 
that, though.  But OTOH I do see how you prevent use-after-frees (by 
preventing allocations  of clusters with refcount == 0 but with active 
writes on them), which works just as well.

And if freeing the object (clusters) were postponed until all writes are 
settled, I don’t think we’d need the checks in the cluster allocation 
functions.

> Currently the object is "freed" when qcow2-refcoutn becomes 0, that's 
> the code in update_recounts(). But we want to postpone freeing until 
> full reference count becomes zero, i.e. both qcow2-refcount is zero and 
> inflight-write-cnt is zero. So, if qcow2-refcount becomes zero but 
> inflight-write-cnt is nonzero, we postpone discarding the cluster, 
> storing needed information in Qcow2InFlightRefcount. And when 
> inflight-write-cnt becomes zero (and we see this information) we call 
> update_refcount(addend=0) to trigger actual discarding of the cluster.
> 
> 3. we should consider full reference count when search for free host 
> clusters.
> 
>>
>>
>> Side note: Intuitively, this hash table feels like an unnecessarily 
>> big hammer to me, but I can’t think of a better solution either, so.  
>> (I mainly don’t like how every write request will result in one 
>> allocation and hash table entry per cluster it touches, even when no 
>> data cluster is ever discarded.)
> 
> We can have a list of inflight-data-writes instead.

Yes.  I’m not sure which would actually be better in the end.

I feel such a list would be better for the normal case (just writes, no 
allocations or discards), but I don’t know how much worse they’d be when 
something needs to be looked up (i.e. on allocations or discards).

> I afraid, we can't 
> reuse tracked requests of underlying .file node, because we should add 
> inflight-write to the list before releasing the qcow2 mutex.

Yes.

>>> Enable qcow2-discard-during-rewrite as it is fixed.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/qcow2.h                                 |   9 ++
>>>   block/qcow2-refcount.c                        | 149 +++++++++++++++++-
>>>   block/qcow2.c                                 |  26 ++-
>>>   .../tests/qcow2-discard-during-rewrite        |   2 +-
>>>   4 files changed, 181 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 0678073b74..e18d8dfbae 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
>>>        * is to convert the image with the desired compression type set.
>>>        */
>>>       Qcow2CompressionType compression_type;
>>> +
>>> +    GHashTable *inflight_writes_counters;
>>>   } BDRVQcow2State;
>>>   typedef struct Qcow2COWRegion {
>>> @@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>>>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>>>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length);
>>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length);
>>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t 
>>> offset,
>>> +                                     int64_t length);
>>> +
>>>   /* qcow2-cluster.c functions */
>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>                           bool exact_size);
>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>> index 8e649b008e..464d133368 100644
>>> --- a/block/qcow2-refcount.c
>>> +++ b/block/qcow2-refcount.c
>>> @@ -799,6 +799,140 @@ found:
>>>       }
>>>   }
>>> +/*
>>> + * Qcow2InFlightRefcount is a type for values of 
>>> s->inflight_writes_counters
>>> + * hasm map. And it's keys are cluster indices.
>>
>> *hash, *its
>>
>> I’d rather document this for s->inflight_writes_counters, though (like 
>> “/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)
>>
>>> + */
>>> +typedef struct Qcow2InFlightRefcount {
>>> +    /*
>>> +     * Number of in-flight writes to the cluster, always > 0, as 
>>> when becomes
>>
>> s/when becomes/when it becomes/
>>
>>> +     * 0 the entry is removed from s->inflight_writes_counters.
>>> +     */
>>> +    uint64_t inflight_writes_cnt;
>>> +
>>> +    /* Cluster refcount is known to be zero */
>>> +    bool refcount_zero;
>>> +
>>> +    /* Cluster refcount was made zero with this discard-type */
>>> +    enum qcow2_discard_type type;
>>> +} Qcow2InFlightRefcount;
>>> +
>>> +static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
>>> +                                           int64_t cluster_index)
>>> +{
>>> +    Qcow2InFlightRefcount *infl;
>>> +
>>> +    if (!s->inflight_writes_counters) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    infl = g_hash_table_lookup(s->inflight_writes_counters, 
>>> &cluster_index);
>>> +
>>> +    if (infl) {
>>> +        assert(infl->inflight_writes_cnt > 0);
>>> +    }
>>> +
>>> +    return infl;
>>> +}
>>> +
>>> +/*
>>> + * Returns true if there are any in-flight writes to the cluster 
>>> blocking
>>> + * its reallocation.
>>> + */
>>> +static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
>>> +{
>>> +    return !!find_infl_wr(s, cluster_index);
>>
>> I wonder if g_hash_table_contains() would be quicker. *shrug*
>>
>>> +}
>>> +
>>> +static int update_inflight_write_cnt(BlockDriverState *bs,
>>> +                                     int64_t offset, int64_t length,
>>> +                                     bool decrease, bool locked)
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    int64_t start, last, cluster_offset;
>>> +
>>> +    if (locked) {
>>> +        qemu_co_mutex_assert_locked(&s->lock);
>>> +    }
>>> +
>>> +    start = start_of_cluster(s, offset);
>>> +    last = start_of_cluster(s, offset + length - 1);
>>> +    for (cluster_offset = start; cluster_offset <= last;
>>> +         cluster_offset += s->cluster_size)
>>> +    {
>>> +        int64_t cluster_index = cluster_offset >> s->cluster_bits;
>>> +        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>> +
>>> +        if (!infl) {
>>> +            infl = g_new0(Qcow2InFlightRefcount, 1);
>>> +            g_hash_table_insert(s->inflight_writes_counters,
>>> +                                g_memdup(&cluster_index, 
>>> sizeof(cluster_index)),
>>> +                                infl);
>>
>> I suppose we could save one allocation by putting the cluster index 
>> into Qcow2InFlightRefcount and then giving the key as 
>> &infl->cluster_index. Not sure if that’s too nasty, though.
> 
> Allocations are a lot faster than IO.. So, I don't think we'll see the 
> difference in normal cases.

Sure.

>>> +        }
>>> +
>>> +        if (decrease) {
>>> +            assert(infl->inflight_writes_cnt >= 1);
>>> +            infl->inflight_writes_cnt--;
>>> +        } else {
>>> +            infl->inflight_writes_cnt++;
>>> +        }
>>> +
>>> +        if (infl->inflight_writes_cnt == 0) {
>>> +            bool refcount_zero = infl->refcount_zero;
>>> +            enum qcow2_discard_type type = infl->type;
>>> +
>>> +            g_hash_table_remove(s->inflight_writes_counters, 
>>> &cluster_index);
>>> +
>>> +            if (refcount_zero) {
>>> +                /*
>>> +                 * Slow path. We must reset normal refcount to 
>>> actually release
>>
>> I don’t understand what “slow path” means in this context.  (Nor do I 
>> really understand the rest, but more on that below.)
> 
> 
> In most cases inc/dev of inflight-writes-cnt is fast: it only update 
> in-memory structure. But when full reference count of the cluster 
> becomes zero we call real update_refcount to trigger discarding of the 
> cluster.. This may be slower. Probably the needed code should be moved 
> from update_refcount to separate function like host_cluster_free(), to 
> not cheat with addend=0.

OK.

>>> +                 * the cluster.
>>> +                 */
>>> +                int ret;
>>> +
>>> +                if (!locked) {
>>> +                    qemu_co_mutex_lock(&s->lock);
>>> +                }
>>> +                ret = qcow2_update_cluster_refcount(bs, 
>>> cluster_index, 0,
>>> +                                                    true, type);
>>
>> I don’t understand this, but then again I’m writing this very early in 
>> my review, and at this point I don’t understand anything, really.  
>> (The comment doesn’t explain to me what’s happening here.)
>>
>> Revisiting later, refcount_zero is set by update_refcount() when the 
>> refcount drops to zero, so invoking this function here will finalize 
>> the change.  I don’t quite understand what’s going on in 
>> update_refcount(), though.
>>
>> (And even after finding out why, I don’t understand why the comment 
>> doesn’t explain why we must invoke qcow2_update_cluster_refcount() 
>> with addend=0.)
> 
> Sorry for the mess. Here we want to trigger the code in 
> update_refcount() that is freeing host cluster.. I.e. the code that runs 
> when refcount becomes zero.
> 
>>
>>> +                if (!locked) {
>>> +                    qemu_co_mutex_unlock(&s->lock);
>>> +                }
>>> +
>>> +                if (ret < 0) {
>>> +                    return ret;
>>> +                }
>>> +            }
>>> +        }
>>> +
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length)
>>> +{
>>> +    return update_inflight_write_cnt(bs, offset, length, false, false);
>>> +}
>>> +
>>> +/*
>>> + * Called with s->lock not locked by caller. Will take s->lock only 
>>> if need to
>>> + * release the cluster (refcount is 0 and inflight-write-cnt becomes 
>>> zero).
>>> + */
>>
>> Why doesn’t qcow2_inflight_writes_inc() get the same comment?
>>
>> ...Oh, because @locked doesn’t have an influence there.  Why isn’t it 
>> worth a comment that *_inc() works both with a locked and an unlocked 
>> mutex?
> 
> OK, it worth
> 
>>
>>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length)
>>> +{
>>> +    return update_inflight_write_cnt(bs, offset, length, true, false);
>>> +}
>>> +
>>> +/* Called with s->lock locked. */
>>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t 
>>> offset,
>>> +                                     int64_t length)
>>> +{
>>> +    return update_inflight_write_cnt(bs, offset, length, true, true);
>>> +}
>>> +
>>>   /* XXX: cache several refcount block clusters ? */
>>>   /* @addend is the absolute value of the addend; if @decrease is 
>>> set, @addend
>>>    * will be subtracted from the current refcount, otherwise it will 
>>> be added */
>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>> update_refcount(BlockDriverState *bs,
>>>           if (refcount == 0) {
>>>               void *table;
>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>> cluster_index);
>>> +
>>> +            if (infl) {
>>> +                infl->refcount_zero = true;
>>> +                infl->type = type;
>>> +                continue;
>>> +            }
>>
>> I don’t understand what this is supposed to do exactly.  It seems like 
>> it wants to keep metadata structures in the cache that are still in 
>> use (because dropping them from the caches is what happens next), but 
>> users of metadata structures won’t set in-flight counters for those 
>> metadata structures, will they?
> 
> Don't follow.
> 
> We want the code in "if (refcount == 0)" to be triggered only when full 
> reference count of the host cluster becomes 0, including 
> inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we 
> postpone freeing the host cluster, it will be done later from "slow 
> path" in update_inflight_write_cnt().

But the code under “if (refcount == 0)” doesn’t free anything, does it? 
  All I can see is code to remove metadata structures from the metadata 
caches (if the discarded cluster was an L2 table or a refblock), and 
finally the discard on the underlying file.  I don’t see how that 
protocol-level discard has anything to do with our problem, though.

As far as I understand, the freeing happens immediately above the “if 
(refcount == 0)” block by s->set_refcount() setting the refcount to 0. 
(including updating s->free_cluster_index if the refcount is 0).

>> (I also wondered why the continue wasn’t put before the 
>> s->set_refcount() call, but I suppose the same effect is had by the 
>> has_infl_wr() check in alloc_clusters_noref() and 
>> qcow2_alloc_clusters_at().)
> 
> The only change of update_refcount() logic is postponing of freeing the 
> cluster. So, handling of qcow2-refacount is kept as is. qcow2-refcount 
> becomes zero, and s->set_refcount() is called. inflight-write-cnt is a 
> separate thing, not the cache for qcow2-refcount.
> 
>>
>>>               table = 
>>> qcow2_cache_is_table_offset(s->refcount_block_cache,
>>>                                                   offset);
>>> @@ -983,7 +1124,7 @@ retry:
>>>           if (ret < 0) {
>>>               return ret;
>>> -        } else if (refcount != 0) {
>>> +        } else if (refcount != 0 || has_infl_wr(s, 
>>> next_cluster_index)) {
>>>               goto retry;
>>>           }
>>>       }
>>> @@ -1046,7 +1187,7 @@ int64_t 
>>> qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
>>>               ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
>>>               if (ret < 0) {
>>>                   return ret;
>>> -            } else if (refcount != 0) {
>>> +            } else if (refcount != 0 || has_infl_wr(s, 
>>> cluster_index)) {
>>>                   break;
>>>               }
>>>           }
>>> @@ -2294,7 +2435,9 @@ static int64_t 
>>> alloc_clusters_imrt(BlockDriverState *bs,
>>>            contiguous_free_clusters < cluster_count;
>>>            cluster++)
>>>       {
>>> -        if (!s->get_refcount(*refcount_table, cluster)) {
>>> +        if (!s->get_refcount(*refcount_table, cluster) &&
>>> +            !has_infl_wr(s, cluster))
>>
>> Is this really necessary?
> 
> Yes. Everywhere when we want free cluster, we should check that full 
> reference count is zero, which is qcow2-refcount + inflight-write-cnt.. 
> Which is equal to check that both qcow2-refcount and inflight-write-cnt 
> are zero. And for for zero inflight-write-cnt it must just be absent in 
> the hash.

I understand the sentiment, but I don’t think that reasoning applies 
here.  The in-memory refcount table (IMRT) used for qemu-img check will 
not be updated on discards anyway.  If there were any I/O to the image 
concurrently to check, it would completely ignore the IMRT, so the check 
feels hypocritical.

>>> +        {
>>>               contiguous_free_clusters++;
>>>               if (first_gap) {
>>>                   /* If this is the first free cluster found, update
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index d9f49a52e7..6ee94421e0 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>
>> [...]
>>
>>> @@ -4536,13 +4553,20 @@ 
>>> qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>>       }
>>>       ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, 
>>> out_len, true);
>>> -    qemu_co_mutex_unlock(&s->lock);
>>>       if (ret < 0) {
>>> +        qemu_co_mutex_unlock(&s->lock);
>>>           goto fail;
>>>       }
>>> +    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>>
>> It was my impression that this function could be called either with or 
>> without a lock, that it wouldn’t really matter.  Well, at least that’s 
>> what I figured out for lack of a comment regarding the contract how it 
>> is to be invoked.
>>
> 
> I'm sorry:(

Oh, geez, now I feel bad, too.  Thanks a lot for your explanations, it 
all makes much more sense to me now! :)

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 11:17       ` Max Reitz
@ 2021-03-12 12:32         ` Vladimir Sementsov-Ogievskiy
  2021-03-12 12:42           ` Vladimir Sementsov-Ogievskiy
                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 12:32 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 14:17, Max Reitz wrote:
> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>> 11.03.2021 22:58, Max Reitz wrote:
>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>> becomes 0) and reused during data write. In this case data write may

[..]

>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>           if (refcount == 0) {
>>>>               void *table;
>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>> +
>>>> +            if (infl) {
>>>> +                infl->refcount_zero = true;
>>>> +                infl->type = type;
>>>> +                continue;
>>>> +            }
>>>
>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>
>> Don't follow.
>>
>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
> 
> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.

Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt

> 
> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).

Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free anything.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 12:32         ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 12:42           ` Vladimir Sementsov-Ogievskiy
  2021-03-12 15:01             ` Max Reitz
  2021-03-12 12:46           ` Vladimir Sementsov-Ogievskiy
  2021-03-12 14:58           ` Max Reitz
  2 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 12:42 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 14:17, Max Reitz wrote:
>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.03.2021 22:58, Max Reitz wrote:
>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>> becomes 0) and reused during data write. In this case data write may
> 
> [..]
> 
>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>           if (refcount == 0) {
>>>>>               void *table;
>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>> +
>>>>> +            if (infl) {
>>>>> +                infl->refcount_zero = true;
>>>>> +                infl->type = type;
>>>>> +                continue;
>>>>> +            }
>>>>
>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>
>>> Don't follow.
>>>
>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>
>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
> 
> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
> 
> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
> 
>>
>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
> 
> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
> 
> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
> 
> 


Looking now at this place:


         if (refcount == 0 && cluster_index < s->free_cluster_index) {
             s->free_cluster_index = cluster_index;
         }
         s->set_refcount(refcount_block, block_index, refcount);
                                                                                  
         if (refcount == 0) {
             void *table;
             Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
                                                                                  
             if (infl) {
                 infl->refcount_zero = true;
                 infl->type = type;
                 continue;
             }
                                                                                  
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                 offset);
             if (table != NULL) {
                 qcow2_cache_put(s->refcount_block_cache, &refcount_block);
                 old_table_index = -1;
                 qcow2_cache_discard(s->refcount_block_cache, table);
             }
                                                                                  
             table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);
             if (table != NULL) {
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
                                                                                  
             if (s->compressed_cache) {
                 seqcache_discard_cluster(s->compressed_cache, cluster_offset);
             }
                                                                                  
             if (s->discard_passthrough[type]) {
                 update_refcount_discard(bs, cluster_offset, s->cluster_size);
             }
         }


Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? offset is the start of the whole loop, and is not updated on iterations. Isn't it more correct to use cluster_offset here? Or we are sure that refcount and l2 metadata is always discarded by exactly one cluster? Than we are OK, but still worth an assertion that offset == cluster_offset.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 12:32         ` Vladimir Sementsov-Ogievskiy
  2021-03-12 12:42           ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 12:46           ` Vladimir Sementsov-Ogievskiy
  2021-03-12 15:10             ` Max Reitz
  2021-03-12 14:58           ` Max Reitz
  2 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 12:46 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 14:17, Max Reitz wrote:
>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.03.2021 22:58, Max Reitz wrote:
>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>> becomes 0) and reused during data write. In this case data write may
> 
> [..]
> 
>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>           if (refcount == 0) {
>>>>>               void *table;
>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>> +
>>>>> +            if (infl) {
>>>>> +                infl->refcount_zero = true;
>>>>> +                infl->type = type;
>>>>> +                continue;
>>>>> +            }
>>>>
>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>
>>> Don't follow.
>>>
>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>
>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
> 
> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
> 
> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
> 
>>
>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
> 
> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
> 
> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
> 
> 

So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
          } else {
              refcount += addend;
          }
-        if (refcount == 0 && cluster_index < s->free_cluster_index) {
-            s->free_cluster_index = cluster_index;
-        }
          s->set_refcount(refcount_block, block_index, refcount);
  
          if (refcount == 0) {
              void *table;
              Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
  
-            if (infl) {
-                infl->refcount_zero = true;
-                infl->type = type;
-                continue;
-            }
-
              table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                  offset);
              if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                  qcow2_cache_discard(s->l2_table_cache, table);
              }
  
+            if (infl) {
+                infl->refcount_zero = true;
+                infl->type = type;
+                continue;
+            }
+
+            if (cluster_index < s->free_cluster_index) {
+                s->free_cluster_index = cluster_index;
+            }
+
              if (s->discard_passthrough[type]) {
                  update_refcount_discard(bs, cluster_offset, s->cluster_size);
              }



-- 
Best regards,
Vladimir


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] util: implement seqcache
  2021-03-05 17:35 ` [PATCH v3 4/6] util: implement seqcache Vladimir Sementsov-Ogievskiy
@ 2021-03-12 13:41   ` Max Reitz
  2021-03-12 14:37     ` Vladimir Sementsov-Ogievskiy
  2021-06-04 14:31   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-12 13:41 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> Implement cache for small sequential unaligned writes, so that they may
> be cached until we get a complete cluster and then write it.
> 
> The cache is intended to be used for backup to qcow2 compressed target
> opened in O_DIRECT mode, but can be reused for any similar (even not
> block-layer related) task.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   include/qemu/seqcache.h |  42 +++++
>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>   MAINTAINERS             |   6 +
>   util/meson.build        |   1 +
>   4 files changed, 410 insertions(+)
>   create mode 100644 include/qemu/seqcache.h
>   create mode 100644 util/seqcache.c

Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re using 
a list again instead of a hash table.  I suppose we do need the list 
anyway because of the next_flush iterator, so using a hash table would 
only complicate the implementation, but still.

[...]

> diff --git a/util/seqcache.c b/util/seqcache.c
> new file mode 100644
> index 0000000000..d923560eab
> --- /dev/null
> +++ b/util/seqcache.c
> @@ -0,0 +1,361 @@
> +/*
> + * Cache for small sequential write requests.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + *
> + *
> + * = Description =
> + *
> + * SeqCache is an abbreviation for Sequential Cache.
> + *
> + * The Cache is intended to improve performance of small unaligned sequential
> + * writes. Cache has a cluster_size parameter and the unit of caching is aligned
> + * cluster.  Cache has a list of cached clusters, several "finished" ones and at
> + * most one "unfinished". "unfinished" cluster is a cluster where last byte of
> + * last write operation is cached and there is a free place after that byte to
> + * the end of cluster. "finished" clusters are just stored in cache to be read
> + * or flushed and don't allow any writes to them.
> + *
> + * If write to the cache intersects cluster bounds, it's splat into several

*split

(Though I like “splat”.  Sounds like a wet blotch.)

> + * requests by cluster bounds. So, consider a write that doesn't intersect
> + * cluster bounds to describe the whole picture:
> + *
> + * There are two cases allowed:
> + *
> + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
> + *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
> + *    filled up to the cluster bound it becomes "finished".
> + *
> + * 2. Write to new cluster, not existing in the cache. It may be sequential to
> + *    previous write or not. Current "unfinshed" cluster (if exists) becomes

*unfinished

> + *    "finished" and new "unfinished" cluster is started. Note also that offset
> + *    of the write to new cluster is not required to be aligned.
> + *
> + *    Any other write operation (non-sequential write to "unfinished" cluster
> + *    or write to any of "finished" clusters) will crash.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/queue.h"
> +#include "qemu/seqcache.h"
> +
> +/*
> + * Cluster
> + *
> + * Representation of one cached cluster, aligned to SeqCache::cluster_size.
> + * Caches only one subregion of the cluster, started at @offset (may be
> + * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
> + * The whole subregion always lay in one aligned cluster:
> + *
> + *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
> + *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
> + *
> + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
> + * allocated @buf length is at least:
> + *
> + *      cluster_size - offset % cluster_size
> + */
> +typedef struct Cluster {
> +   int64_t offset;
> +   int64_t bytes;
> +   uint8_t *buf;
> +   QSIMPLEQ_ENTRY(Cluster) entry;
> +} Cluster;
> +
> +/*
> + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
> + * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
> + * calls.
> + *
> + * @cur_write->offset may be unaligned to @cluster_size if first write offset is
> + * not aligned (for example, if there was a flush request and cache was flushed,
> + * then we continue from the middle of the cluster with an empty cache).
> + *
> + * @cur_write may be NULL, which means we don't have current cluster and next
> + * seqcache_write() will start a new one.
> + *
> + * @all is a list of all clusters cached in the cache, some "finished" and one
> + * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
> + * one in the list.
> + *
> + * @nb_clusters is number of elements in @all list.
> + *
> + * @next_flush is an iterator for flushing. SeqCache knows nothing about how
> + * data should be flushing, so for flush user calls seqcache_get_next_flush()

s/flushing/flushed/

> + * (which moves @next_flush iterator) and when data is flushed somehow and cache
> + * is not needed anymore, user can call seqcache_discard_cluster().

AFAIU, next_flush always points to the first finished cluster that has 
not yet been returned by seqcache_get_next_flush(), is that correct? 
(Yes, at least the latter part is implied by this comment, I’m just 
asking for clarity, because I want to be sure the simple

   s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);

in seqcache_get_next_flush() does what I think it does, which is never 
to let s->next_flush be NULL even though there are still flushable 
clusters somewhere.)

> + */
> +typedef struct SeqCache {
> +    int64_t cluster_size;
> +    int64_t nb_clusters;
> +    Cluster *cur_write;
> +    Cluster *next_flush;
> +    QSIMPLEQ_HEAD(, Cluster) all;
> +} SeqCache;

[...]

> +/* Align down @offset to s->cluster_size and search for corresponding cluster */
> +static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
> +{
> +    Cluster *cl;
> +    int64_t cl_start = cluster_start(s, offset);
> +
> +    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
> +        if (cluster_start(s, cl->offset) == cl_start) {
> +            return cl;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* Makes current "unfinished" cluster the "finished" one. */

This sounds a bit like there’d be only a single finished cluster, so I’d 
rather write it as “Mark the current "unfinished" cluster as "finished".”

> +static void seqcache_finalize_current_cluster(SeqCache *s)
> +{
> +    if (s->cur_write && !s->next_flush) {
> +        s->next_flush = s->cur_write;
> +    }
> +    s->cur_write = NULL;
> +}
> +
> +/*
> + * Write an @offset, @bytes, @buf request into the cache. The requests MUST not

s/requests/request/

> + * intersect cluster bounds.
> + */
> +static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
> +                               uint8_t *buf)

Could use a const, though not a must.

> +{
> +    assert(bytes > 0);
> +    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
> +
> +    if (s->cur_write && offset == cached_end(s->cur_write)) {
> +        /* Continue sequential process */
> +        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
> +        s->cur_write->bytes += bytes;
> +
> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
> +            seqcache_finalize_current_cluster(s);
> +        }
> +
> +        return;
> +    }
> +
> +    /* We are starting a new cluster. Check that it's really new */
> +    assert(!seqcache_find_cluster(s, offset));
> +
> +    seqcache_finalize_current_cluster(s);
> +
> +    s->cur_write = g_new(Cluster, 1);
> +    *s->cur_write = (Cluster) {
> +        .offset = offset,
> +        .bytes = bytes,
> +        .buf = g_malloc(s->cluster_size),

I have to ask: Why not s->cluster_size - offset % s->cluster_size as 
advertised by the comment describing Cluster?

More practical question: Should we use qemu_memalign() (aligning either 
at the cluster size or at the block alignment, which would need to be 
passed to seqcache_new()) when offset is aligned to the cluster size? 
(Or with a custom alignment, if it is aligned to that.)

I feel that for O_DIRECT images it might be kind of important to align 
the buffer to the host block size.

> +    };
> +
> +    memcpy(s->cur_write->buf, buf, bytes);
> +    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
> +    s->nb_clusters++;
> +}
> +
> +/* Write an @offset, @bytes, @buf request into the cache. */
> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)

“const” might again find its place here.

> +{
> +    while (bytes) {
> +        int64_t cl_end = cluster_end(s, offset);
> +        int64_t chunk = MIN(bytes, cl_end - offset);
> +
> +        seqcache_write_one(s, offset, chunk, buf);
> +        offset += chunk;
> +        bytes -= chunk;
> +        buf += chunk;
> +    }
> +}

[...]

> +/*
> + * Get next region for flushing. @offset, @bytes and @buf are out-parameters
> + * to return the region.
> + *
> + * @unfinished is in-out argument which means that user is interested in
> + * flushing unfinished cluster too:
> + *
> + * If there are "finished" clusters, "finished" cluster is returned and
> + * *@unfinished is set to false, independently of its original value.
> + *
> + * If there are no "finished" clusters but "unfinished" exists (i.e.
> + * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
> + * value remains the same and the following logic works:
> + *
> + *    If *@unfinished:
> + *       return s->cur_write unfinished cluster for flushing
> + *    Else
> + *       return nothing
> + *
> + *
> + * Returns true and set @offset, @bytes, @buf and @unfinished if there is
> + * something to flush (accordingly to @unfinished value), returns false
> + * otherwise.
> + *
> + * Nothing is removed from the cache.

Out of curiosity, mainly, is that because the returned *buf is only 
valid as long as the entry is still in the cache, or is there a 
different reason that I’m missing?

(Hm, perhaps the fact that the user may want to keep it available for 
reading through seqcache_read()?)

> + */
> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
> +                             uint8_t **buf, bool *unfinished)

Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the 
callers would hate that, though.

> +{
> +    Cluster *req = s->next_flush;
> +
> +    if (s->next_flush) {
> +        *unfinished = false;
> +        req = s->next_flush;
> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
> +        if (s->next_flush == s->cur_write) {
> +            s->next_flush = NULL;
> +        }
> +    } else if (s->cur_write && *unfinished) {
> +        req = s->cur_write;

I was wondering whether flushing an unfinished cluster wouldn’t kind of 
finalize it, but I suppose the problem with that would be that you can’t 
add data to a finished cluster, which wouldn’t be that great if you’re 
just flushing the cache without wanting to drop it all.

(The problem I see is that flushing it later will mean all the data that 
already has been written here will have to be rewritten.  Not that bad, 
I suppose.)

> +    } else {
> +        return false;
> +    }
> +
> +    *offset = req->offset;
> +    *bytes = req->bytes;
> +    *buf = req->buf;
> +
> +    return true;
> +}
> +
> +/*
> + * Find corresponding cluster and drop it. No matter does requested @offset is
> + * cached itself or not.

The second sentence sounds strange grammatically, if I understand 
correctly, I’d change this to something like “Find the cluster 
corresponding to @offset and drop it.  It does not matter whether 
@offset itself is actually within that cluster’s cached range or not.”

Max

> + */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] util: implement seqcache
  2021-03-12 13:41   ` Max Reitz
@ 2021-03-12 14:37     ` Vladimir Sementsov-Ogievskiy
  2021-03-12 15:13       ` Max Reitz
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 14:37 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 16:41, Max Reitz wrote:
> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>> Implement cache for small sequential unaligned writes, so that they may
>> be cached until we get a complete cluster and then write it.
>>
>> The cache is intended to be used for backup to qcow2 compressed target
>> opened in O_DIRECT mode, but can be reused for any similar (even not
>> block-layer related) task.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   include/qemu/seqcache.h |  42 +++++
>>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>>   MAINTAINERS             |   6 +
>>   util/meson.build        |   1 +
>>   4 files changed, 410 insertions(+)
>>   create mode 100644 include/qemu/seqcache.h
>>   create mode 100644 util/seqcache.c
> 
> Looks quite good to me, thanks.  Nice explanations, too. :)
> 
> The only design question I have is whether there’s a reason you’re using a list again instead of a hash table.  I suppose we do need the list anyway because of the next_flush iterator, so using a hash table would only complicate the implementation, but still.

Yes, it seems correct for flush iterator go in same order as writes comes, so we need a list. We can add a hash table, it will only help on read.. But for compressed cache in qcow2 we try to flush often enough, so there should not be many clusters in the cache. So I think addition of hash table may be done later if needed.

> 
> [...]
> 
>> diff --git a/util/seqcache.c b/util/seqcache.c
>> new file mode 100644
>> index 0000000000..d923560eab
>> --- /dev/null
>> +++ b/util/seqcache.c
>> @@ -0,0 +1,361 @@
>> +/*
>> + * Cache for small sequential write requests.
>> + *
>> + * Copyright (c) 2021 Virtuozzo International GmbH.
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + *
>> + *
>> + * = Description =
>> + *
>> + * SeqCache is an abbreviation for Sequential Cache.
>> + *
>> + * The Cache is intended to improve performance of small unaligned sequential
>> + * writes. Cache has a cluster_size parameter and the unit of caching is aligned
>> + * cluster.  Cache has a list of cached clusters, several "finished" ones and at
>> + * most one "unfinished". "unfinished" cluster is a cluster where last byte of
>> + * last write operation is cached and there is a free place after that byte to
>> + * the end of cluster. "finished" clusters are just stored in cache to be read
>> + * or flushed and don't allow any writes to them.
>> + *
>> + * If write to the cache intersects cluster bounds, it's splat into several
> 
> *split
> 
> (Though I like “splat”.  Sounds like a wet blotch.)
> 
>> + * requests by cluster bounds. So, consider a write that doesn't intersect
>> + * cluster bounds to describe the whole picture:
>> + *
>> + * There are two cases allowed:
>> + *
>> + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
>> + *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
>> + *    filled up to the cluster bound it becomes "finished".
>> + *
>> + * 2. Write to new cluster, not existing in the cache. It may be sequential to
>> + *    previous write or not. Current "unfinshed" cluster (if exists) becomes
> 
> *unfinished
> 
>> + *    "finished" and new "unfinished" cluster is started. Note also that offset
>> + *    of the write to new cluster is not required to be aligned.
>> + *
>> + *    Any other write operation (non-sequential write to "unfinished" cluster
>> + *    or write to any of "finished" clusters) will crash.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qemu/queue.h"
>> +#include "qemu/seqcache.h"
>> +
>> +/*
>> + * Cluster
>> + *
>> + * Representation of one cached cluster, aligned to SeqCache::cluster_size.
>> + * Caches only one subregion of the cluster, started at @offset (may be
>> + * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
>> + * The whole subregion always lay in one aligned cluster:
>> + *
>> + *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
>> + *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
>> + *
>> + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
>> + * allocated @buf length is at least:
>> + *
>> + *      cluster_size - offset % cluster_size
>> + */
>> +typedef struct Cluster {
>> +   int64_t offset;
>> +   int64_t bytes;
>> +   uint8_t *buf;
>> +   QSIMPLEQ_ENTRY(Cluster) entry;
>> +} Cluster;
>> +
>> +/*
>> + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
>> + * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
>> + * calls.
>> + *
>> + * @cur_write->offset may be unaligned to @cluster_size if first write offset is
>> + * not aligned (for example, if there was a flush request and cache was flushed,
>> + * then we continue from the middle of the cluster with an empty cache).
>> + *
>> + * @cur_write may be NULL, which means we don't have current cluster and next
>> + * seqcache_write() will start a new one.
>> + *
>> + * @all is a list of all clusters cached in the cache, some "finished" and one
>> + * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
>> + * one in the list.
>> + *
>> + * @nb_clusters is number of elements in @all list.
>> + *
>> + * @next_flush is an iterator for flushing. SeqCache knows nothing about how
>> + * data should be flushing, so for flush user calls seqcache_get_next_flush()
> 
> s/flushing/flushed/
> 
>> + * (which moves @next_flush iterator) and when data is flushed somehow and cache
>> + * is not needed anymore, user can call seqcache_discard_cluster().
> 
> AFAIU, next_flush always points to the first finished cluster that has not yet been returned by seqcache_get_next_flush(), is that correct?

Yes, right.

  (Yes, at least the latter part is implied by this comment, I’m just asking for clarity, because I want to be sure the simple
> 
>    s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);
> 
> in seqcache_get_next_flush() does what I think it does, which is never to let s->next_flush be NULL even though there are still flushable clusters somewhere.)
> 
>> + */
>> +typedef struct SeqCache {
>> +    int64_t cluster_size;
>> +    int64_t nb_clusters;
>> +    Cluster *cur_write;
>> +    Cluster *next_flush;
>> +    QSIMPLEQ_HEAD(, Cluster) all;
>> +} SeqCache;
> 
> [...]
> 
>> +/* Align down @offset to s->cluster_size and search for corresponding cluster */
>> +static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
>> +{
>> +    Cluster *cl;
>> +    int64_t cl_start = cluster_start(s, offset);
>> +
>> +    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
>> +        if (cluster_start(s, cl->offset) == cl_start) {
>> +            return cl;
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +/* Makes current "unfinished" cluster the "finished" one. */
> 
> This sounds a bit like there’d be only a single finished cluster, so I’d rather write it as “Mark the current "unfinished" cluster as "finished".”
> 
>> +static void seqcache_finalize_current_cluster(SeqCache *s)
>> +{
>> +    if (s->cur_write && !s->next_flush) {
>> +        s->next_flush = s->cur_write;
>> +    }
>> +    s->cur_write = NULL;
>> +}
>> +
>> +/*
>> + * Write an @offset, @bytes, @buf request into the cache. The requests MUST not
> 
> s/requests/request/
> 
>> + * intersect cluster bounds.
>> + */
>> +static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
>> +                               uint8_t *buf)
> 
> Could use a const, though not a must.
> 
>> +{
>> +    assert(bytes > 0);
>> +    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
>> +
>> +    if (s->cur_write && offset == cached_end(s->cur_write)) {
>> +        /* Continue sequential process */
>> +        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
>> +        s->cur_write->bytes += bytes;
>> +
>> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
>> +            seqcache_finalize_current_cluster(s);
>> +        }
>> +
>> +        return;
>> +    }
>> +
>> +    /* We are starting a new cluster. Check that it's really new */
>> +    assert(!seqcache_find_cluster(s, offset));
>> +
>> +    seqcache_finalize_current_cluster(s);
>> +
>> +    s->cur_write = g_new(Cluster, 1);
>> +    *s->cur_write = (Cluster) {
>> +        .offset = offset,
>> +        .bytes = bytes,
>> +        .buf = g_malloc(s->cluster_size),
> 
> I have to ask: Why not s->cluster_size - offset % s->cluster_size as advertised by the comment describing Cluster?

I just didn't care, it should be rare case. But for generic code better be precise. I'll update it.

> 
> More practical question: Should we use qemu_memalign() (aligning either at the cluster size or at the block alignment, which would need to be passed to seqcache_new()) when offset is aligned to the cluster size? (Or with a custom alignment, if it is aligned to that.)
> 
> I feel that for O_DIRECT images it might be kind of important to align the buffer to the host block size.

Will do

> 
>> +    };
>> +
>> +    memcpy(s->cur_write->buf, buf, bytes);
>> +    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
>> +    s->nb_clusters++;
>> +}
>> +
>> +/* Write an @offset, @bytes, @buf request into the cache. */
>> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
> 
> “const” might again find its place here.
> 
>> +{
>> +    while (bytes) {
>> +        int64_t cl_end = cluster_end(s, offset);
>> +        int64_t chunk = MIN(bytes, cl_end - offset);
>> +
>> +        seqcache_write_one(s, offset, chunk, buf);
>> +        offset += chunk;
>> +        bytes -= chunk;
>> +        buf += chunk;
>> +    }
>> +}
> 
> [...]
> 
>> +/*
>> + * Get next region for flushing. @offset, @bytes and @buf are out-parameters
>> + * to return the region.
>> + *
>> + * @unfinished is in-out argument which means that user is interested in
>> + * flushing unfinished cluster too:
>> + *
>> + * If there are "finished" clusters, "finished" cluster is returned and
>> + * *@unfinished is set to false, independently of its original value.
>> + *
>> + * If there are no "finished" clusters but "unfinished" exists (i.e.
>> + * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
>> + * value remains the same and the following logic works:
>> + *
>> + *    If *@unfinished:
>> + *       return s->cur_write unfinished cluster for flushing
>> + *    Else
>> + *       return nothing
>> + *
>> + *
>> + * Returns true and set @offset, @bytes, @buf and @unfinished if there is
>> + * something to flush (accordingly to @unfinished value), returns false
>> + * otherwise.
>> + *
>> + * Nothing is removed from the cache.
> 
> Out of curiosity, mainly, is that because the returned *buf is only valid as long as the entry is still in the cache, or is there a different reason that I’m missing?
> 
> (Hm, perhaps the fact that the user may want to keep it available for reading through seqcache_read()?)

Yes, that's for read. This way reads are not blocked by in-flight cache flush. We just read from cache if data in the cache, and if not read from the image. We don't have to wait for intersecting in-flight write to finish.

> 
>> + */
>> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
>> +                             uint8_t **buf, bool *unfinished)
> 
> Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the callers would hate that, though.

Will do. And actually I wrote quite big explanation but missed the fact that caller don't get ownership on buf, it should be mentioned.

> 
>> +{
>> +    Cluster *req = s->next_flush;
>> +
>> +    if (s->next_flush) {
>> +        *unfinished = false;
>> +        req = s->next_flush;
>> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
>> +        if (s->next_flush == s->cur_write) {
>> +            s->next_flush = NULL;
>> +        }
>> +    } else if (s->cur_write && *unfinished) {
>> +        req = s->cur_write;
> 
> I was wondering whether flushing an unfinished cluster wouldn’t kind of finalize it, but I suppose the problem with that would be that you can’t add data to a finished cluster, which wouldn’t be that great if you’re just flushing the cache without wanting to drop it all.
> 
> (The problem I see is that flushing it later will mean all the data that already has been written here will have to be rewritten.  Not that bad, I suppose.)

Yes that's all correct. Also there is additional strong reason: qcow2 depends on the fact that clusters become "finished" by defined rules: only when it really finished up the the end or when qcow2 starts writing another cluster.

For "finished" clusters with unaligned end we can safely align this end up to some good alignment writing a bit more data than needed. It's safe because tail of the cluster is never used. And we'll perform better with aligned write avoiding RMW.

But when flushing "unfinished" cluster, we should write exactly what we have in the cache, as there may happen parallel write to the same cluster, which will continue the sequential process.

> 
>> +    } else {
>> +        return false;
>> +    }
>> +
>> +    *offset = req->offset;
>> +    *bytes = req->bytes;
>> +    *buf = req->buf;
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * Find corresponding cluster and drop it. No matter does requested @offset is
>> + * cached itself or not.
> 
> The second sentence sounds strange grammatically, if I understand correctly, I’d change this to something like “Find the cluster corresponding to @offset and drop it.  It does not matter whether @offset itself is actually within that cluster’s cached range or not.”
> 

Right, thanks for good rewording. I'm afraid, I often use russian language constructions with english words.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 12:32         ` Vladimir Sementsov-Ogievskiy
  2021-03-12 12:42           ` Vladimir Sementsov-Ogievskiy
  2021-03-12 12:46           ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 14:58           ` Max Reitz
  2021-03-12 15:39             ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-12 14:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 14:17, Max Reitz wrote:
>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.03.2021 22:58, Max Reitz wrote:
>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>> becomes 0) and reused during data write. In this case data write may
> 
> [..]
> 
>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>> update_refcount(BlockDriverState *bs,
>>>>>           if (refcount == 0) {
>>>>>               void *table;
>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>> cluster_index);
>>>>> +
>>>>> +            if (infl) {
>>>>> +                infl->refcount_zero = true;
>>>>> +                infl->type = type;
>>>>> +                continue;
>>>>> +            }
>>>>
>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>> like it wants to keep metadata structures in the cache that are 
>>>> still in use (because dropping them from the caches is what happens 
>>>> next), but users of metadata structures won’t set in-flight counters 
>>>> for those metadata structures, will they?
>>>
>>> Don't follow.
>>>
>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>> full reference count of the host cluster becomes 0, including 
>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, 
>>> we postpone freeing the host cluster, it will be done later from 
>>> "slow path" in update_inflight_write_cnt().
>>
>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>> it?  All I can see is code to remove metadata structures from the 
>> metadata caches (if the discarded cluster was an L2 table or a 
>> refblock), and finally the discard on the underlying file.  I don’t 
>> see how that protocol-level discard has anything to do with our 
>> problem, though.
> 
> Hmm. Still, if we do this discard, and then our in-flight write, we'll 
> have data instead of a hole. Not a big deal, but seems better to 
> postpone discard.
> 
> On the other hand, clearing caches is OK, as its related only to 
> qcow2-refcount, not to inflight-write-cnt
> 
>>
>> As far as I understand, the freeing happens immediately above the “if 
>> (refcount == 0)” block by s->set_refcount() setting the refcount to 0. 
>> (including updating s->free_cluster_index if the refcount is 0).
> 
> Hmm.. And that (setting s->free_cluster_index) what I should actually 
> prevent until total reference count becomes zero.
> 
> And about s->set_refcount(): it only update a refcount itself, and don't 
> free anything.

That is what freeing is, though.  I consider something to be free when 
allocation functions will allocate it.  The allocation functions look at 
the refcount, so once a cluster’s refcount is 0, it is free.

If that isn’t what freeing is, nothing in update_refcount() frees 
anything (when looking at how data clusters are handled).  Passing the 
discard through to the protocol layer isn’t “freeing”, because it’s 
independent of qcow2.

Now, your patch adds an additional check to the allocation functions 
(whether there are ongoing writes on the cluster), so it’s indeed 
possible that a cluster can have a refcount of 0 but still won’t be used 
by allocation functions.

But that means you’ve just changed the definition of what a free cluster 
is.  In fact, that means that nothing in update_refcount() can free a 
cluster that has active writes to it, because now a cluster is only free 
if there are no such writes.  It follows that you needn’t change 
update_refcount() to prevent clusters with such writes from being freed, 
because with this new definition of what a free cluster is, it’s 
impossible for update_refcount() to free them.

(Yes, you’re right that it would be nice to postpone the protocol-level 
discard still, but not doing so wouldn’t be a catastrophe – which shows 
that it has little to do with actually freeing something, as far as 
qcow2 is concerned.

If it’s just about postponing the discard, we can do exactly that: Let 
update_refcount() skip discarding for clusters that are still in use, 
and then let update_inflight_write_cnt() only do that discard instead of 
invoking all of qcow2_update_cluster_refcount().)

Alternatively, we could also not change the definition of what a free 
cluster is, which means we wouldn’t need to change the allocation 
functions, but instead postpone the refcount update that 
update_refcount() does.  That would mean we’d actually need to really 
drop the refcount in update_inflight_write_cnt() instead of doing a -0.

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 12:42           ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 15:01             ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-03-12 15:01 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 13:42, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 14:17, Max Reitz wrote:
>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>> becomes 0) and reused during data write. In this case data write may
>>
>> [..]
>>
>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>>> update_refcount(BlockDriverState *bs,
>>>>>>           if (refcount == 0) {
>>>>>>               void *table;
>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>>> cluster_index);
>>>>>> +
>>>>>> +            if (infl) {
>>>>>> +                infl->refcount_zero = true;
>>>>>> +                infl->type = type;
>>>>>> +                continue;
>>>>>> +            }
>>>>>
>>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>>> like it wants to keep metadata structures in the cache that are 
>>>>> still in use (because dropping them from the caches is what happens 
>>>>> next), but users of metadata structures won’t set in-flight 
>>>>> counters for those metadata structures, will they?
>>>>
>>>> Don't follow.
>>>>
>>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>>> full reference count of the host cluster becomes 0, including 
>>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 
>>>> 0, we postpone freeing the host cluster, it will be done later from 
>>>> "slow path" in update_inflight_write_cnt().
>>>
>>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>>> it?  All I can see is code to remove metadata structures from the 
>>> metadata caches (if the discarded cluster was an L2 table or a 
>>> refblock), and finally the discard on the underlying file.  I don’t 
>>> see how that protocol-level discard has anything to do with our 
>>> problem, though.
>>
>> Hmm. Still, if we do this discard, and then our in-flight write, we'll 
>> have data instead of a hole. Not a big deal, but seems better to 
>> postpone discard.
>>
>> On the other hand, clearing caches is OK, as its related only to 
>> qcow2-refcount, not to inflight-write-cnt
>>
>>>
>>> As far as I understand, the freeing happens immediately above the “if 
>>> (refcount == 0)” block by s->set_refcount() setting the refcount to 
>>> 0. (including updating s->free_cluster_index if the refcount is 0).
>>
>> Hmm.. And that (setting s->free_cluster_index) what I should actually 
>> prevent until total reference count becomes zero.
>>
>> And about s->set_refcount(): it only update a refcount itself, and 
>> don't free anything.
>>
>>
> 
> 
> Looking now at this place:
> 
> 
>          if (refcount == 0 && cluster_index < s->free_cluster_index) {
>              s->free_cluster_index = cluster_index;
>          }
>          s->set_refcount(refcount_block, block_index, refcount);
>          if (refcount == 0) {
>              void *table;
>              Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>              if (infl) {
>                  infl->refcount_zero = true;
>                  infl->type = type;
>                  continue;
>              }
>              table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>                                                  offset);
>              if (table != NULL) {
>                  qcow2_cache_put(s->refcount_block_cache, &refcount_block);
>                  old_table_index = -1;
>                  qcow2_cache_discard(s->refcount_block_cache, table);
>              }
>              table = qcow2_cache_is_table_offset(s->l2_table_cache, 
> offset);
>              if (table != NULL) {
>                  qcow2_cache_discard(s->l2_table_cache, table);
>              }
>              if (s->compressed_cache) {
>                  seqcache_discard_cluster(s->compressed_cache, 
> cluster_offset);
>              }
>              if (s->discard_passthrough[type]) {
>                  update_refcount_discard(bs, cluster_offset, 
> s->cluster_size);
>              }
>          }
> 
> 
> Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? 
> offset is the start of the whole loop, and is not updated on iterations. 
> Isn't it more correct to use cluster_offset here? Or we are sure that 
> refcount and l2 metadata is always discarded by exactly one cluster? 
> Than we are OK, but still worth an assertion that offset == cluster_offset.

Spontaneously, I think it’s a bug that hasn’t made any problems yet, 
because I suppose L2 tables and refblocks are indeed always discarded 
one by one (i.e., cluster by cluster).  It definitely looks like this 
should be cluster_offset, yes.

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 12:46           ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 15:10             ` Max Reitz
  2021-03-12 15:24               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-12 15:10 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 14:17, Max Reitz wrote:
>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>> becomes 0) and reused during data write. In this case data write may
>>
>> [..]
>>
>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>>> update_refcount(BlockDriverState *bs,
>>>>>>           if (refcount == 0) {
>>>>>>               void *table;
>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>>> cluster_index);
>>>>>> +
>>>>>> +            if (infl) {
>>>>>> +                infl->refcount_zero = true;
>>>>>> +                infl->type = type;
>>>>>> +                continue;
>>>>>> +            }
>>>>>
>>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>>> like it wants to keep metadata structures in the cache that are 
>>>>> still in use (because dropping them from the caches is what happens 
>>>>> next), but users of metadata structures won’t set in-flight 
>>>>> counters for those metadata structures, will they?
>>>>
>>>> Don't follow.
>>>>
>>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>>> full reference count of the host cluster becomes 0, including 
>>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 
>>>> 0, we postpone freeing the host cluster, it will be done later from 
>>>> "slow path" in update_inflight_write_cnt().
>>>
>>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>>> it?  All I can see is code to remove metadata structures from the 
>>> metadata caches (if the discarded cluster was an L2 table or a 
>>> refblock), and finally the discard on the underlying file.  I don’t 
>>> see how that protocol-level discard has anything to do with our 
>>> problem, though.
>>
>> Hmm. Still, if we do this discard, and then our in-flight write, we'll 
>> have data instead of a hole. Not a big deal, but seems better to 
>> postpone discard.
>>
>> On the other hand, clearing caches is OK, as its related only to 
>> qcow2-refcount, not to inflight-write-cnt
>>
>>>
>>> As far as I understand, the freeing happens immediately above the “if 
>>> (refcount == 0)” block by s->set_refcount() setting the refcount to 
>>> 0. (including updating s->free_cluster_index if the refcount is 0).
>>
>> Hmm.. And that (setting s->free_cluster_index) what I should actually 
>> prevent until total reference count becomes zero.
>>
>> And about s->set_refcount(): it only update a refcount itself, and 
>> don't free anything.
>>
>>
> 
> So, it is more correct like this:
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 464d133368..1da282446d 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
> update_refcount(BlockDriverState *bs,
>           } else {
>               refcount += addend;
>           }
> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
> -            s->free_cluster_index = cluster_index;
> -        }
>           s->set_refcount(refcount_block, block_index, refcount);
> 
>           if (refcount == 0) {
>               void *table;
>               Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
> 
> -            if (infl) {
> -                infl->refcount_zero = true;
> -                infl->type = type;
> -                continue;
> -            }
> -
>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>                                                   offset);
>               if (table != NULL) {
> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
> update_refcount(BlockDriverState *bs,
>                   qcow2_cache_discard(s->l2_table_cache, table);
>               }
> 
> +            if (infl) {
> +                infl->refcount_zero = true;
> +                infl->type = type;
> +                continue;
> +            }
> +
> +            if (cluster_index < s->free_cluster_index) {
> +                s->free_cluster_index = cluster_index;
> +            }
> +
>               if (s->discard_passthrough[type]) {
>                   update_refcount_discard(bs, cluster_offset, 
> s->cluster_size);
>               }

I don’t think I like using s->free_cluster_index as a protection against 
allocating something before it.

First, it comes back the problem I just described in my mail from 15:58 
GMT+1, which is that you’re changing the definition of what a free 
cluster is.  With this proposal, you’re proposing yet a new definition: 
A free cluster is anything with refcount == 0 after free_cluster_index.

Now looking only at the allocation functions, it may look like that kind 
of is the definition already.  But I don’t think that was the intention 
when free_cluster_index was introduced, so we’d have to check every 
place that sets free_cluster_index, to see whether it adheres to this 
definition.

And I think it’s clear that there is a place that won’t adhere to this 
definition, and that is this very place here, in update_refcount().  Say 
free_cluster_index is 42.  Then you free cluster 39, but there is a 
write to it, so free_cluster_index isn’t update.  Then you free cluster 
38, and there are writes to that cluster, so free_cluster_index is 
updated to 38.  Suddenly, 39 is free to be allocated, too.

(The precise problem is that with this new definition decreasing 
free_cluster_index suddenly has the power to free any cluster between 
its new and all value.  With the old definition, changing 
free_cluster_index would never free any cluster.  So when you decrease 
free_cluster_index, you suddenly have to be sure that all clusters 
between the new and old value that have refcount 0 are indeed to be 
considered free.)

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] util: implement seqcache
  2021-03-12 14:37     ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 15:13       ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-03-12 15:13 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 15:37, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 16:41, Max Reitz wrote:
>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>> Implement cache for small sequential unaligned writes, so that they may
>>> be cached until we get a complete cluster and then write it.
>>>
>>> The cache is intended to be used for backup to qcow2 compressed target
>>> opened in O_DIRECT mode, but can be reused for any similar (even not
>>> block-layer related) task.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   include/qemu/seqcache.h |  42 +++++
>>>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>>>   MAINTAINERS             |   6 +
>>>   util/meson.build        |   1 +
>>>   4 files changed, 410 insertions(+)
>>>   create mode 100644 include/qemu/seqcache.h
>>>   create mode 100644 util/seqcache.c
>>
>> Looks quite good to me, thanks.  Nice explanations, too. :)
>>
>> The only design question I have is whether there’s a reason you’re 
>> using a list again instead of a hash table.  I suppose we do need the 
>> list anyway because of the next_flush iterator, so using a hash table 
>> would only complicate the implementation, but still.
> 
> Yes, it seems correct for flush iterator go in same order as writes 
> comes, so we need a list. We can add a hash table, it will only help on 
> read.. But for compressed cache in qcow2 we try to flush often enough, 
> so there should not be many clusters in the cache. So I think addition 
> of hash table may be done later if needed.

Sure.  The problem I see is that we’ll probably never reach the point of 
it really being needed. O:)

So I think it’s a question of now or never.

[...]

>>> + */
>>> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t 
>>> *bytes,
>>> +                             uint8_t **buf, bool *unfinished)
>>
>> Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the 
>> callers would hate that, though.
> 
> Will do. And actually I wrote quite big explanation but missed the fact 
> that caller don't get ownership on buf, it should be mentioned.

Great, thanks.

>>> +{
>>> +    Cluster *req = s->next_flush;
>>> +
>>> +    if (s->next_flush) {
>>> +        *unfinished = false;
>>> +        req = s->next_flush;
>>> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
>>> +        if (s->next_flush == s->cur_write) {
>>> +            s->next_flush = NULL;
>>> +        }
>>> +    } else if (s->cur_write && *unfinished) {
>>> +        req = s->cur_write;
>>
>> I was wondering whether flushing an unfinished cluster wouldn’t kind 
>> of finalize it, but I suppose the problem with that would be that you 
>> can’t add data to a finished cluster, which wouldn’t be that great if 
>> you’re just flushing the cache without wanting to drop it all.
>>
>> (The problem I see is that flushing it later will mean all the data 
>> that already has been written here will have to be rewritten.  Not 
>> that bad, I suppose.)
> 
> Yes that's all correct. Also there is additional strong reason: qcow2 
> depends on the fact that clusters become "finished" by defined rules: 
> only when it really finished up the the end or when qcow2 starts writing 
> another cluster.
> 
> For "finished" clusters with unaligned end we can safely align this end 
> up to some good alignment writing a bit more data than needed. It's safe 
> because tail of the cluster is never used. And we'll perform better with 
> aligned write avoiding RMW.
> 
> But when flushing "unfinished" cluster, we should write exactly what we 
> have in the cache, as there may happen parallel write to the same 
> cluster, which will continue the sequential process.

OK, thanks for the explanation.

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 15:10             ` Max Reitz
@ 2021-03-12 15:24               ` Vladimir Sementsov-Ogievskiy
  2021-03-12 15:52                 ` Max Reitz
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 15:24 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 18:10, Max Reitz wrote:
> On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>> 12.03.2021 14:17, Max Reitz wrote:
>>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>>> becomes 0) and reused during data write. In this case data write may
>>>
>>> [..]
>>>
>>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>>>           if (refcount == 0) {
>>>>>>>               void *table;
>>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>>>> +
>>>>>>> +            if (infl) {
>>>>>>> +                infl->refcount_zero = true;
>>>>>>> +                infl->type = type;
>>>>>>> +                continue;
>>>>>>> +            }
>>>>>>
>>>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>>>
>>>>> Don't follow.
>>>>>
>>>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>>>
>>>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
>>>
>>> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
>>>
>>> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
>>>
>>>>
>>>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
>>>
>>> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
>>>
>>> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
>>>
>>>
>>
>> So, it is more correct like this:
>>
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 464d133368..1da282446d 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>           } else {
>>               refcount += addend;
>>           }
>> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
>> -            s->free_cluster_index = cluster_index;
>> -        }
>>           s->set_refcount(refcount_block, block_index, refcount);
>>
>>           if (refcount == 0) {
>>               void *table;
>>               Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>
>> -            if (infl) {
>> -                infl->refcount_zero = true;
>> -                infl->type = type;
>> -                continue;
>> -            }
>> -
>>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>>                                                   offset);
>>               if (table != NULL) {
>> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>               }
>>
>> +            if (infl) {
>> +                infl->refcount_zero = true;
>> +                infl->type = type;
>> +                continue;
>> +            }
>> +
>> +            if (cluster_index < s->free_cluster_index) {
>> +                s->free_cluster_index = cluster_index;
>> +            }
>> +
>>               if (s->discard_passthrough[type]) {
>>                   update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>               }
> 
> I don’t think I like using s->free_cluster_index as a protection against allocating something before it.

Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes.


> 
> First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is.  With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index.

I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0. And free_cluster_index is a hint where start to search for such cluster.

> 
> Now looking only at the allocation functions, it may look like that kind of is the definition already.  But I don’t think that was the intention when free_cluster_index was introduced, so we’d have to check every place that sets free_cluster_index, to see whether it adheres to this definition.
> 
> And I think it’s clear that there is a place that won’t adhere to this definition, and that is this very place here, in update_refcount().  Say free_cluster_index is 42.  Then you free cluster 39, but there is a write to it, so free_cluster_index isn’t update.  Then you free cluster 38, and there are writes to that cluster, so free_cluster_index is updated to 38.  Suddenly, 39 is free to be allocated, too.

Why? 39 is protected by inflight-cnt, and we do has_infl_wr() check together with refcount==0 check when allocate clusters.

> 
> (The precise problem is that with this new definition decreasing free_cluster_index suddenly has the power to free any cluster between its new and all value.  With the old definition, changing free_cluster_index would never free any cluster.  So when you decrease free_cluster_index, you suddenly have to be sure that all clusters between the new and old value that have refcount 0 are indeed to be considered free.)
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 14:58           ` Max Reitz
@ 2021-03-12 15:39             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 15:39 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 17:58, Max Reitz wrote:
> On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 14:17, Max Reitz wrote:
>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>> becomes 0) and reused during data write. In this case data write may
>>
>> [..]
>>
>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>>           if (refcount == 0) {
>>>>>>               void *table;
>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>>> +
>>>>>> +            if (infl) {
>>>>>> +                infl->refcount_zero = true;
>>>>>> +                infl->type = type;
>>>>>> +                continue;
>>>>>> +            }
>>>>>
>>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>>
>>>> Don't follow.
>>>>
>>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>>
>>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
>>
>> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
>>
>> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
>>
>>>
>>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
>>
>> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
>>
>> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
> 
> That is what freeing is, though.  I consider something to be free when allocation functions will allocate it.  The allocation functions look at the refcount, so once a cluster’s refcount is 0, it is free.

And with this patch I try to update allocation function to look also at inflight-write-counters. If I missed something its a bug in the patch.

> 
> If that isn’t what freeing is, nothing in update_refcount() frees anything (when looking at how data clusters are handled).  Passing the discard through to the protocol layer isn’t “freeing”, because it’s independent of qcow2.
> 
> Now, your patch adds an additional check to the allocation functions (whether there are ongoing writes on the cluster), so it’s indeed possible that a cluster can have a refcount of 0 but still won’t be used by allocation functions.
> 
> But that means you’ve just changed the definition of what a free cluster is.  In fact, that means that nothing in update_refcount() can free a cluster that has active writes to it, because now a cluster is only free if there are no such writes.  It follows that you needn’t change update_refcount() to prevent clusters with such writes from being freed, because with this new definition of what a free cluster is, it’s impossible for update_refcount() to free them.

But as I noted somewhere else, update_refcount should not discard the host cluster in parallel with inflight write. It's not completely wrong, but it's inefficient.

> 
> (Yes, you’re right that it would be nice to postpone the protocol-level discard still, but not doing so wouldn’t be a catastrophe – which shows that it has little to do with actually freeing something, as far as qcow2 is concerned.
> 
> If it’s just about postponing the discard, we can do exactly that: Let update_refcount() skip discarding for clusters that are still in use, and then let update_inflight_write_cnt() only do that discard instead of invoking all of qcow2_update_cluster_refcount().)

Agree, yes.

> 
> Alternatively, we could also not change the definition of what a free cluster is, which means we wouldn’t need to change the allocation functions, but instead postpone the refcount update that update_refcount() does.  That would mean we’d actually need to really drop the refcount in update_inflight_write_cnt() instead of doing a -0.
> 

Hmm, that should work too. Do you think it is better? With first approach meaning of zero refcount is changed (it's not a free cluster now, keep in mind inflight-write-cnt too). So we should update functions interested in zero refcount. With second approach refcount=1 changes the meaning (it my be actually referenced by inflight-write-cnt object, not by some qcow2 metadata object).. Shouldn't we update some functions that are interested in refcount=1? Intuitively it seems safe enough. Nothing dangerous is in refcount=1 for cluster which is actually unused at all.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 15:24               ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 15:52                 ` Max Reitz
  2021-03-12 16:03                   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-12 15:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 18:10, Max Reitz wrote:
>> On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
>>> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>> 12.03.2021 14:17, Max Reitz wrote:
>>>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>>>> becomes 0) and reused during data write. In this case data write 
>>>>>>>> may
>>>>
>>>> [..]
>>>>
>>>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>>>>> update_refcount(BlockDriverState *bs,
>>>>>>>>           if (refcount == 0) {
>>>>>>>>               void *table;
>>>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>>>>> cluster_index);
>>>>>>>> +
>>>>>>>> +            if (infl) {
>>>>>>>> +                infl->refcount_zero = true;
>>>>>>>> +                infl->type = type;
>>>>>>>> +                continue;
>>>>>>>> +            }
>>>>>>>
>>>>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>>>>> like it wants to keep metadata structures in the cache that are 
>>>>>>> still in use (because dropping them from the caches is what 
>>>>>>> happens next), but users of metadata structures won’t set 
>>>>>>> in-flight counters for those metadata structures, will they?
>>>>>>
>>>>>> Don't follow.
>>>>>>
>>>>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>>>>> full reference count of the host cluster becomes 0, including 
>>>>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 
>>>>>> 0, we postpone freeing the host cluster, it will be done later 
>>>>>> from "slow path" in update_inflight_write_cnt().
>>>>>
>>>>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>>>>> it?  All I can see is code to remove metadata structures from the 
>>>>> metadata caches (if the discarded cluster was an L2 table or a 
>>>>> refblock), and finally the discard on the underlying file.  I don’t 
>>>>> see how that protocol-level discard has anything to do with our 
>>>>> problem, though.
>>>>
>>>> Hmm. Still, if we do this discard, and then our in-flight write, 
>>>> we'll have data instead of a hole. Not a big deal, but seems better 
>>>> to postpone discard.
>>>>
>>>> On the other hand, clearing caches is OK, as its related only to 
>>>> qcow2-refcount, not to inflight-write-cnt
>>>>
>>>>>
>>>>> As far as I understand, the freeing happens immediately above the 
>>>>> “if (refcount == 0)” block by s->set_refcount() setting the 
>>>>> refcount to 0. (including updating s->free_cluster_index if the 
>>>>> refcount is 0).
>>>>
>>>> Hmm.. And that (setting s->free_cluster_index) what I should 
>>>> actually prevent until total reference count becomes zero.
>>>>
>>>> And about s->set_refcount(): it only update a refcount itself, and 
>>>> don't free anything.
>>>>
>>>>
>>>
>>> So, it is more correct like this:
>>>
>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>> index 464d133368..1da282446d 100644
>>> --- a/block/qcow2-refcount.c
>>> +++ b/block/qcow2-refcount.c
>>> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
>>> update_refcount(BlockDriverState *bs,
>>>           } else {
>>>               refcount += addend;
>>>           }
>>> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
>>> -            s->free_cluster_index = cluster_index;
>>> -        }
>>>           s->set_refcount(refcount_block, block_index, refcount);
>>>
>>>           if (refcount == 0) {
>>>               void *table;
>>>               Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>> cluster_index);
>>>
>>> -            if (infl) {
>>> -                infl->refcount_zero = true;
>>> -                infl->type = type;
>>> -                continue;
>>> -            }
>>> -
>>>               table = 
>>> qcow2_cache_is_table_offset(s->refcount_block_cache,
>>>                                                   offset);
>>>               if (table != NULL) {
>>> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
>>> update_refcount(BlockDriverState *bs,
>>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>>               }
>>>
>>> +            if (infl) {
>>> +                infl->refcount_zero = true;
>>> +                infl->type = type;
>>> +                continue;
>>> +            }
>>> +
>>> +            if (cluster_index < s->free_cluster_index) {
>>> +                s->free_cluster_index = cluster_index;
>>> +            }
>>> +
>>>               if (s->discard_passthrough[type]) {
>>>                   update_refcount_discard(bs, cluster_offset, 
>>> s->cluster_size);
>>>               }
>>
>> I don’t think I like using s->free_cluster_index as a protection 
>> against allocating something before it.
> 
> Hmm, I just propose not to update it, if refcount reached 0 but we still 
> have inflight writes.
> 
> 
>>
>> First, it comes back the problem I just described in my mail from 
>> 15:58 GMT+1, which is that you’re changing the definition of what a 
>> free cluster is.  With this proposal, you’re proposing yet a new 
>> definition: A free cluster is anything with refcount == 0 after 
>> free_cluster_index.
> 
> I think that free cluster is anything with refcount = 0 and 
> inflight-write-cnt = 0.

Then, as I said in my other mail, update_refcount() just cannot free any 
cluster.  So changes to that function can’t be justified by preventing 
it from freeing clusters.

You need to clearly define what it is that update_refcount() should or 
shouldn’t do, and then we have to think about whether when all writes 
have settled, we really have to invoke qcow2_update_cluster_refcount() 
or whether we should do the small outstanding changes just directly in 
update_inflight_write_cnt().

I think this needs to be more formalized, or it doesn’t make sense.

For example, say we do define a free cluster to be refcount (RC) = 0 and 
inflight-write-cnt (IFWC) = 0.  Then everything that is done to a 
cluster because it is considered being freed right now because its RC 
drops to 0 must probably be changed to only be done if also its IFWC is 
0.  For example, we should only discard host clusters on the protocol 
layer if a cluster becomes free.  update_refcount() will no longer be 
able to free clusters with IFWC > 0, so it must never issue a 
protocol-level discard for them.  And, yes, it also shouldn’t adjust 
first_free_cluster_index, as you propose here.  (But you didn’t explain 
why, and it seems like it was just intuition to you instead of looking 
at it more formally.)

Instead, for clusters with RC = 0 and IFWC > 0, 
update_inflight_write_cnt() will take on the role of freeing them.  So 
now that function must adjust first_free_cluster_index and issue the 
protocol-level discard for such clusters.

I suppose in practice we could invoke qcow2_update_cluster_refcount() 
with -0, as you do, because now the cluster has RC = 0 and IFWC = 0, so 
now that function will be capable of freeing it.  But to me, that just 
looks like a bit of abuse.


I suppose we could create a new function qcow2_cluster_freed() where we 
collect everything that needs to be done once a cluster is considered 
freed (which so far was whenever its RC dropped to 0, which only happens 
in update_refcount(); and then will be whenever its RC and its IFWC drop 
to 0, which can happen in either update_refcount() or 
update_inflight_write_cnt()).  What would belong in there is discarding 
the cluster on the protocol level, and adjusting 
first_free_cluster_index.  (Perhaps more, I don’t know.)  With such a 
function, it would seem clear to me that there is no need to invoke 
qcow2_update_cluster_refcount() just to get precisely that effect.


(The alternative would be to keep RC == 0 the definition of a freed 
cluster.  Then we’d have to postpone the s->set_refcount() in 
update_refcount(), and update the refcount again in 
update_inflight_write_cnt(), but invoking 
qcow2_update_cluster_refcount().  We wouldn’t need to change the 
allocation functions.

I’m not saying that alternative is better – I don’t think it is, I think 
you’re right that the definition of a freed cluster should be changed. 
I’m just presenting it in contrast, to show when it would make sense to 
call qcow2_update_cluster_refcount().)

> And free_cluster_index is a hint where start to 
> search for such cluster.
> 
>>
>> Now looking only at the allocation functions, it may look like that 
>> kind of is the definition already.  But I don’t think that was the 
>> intention when free_cluster_index was introduced, so we’d have to 
>> check every place that sets free_cluster_index, to see whether it 
>> adheres to this definition.
>>
>> And I think it’s clear that there is a place that won’t adhere to this 
>> definition, and that is this very place here, in update_refcount().  
>> Say free_cluster_index is 42.  Then you free cluster 39, but there is 
>> a write to it, so free_cluster_index isn’t update.  Then you free 
>> cluster 38, and there are writes to that cluster, so 
>> free_cluster_index is updated to 38.  Suddenly, 39 is free to be 
>> allocated, too.
> 
> Why? 39 is protected by inflight-cnt, and we do has_infl_wr() check 
> together with refcount==0 check when allocate clusters.

I was (wrongly) assuming that with this change you’d drop the check in 
the allocation functions.

Max

>> (The precise problem is that with this new definition decreasing 
>> free_cluster_index suddenly has the power to free any cluster between 
>> its new and all value.  With the old definition, changing 
>> free_cluster_index would never free any cluster.  So when you decrease 
>> free_cluster_index, you suddenly have to be sure that all clusters 
>> between the new and old value that have refcount 0 are indeed to be 
>> considered free.)
>>
>> Max
>>
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard
  2021-03-12 15:52                 ` Max Reitz
@ 2021-03-12 16:03                   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 16:03 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 18:52, Max Reitz wrote:
> On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 18:10, Max Reitz wrote:
>>> On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
>>>> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 12.03.2021 14:17, Max Reitz wrote:
>>>>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>>>>> becomes 0) and reused during data write. In this case data write may
>>>>>
>>>>> [..]
>>>>>
>>>>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>>>>>           if (refcount == 0) {
>>>>>>>>>               void *table;
>>>>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>>>>>> +
>>>>>>>>> +            if (infl) {
>>>>>>>>> +                infl->refcount_zero = true;
>>>>>>>>> +                infl->type = type;
>>>>>>>>> +                continue;
>>>>>>>>> +            }
>>>>>>>>
>>>>>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>>>>>
>>>>>>> Don't follow.
>>>>>>>
>>>>>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>>>>>
>>>>>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
>>>>>
>>>>> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
>>>>>
>>>>> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
>>>>>
>>>>>>
>>>>>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
>>>>>
>>>>> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
>>>>>
>>>>> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
>>>>>
>>>>>
>>>>
>>>> So, it is more correct like this:
>>>>
>>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>>> index 464d133368..1da282446d 100644
>>>> --- a/block/qcow2-refcount.c
>>>> +++ b/block/qcow2-refcount.c
>>>> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>           } else {
>>>>               refcount += addend;
>>>>           }
>>>> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
>>>> -            s->free_cluster_index = cluster_index;
>>>> -        }
>>>>           s->set_refcount(refcount_block, block_index, refcount);
>>>>
>>>>           if (refcount == 0) {
>>>>               void *table;
>>>>               Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>
>>>> -            if (infl) {
>>>> -                infl->refcount_zero = true;
>>>> -                infl->type = type;
>>>> -                continue;
>>>> -            }
>>>> -
>>>>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>>>>                                                   offset);
>>>>               if (table != NULL) {
>>>> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>>>               }
>>>>
>>>> +            if (infl) {
>>>> +                infl->refcount_zero = true;
>>>> +                infl->type = type;
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            if (cluster_index < s->free_cluster_index) {
>>>> +                s->free_cluster_index = cluster_index;
>>>> +            }
>>>> +
>>>>               if (s->discard_passthrough[type]) {
>>>>                   update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>>>               }
>>>
>>> I don’t think I like using s->free_cluster_index as a protection against allocating something before it.
>>
>> Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes.
>>
>>
>>>
>>> First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is.  With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index.
>>
>> I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0.
> 
> Then, as I said in my other mail, update_refcount() just cannot free any cluster.  So changes to that function can’t be justified by preventing it from freeing clusters.
> 
> You need to clearly define what it is that update_refcount() should or shouldn’t do, and then we have to think about whether when all writes have settled, we really have to invoke qcow2_update_cluster_refcount() or whether we should do the small outstanding changes just directly in update_inflight_write_cnt().
> 
> I think this needs to be more formalized, or it doesn’t make sense.
> 
> For example, say we do define a free cluster to be refcount (RC) = 0 and inflight-write-cnt (IFWC) = 0.  Then everything that is done to a cluster because it is considered being freed right now because its RC drops to 0 must probably be changed to only be done if also its IFWC is 0.  For example, we should only discard host clusters on the protocol layer if a cluster becomes free.  update_refcount() will no longer be able to free clusters with IFWC > 0, so it must never issue a protocol-level discard for them.  And, yes, it also shouldn’t adjust first_free_cluster_index, as you propose here.  (But you didn’t explain why, and it seems like it was just intuition to you instead of looking at it more formally.)
> 
> Instead, for clusters with RC = 0 and IFWC > 0, update_inflight_write_cnt() will take on the role of freeing them.  So now that function must adjust first_free_cluster_index and issue the protocol-level discard for such clusters.

Yes, agree.

> 
> I suppose in practice we could invoke qcow2_update_cluster_refcount() with -0, as you do, because now the cluster has RC = 0 and IFWC = 0, so now that function will be capable of freeing it.  But to me, that just looks like a bit of abuse.

agree

> 
> 
> I suppose we could create a new function qcow2_cluster_freed() where we collect everything that needs to be done once a cluster is considered freed (which so far was whenever its RC dropped to 0, which only happens in update_refcount(); and then will be whenever its RC and its IFWC drop to 0, which can happen in either update_refcount() or update_inflight_write_cnt()).  What would belong in there is discarding the cluster on the protocol level, and adjusting first_free_cluster_index.  (Perhaps more, I don’t know.)  With such a function, it would seem clear to me that there is no need to invoke qcow2_update_cluster_refcount() just to get precisely that effect.

yes

> 
> 
> (The alternative would be to keep RC == 0 the definition of a freed cluster.  Then we’d have to postpone the s->set_refcount() in update_refcount(), and update the refcount again in update_inflight_write_cnt(), but invoking qcow2_update_cluster_refcount().  We wouldn’t need to change the allocation functions.
> 
> I’m not saying that alternative is better – I don’t think it is, I think you’re right that the definition of a freed cluster should be changed. I’m just presenting it in contrast, to show when it would make sense to call qcow2_update_cluster_refcount().)

OK

In the meanwhile Kevin dispelled my "big problems" in "[PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard", so probably next step would be to retry CoRwLock-based approach.

> 
>> And free_cluster_index is a hint where start to search for such cluster.
>>
>>>
>>> Now looking only at the allocation functions, it may look like that kind of is the definition already.  But I don’t think that was the intention when free_cluster_index was introduced, so we’d have to check every place that sets free_cluster_index, to see whether it adheres to this definition.
>>>
>>> And I think it’s clear that there is a place that won’t adhere to this definition, and that is this very place here, in update_refcount(). Say free_cluster_index is 42.  Then you free cluster 39, but there is a write to it, so free_cluster_index isn’t update.  Then you free cluster 38, and there are writes to that cluster, so free_cluster_index is updated to 38.  Suddenly, 39 is free to be allocated, too.
>>
>> Why? 39 is protected by inflight-cnt, and we do has_infl_wr() check together with refcount==0 check when allocate clusters.
> 
> I was (wrongly) assuming that with this change you’d drop the check in the allocation functions.
> 
> Max
> 
>>> (The precise problem is that with this new definition decreasing free_cluster_index suddenly has the power to free any cluster between its new and all value.  With the old definition, changing free_cluster_index would never free any cluster.  So when you decrease free_cluster_index, you suddenly have to be sure that all clusters between the new and old value that have refcount 0 are indeed to be considered free.)
>>>
>>> Max
>>>
>>
>>
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix
  2021-03-05 17:35 ` [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix Vladimir Sementsov-Ogievskiy
@ 2021-03-12 16:53   ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-03-12 16:53 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> We are going to reuse the script to generate a qcow2_ function in
> further commit. Prepare the script now.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   scripts/block-coroutine-wrapper.py | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-05 17:35 ` [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes Vladimir Sementsov-Ogievskiy
@ 2021-03-12 18:15   ` Max Reitz
  2021-03-12 18:43     ` Vladimir Sementsov-Ogievskiy
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Max Reitz @ 2021-03-12 18:15 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> Compressed writes are unaligned to 512, which works very slow in
> O_DIRECT mode. Let's use the cache.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/coroutines.h     |   3 +
>   block/qcow2.h          |   4 ++
>   block/qcow2-refcount.c |  10 +++
>   block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
>   4 files changed, 164 insertions(+), 11 deletions(-)
> 
> diff --git a/block/coroutines.h b/block/coroutines.h
> index 4cfb4946e6..cb8e05830e 100644
> --- a/block/coroutines.h
> +++ b/block/coroutines.h
> @@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
>   int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
>                                           QEMUIOVector *qiov, int64_t pos);
>   
> +int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs);
> +int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs);
> +
>   #endif /* BLOCK_COROUTINES_INT_H */
> diff --git a/block/qcow2.h b/block/qcow2.h
> index e18d8dfbae..8b8fed0950 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -28,6 +28,7 @@
>   #include "crypto/block.h"
>   #include "qemu/coroutine.h"
>   #include "qemu/units.h"
> +#include "qemu/seqcache.h"
>   #include "block/block_int.h"
>   
>   //#define DEBUG_ALLOC
> @@ -422,6 +423,9 @@ typedef struct BDRVQcow2State {
>       Qcow2CompressionType compression_type;
>   
>       GHashTable *inflight_writes_counters;
> +
> +    SeqCache *compressed_cache;
> +    int compressed_flush_ret;
>   } BDRVQcow2State;
>   
>   typedef struct Qcow2COWRegion {
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 464d133368..218917308e 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -29,6 +29,7 @@
>   #include "qemu/bswap.h"
>   #include "qemu/cutils.h"
>   #include "trace.h"
> +#include "block/coroutines.h"
>   
>   static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
>                                       uint64_t max);
> @@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>                   qcow2_cache_discard(s->l2_table_cache, table);
>               }
>   
> +            if (s->compressed_cache) {
> +                seqcache_discard_cluster(s->compressed_cache, cluster_offset);
> +            }
> +
>               if (s->discard_passthrough[type]) {
>                   update_refcount_discard(bs, cluster_offset, s->cluster_size);
>               }
> @@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs)
>       BDRVQcow2State *s = bs->opaque;
>       int ret;
>   
> +    ret = qcow2_flush_compressed_cache(bs);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +

I wonder a bit why this function doesn’t work on a best-effort basis, 
but that isn’t your problem.

>       ret = qcow2_cache_write(bs, s->l2_table_cache);
>       if (ret < 0) {
>           return ret;
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 6ee94421e0..5f3713cd6f 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -42,6 +42,7 @@
>   #include "qapi/qapi-visit-block-core.h"
>   #include "crypto.h"
>   #include "block/aio_task.h"
> +#include "block/coroutines.h"
>   
>   /*
>     Differences with QCOW:
> @@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>       s->inflight_writes_counters =
>           g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
>   
> +    if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {

Looks a bit like a layering violation, but I have no better proposal and 
you gave your reasoning, so, OK.

I wonder whether this should be re-checked during reopen, though.

> +        s->compressed_cache = seqcache_new(s->cluster_size);
> +    }
> +
>       return ret;
>   
>    fail:
> @@ -2653,6 +2658,91 @@ fail_nometa:
>       return ret;
>   }
>   
> +/*
> + * Flush one cluster of compressed cache if exists. If @unfinished is true and
> + * there is no finished cluster to flush, flush the unfinished one. Otherwise
> + * flush only finished clusters.
> + *
> + * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 on
> + * error.
> + */
> +static int coroutine_fn
> +qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int ret;
> +    int64_t align = 1;
> +    int64_t offset, bytes;
> +    uint8_t *buf;
> +
> +    if (!s->compressed_cache) {
> +        return 0;
> +    }
> +
> +    if (!seqcache_get_next_flush(s->compressed_cache, &offset, &bytes, &buf,
> +                                 &unfinished))
> +    {
> +        return 0;
> +    }
> +
> +    qcow2_inflight_writes_inc(bs, offset, bytes);
> +
> +    /*
> +     * Don't try to align-up unfinished cached cluster, parallel write to it is
> +     * possible! For finished host clusters data in the tail of the cluster will
> +     * be never used. So, take some good alignment for speed.
> +     *
> +     * Note also, that seqcache guarantees that allocated size of @buf is enough
> +     * to fill the cluster up to the end, so we are safe to align up with
> +     * align <= cluster_size.
> +     */
> +    if (!unfinished) {
> +        align = MIN(s->cluster_size,
> +                    MAX(s->data_file->bs->bl.request_alignment,
> +                        bdrv_opt_mem_align(bs)));
> +    }
> +

I’d move the pre-write overlap check here, because its purpose is to 
prevent writes to metadata structures as they are happening, not as they 
are queued into a cache.  (I’d say.)

> +    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
> +    ret = bdrv_co_pwrite(s->data_file, offset,
> +                         QEMU_ALIGN_UP(offset + bytes, align) - offset,

I remember you said you wanted to describe more of the properties of the 
buffer returned by seqcache_get_next_flush().  That would be nice here, 
because intuitively one would assume that the buffer is @bytes bytes 
long, which aligning here will exceed.  (It’s fine, but it would be 
nicer if there was a comment that assured that it’s fine.)

> +                         buf, 0);
> +    if (ret < 0 && s->compressed_flush_ret == 0) {
> +        /*
> +         * The data that was "written" earlier is lost. We don't want to
> +         * care with storing it somehow to retry flushing later.

Yeah, there is little point in trying something like writing it again 
and again in the hope that maybe at some point it’ll work.

It is a shame that the error isn’t returned by the original compressed 
write, but what can you do.

>                                                                   Also, how much
> +         * data will we have to store, if not drop it after failed flush?
> +         * After this point qcow2_co_flush_compressed_cache() (and
> +         * qcow2_flush_compressed_cache()) never return success.
> +         */
> +        s->compressed_flush_ret = ret;
> +    }
> +
> +    seqcache_discard_cluster(s->compressed_cache, offset);
> +
> +    qcow2_inflight_writes_dec(bs, offset, bytes);
> +
> +    return ret < 0 ? ret : 1;
> +}
> +
> +int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +
> +    if (s->compressed_cache) {
> +        uint64_t nb_clusters = seqcache_nb_clusters(s->compressed_cache);
> +
> +        /*
> +         * If parallel writes are possible we don't want to loop forever. So
> +         * flush only clusters available at start of flush operation.
> +         */
> +        while (nb_clusters--) {
> +            qcow2_co_compressed_flush_one(bs, true);
> +        }
> +    }
> +
> +    return s->compressed_flush_ret;
> +}
> +
>   static int qcow2_inactivate(BlockDriverState *bs)
>   {
>       BDRVQcow2State *s = bs->opaque;
> @@ -2667,6 +2757,13 @@ static int qcow2_inactivate(BlockDriverState *bs)
>                             bdrv_get_device_or_node_name(bs));
>       }
>   
> +    ret = qcow2_flush_compressed_cache(bs);
> +    if (ret) {
> +        result = ret;
> +        error_report("Failed to flush the compressed write cache: %s",
> +                     strerror(-ret));
> +    }
> +
>       ret = qcow2_cache_flush(bs, s->l2_table_cache);
>       if (ret) {
>           result = ret;
> @@ -2699,6 +2796,12 @@ static void qcow2_close(BlockDriverState *bs)
>           qcow2_inactivate(bs);
>       }
>   
> +    /*
> +     * Cache should be flushed in qcow2_inactivate() and should be empty in
> +     * inactive mode. So we are safe to free it.
> +     */
> +    seqcache_free(s->compressed_cache);
> +
>       cache_clean_timer_del(bs);
>       qcow2_cache_destroy(s->l2_table_cache);
>       qcow2_cache_destroy(s->refcount_block_cache);
> @@ -4558,18 +4661,42 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>           goto fail;
>       }
>   
> -    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
> +    if (s->compressed_cache) {

Why is this conditional?

> +        /*
> +         * It's important to do seqcache_write() in the same critical section
> +         * (by s->lock) as qcow2_alloc_compressed_cluster_offset(), so that the
> +         * cache is filled sequentially.
> +         */

Yes.

> +        seqcache_write(s->compressed_cache, cluster_offset, out_len, out_buf);
>   
> -    qemu_co_mutex_unlock(&s->lock);
> +        qemu_co_mutex_unlock(&s->lock);
>   
> -    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
> -    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
> +        ret = qcow2_co_compressed_flush_one(bs, false);

The qcow2 doc says a compressed cluster can span multiple host clusters. 
  I don’t know whether that can happen with this driver, but if it does, 
wouldn’t that mean we’d need to flush two clusters here?  Oh, no, never 
mind.  Only the first one would be finished and thus flushed, not the 
second one.

I could have now removed the above paragraph, but it made me think, so I 
kept it:

Hm.  Actually, if we unconditionally flush here, doesn’t that mean that 
we’ll never have a finished cluster in the cache for longer than the 
span between the seqcache_write() and this 
qcow2_co_compressed_flush_one()?  I.e., the 
qcow2_co_flush_compressed_cache() is supposed to never flush any 
finished cluster, but only the currently active unfinished cluster (if 
there is one), right?

> +        if (ret < 0) {
> +            /*
> +             * if ret < 0 it probably not this request which failed, but some
> +             * previous one that was cached. Moreover, when we write to the
> +             * cache we probably may always report success, because
> +             * seqcache_write() never fails. Still, it's better to inform
> +             * compressed backup now then wait until final flush.
> +             */

Yup.

> +            goto fail;
> +        }
> +    } else {
> +        qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>   
> -    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
> +        qemu_co_mutex_unlock(&s->lock);
>   
> -    if (ret < 0) {
> -        goto fail;
> +        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
> +        ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
> +
> +        qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
> +
> +        if (ret < 0) {
> +            goto fail;
> +        }
>       }
> +
>   success:
>       ret = 0;
>   fail:
> @@ -4681,10 +4808,19 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
>   
>       out_buf = qemu_blockalign(bs, s->cluster_size);
>   
> -    BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
> -    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
> -    if (ret < 0) {
> -        goto fail;
> +    /*
> +     * seqcache_read may return less bytes than csize, as csize may exceed
> +     * actual compressed data size. So we are OK if seqcache_read returns
> +     * something > 0.

I was about to ask what happens when a compressed cluster spans two host 
clusters (I could have imagined that in theory the second one could have 
been discarded, but not the first one, so reading from the cache would 
really be short -- we would have needed to check that we only fell short 
in the range of 512 bytes, not more).

But then I realized that in this version of the series, all finished 
clusters are immediately discarded and only the current unfinished one 
is kept.  Does it even make sense to try seqcache_read() here, then?

> +     */
> +    if (!s->compressed_cache ||

(As for writing, I don’t think this can ever occur.)

Max

> +        seqcache_read(s->compressed_cache, coffset, csize, buf) <= 0)
> +    {
> +        BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
> +        ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
> +        if (ret < 0) {
> +            goto fail;
> +        }
>       }
>   
>       if (qcow2_co_decompress(bs, out_buf, s->cluster_size, buf, csize) < 0) {
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-12 18:15   ` Max Reitz
@ 2021-03-12 18:43     ` Vladimir Sementsov-Ogievskiy
  2021-03-15  9:58       ` Max Reitz
  2021-03-12 18:45     ` Vladimir Sementsov-Ogievskiy
  2021-03-29 20:18     ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 18:43 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 21:15, Max Reitz wrote:
> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>> Compressed writes are unaligned to 512, which works very slow in
>> O_DIRECT mode. Let's use the cache.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/coroutines.h     |   3 +
>>   block/qcow2.h          |   4 ++
>>   block/qcow2-refcount.c |  10 +++
>>   block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
>>   4 files changed, 164 insertions(+), 11 deletions(-)
>>
>> diff --git a/block/coroutines.h b/block/coroutines.h
>> index 4cfb4946e6..cb8e05830e 100644
>> --- a/block/coroutines.h
>> +++ b/block/coroutines.h
>> @@ -66,4 +66,7 @@ int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
>>   int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
>>                                           QEMUIOVector *qiov, int64_t pos);
>> +int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs);
>> +int generated_co_wrapper qcow2_flush_compressed_cache(BlockDriverState *bs);
>> +
>>   #endif /* BLOCK_COROUTINES_INT_H */
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index e18d8dfbae..8b8fed0950 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -28,6 +28,7 @@
>>   #include "crypto/block.h"
>>   #include "qemu/coroutine.h"
>>   #include "qemu/units.h"
>> +#include "qemu/seqcache.h"
>>   #include "block/block_int.h"
>>   //#define DEBUG_ALLOC
>> @@ -422,6 +423,9 @@ typedef struct BDRVQcow2State {
>>       Qcow2CompressionType compression_type;
>>       GHashTable *inflight_writes_counters;
>> +
>> +    SeqCache *compressed_cache;
>> +    int compressed_flush_ret;
>>   } BDRVQcow2State;
>>   typedef struct Qcow2COWRegion {
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 464d133368..218917308e 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -29,6 +29,7 @@
>>   #include "qemu/bswap.h"
>>   #include "qemu/cutils.h"
>>   #include "trace.h"
>> +#include "block/coroutines.h"
>>   static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
>>                                       uint64_t max);
>> @@ -1040,6 +1041,10 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>               }
>> +            if (s->compressed_cache) {
>> +                seqcache_discard_cluster(s->compressed_cache, cluster_offset);
>> +            }
>> +
>>               if (s->discard_passthrough[type]) {
>>                   update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>               }
>> @@ -1349,6 +1354,11 @@ int coroutine_fn qcow2_write_caches(BlockDriverState *bs)
>>       BDRVQcow2State *s = bs->opaque;
>>       int ret;
>> +    ret = qcow2_flush_compressed_cache(bs);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
> 
> I wonder a bit why this function doesn’t work on a best-effort basis, but that isn’t your problem.
> 
>>       ret = qcow2_cache_write(bs, s->l2_table_cache);
>>       if (ret < 0) {
>>           return ret;
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 6ee94421e0..5f3713cd6f 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -42,6 +42,7 @@
>>   #include "qapi/qapi-visit-block-core.h"
>>   #include "crypto.h"
>>   #include "block/aio_task.h"
>> +#include "block/coroutines.h"
>>   /*
>>     Differences with QCOW:
>> @@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>>       s->inflight_writes_counters =
>>           g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
>> +    if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {
> 
> Looks a bit like a layering violation, but I have no better proposal and you gave your reasoning, so, OK.
> 
> I wonder whether this should be re-checked during reopen, though.

Hmm yes. Somehow I thought qcow2_do_open() will be called again, but it's about inactivation/invalidation, not about reopen.

> 
>> +        s->compressed_cache = seqcache_new(s->cluster_size);
>> +    }
>> +
>>       return ret;
>>    fail:
>> @@ -2653,6 +2658,91 @@ fail_nometa:
>>       return ret;
>>   }
>> +/*
>> + * Flush one cluster of compressed cache if exists. If @unfinished is true and
>> + * there is no finished cluster to flush, flush the unfinished one. Otherwise
>> + * flush only finished clusters.
>> + *
>> + * Return 0 if nothing to flush, 1 if one cluster successfully flushed and <0 on
>> + * error.
>> + */
>> +static int coroutine_fn
>> +qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int ret;
>> +    int64_t align = 1;
>> +    int64_t offset, bytes;
>> +    uint8_t *buf;
>> +
>> +    if (!s->compressed_cache) {
>> +        return 0;
>> +    }
>> +
>> +    if (!seqcache_get_next_flush(s->compressed_cache, &offset, &bytes, &buf,
>> +                                 &unfinished))
>> +    {
>> +        return 0;
>> +    }
>> +
>> +    qcow2_inflight_writes_inc(bs, offset, bytes);
>> +
>> +    /*
>> +     * Don't try to align-up unfinished cached cluster, parallel write to it is
>> +     * possible! For finished host clusters data in the tail of the cluster will
>> +     * be never used. So, take some good alignment for speed.
>> +     *
>> +     * Note also, that seqcache guarantees that allocated size of @buf is enough
>> +     * to fill the cluster up to the end, so we are safe to align up with
>> +     * align <= cluster_size.
>> +     */
>> +    if (!unfinished) {
>> +        align = MIN(s->cluster_size,
>> +                    MAX(s->data_file->bs->bl.request_alignment,
>> +                        bdrv_opt_mem_align(bs)));
>> +    }
>> +
> 
> I’d move the pre-write overlap check here, because its purpose is to prevent writes to metadata structures as they are happening, not as they are queued into a cache.  (I’d say.)

Agree

> 
>> +    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>> +    ret = bdrv_co_pwrite(s->data_file, offset,
>> +                         QEMU_ALIGN_UP(offset + bytes, align) - offset,
> 
> I remember you said you wanted to describe more of the properties of the buffer returned by seqcache_get_next_flush().  That would be nice here, because intuitively one would assume that the buffer is @bytes bytes long, which aligning here will exceed.  (It’s fine, but it would be nicer if there was a comment that assured that it’s fine.)
> 
>> +                         buf, 0);
>> +    if (ret < 0 && s->compressed_flush_ret == 0) {
>> +        /*
>> +         * The data that was "written" earlier is lost. We don't want to
>> +         * care with storing it somehow to retry flushing later.
> 
> Yeah, there is little point in trying something like writing it again and again in the hope that maybe at some point it’ll work.
> 
> It is a shame that the error isn’t returned by the original compressed write, but what can you do.
> 
>>                                                                   Also, how much
>> +         * data will we have to store, if not drop it after failed flush?
>> +         * After this point qcow2_co_flush_compressed_cache() (and
>> +         * qcow2_flush_compressed_cache()) never return success.
>> +         */
>> +        s->compressed_flush_ret = ret;
>> +    }
>> +
>> +    seqcache_discard_cluster(s->compressed_cache, offset);
>> +
>> +    qcow2_inflight_writes_dec(bs, offset, bytes);
>> +
>> +    return ret < 0 ? ret : 1;
>> +}
>> +
>> +int coroutine_fn qcow2_co_flush_compressed_cache(BlockDriverState *bs)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +
>> +    if (s->compressed_cache) {
>> +        uint64_t nb_clusters = seqcache_nb_clusters(s->compressed_cache);
>> +
>> +        /*
>> +         * If parallel writes are possible we don't want to loop forever. So
>> +         * flush only clusters available at start of flush operation.
>> +         */
>> +        while (nb_clusters--) {
>> +            qcow2_co_compressed_flush_one(bs, true);
>> +        }
>> +    }
>> +
>> +    return s->compressed_flush_ret;
>> +}
>> +
>>   static int qcow2_inactivate(BlockDriverState *bs)
>>   {
>>       BDRVQcow2State *s = bs->opaque;
>> @@ -2667,6 +2757,13 @@ static int qcow2_inactivate(BlockDriverState *bs)
>>                             bdrv_get_device_or_node_name(bs));
>>       }
>> +    ret = qcow2_flush_compressed_cache(bs);
>> +    if (ret) {
>> +        result = ret;
>> +        error_report("Failed to flush the compressed write cache: %s",
>> +                     strerror(-ret));
>> +    }
>> +
>>       ret = qcow2_cache_flush(bs, s->l2_table_cache);
>>       if (ret) {
>>           result = ret;
>> @@ -2699,6 +2796,12 @@ static void qcow2_close(BlockDriverState *bs)
>>           qcow2_inactivate(bs);
>>       }
>> +    /*
>> +     * Cache should be flushed in qcow2_inactivate() and should be empty in
>> +     * inactive mode. So we are safe to free it.
>> +     */
>> +    seqcache_free(s->compressed_cache);
>> +
>>       cache_clean_timer_del(bs);
>>       qcow2_cache_destroy(s->l2_table_cache);
>>       qcow2_cache_destroy(s->refcount_block_cache);
>> @@ -4558,18 +4661,42 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>           goto fail;
>>       }
>> -    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>> +    if (s->compressed_cache) {
> 
> Why is this conditional?

We don't have compressed_cache for non o_direct.

> 
>> +        /*
>> +         * It's important to do seqcache_write() in the same critical section
>> +         * (by s->lock) as qcow2_alloc_compressed_cluster_offset(), so that the
>> +         * cache is filled sequentially.
>> +         */
> 
> Yes.
> 
>> +        seqcache_write(s->compressed_cache, cluster_offset, out_len, out_buf);
>> -    qemu_co_mutex_unlock(&s->lock);
>> +        qemu_co_mutex_unlock(&s->lock);
>> -    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>> -    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
>> +        ret = qcow2_co_compressed_flush_one(bs, false);
> 
> The qcow2 doc says a compressed cluster can span multiple host clusters.  I don’t know whether that can happen with this driver, but if it does, wouldn’t that mean we’d need to flush two clusters here?  Oh, no, never mind.  Only the first one would be finished and thus flushed, not the second one.
> 
> I could have now removed the above paragraph, but it made me think, so I kept it:
> 
> Hm.  Actually, if we unconditionally flush here, doesn’t that mean that we’ll never have a finished cluster in the cache for longer than the span between the seqcache_write() and this qcow2_co_compressed_flush_one()?  I.e., the qcow2_co_flush_compressed_cache() is supposed to never flush any finished cluster, but only the currently active unfinished cluster (if there is one), right?

Hmm. Maybe if we have parallel write and flush requests, it's a kind of race condition: may be flush will flush both finished and unfinished cluster, maybe write will flush the finished cluster and flush will flush only unfinished one.. Moreover we may have several parallel requests, so they make several finished clusters, and sudden flush will flush them all.

> 
>> +        if (ret < 0) {
>> +            /*
>> +             * if ret < 0 it probably not this request which failed, but some
>> +             * previous one that was cached. Moreover, when we write to the
>> +             * cache we probably may always report success, because
>> +             * seqcache_write() never fails. Still, it's better to inform
>> +             * compressed backup now then wait until final flush.
>> +             */
> 
> Yup.
> 
>> +            goto fail;
>> +        }
>> +    } else {
>> +        qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>> -    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
>> +        qemu_co_mutex_unlock(&s->lock);
>> -    if (ret < 0) {
>> -        goto fail;
>> +        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>> +        ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
>> +
>> +        qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
>> +
>> +        if (ret < 0) {
>> +            goto fail;
>> +        }
>>       }
>> +
>>   success:
>>       ret = 0;
>>   fail:
>> @@ -4681,10 +4808,19 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
>>       out_buf = qemu_blockalign(bs, s->cluster_size);
>> -    BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
>> -    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
>> -    if (ret < 0) {
>> -        goto fail;
>> +    /*
>> +     * seqcache_read may return less bytes than csize, as csize may exceed
>> +     * actual compressed data size. So we are OK if seqcache_read returns
>> +     * something > 0.
> 
> I was about to ask what happens when a compressed cluster spans two host clusters (I could have imagined that in theory the second one could have been discarded, but not the first one, so reading from the cache would really be short -- we would have needed to check that we only fell short in the range of 512 bytes, not more).
> 
> But then I realized that in this version of the series, all finished clusters are immediately discarded and only the current unfinished one is kept.  Does it even make sense to try seqcache_read() here, then?

Hmm. Not immediately, but after flush. An flush is not under mutex. So in theory at some moment we may have several finished clusters "in-flight". And your question make sense. The cache supports reading from consequitive clusters. But we also should support here reading one part of data from disk and another from the cache to be safe.

> 
>> +     */
>> +    if (!s->compressed_cache ||
> 
> (As for writing, I don’t think this can ever occur.)
> 
> Max
> 
>> +        seqcache_read(s->compressed_cache, coffset, csize, buf) <= 0)
>> +    {
>> +        BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
>> +        ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
>> +        if (ret < 0) {
>> +            goto fail;
>> +        }
>>       }
>>       if (qcow2_co_decompress(bs, out_buf, s->cluster_size, buf, csize) < 0) {
>>
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-12 18:15   ` Max Reitz
  2021-03-12 18:43     ` Vladimir Sementsov-Ogievskiy
@ 2021-03-12 18:45     ` Vladimir Sementsov-Ogievskiy
  2021-03-29 20:18     ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-12 18:45 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 21:15, Max Reitz wrote:
>> @@ -1834,6 +1835,10 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>>       s->inflight_writes_counters =
>>           g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
>> +    if (!has_data_file(bs) && (bs->file->bs->open_flags & BDRV_O_NOCACHE)) {
> 
> Looks a bit like a layering violation, but I have no better proposal and you gave your reasoning, so, OK.

Probably better is check request_alignment of bs->file->bs. If it > 1 then enable the cache.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-12 18:43     ` Vladimir Sementsov-Ogievskiy
@ 2021-03-15  9:58       ` Max Reitz
  2021-03-15 14:40         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-15  9:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 12.03.21 19:43, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 21:15, Max Reitz wrote:
>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>> Compressed writes are unaligned to 512, which works very slow in
>>> O_DIRECT mode. Let's use the cache.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/coroutines.h     |   3 +
>>>   block/qcow2.h          |   4 ++
>>>   block/qcow2-refcount.c |  10 +++
>>>   block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
>>>   4 files changed, 164 insertions(+), 11 deletions(-)

[...]

>>> @@ -2699,6 +2796,12 @@ static void qcow2_close(BlockDriverState *bs)
>>>           qcow2_inactivate(bs);
>>>       }
>>> +    /*
>>> +     * Cache should be flushed in qcow2_inactivate() and should be 
>>> empty in
>>> +     * inactive mode. So we are safe to free it.
>>> +     */
>>> +    seqcache_free(s->compressed_cache);
>>> +
>>>       cache_clean_timer_del(bs);
>>>       qcow2_cache_destroy(s->l2_table_cache);
>>>       qcow2_cache_destroy(s->refcount_block_cache);
>>> @@ -4558,18 +4661,42 @@ 
>>> qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>>           goto fail;
>>>       }
>>> -    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>>> +    if (s->compressed_cache) {
>>
>> Why is this conditional?
> 
> We don't have compressed_cache for non o_direct.

Oh right.

>>> +        /*
>>> +         * It's important to do seqcache_write() in the same 
>>> critical section
>>> +         * (by s->lock) as qcow2_alloc_compressed_cluster_offset(), 
>>> so that the
>>> +         * cache is filled sequentially.
>>> +         */
>>
>> Yes.
>>
>>> +        seqcache_write(s->compressed_cache, cluster_offset, out_len, 
>>> out_buf);
>>> -    qemu_co_mutex_unlock(&s->lock);
>>> +        qemu_co_mutex_unlock(&s->lock);
>>> -    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>>> -    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, 
>>> out_buf, 0);
>>> +        ret = qcow2_co_compressed_flush_one(bs, false);
>>
>> The qcow2 doc says a compressed cluster can span multiple host 
>> clusters.  I don’t know whether that can happen with this driver, but 
>> if it does, wouldn’t that mean we’d need to flush two clusters here?  
>> Oh, no, never mind.  Only the first one would be finished and thus 
>> flushed, not the second one.
>>
>> I could have now removed the above paragraph, but it made me think, so 
>> I kept it:
>>
>> Hm.  Actually, if we unconditionally flush here, doesn’t that mean 
>> that we’ll never have a finished cluster in the cache for longer than 
>> the span between the seqcache_write() and this 
>> qcow2_co_compressed_flush_one()?  I.e., the 
>> qcow2_co_flush_compressed_cache() is supposed to never flush any 
>> finished cluster, but only the currently active unfinished cluster (if 
>> there is one), right?
> 
> Hmm. Maybe if we have parallel write and flush requests, it's a kind of 
> race condition: may be flush will flush both finished and unfinished 
> cluster, maybe write will flush the finished cluster and flush will 
> flush only unfinished one.. Moreover we may have several parallel 
> requests, so they make several finished clusters, and sudden flush will 
> flush them all.

OK.  I was mostly asking because I was wondering how much you expect the 
cache to be filled, i.e., how much you expect the read cache to help.

[...]

>>> @@ -4681,10 +4808,19 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
>>>       out_buf = qemu_blockalign(bs, s->cluster_size);
>>> -    BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
>>> -    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
>>> -    if (ret < 0) {
>>> -        goto fail;
>>> +    /*
>>> +     * seqcache_read may return less bytes than csize, as csize may 
>>> exceed
>>> +     * actual compressed data size. So we are OK if seqcache_read 
>>> returns
>>> +     * something > 0.
>>
>> I was about to ask what happens when a compressed cluster spans two 
>> host clusters (I could have imagined that in theory the second one 
>> could have been discarded, but not the first one, so reading from the 
>> cache would really be short -- we would have needed to check that we 
>> only fell short in the range of 512 bytes, not more).
>>
>> But then I realized that in this version of the series, all finished 
>> clusters are immediately discarded and only the current unfinished one 
>> is kept.  Does it even make sense to try seqcache_read() here, then?
> 
> Hmm. Not immediately, but after flush. An flush is not under mutex. So 
> in theory at some moment we may have several finished clusters 
> "in-flight". And your question make sense. The cache supports reading 
> from consequitive clusters. But we also should support here reading one 
> part of data from disk and another from the cache to be safe.

The question is whether it really makes sense to even have a 
seqcache_read() path when in reality it’s probably never accessed.  I 
mean, besides the fact that it seems based purely on chance whether a 
read might fetch something from the cache even while we’re writing, in 
practice I don’t know any case where we’d write to and read from a 
compressed qcow2 image at the same time.  (I don’t know what you’re 
doing with the 'compress' filter, though.)

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-15  9:58       ` Max Reitz
@ 2021-03-15 14:40         ` Vladimir Sementsov-Ogievskiy
  2021-03-16 12:25           ` Max Reitz
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-15 14:40 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

15.03.2021 12:58, Max Reitz wrote:
> On 12.03.21 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 21:15, Max Reitz wrote:
>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>> Compressed writes are unaligned to 512, which works very slow in
>>>> O_DIRECT mode. Let's use the cache.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>   block/coroutines.h     |   3 +
>>>>   block/qcow2.h          |   4 ++
>>>>   block/qcow2-refcount.c |  10 +++
>>>>   block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
>>>>   4 files changed, 164 insertions(+), 11 deletions(-)
> 
> [...]
> 
>>>> @@ -2699,6 +2796,12 @@ static void qcow2_close(BlockDriverState *bs)
>>>>           qcow2_inactivate(bs);
>>>>       }
>>>> +    /*
>>>> +     * Cache should be flushed in qcow2_inactivate() and should be empty in
>>>> +     * inactive mode. So we are safe to free it.
>>>> +     */
>>>> +    seqcache_free(s->compressed_cache);
>>>> +
>>>>       cache_clean_timer_del(bs);
>>>>       qcow2_cache_destroy(s->l2_table_cache);
>>>>       qcow2_cache_destroy(s->refcount_block_cache);
>>>> @@ -4558,18 +4661,42 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>>>           goto fail;
>>>>       }
>>>> -    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>>>> +    if (s->compressed_cache) {
>>>
>>> Why is this conditional?
>>
>> We don't have compressed_cache for non o_direct.
> 
> Oh right.
> 
>>>> +        /*
>>>> +         * It's important to do seqcache_write() in the same critical section
>>>> +         * (by s->lock) as qcow2_alloc_compressed_cluster_offset(), so that the
>>>> +         * cache is filled sequentially.
>>>> +         */
>>>
>>> Yes.
>>>
>>>> +        seqcache_write(s->compressed_cache, cluster_offset, out_len, out_buf);
>>>> -    qemu_co_mutex_unlock(&s->lock);
>>>> +        qemu_co_mutex_unlock(&s->lock);
>>>> -    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>>>> -    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
>>>> +        ret = qcow2_co_compressed_flush_one(bs, false);
>>>
>>> The qcow2 doc says a compressed cluster can span multiple host clusters.  I don’t know whether that can happen with this driver, but if it does, wouldn’t that mean we’d need to flush two clusters here? Oh, no, never mind.  Only the first one would be finished and thus flushed, not the second one.
>>>
>>> I could have now removed the above paragraph, but it made me think, so I kept it:
>>>
>>> Hm.  Actually, if we unconditionally flush here, doesn’t that mean that we’ll never have a finished cluster in the cache for longer than the span between the seqcache_write() and this qcow2_co_compressed_flush_one()?  I.e., the qcow2_co_flush_compressed_cache() is supposed to never flush any finished cluster, but only the currently active unfinished cluster (if there is one), right?
>>
>> Hmm. Maybe if we have parallel write and flush requests, it's a kind of race condition: may be flush will flush both finished and unfinished cluster, maybe write will flush the finished cluster and flush will flush only unfinished one.. Moreover we may have several parallel requests, so they make several finished clusters, and sudden flush will flush them all.
> 
> OK.  I was mostly asking because I was wondering how much you expect the cache to be filled, i.e., how much you expect the read cache to help.
> 

It should depend on how many parallel write requests allowed.. (and that's an actual limitation for cache size). We can add a "bottom-limit" for the cache, i.e. don't flush if there less than X clusters in the cache. Still I don't think there actual scenarios where both compressed reads and writes done. And anyway, to accelerate compressed read we'd better do some simple read-ahead.


> 
>>>> @@ -4681,10 +4808,19 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
>>>>       out_buf = qemu_blockalign(bs, s->cluster_size);
>>>> -    BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
>>>> -    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
>>>> -    if (ret < 0) {
>>>> -        goto fail;
>>>> +    /*
>>>> +     * seqcache_read may return less bytes than csize, as csize may exceed
>>>> +     * actual compressed data size. So we are OK if seqcache_read returns
>>>> +     * something > 0.
>>>
>>> I was about to ask what happens when a compressed cluster spans two host clusters (I could have imagined that in theory the second one could have been discarded, but not the first one, so reading from the cache would really be short -- we would have needed to check that we only fell short in the range of 512 bytes, not more).
>>>
>>> But then I realized that in this version of the series, all finished clusters are immediately discarded and only the current unfinished one is kept.  Does it even make sense to try seqcache_read() here, then?
>>
>> Hmm. Not immediately, but after flush. An flush is not under mutex. So in theory at some moment we may have several finished clusters "in-flight". And your question make sense. The cache supports reading from consequitive clusters. But we also should support here reading one part of data from disk and another from the cache to be safe.
> 
> The question is whether it really makes sense to even have a seqcache_read() path when in reality it’s probably never accessed.  I mean, besides the fact that it seems based purely on chance whether a read might fetch something from the cache even while we’re writing, in practice I don’t know any case where we’d write to and read from a compressed qcow2 image at the same time.  (I don’t know what you’re doing with the 'compress' filter, though.)
> 

Note, that for user that's not a parallel write and read to the same cluster:

1. user writes cluster A, request succeeded, data is in the cache

2. user writes some other clusters, cache filled, flush started

3. in parallel to [2] user reads cluster A. From the POV of user, cluster A is written already, and should be read successfully

And seqcache_read() gives a simple non-blocking way to support read operation.

I have two additional thoughts in mind:

- After introducing compress filter we actually support parallel compressed reads and writes.. Even if nobody uses it, it works, and it should continue working correctly

- The only think that breaks using compressed device (with help of compressed filter) is that we can't rewrite compressed cluster (by compressed cluster I mean). But it's not hard to implement.. So, at some moment the limitation will go away.. Hmm. still I can't imagine a usecase :)

I can image that both reads and writes may be used on temporary image in backup scheme with reverse delta, or for external backup.

But rewriting compressed clusters is sensible only when we run real guest on compressed image.. Can it be helpful? Maybe for scenarios with low disk usage ratio..


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-15 14:40         ` Vladimir Sementsov-Ogievskiy
@ 2021-03-16 12:25           ` Max Reitz
  2021-03-16 17:48             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Max Reitz @ 2021-03-16 12:25 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 15.03.21 15:40, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 12:58, Max Reitz wrote:

[...]

>> The question is whether it really makes sense to even have a 
>> seqcache_read() path when in reality it’s probably never accessed.  I 
>> mean, besides the fact that it seems based purely on chance whether a 
>> read might fetch something from the cache even while we’re writing, in 
>> practice I don’t know any case where we’d write to and read from a 
>> compressed qcow2 image at the same time.  (I don’t know what you’re 
>> doing with the 'compress' filter, though.)
>>
> 
> Note, that for user that's not a parallel write and read to the same 
> cluster:
> 
> 1. user writes cluster A, request succeeded, data is in the cache
> 
> 2. user writes some other clusters, cache filled, flush started
> 
> 3. in parallel to [2] user reads cluster A. From the POV of user, 
> cluster A is written already, and should be read successfully

Yes, but when would that happen?

> And seqcache_read() gives a simple non-blocking way to support read 
> operation.

OK, that makes sense.  We’d need to flush the cache before we can read 
anything from the disk, so we should have a read-from-cache branch here.

> But rewriting compressed clusters is sensible only when we run real 
> guest on compressed image.. Can it be helpful? Maybe for scenarios with 
> low disk usage ratio..

I’m not sure, but the point is that rewrites are currently not 
supported.  The whole compression implementation is mainly tailored 
towards just writing a complete image (e.g. by qemu-img convert or the 
backup job), so that’s where my question is coming from: It’s difficult 
for me to see a currently working use case where you’d read from and 
write to a compressed image at the same time.

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-16 12:25           ` Max Reitz
@ 2021-03-16 17:48             ` Vladimir Sementsov-Ogievskiy
  2021-03-17  8:09               ` Max Reitz
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-16 17:48 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

16.03.2021 15:25, Max Reitz wrote:
> On 15.03.21 15:40, Vladimir Sementsov-Ogievskiy wrote:
>> 15.03.2021 12:58, Max Reitz wrote:
> 
> [...]
> 
>>> The question is whether it really makes sense to even have a seqcache_read() path when in reality it’s probably never accessed.  I mean, besides the fact that it seems based purely on chance whether a read might fetch something from the cache even while we’re writing, in practice I don’t know any case where we’d write to and read from a compressed qcow2 image at the same time.  (I don’t know what you’re doing with the 'compress' filter, though.)
>>>
>>
>> Note, that for user that's not a parallel write and read to the same cluster:
>>
>> 1. user writes cluster A, request succeeded, data is in the cache
>>
>> 2. user writes some other clusters, cache filled, flush started
>>
>> 3. in parallel to [2] user reads cluster A. From the POV of user, cluster A is written already, and should be read successfully
> 
> Yes, but when would that happen?
> 
>> And seqcache_read() gives a simple non-blocking way to support read operation.
> 
> OK, that makes sense.  We’d need to flush the cache before we can read anything from the disk, so we should have a read-from-cache branch here.
> 
>> But rewriting compressed clusters is sensible only when we run real guest on compressed image.. Can it be helpful? Maybe for scenarios with low disk usage ratio..
> 
> I’m not sure, but the point is that rewrites are currently not supported.  The whole compression implementation is mainly tailored towards just writing a complete image (e.g. by qemu-img convert or the backup job), so that’s where my question is coming from: It’s difficult for me to see a currently working use case where you’d read from and write to a compressed image at the same time.
> 

External backup works like the following:

  - start backup sync=none from active disk to temporary disk
  - export temporary disk through nbd
  - external tool reads from nbd export

For this scheme it may make sense to use compression, and we get a use case where compressed reads and writes are used in the same time. Moreover this is possible just now, and no reason to not support it.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-16 17:48             ` Vladimir Sementsov-Ogievskiy
@ 2021-03-17  8:09               ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-03-17  8:09 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, jsnow, qemu-devel, ehabkost, crosa

On 16.03.21 18:48, Vladimir Sementsov-Ogievskiy wrote:
> 16.03.2021 15:25, Max Reitz wrote:
>> On 15.03.21 15:40, Vladimir Sementsov-Ogievskiy wrote:
>>> 15.03.2021 12:58, Max Reitz wrote:
>>
>> [...]
>>
>>>> The question is whether it really makes sense to even have a 
>>>> seqcache_read() path when in reality it’s probably never accessed.  
>>>> I mean, besides the fact that it seems based purely on chance 
>>>> whether a read might fetch something from the cache even while we’re 
>>>> writing, in practice I don’t know any case where we’d write to and 
>>>> read from a compressed qcow2 image at the same time.  (I don’t know 
>>>> what you’re doing with the 'compress' filter, though.)
>>>>
>>>
>>> Note, that for user that's not a parallel write and read to the same 
>>> cluster:
>>>
>>> 1. user writes cluster A, request succeeded, data is in the cache
>>>
>>> 2. user writes some other clusters, cache filled, flush started
>>>
>>> 3. in parallel to [2] user reads cluster A. From the POV of user, 
>>> cluster A is written already, and should be read successfully
>>
>> Yes, but when would that happen?
>>
>>> And seqcache_read() gives a simple non-blocking way to support read 
>>> operation.
>>
>> OK, that makes sense.  We’d need to flush the cache before we can read 
>> anything from the disk, so we should have a read-from-cache branch here.
>>
>>> But rewriting compressed clusters is sensible only when we run real 
>>> guest on compressed image.. Can it be helpful? Maybe for scenarios 
>>> with low disk usage ratio..
>>
>> I’m not sure, but the point is that rewrites are currently not 
>> supported.  The whole compression implementation is mainly tailored 
>> towards just writing a complete image (e.g. by qemu-img convert or the 
>> backup job), so that’s where my question is coming from: It’s 
>> difficult for me to see a currently working use case where you’d read 
>> from and write to a compressed image at the same time.
>>
> 
> External backup works like the following:
> 
>   - start backup sync=none from active disk to temporary disk
>   - export temporary disk through nbd
>   - external tool reads from nbd export
> 
> For this scheme it may make sense to use compression,

Not sure whether it really does, because it’s my impression the 
temporary disk is deleted after the backup, but no matter: You’re right, 
that’s indeed a valid case.

Max



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes
  2021-03-12 18:15   ` Max Reitz
  2021-03-12 18:43     ` Vladimir Sementsov-Ogievskiy
  2021-03-12 18:45     ` Vladimir Sementsov-Ogievskiy
@ 2021-03-29 20:18     ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-29 20:18 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, crosa, ehabkost, kwolf, jsnow

12.03.2021 21:15, Max Reitz wrote:
> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>> Compressed writes are unaligned to 512, which works very slow in
>> O_DIRECT mode. Let's use the cache.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/coroutines.h     |   3 +
>>   block/qcow2.h          |   4 ++
>>   block/qcow2-refcount.c |  10 +++
>>   block/qcow2.c          | 158 ++++++++++++++++++++++++++++++++++++++---
>>   4 files changed, 164 insertions(+), 11 deletions(-)
>>

[..]

>> +static int coroutine_fn
>> +qcow2_co_compressed_flush_one(BlockDriverState *bs, bool unfinished)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int ret;
>> +    int64_t align = 1;
>> +    int64_t offset, bytes;
>> +    uint8_t *buf;
>> +
>> +    if (!s->compressed_cache) {
>> +        return 0;
>> +    }
>> +
>> +    if (!seqcache_get_next_flush(s->compressed_cache, &offset, &bytes, &buf,
>> +                                 &unfinished))
>> +    {
>> +        return 0;
>> +    }
>> +
>> +    qcow2_inflight_writes_inc(bs, offset, bytes);
>> +
>> +    /*
>> +     * Don't try to align-up unfinished cached cluster, parallel write to it is
>> +     * possible! For finished host clusters data in the tail of the cluster will
>> +     * be never used. So, take some good alignment for speed.
>> +     *
>> +     * Note also, that seqcache guarantees that allocated size of @buf is enough
>> +     * to fill the cluster up to the end, so we are safe to align up with
>> +     * align <= cluster_size.
>> +     */
>> +    if (!unfinished) {
>> +        align = MIN(s->cluster_size,
>> +                    MAX(s->data_file->bs->bl.request_alignment,
>> +                        bdrv_opt_mem_align(bs)));
>> +    }
>> +
> 
> I’d move the pre-write overlap check here, because its purpose is to prevent writes to metadata structures as they are happening, not as they are queued into a cache.  (I’d say.)

Hmm. pre-write overlap check usually done under mutex.. Should I add an additional critical section to do overlap check? I'm not sure.

> 
>> +    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>> +    ret = bdrv_co_pwrite(s->data_file, offset,
>> +                         QEMU_ALIGN_UP(offset + bytes, align) - offset,
> 
> I remember you said you wanted to describe more of the properties of the buffer returned by seqcache_get_next_flush().  That would be nice here, because intuitively one would assume that the buffer is @bytes bytes long, which aligning here will exceed.  (It’s fine, but it would be nicer if there was a comment that assured that it’s fine.)
> 

It's here, read several lines above: "Note also, that..."

>> +                         buf, 0);
>> +    if (ret < 0 && s->compressed_flush_ret == 0) {
>> +        /*
>> +         * The data that was "written" earlier is lost. We don't want to
>> +         * care with storing it somehow to retry flushing later.



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] util: implement seqcache
  2021-03-05 17:35 ` [PATCH v3 4/6] util: implement seqcache Vladimir Sementsov-Ogievskiy
  2021-03-12 13:41   ` Max Reitz
@ 2021-06-04 14:31   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-06-04 14:31 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, crosa, ehabkost, mreitz, kwolf, jsnow

05.03.2021 20:35, Vladimir Sementsov-Ogievskiy wrote:
> Implement cache for small sequential unaligned writes, so that they may
> be cached until we get a complete cluster and then write it.
> 
> The cache is intended to be used for backup to qcow2 compressed target
> opened in O_DIRECT mode, but can be reused for any similar (even not
> block-layer related) task.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   include/qemu/seqcache.h |  42 +++++
>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>   MAINTAINERS             |   6 +
>   util/meson.build        |   1 +
>   4 files changed, 410 insertions(+)
>   create mode 100644 include/qemu/seqcache.h
>   create mode 100644 util/seqcache.c
> 
> diff --git a/include/qemu/seqcache.h b/include/qemu/seqcache.h
> new file mode 100644
> index 0000000000..a308d54b01
> --- /dev/null
> +++ b/include/qemu/seqcache.h
> @@ -0,0 +1,42 @@
> +/*
> + * Cache for small sequential write requests.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#ifndef QEMU_SEQCACHE_H
> +#define QEMU_SEQCACHE_H
> +
> +typedef struct SeqCache SeqCache;
> +
> +SeqCache *seqcache_new(int64_t cluster_size);
> +void seqcache_free(SeqCache *s);
> +
> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
> +int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
> +
> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
> +                             uint8_t **buf, bool *unfinished);
> +void seqcache_discard_cluster(SeqCache *s, int64_t offset);
> +
> +uint64_t seqcache_nb_clusters(SeqCache *s);
> +
> +#endif /* QEMU_SEQCACHE_H */
> diff --git a/util/seqcache.c b/util/seqcache.c
> new file mode 100644
> index 0000000000..d923560eab
> --- /dev/null
> +++ b/util/seqcache.c
> @@ -0,0 +1,361 @@
> +/*
> + * Cache for small sequential write requests.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + *
> + *
> + * = Description =
> + *
> + * SeqCache is an abbreviation for Sequential Cache.
> + *
> + * The Cache is intended to improve performance of small unaligned sequential
> + * writes. Cache has a cluster_size parameter and the unit of caching is aligned
> + * cluster.  Cache has a list of cached clusters, several "finished" ones and at
> + * most one "unfinished". "unfinished" cluster is a cluster where last byte of
> + * last write operation is cached and there is a free place after that byte to
> + * the end of cluster. "finished" clusters are just stored in cache to be read
> + * or flushed and don't allow any writes to them.
> + *
> + * If write to the cache intersects cluster bounds, it's splat into several
> + * requests by cluster bounds. So, consider a write that doesn't intersect
> + * cluster bounds to describe the whole picture:
> + *
> + * There are two cases allowed:
> + *
> + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
> + *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
> + *    filled up to the cluster bound it becomes "finished".
> + *
> + * 2. Write to new cluster, not existing in the cache. It may be sequential to
> + *    previous write or not. Current "unfinshed" cluster (if exists) becomes
> + *    "finished" and new "unfinished" cluster is started. Note also that offset
> + *    of the write to new cluster is not required to be aligned.
> + *
> + *    Any other write operation (non-sequential write to "unfinished" cluster
> + *    or write to any of "finished" clusters) will crash.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/queue.h"
> +#include "qemu/seqcache.h"
> +
> +/*
> + * Cluster
> + *
> + * Representation of one cached cluster, aligned to SeqCache::cluster_size.
> + * Caches only one subregion of the cluster, started at @offset (may be
> + * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
> + * The whole subregion always lay in one aligned cluster:
> + *
> + *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
> + *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
> + *
> + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
> + * allocated @buf length is at least:
> + *
> + *      cluster_size - offset % cluster_size
> + */
> +typedef struct Cluster {
> +   int64_t offset;
> +   int64_t bytes;
> +   uint8_t *buf;
> +   QSIMPLEQ_ENTRY(Cluster) entry;
> +} Cluster;
> +
> +/*
> + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
> + * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
> + * calls.
> + *
> + * @cur_write->offset may be unaligned to @cluster_size if first write offset is
> + * not aligned (for example, if there was a flush request and cache was flushed,
> + * then we continue from the middle of the cluster with an empty cache).
> + *
> + * @cur_write may be NULL, which means we don't have current cluster and next
> + * seqcache_write() will start a new one.
> + *
> + * @all is a list of all clusters cached in the cache, some "finished" and one
> + * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
> + * one in the list.
> + *
> + * @nb_clusters is number of elements in @all list.
> + *
> + * @next_flush is an iterator for flushing. SeqCache knows nothing about how
> + * data should be flushing, so for flush user calls seqcache_get_next_flush()
> + * (which moves @next_flush iterator) and when data is flushed somehow and cache
> + * is not needed anymore, user can call seqcache_discard_cluster().
> + */
> +typedef struct SeqCache {
> +    int64_t cluster_size;
> +    int64_t nb_clusters;
> +    Cluster *cur_write;
> +    Cluster *next_flush;
> +    QSIMPLEQ_HEAD(, Cluster) all;
> +} SeqCache;
> +
> +static void cluster_free(Cluster *req)
> +{
> +    if (req) {
> +        g_free(req->buf);
> +        g_free(req);
> +    }
> +}
> +
> +SeqCache *seqcache_new(int64_t cluster_size)
> +{
> +    SeqCache *s = g_new(SeqCache, 1);
> +
> +    *s = (SeqCache) {
> +        .cluster_size = cluster_size,
> +    };
> +
> +    QSIMPLEQ_INIT(&s->all);
> +
> +    return s;
> +}
> +
> +/*
> + * User should clean the cache calling seqcache_get_next_flush() and
> + * seqcache_discard_cluster() sequentially before freeing it.
> + */
> +void seqcache_free(SeqCache *s)
> +{
> +    if (s) {
> +        assert(QSIMPLEQ_EMPTY(&s->all));
> +        g_free(s);
> +    }
> +}
> +
> +/* End of cached region inside one cluster */
> +static int64_t cached_end(Cluster *cl)
> +{
> +    return cl->offset + cl->bytes;
> +}
> +
> +/* Start of aligned cluster containing the @offset */
> +static int64_t cluster_start(SeqCache *s, int64_t offset)
> +{
> +    return QEMU_ALIGN_DOWN(offset, s->cluster_size);
> +}
> +
> +/* End of aligned cluster containing the @offset */
> +static int64_t cluster_end(SeqCache *s, int64_t offset)
> +{
> +    return cluster_start(s, offset) + s->cluster_size;
> +}
> +
> +/* Align down @offset to s->cluster_size and search for corresponding cluster */
> +static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
> +{
> +    Cluster *cl;
> +    int64_t cl_start = cluster_start(s, offset);
> +
> +    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
> +        if (cluster_start(s, cl->offset) == cl_start) {
> +            return cl;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* Makes current "unfinished" cluster the "finished" one. */
> +static void seqcache_finalize_current_cluster(SeqCache *s)
> +{
> +    if (s->cur_write && !s->next_flush) {
> +        s->next_flush = s->cur_write;
> +    }
> +    s->cur_write = NULL;
> +}
> +
> +/*
> + * Write an @offset, @bytes, @buf request into the cache. The requests MUST not
> + * intersect cluster bounds.
> + */
> +static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
> +                               uint8_t *buf)
> +{
> +    assert(bytes > 0);
> +    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
> +
> +    if (s->cur_write && offset == cached_end(s->cur_write)) {
> +        /* Continue sequential process */
> +        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
> +        s->cur_write->bytes += bytes;
> +
> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
> +            seqcache_finalize_current_cluster(s);
> +        }
> +
> +        return;
> +    }
> +
> +    /* We are starting a new cluster. Check that it's really new */
> +    assert(!seqcache_find_cluster(s, offset));
> +
> +    seqcache_finalize_current_cluster(s);
> +
> +    s->cur_write = g_new(Cluster, 1);
> +    *s->cur_write = (Cluster) {
> +        .offset = offset,
> +        .bytes = bytes,
> +        .buf = g_malloc(s->cluster_size),
> +    };
> +
> +    memcpy(s->cur_write->buf, buf, bytes);
> +    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
> +    s->nb_clusters++;

Here, must also do

> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
> +            seqcache_finalize_current_cluster(s);
> +        }


otherwise we'll try to continue that new filled buffer by next cluster on next seqcache_write_one(), and buffer is overflowed [faced this bug in our downstream release].


> +}
> +
> +/* Write an @offset, @bytes, @buf request into the cache. */
> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
> +{
> +    while (bytes) {
> +        int64_t cl_end = cluster_end(s, offset);
> +        int64_t chunk = MIN(bytes, cl_end - offset);
> +
> +        seqcache_write_one(s, offset, chunk, buf);
> +        offset += chunk;
> +        bytes -= chunk;
> +        buf += chunk;
> +    }
> +}
> +
> +/*
> + * Read from the cache.
> + *
> + * If @offset misses the cache, return 0.
> + *
> + * If @offset is inside the cache, copy corresponding available data to @buf
> + * (may be less than required @bytes if hole reached earlier) and return number
> + * of bytes copied.
> + */
> +int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
> +{
> +    uint8_t *pos = buf;
> +
> +    while (bytes) {
> +        Cluster *cl = seqcache_find_cluster(s, offset);
> +        int64_t chunk;
> +
> +        if (!cl || offset < cl->offset || offset >= cached_end(cl)) {
> +            break;
> +        }
> +
> +        chunk = MIN(bytes, cached_end(cl) - offset);
> +        memcpy(pos, cl->buf + offset - cl->offset, chunk);
> +        offset += chunk;
> +        bytes -= chunk;
> +        pos += chunk;
> +
> +        if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) {
> +            break;
> +        }
> +    }
> +
> +    return pos - buf;
> +}
> +
> +/*
> + * Get next region for flushing. @offset, @bytes and @buf are out-parameters
> + * to return the region.
> + *
> + * @unfinished is in-out argument which means that user is interested in
> + * flushing unfinished cluster too:
> + *
> + * If there are "finished" clusters, "finished" cluster is returned and
> + * *@unfinished is set to false, independently of its original value.
> + *
> + * If there are no "finished" clusters but "unfinished" exists (i.e.
> + * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
> + * value remains the same and the following logic works:
> + *
> + *    If *@unfinished:
> + *       return s->cur_write unfinished cluster for flushing
> + *    Else
> + *       return nothing
> + *
> + *
> + * Returns true and set @offset, @bytes, @buf and @unfinished if there is
> + * something to flush (accordingly to @unfinished value), returns false
> + * otherwise.
> + *
> + * Nothing is removed from the cache.
> + */
> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
> +                             uint8_t **buf, bool *unfinished)
> +{
> +    Cluster *req = s->next_flush;
> +
> +    if (s->next_flush) {
> +        *unfinished = false;
> +        req = s->next_flush;
> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
> +        if (s->next_flush == s->cur_write) {
> +            s->next_flush = NULL;
> +        }
> +    } else if (s->cur_write && *unfinished) {
> +        req = s->cur_write;
> +    } else {
> +        return false;
> +    }
> +
> +    *offset = req->offset;
> +    *bytes = req->bytes;
> +    *buf = req->buf;
> +
> +    return true;
> +}
> +
> +/*
> + * Find corresponding cluster and drop it. No matter does requested @offset is
> + * cached itself or not.
> + */
> +void seqcache_discard_cluster(SeqCache *s, int64_t offset)
> +{
> +    Cluster *cl = seqcache_find_cluster(s, offset);
> +
> +    if (!cl) {
> +        return;
> +    }
> +
> +    if (cl == s->cur_write) {
> +        assert(cl != s->next_flush);
> +        s->cur_write = NULL;
> +    } else if (cl == s->next_flush) {
> +        assert(cl != s->cur_write);
> +        s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);
> +        if (s->next_flush == s->cur_write) {
> +            s->next_flush = NULL;
> +        }
> +    }
> +
> +    QSIMPLEQ_REMOVE(&s->all, cl, Cluster, entry);
> +    cluster_free(cl);
> +    s->nb_clusters--;
> +}
> +
> +/* Returns number of cached clusters including unfinished */
> +uint64_t seqcache_nb_clusters(SeqCache *s)
> +{
> +    return s->nb_clusters;
> +}
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9b2aa18e1f..5e263c3343 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3347,3 +3347,9 @@ Performance Tools and Tests
>   M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
>   S: Maintained
>   F: scripts/performance/
> +
> +Sequential Cache
> +M: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> +S: Supported
> +F: util/seqcache.c
> +F: include/qemu/seqcache.h
> diff --git a/util/meson.build b/util/meson.build
> index 984fba965f..6682e463ac 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -75,6 +75,7 @@ if have_block
>     util_ss.add(files('block-helpers.c'))
>     util_ss.add(files('qemu-coroutine-sleep.c'))
>     util_ss.add(files('qemu-co-shared-resource.c'))
> +  util_ss.add(files('seqcache.c'))
>     util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
>     util_ss.add(files('readline.c'))
>     util_ss.add(files('throttle.c'))
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-06-04 14:32 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 17:35 [PATCH v3 0/6] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
2021-03-05 17:35 ` [PATCH v3 1/6] block-jobs: flush target at the end of .run() Vladimir Sementsov-Ogievskiy
2021-03-11 16:57   ` Max Reitz
2021-03-05 17:35 ` [PATCH v3 2/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
2021-03-05 17:35 ` [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard Vladimir Sementsov-Ogievskiy
2021-03-11 19:58   ` Max Reitz
2021-03-12  9:09     ` Vladimir Sementsov-Ogievskiy
2021-03-12 11:17       ` Max Reitz
2021-03-12 12:32         ` Vladimir Sementsov-Ogievskiy
2021-03-12 12:42           ` Vladimir Sementsov-Ogievskiy
2021-03-12 15:01             ` Max Reitz
2021-03-12 12:46           ` Vladimir Sementsov-Ogievskiy
2021-03-12 15:10             ` Max Reitz
2021-03-12 15:24               ` Vladimir Sementsov-Ogievskiy
2021-03-12 15:52                 ` Max Reitz
2021-03-12 16:03                   ` Vladimir Sementsov-Ogievskiy
2021-03-12 14:58           ` Max Reitz
2021-03-12 15:39             ` Vladimir Sementsov-Ogievskiy
2021-03-05 17:35 ` [PATCH v3 4/6] util: implement seqcache Vladimir Sementsov-Ogievskiy
2021-03-12 13:41   ` Max Reitz
2021-03-12 14:37     ` Vladimir Sementsov-Ogievskiy
2021-03-12 15:13       ` Max Reitz
2021-06-04 14:31   ` Vladimir Sementsov-Ogievskiy
2021-03-05 17:35 ` [PATCH v3 5/6] block-coroutine-wrapper: allow non bdrv_ prefix Vladimir Sementsov-Ogievskiy
2021-03-12 16:53   ` Max Reitz
2021-03-05 17:35 ` [PATCH v3 6/6] block/qcow2: use seqcache for compressed writes Vladimir Sementsov-Ogievskiy
2021-03-12 18:15   ` Max Reitz
2021-03-12 18:43     ` Vladimir Sementsov-Ogievskiy
2021-03-15  9:58       ` Max Reitz
2021-03-15 14:40         ` Vladimir Sementsov-Ogievskiy
2021-03-16 12:25           ` Max Reitz
2021-03-16 17:48             ` Vladimir Sementsov-Ogievskiy
2021-03-17  8:09               ` Max Reitz
2021-03-12 18:45     ` Vladimir Sementsov-Ogievskiy
2021-03-29 20:18     ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.