qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless)
@ 2021-03-26 20:00 Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 1/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Hi all!

It's an alternative lock-less solution to
  [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock)

It's an updated version of what was don in
  [PATCH v3 0/6] qcow2: compressed write cache

I've split the logic into several patches, add a lot more comments and
documentation, I hope it's all now a lot more clean. 

Main change from v3 is careful handling "freeing" the cluster. The
concept of "free cluster" is established. And good thing: now updating
inflight-counters it lock-less: we don't need the lock for discard
operation, instead we consider this discard as the last one inflight
write!

The solution is still more complicated then co-rw-lock based one. But it
is cool and I hope someone like it.

RFC: should we protect reads as well of intersecting with discarding
host clusters? I think yes. Reads are not as critical, as they can't
corrupt the metadata or data (due to use-after-free like writes), but
with reads it's possible that guest reads some metadata cluster. It's a
kind of security violation I think.

Vladimir Sementsov-Ogievskiy (6):
  iotests: add qcow2-discard-during-rewrite
  qcow2: fix cache discarding in update_refcount()
  qcow2: introduce is_cluster_free() helper
  qcow2: introduce inflight-write-counters
  qcow2: consider in-flight-writes when freeing clusters
  qcow2: do not discard host clusters during in-flight writes

 block/qcow2.h                                 |  16 ++
 block/qcow2-refcount.c                        | 184 +++++++++++++++++-
 block/qcow2.c                                 |  26 ++-
 .../tests/qcow2-discard-during-rewrite        |  72 +++++++
 .../tests/qcow2-discard-during-rewrite.out    |  21 ++
 5 files changed, 308 insertions(+), 11 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

-- 
2.29.2



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/6] iotests: add qcow2-discard-during-rewrite
  2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
@ 2021-03-26 20:00 ` Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 2/6] qcow2: fix cache discarding in update_refcount() Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Simple test:
 - start writing to allocated cluster A
 - discard this cluster
 - write to another unallocated cluster B (it's allocated in same place
   where A was allocated)
 - continue writing to A

For now last action pollutes cluster B which is a bug fixed by the
following commit.

For now, add test to "disabled" group, so that it doesn't run
automatically.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 .../tests/qcow2-discard-during-rewrite        | 72 +++++++++++++++++++
 .../tests/qcow2-discard-during-rewrite.out    | 21 ++++++
 2 files changed, 93 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
new file mode 100755
index 0000000000..7f0d8a107a
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+# group: quick disabled
+#
+# Test discarding (and reusing) host cluster during writing data to it.
+#
+# Copyright (c) 2021 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=vsementsov@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./../common.rc
+. ./../common.filter
+
+_supported_fmt qcow2
+_supported_proto file fuse
+_supported_os Linux
+
+size=1M
+_make_test_img $size
+
+(
+cat <<EOF
+write -P 1 0 64K
+
+break pwritev A
+aio_write -P 2 0 64K
+wait_break A
+
+discard 0 64K
+write -P 3 128K 64K
+read -P 3 128K 64K
+
+break pwritev_done B
+resume A
+wait_break B
+resume B
+
+read -P 0 0 64K
+read -P 3 128K 64K
+EOF
+) | $QEMU_IO blkdebug::$TEST_IMG | _filter_qemu_io
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
new file mode 100644
index 0000000000..8e75b2fbff
--- /dev/null
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
@@ -0,0 +1,21 @@
+QA output created by qcow2-discard-during-rewrite
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+discard 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Resuming request 'A'
+blkdebug: Suspended request 'B'
+blkdebug: Resuming request 'B'
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/6] qcow2: fix cache discarding in update_refcount()
  2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 1/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
@ 2021-03-26 20:00 ` Vladimir Sementsov-Ogievskiy
  2021-04-07 16:34   ` Alberto Garcia
  2021-03-26 20:00 ` [PATCH v5 3/6] qcow2: introduce is_cluster_free() helper Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Here refcount of cluster at @cluster_offset reached 0, so we "free"
that cluster. Not a cluster at @offset. The thing that save us from the
bug is that L2 tables and refblocks are discarded one by one. Still,
let's be precise.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2-refcount.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..543fcf289c 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -887,14 +887,15 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
             void *table;
 
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
-                                                offset);
+                                                cluster_offset);
             if (table != NULL) {
                 qcow2_cache_put(s->refcount_block_cache, &refcount_block);
                 old_table_index = -1;
                 qcow2_cache_discard(s->refcount_block_cache, table);
             }
 
-            table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);
+            table = qcow2_cache_is_table_offset(s->l2_table_cache,
+                                                cluster_offset);
             if (table != NULL) {
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/6] qcow2: introduce is_cluster_free() helper
  2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 1/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 2/6] qcow2: fix cache discarding in update_refcount() Vladimir Sementsov-Ogievskiy
@ 2021-03-26 20:00 ` Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 4/6] qcow2: introduce inflight-write-counters Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

We are going to change the concept of "free host cluster", so let's
clarify it now and add a helper, which we will modify later.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2-refcount.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 543fcf289c..1369724b41 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -962,13 +962,32 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
 /* cluster allocation functions */
 
 
+/*
+ * Cluster is free when its refcount is 0
+ *
+ * Return < 0 if failed to get refcount
+ *          0 if cluster is not free
+ *          1 if cluster is free
+ */
+static int is_cluster_free(BlockDriverState *bs, int64_t cluster_index)
+{
+    int ret;
+    uint64_t refcount;
+
+    ret = qcow2_get_refcount(bs, cluster_index, &refcount);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return refcount == 0;
+}
 
 /* return < 0 if error */
 static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
                                     uint64_t max)
 {
     BDRVQcow2State *s = bs->opaque;
-    uint64_t i, nb_clusters, refcount;
+    uint64_t i, nb_clusters;
     int ret;
 
     /* We can't allocate clusters if they may still be queued for discard. */
@@ -980,11 +999,11 @@ static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
 retry:
     for(i = 0; i < nb_clusters; i++) {
         uint64_t next_cluster_index = s->free_cluster_index++;
-        ret = qcow2_get_refcount(bs, next_cluster_index, &refcount);
 
+        ret = is_cluster_free(bs, next_cluster_index);
         if (ret < 0) {
             return ret;
-        } else if (refcount != 0) {
+        } else if (!ret) {
             goto retry;
         }
     }
@@ -1031,7 +1050,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
                                 int64_t nb_clusters)
 {
     BDRVQcow2State *s = bs->opaque;
-    uint64_t cluster_index, refcount;
+    uint64_t cluster_index;
     uint64_t i;
     int ret;
 
@@ -1044,10 +1063,10 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
         /* Check how many clusters there are free */
         cluster_index = offset >> s->cluster_bits;
         for(i = 0; i < nb_clusters; i++) {
-            ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
+            ret = is_cluster_free(bs, cluster_index++);
             if (ret < 0) {
                 return ret;
-            } else if (refcount != 0) {
+            } else if (!ret) {
                 break;
             }
         }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 4/6] qcow2: introduce inflight-write-counters
  2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2021-03-26 20:00 ` [PATCH v5 3/6] qcow2: introduce is_cluster_free() helper Vladimir Sementsov-Ogievskiy
@ 2021-03-26 20:00 ` Vladimir Sementsov-Ogievskiy
  2021-03-30 12:05   ` Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 5/6] qcow2: consider in-flight-writes when freeing clusters Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 6/6] qcow2: do not discard host clusters during in-flight writes Vladimir Sementsov-Ogievskiy
  5 siblings, 1 reply; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

We have a bug in qcow2: assume we've started data write into host
cluster A. s->lock is unlocked. During the write the refcount of
cluster A may become zero, cluster may be reallocated for other needs,
and our in-flight write become a use-after-free. More details will be
in the further commit which actually fixes the bug.

For now, let's prepare infrastructure for the following fix. We are
going to track these in-flight data writes. So, we create a hash map

  cluster_index -> Qcow2InFlightRefcount

For now, add only basic structure and simple counting logic. No guest
write is actually counted, we only add infrastructure.
Qcow2InFlightRefcount will be expanded in the following commit, that's
why we need a structure.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.h          | 16 ++++++++
 block/qcow2-refcount.c | 86 ++++++++++++++++++++++++++++++++++++++++++
 block/qcow2.c          |  5 +++
 3 files changed, 107 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0fe5f74ed3..b25ef06111 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,17 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    /*
+     * inflight_writes_counters:
+     *   Map cluster index (int64_t) -> Qcow2InFlightWriteCounter
+     *
+     * The map contains entries only for clusters that have in-flight data
+     * (not-metadata) writes. So Qcow2InFlightWriteCounter::inflight_writes_cnt
+     * is always (except for when being removed in update_inflight_write_cnt())
+     * >= 1 for stored elements.
+     */
+    GHashTable *inflight_writes_counters;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -896,6 +907,11 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
 int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
 
+void qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                               int64_t length);
+void qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                               int64_t length);
+
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 1369724b41..eedc83ea4a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -799,6 +799,92 @@ found:
     }
 }
 
+typedef struct Qcow2InFlightWriteCounter {
+    /*
+     * Number of in-flight writes to the cluster, always > 0, as when it becomes
+     * 0 the entry is removed from s->inflight_writes_counters.
+     */
+    uint64_t inflight_writes_cnt;
+} Qcow2InFlightWriteCounter;
+
+/* Find Qcow2InFlightWriteCounter corresponding to @cluster_index */
+static Qcow2InFlightWriteCounter *find_infl_wr(BDRVQcow2State *s,
+                                               int64_t cluster_index)
+{
+    Qcow2InFlightWriteCounter *infl;
+
+    if (!s->inflight_writes_counters) {
+        return NULL;
+    }
+
+    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
+
+    if (infl) {
+        assert(infl->inflight_writes_cnt > 0);
+    }
+
+    return infl;
+}
+
+/*
+ * The function is intended to be called with decrease=false before writing
+ * guest data and with decrease=true after write finish.
+ */
+static void coroutine_fn
+update_inflight_write_cnt(BlockDriverState *bs, int64_t offset, int64_t length,
+                          bool decrease)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t start, last, cluster_index;
+
+    start = start_of_cluster(s, offset) >> s->cluster_bits;
+    last = start_of_cluster(s, offset + length - 1) >> s->cluster_bits;
+    for (cluster_index = start; cluster_index <= last; cluster_index++) {
+        Qcow2InFlightWriteCounter *infl = find_infl_wr(s, cluster_index);
+
+        if (!decrease) {
+            if (!infl) {
+                infl = g_new0(Qcow2InFlightWriteCounter, 1);
+                g_hash_table_insert(s->inflight_writes_counters,
+                                    g_memdup(&cluster_index,
+                                             sizeof(cluster_index)), infl);
+            }
+            infl->inflight_writes_cnt++;
+            continue;
+        }
+
+        /* decrease */
+        assert(infl);
+        assert(infl->inflight_writes_cnt >= 1);
+
+        infl->inflight_writes_cnt--;
+
+        if (infl->inflight_writes_cnt == 0) {
+            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+        }
+    }
+}
+
+/*
+ * Works both with locked and unlocked s->lock. It just doesn't touch s->lock in
+ * contrast to qcow2_inflight_writes_dec()
+ */
+void qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                               int64_t length)
+{
+    update_inflight_write_cnt(bs, offset, length, false);
+}
+
+/*
+ * Called with s->lock not locked by caller. Will take s->lock only if need to
+ * actually discard some clusters.
+ */
+void qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                               int64_t length)
+{
+    update_inflight_write_cnt(bs, offset, length, true);
+}
+
 /* XXX: cache several refcount block clusters ? */
 /* @addend is the absolute value of the addend; if @decrease is set, @addend
  * will be subtracted from the current refcount, otherwise it will be added */
diff --git a/block/qcow2.c b/block/qcow2.c
index 0db1227ac9..0a5bd4ea4e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1833,6 +1833,8 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
 #endif
 
     qemu_co_queue_init(&s->thread_task_queue);
+    s->inflight_writes_counters =
+        g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
 
     return ret;
 
@@ -2709,6 +2711,9 @@ static void qcow2_close(BlockDriverState *bs)
     g_free(s->image_backing_file);
     g_free(s->image_backing_format);
 
+    assert(g_hash_table_size(s->inflight_writes_counters) == 0);
+    g_hash_table_unref(s->inflight_writes_counters);
+
     if (has_data_file(bs)) {
         bdrv_unref_child(bs, s->data_file);
         s->data_file = NULL;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 5/6] qcow2: consider in-flight-writes when freeing clusters
  2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2021-03-26 20:00 ` [PATCH v5 4/6] qcow2: introduce inflight-write-counters Vladimir Sementsov-Ogievskiy
@ 2021-03-26 20:00 ` Vladimir Sementsov-Ogievskiy
  2021-03-26 20:00 ` [PATCH v5 6/6] qcow2: do not discard host clusters during in-flight writes Vladimir Sementsov-Ogievskiy
  5 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

We have a bug in qcow2: assume we've started data write into host
cluster A. s->lock is unlocked. During the write the refcount of
cluster A may become zero, cluster may be reallocated for other needs,
and our in-flight write become a use-after-free.

To fix the bug let's do the following. Or better, let's start from what
we have now:

Now we consider cluster "free", when its refcount is 0. When cluster
becomes "free" we also update s->free_cluster_index and optionally
discard it on bs->file level. These two operations are done in same
s->lock critical section where refcount becomes 0 (and this all is in
update_refcount()). Calling update_refcount() is wrong. It's ofcourse
done sometimes, as not everything is moved to coroutine for now..
Still, it's out of our topic.

Later, we can reallocate "free" cluster in alloc_clusters_noref() and
qcow2_alloc_clusters_at(), where is_cluster_free() is used.

OK, to correctly handle in-flight writes, let's modify a concept of
"free" cluster, so that cluster is "free" when its refcount is 0 and
there no inflight writes.

So, we discard the cluster at bs->file level, update
s->free_cluster_index and allow reallocation only when both refcount
and inflight-write-cnt becomes both zero. It may happen either in
update_refcount() or in update_inflight_write_cnt().

Consider update_refcount() first. Here we update refcounts metadata we
must be under s->lock. So, if we catch a situation when refcount
becomes 0 but there are in-flight writes we change a behavior so that
we don't update s->free_cluster_index, and do not discard the cluster.
This will be done by update_inflight_write_cnt() later. So, we save
needed information into Qcow2InFlightWriteCounter structure, so that
further update_inflight_write_cnt() do not need to load the refcount.

Now what about update_inflight_write_cnt()? Here things become more
interesting, because we can avoid s->lock. If
inflight-write-cnt + refcount is still positive, we don't have any
yield point, just update inflight write counter and we are done.
If inflight-write-cnt becomes 0 and refcount is already 0, we just need
to keep inflight-write-counter=1 during the discard operation (if it
is needed) and then drop the counter and update s->free_cluster_index.

Note, that at this point we only implement the whole infrastructure for
in-flight writes handling. Nobody now call qcow2_inflight_writes_inc()
or qcow2_inflight_writes_dec(). It's a deal of the following patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2-refcount.c | 70 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 64 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index eedc83ea4a..9a0d6570a5 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -805,6 +805,15 @@ typedef struct Qcow2InFlightWriteCounter {
      * 0 the entry is removed from s->inflight_writes_counters.
      */
     uint64_t inflight_writes_cnt;
+
+    /* For convenience, keep cluster_index here */
+    int64_t cluster_index;
+
+    /* Cluster refcount is known to be zero */
+    bool refcount_zero;
+
+    /* Cluster refcount was made zero with this discard-type */
+    enum qcow2_discard_type last_discard_type;
 } Qcow2InFlightWriteCounter;
 
 /* Find Qcow2InFlightWriteCounter corresponding to @cluster_index */
@@ -845,6 +854,7 @@ update_inflight_write_cnt(BlockDriverState *bs, int64_t offset, int64_t length,
         if (!decrease) {
             if (!infl) {
                 infl = g_new0(Qcow2InFlightWriteCounter, 1);
+                infl->cluster_index = cluster_index;
                 g_hash_table_insert(s->inflight_writes_counters,
                                     g_memdup(&cluster_index,
                                              sizeof(cluster_index)), infl);
@@ -857,10 +867,40 @@ update_inflight_write_cnt(BlockDriverState *bs, int64_t offset, int64_t length,
         assert(infl);
         assert(infl->inflight_writes_cnt >= 1);
 
-        infl->inflight_writes_cnt--;
+        if (infl->inflight_writes_cnt > 1) {
+            infl->inflight_writes_cnt--;
+            continue;
+        }
 
-        if (infl->inflight_writes_cnt == 0) {
+        if (!infl->refcount_zero) {
+            /*
+             * All in-flight writes are finished, but cluster is still in use,
+             * nothing to do now.
+             */
             g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+            continue;
+        }
+
+        /*
+         * OK. At this point there no more refcounts and no more in-flight
+         * writes. Cluster is almost free. But we probably want to do one more
+         * in-flight operation: discard. So we keep inflight_writes_cnt = 1
+         * during the discard.
+         */
+        if (s->discard_passthrough[infl->last_discard_type]) {
+            int64_t cluster_offset = cluster_index << s->cluster_bits;
+            if (s->cache_discards) {
+                update_refcount_discard(bs, cluster_offset, s->cluster_size);
+            } else {
+                /* Discard is optional, ignore the return value */
+                bdrv_pdiscard(bs->file, cluster_offset, s->cluster_size);
+            }
+        }
+
+        g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+
+        if (cluster_index < s->free_cluster_index) {
+            s->free_cluster_index = cluster_index;
         }
     }
 }
@@ -970,6 +1010,7 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
         s->set_refcount(refcount_block, block_index, refcount);
 
         if (refcount == 0) {
+            Qcow2InFlightWriteCounter *infl;
             void *table;
 
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
@@ -986,8 +1027,24 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
 
-            if (s->discard_passthrough[type]) {
-                update_refcount_discard(bs, cluster_offset, s->cluster_size);
+            infl = find_infl_wr(s, cluster_index);
+            if (infl) {
+                /*
+                 * Refcount becomes zero, but there are still in-flight writes
+                 * to the cluster. update_inflight_write_cnt() will take care
+                 * of discarding the cluster and updating s->free_cluster_index.
+                 */
+                infl->refcount_zero = true;
+                infl->last_discard_type = type;
+            } else {
+                /* Refcount is zero and no in-fligth writes. Cluster is free. */
+                if (cluster_index < s->free_cluster_index) {
+                    s->free_cluster_index = cluster_index;
+                }
+                if (s->discard_passthrough[type]) {
+                    update_refcount_discard(bs, cluster_offset,
+                                            s->cluster_size);
+                }
             }
         }
     }
@@ -1049,7 +1106,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
 
 
 /*
- * Cluster is free when its refcount is 0
+ * Cluster is free when its refcount is 0 and there is no in-flight writes
  *
  * Return < 0 if failed to get refcount
  *          0 if cluster is not free
@@ -1057,6 +1114,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
  */
 static int is_cluster_free(BlockDriverState *bs, int64_t cluster_index)
 {
+    BDRVQcow2State *s = bs->opaque;
     int ret;
     uint64_t refcount;
 
@@ -1065,7 +1123,7 @@ static int is_cluster_free(BlockDriverState *bs, int64_t cluster_index)
         return ret;
     }
 
-    return refcount == 0;
+    return refcount == 0 && !find_infl_wr(s, cluster_index);
 }
 
 /* return < 0 if error */
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 6/6] qcow2: do not discard host clusters during in-flight writes
  2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2021-03-26 20:00 ` [PATCH v5 5/6] qcow2: consider in-flight-writes when freeing clusters Vladimir Sementsov-Ogievskiy
@ 2021-03-26 20:00 ` Vladimir Sementsov-Ogievskiy
  5 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-26 20:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, vsementsov, den

Finally, after all the preparations, when we have the whole
infrastructure of inflight-write-counters prepared in previous commits,
let's fix the following bug:

1. Start write to qcow2. Assume guest cluster G and corresponding host
   cluster is H.

2. The write requests come to the point of data writing to .file. The
   write to .file is started and qcow2 mutex is unlocked.

3. At this time refcount of H becomes 0. For example, it may be due to
   discard operation on qcow2 node, or rewriting compressed data by
   normal write, or some operation with snapshots..

4. Next, some operations occurs and leads to allocation of H for some
   other needs: it may be another write-to-qcow2-node operation, or
   allocation of L2 table or some other data or metadata cluster
   allocation.

5. So, at this point H is used for something other. Assume, L2 table is
   written into H.

6. And now, our write from [2] finishes. And pollutes L2 table in H.
   That's a bug.

To fix the bug we now have inflight-write-counters, which work in a
way that cluster is not "free" (and therefore will not be reused
and we don't fall into use-after-free described above) until both
refcount and inflight-writes-counter are zero for this cluster.

So, we have the only remaining action: actually call
qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() in
corresponding places.

Note: both functions don't rely on s->lock being held or not. But we
should call qcow2_inflight_writes_inc() in the same s->lock critical
section where we allocated host cluster (or get an offset of existing
cluster from metadata). Otherwise we risk that discard will interrupt
at s->lock unlocking (it should only schedule other coroutines to be
entered on next our yield, but let's be absolutely safe).

Iotest qcow2-discard-during-rewrite is enabled, as it works now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.c                                 | 21 ++++++++++++++++++-
 .../tests/qcow2-discard-during-rewrite        |  2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0a5bd4ea4e..d021c9d3ac 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2550,6 +2550,9 @@ out_unlocked:
 
 out_locked:
     qcow2_handle_l2meta(bs, &l2meta, false);
+
+    qcow2_inflight_writes_dec(bs, host_offset, bytes);
+
     qemu_co_mutex_unlock(&s->lock);
 
     qemu_vfree(crypt_buf);
@@ -2609,6 +2612,8 @@ static coroutine_fn int qcow2_co_pwritev_part(
             goto out_locked;
         }
 
+        qcow2_inflight_writes_inc(bs, host_offset, cur_bytes);
+
         qemu_co_mutex_unlock(&s->lock);
 
         if (!aio && cur_bytes != bytes) {
@@ -4099,10 +4104,17 @@ qcow2_co_copy_range_to(BlockDriverState *bs,
             goto fail;
         }
 
+        qcow2_inflight_writes_inc(bs, host_offset, cur_bytes);
+
         qemu_co_mutex_unlock(&s->lock);
+
         ret = bdrv_co_copy_range_to(src, src_offset, s->data_file, host_offset,
                                     cur_bytes, read_flags, write_flags);
+
         qemu_co_mutex_lock(&s->lock);
+
+        qcow2_inflight_writes_dec(bs, host_offset, cur_bytes);
+
         if (ret < 0) {
             goto fail;
         }
@@ -4538,13 +4550,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
     }
 
     ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
-    qemu_co_mutex_unlock(&s->lock);
     if (ret < 0) {
+        qemu_co_mutex_unlock(&s->lock);
         goto fail;
     }
 
+    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
+
+    qemu_co_mutex_unlock(&s->lock);
+
     BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
     ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+
+    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
+
     if (ret < 0) {
         goto fail;
     }
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
index 7f0d8a107a..2e2e0d2cb0 100755
--- a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# group: quick disabled
+# group: quick
 #
 # Test discarding (and reusing) host cluster during writing data to it.
 #
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 4/6] qcow2: introduce inflight-write-counters
  2021-03-26 20:00 ` [PATCH v5 4/6] qcow2: introduce inflight-write-counters Vladimir Sementsov-Ogievskiy
@ 2021-03-30 12:05   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-03-30 12:05 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, mreitz, kwolf, den

26.03.2021 23:00, Vladimir Sementsov-Ogievskiy wrote:
> We have a bug in qcow2: assume we've started data write into host
> cluster A. s->lock is unlocked. During the write the refcount of
> cluster A may become zero, cluster may be reallocated for other needs,
> and our in-flight write become a use-after-free. More details will be
> in the further commit which actually fixes the bug.
> 
> For now, let's prepare infrastructure for the following fix. We are
> going to track these in-flight data writes. So, we create a hash map
> 
>    cluster_index -> Qcow2InFlightRefcount
> 
> For now, add only basic structure and simple counting logic. No guest
> write is actually counted, we only add infrastructure.
> Qcow2InFlightRefcount will be expanded in the following commit, that's
> why we need a structure.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/qcow2.h          | 16 ++++++++
>   block/qcow2-refcount.c | 86 ++++++++++++++++++++++++++++++++++++++++++
>   block/qcow2.c          |  5 +++
>   3 files changed, 107 insertions(+)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 0fe5f74ed3..b25ef06111 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -420,6 +420,17 @@ typedef struct BDRVQcow2State {
>        * is to convert the image with the desired compression type set.
>        */
>       Qcow2CompressionType compression_type;
> +
> +    /*
> +     * inflight_writes_counters:
> +     *   Map cluster index (int64_t) -> Qcow2InFlightWriteCounter
> +     *
> +     * The map contains entries only for clusters that have in-flight data
> +     * (not-metadata) writes. So Qcow2InFlightWriteCounter::inflight_writes_cnt
> +     * is always (except for when being removed in update_inflight_write_cnt())
> +     * >= 1 for stored elements.
> +     */
> +    GHashTable *inflight_writes_counters;
>   } BDRVQcow2State;
>   
>   typedef struct Qcow2COWRegion {
> @@ -896,6 +907,11 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>   
> +void qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
> +                               int64_t length);
> +void qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
> +                               int64_t length);
> +
>   /* qcow2-cluster.c functions */
>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>                           bool exact_size);
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 1369724b41..eedc83ea4a 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -799,6 +799,92 @@ found:
>       }
>   }
>   
> +typedef struct Qcow2InFlightWriteCounter {
> +    /*
> +     * Number of in-flight writes to the cluster, always > 0, as when it becomes
> +     * 0 the entry is removed from s->inflight_writes_counters.
> +     */
> +    uint64_t inflight_writes_cnt;
> +} Qcow2InFlightWriteCounter;
> +
> +/* Find Qcow2InFlightWriteCounter corresponding to @cluster_index */
> +static Qcow2InFlightWriteCounter *find_infl_wr(BDRVQcow2State *s,
> +                                               int64_t cluster_index)
> +{
> +    Qcow2InFlightWriteCounter *infl;
> +
> +    if (!s->inflight_writes_counters) {
> +        return NULL;
> +    }
> +
> +    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
> +
> +    if (infl) {
> +        assert(infl->inflight_writes_cnt > 0);
> +    }
> +
> +    return infl;
> +}
> +
> +/*
> + * The function is intended to be called with decrease=false before writing
> + * guest data and with decrease=true after write finish.
> + */
> +static void coroutine_fn
> +update_inflight_write_cnt(BlockDriverState *bs, int64_t offset, int64_t length,
> +                          bool decrease)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int64_t start, last, cluster_index;
> +
> +    start = start_of_cluster(s, offset) >> s->cluster_bits;
> +    last = start_of_cluster(s, offset + length - 1) >> s->cluster_bits;
> +    for (cluster_index = start; cluster_index <= last; cluster_index++) {
> +        Qcow2InFlightWriteCounter *infl = find_infl_wr(s, cluster_index);
> +
> +        if (!decrease) {
> +            if (!infl) {
> +                infl = g_new0(Qcow2InFlightWriteCounter, 1);
> +                g_hash_table_insert(s->inflight_writes_counters,
> +                                    g_memdup(&cluster_index,
> +                                             sizeof(cluster_index)), infl);
> +            }
> +            infl->inflight_writes_cnt++;
> +            continue;
> +        }
> +
> +        /* decrease */
> +        assert(infl);
> +        assert(infl->inflight_writes_cnt >= 1);
> +
> +        infl->inflight_writes_cnt--;
> +
> +        if (infl->inflight_writes_cnt == 0) {
> +            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
> +        }
> +    }
> +}
> +
> +/*
> + * Works both with locked and unlocked s->lock. It just doesn't touch s->lock in
> + * contrast to qcow2_inflight_writes_dec()
> + */
> +void qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
> +                               int64_t length)
> +{
> +    update_inflight_write_cnt(bs, offset, length, false);
> +}
> +
> +/*
> + * Called with s->lock not locked by caller. Will take s->lock only if need to
> + * actually discard some clusters.

outdated comment. actually we don't take s->lock.

> + */
> +void qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
> +                               int64_t length)
> +{
> +    update_inflight_write_cnt(bs, offset, length, true);
> +}
> +
>   /* XXX: cache several refcount block clusters ? */
>   /* @addend is the absolute value of the addend; if @decrease is set, @addend
>    * will be subtracted from the current refcount, otherwise it will be added */
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 0db1227ac9..0a5bd4ea4e 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1833,6 +1833,8 @@ static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
>   #endif
>   
>       qemu_co_queue_init(&s->thread_task_queue);
> +    s->inflight_writes_counters =
> +        g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
>   
>       return ret;
>   
> @@ -2709,6 +2711,9 @@ static void qcow2_close(BlockDriverState *bs)
>       g_free(s->image_backing_file);
>       g_free(s->image_backing_format);
>   
> +    assert(g_hash_table_size(s->inflight_writes_counters) == 0);
> +    g_hash_table_unref(s->inflight_writes_counters);
> +
>       if (has_data_file(bs)) {
>           bdrv_unref_child(bs, s->data_file);
>           s->data_file = NULL;
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/6] qcow2: fix cache discarding in update_refcount()
  2021-03-26 20:00 ` [PATCH v5 2/6] qcow2: fix cache discarding in update_refcount() Vladimir Sementsov-Ogievskiy
@ 2021-04-07 16:34   ` Alberto Garcia
  0 siblings, 0 replies; 9+ messages in thread
From: Alberto Garcia @ 2021-04-07 16:34 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, den, vsementsov, qemu-devel, mreitz

On Fri 26 Mar 2021 09:00:41 PM CET, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> Here refcount of cluster at @cluster_offset reached 0, so we "free"
> that cluster. Not a cluster at @offset. The thing that save us from the
> bug is that L2 tables and refblocks are discarded one by one. Still,
> let's be precise.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-04-07 16:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26 20:00 [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless) Vladimir Sementsov-Ogievskiy
2021-03-26 20:00 ` [PATCH v5 1/6] iotests: add qcow2-discard-during-rewrite Vladimir Sementsov-Ogievskiy
2021-03-26 20:00 ` [PATCH v5 2/6] qcow2: fix cache discarding in update_refcount() Vladimir Sementsov-Ogievskiy
2021-04-07 16:34   ` Alberto Garcia
2021-03-26 20:00 ` [PATCH v5 3/6] qcow2: introduce is_cluster_free() helper Vladimir Sementsov-Ogievskiy
2021-03-26 20:00 ` [PATCH v5 4/6] qcow2: introduce inflight-write-counters Vladimir Sementsov-Ogievskiy
2021-03-30 12:05   ` Vladimir Sementsov-Ogievskiy
2021-03-26 20:00 ` [PATCH v5 5/6] qcow2: consider in-flight-writes when freeing clusters Vladimir Sementsov-Ogievskiy
2021-03-26 20:00 ` [PATCH v5 6/6] qcow2: do not discard host clusters during in-flight writes Vladimir Sementsov-Ogievskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).