qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
@ 2019-11-01 10:00 Max Reitz
  2019-11-01 10:00 ` [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas" Max Reitz
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 10:00 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

Hi,

This series builds on the previous RFC.  The workaround is now applied
unconditionally of AIO mode and filesystem because we don’t know those
things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
has been moved to block/io.c.

Applying the workaround unconditionally is fine from a performance
standpoint, because it should actually be dead code, thanks to patch 1
(the elephant in the room).  As far as I know, there is no other block
driver but qcow2 in handle_alloc_space() that would submit zero writes
as part of normal I/O so it can occur concurrently to other write
requests.  It still makes sense to take the workaround for file-posix
because we can’t really prevent that any other block driver will submit
zero writes as part of normal I/O in the future.

Anyway, let’s get to the elephant.

From input by XFS developers
(https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
that c8bb23cbdbe causes fundamental performance problems on XFS with
aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
performance or we wouldn’t have it.

In general, avoiding performance regressions is more important than
improving performance, unless the regressions are just a minor corner
case or insignificant when compared to the improvement.  The XFS
regression is no minor corner case, and it isn’t insignificant.  Laurent
Vivier has found performance to decrease by as much as 88 % (on ppc64le,
fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).

Thus, I believe we should revert the commit for now (and most
importantly for 4.1.1).  We can think about reintroducing it for 5.0,
but that would require more extensive benchmarks on a variety of
systems, and we must see how subclusters change the picture.


I would have liked to do benchmarks myself before making this decision,
but as far as I’m informed, patches for 4.1.1 are to be collected on
Monday, so we need to be quick.


git-backport-diff against the RFC:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/4:[down] 'Revert "qcow2: skip writing zero buffers to empty COW areas"'
002/4:[----] [-C] 'block: Make wait/mark serialising requests public'
003/4:[down] 'block: Add bdrv_co_get_self_request()'
004/4:[0036] [FC] 'block/file-posix: Let post-EOF fallocate serialize'


Max Reitz (4):
  Revert "qcow2: skip writing zero buffers to empty COW areas"
  block: Make wait/mark serialising requests public
  block: Add bdrv_co_get_self_request()
  block/file-posix: Let post-EOF fallocate serialize

 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 ---
 include/block/block_int.h  |  4 ++
 block/file-posix.c         | 38 +++++++++++++++++
 block/io.c                 | 42 +++++++++++++------
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 86 --------------------------------------
 block/trace-events         |  1 -
 tests/qemu-iotests/060     |  7 +---
 tests/qemu-iotests/060.out |  5 +--
 10 files changed, 76 insertions(+), 119 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas"
  2019-11-01 10:00 [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
@ 2019-11-01 10:00 ` Max Reitz
  2019-11-01 10:22   ` Vladimir Sementsov-Ogievskiy
  2019-11-01 12:40   ` Eric Blake
  2019-11-01 10:00 ` [PATCH for-4.2 2/4] block: Make wait/mark serialising requests public Max Reitz
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 10:00 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

This reverts commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a.

This commit causes fundamental performance problems on XFS (because
fallocate() stalls the AIO pipeline), and as such it is not clear that
we should unconditionally enable this behavior.

We expect subclusters to alleviate the performance penalty of small
writes to newly allocated clusters, so when we get them, the originally
intended performance gain may actually no longer be significant.

If we want to reintroduce something similar to c8bb23cbdbe, it will
require extensive benchmarking on various systems with subclusters
enabled.

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 ---
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 86 --------------------------------------
 block/trace-events         |  1 -
 tests/qemu-iotests/060     |  7 +---
 tests/qemu-iotests/060.out |  5 +--
 7 files changed, 4 insertions(+), 107 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index aa97ee2641..f053f15431 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3304,8 +3304,6 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
-# @cluster_alloc_space: an allocation of file space for a cluster (since 4.1)
-#
 # @none: triggers once at creation of the blkdebug node (since 4.1)
 #
 # Since: 2.9
@@ -3326,7 +3324,7 @@
             'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
             'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
             'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-            'cor_write', 'cluster_alloc_space', 'none'] }
+            'cor_write', 'none'] }
 
 ##
 # @BlkdebugIOType:
diff --git a/block/qcow2.h b/block/qcow2.h
index 601c2e4c82..8166f6e311 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -418,12 +418,6 @@ typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
-    /*
-     * Indicates that COW regions are already handled and do not require
-     * any more processing.
-     */
-    bool skip_cow;
-
     /**
      * The I/O vector with the data from the actual guest write request.
      * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8982b7b762..fbfea8c817 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -809,7 +809,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
     assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
     assert(start->offset + start->nb_bytes <= end->offset);
 
-    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
+    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
         return 0;
     }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 7c18721741..17555cb0a1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2274,11 +2274,6 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
             continue;
         }
 
-        /* If COW regions are handled already, skip this too */
-        if (m->skip_cow) {
-            continue;
-        }
-
         /* The data (middle) region must be immediately after the
          * start region */
         if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2305,81 +2300,6 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
     return false;
 }
 
-static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
-{
-    int64_t nr;
-    return !bytes ||
-        (!bdrv_is_allocated_above(bs, NULL, false, offset, bytes, &nr) &&
-         nr == bytes);
-}
-
-static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
-{
-    /*
-     * This check is designed for optimization shortcut so it must be
-     * efficient.
-     * Instead of is_zero(), use is_unallocated() as it is faster (but not
-     * as accurate and can result in false negatives).
-     */
-    return is_unallocated(bs, m->offset + m->cow_start.offset,
-                          m->cow_start.nb_bytes) &&
-           is_unallocated(bs, m->offset + m->cow_end.offset,
-                          m->cow_end.nb_bytes);
-}
-
-static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
-{
-    BDRVQcow2State *s = bs->opaque;
-    QCowL2Meta *m;
-
-    if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
-        return 0;
-    }
-
-    if (bs->encrypted) {
-        return 0;
-    }
-
-    for (m = l2meta; m != NULL; m = m->next) {
-        int ret;
-
-        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
-            continue;
-        }
-
-        if (!is_zero_cow(bs, m)) {
-            continue;
-        }
-
-        /*
-         * instead of writing zero COW buffers,
-         * efficiently zero out the whole clusters
-         */
-
-        ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
-                                            m->nb_clusters * s->cluster_size,
-                                            true);
-        if (ret < 0) {
-            return ret;
-        }
-
-        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
-        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
-                                    m->nb_clusters * s->cluster_size,
-                                    BDRV_REQ_NO_FALLBACK);
-        if (ret < 0) {
-            if (ret != -ENOTSUP && ret != -EAGAIN) {
-                return ret;
-            }
-            continue;
-        }
-
-        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
-        m->skip_cow = true;
-    }
-    return 0;
-}
-
 /*
  * qcow2_co_pwritev_task
  * Called with s->lock unlocked
@@ -2421,12 +2341,6 @@ static coroutine_fn int qcow2_co_pwritev_task(BlockDriverState *bs,
         qiov_offset = 0;
     }
 
-    /* Try to efficiently initialize the physical space with zeroes */
-    ret = handle_alloc_space(bs, l2meta);
-    if (ret < 0) {
-        goto out_unlocked;
-    }
-
     /*
      * If we need to do COW, check if it's possible to merge the
      * writing of the guest data together with that of the COW regions.
diff --git a/block/trace-events b/block/trace-events
index 6ba86decca..c615b26d71 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -72,7 +72,6 @@ qcow2_writev_done_part(void *co, int cur_bytes) "co %p cur_bytes %d"
 qcow2_writev_data(void *co, uint64_t offset) "co %p offset 0x%" PRIx64
 qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
 qcow2_pwrite_zeroes(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
-qcow2_skip_cow(void *co, uint64_t offset, int nb_clusters) "co %p offset 0x%" PRIx64 " nb_clusters %d"
 
 # qcow2-cluster.c
 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
index b91d8321bb..89e911400c 100755
--- a/tests/qemu-iotests/060
+++ b/tests/qemu-iotests/060
@@ -150,15 +150,10 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
 echo
 echo "=== Testing overlap while COW is in flight ==="
 echo
-BACKING_IMG=$TEST_IMG.base
-TEST_IMG=$BACKING_IMG _make_test_img 1G
-
-$QEMU_IO -c 'write 0k 64k' "$BACKING_IMG" | _filter_qemu_io
-
 # compat=0.10 is required in order to make the following discard actually
 # unallocate the sector rather than make it a zero sector - we want COW, after
 # all.
-IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
+IMGOPTS='compat=0.10' _make_test_img 1G
 # Write two clusters, the second one enforces creation of an L2 table after
 # the first data cluster.
 $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index 0f6b0658a1..e42bf8c5a9 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -97,10 +97,7 @@ read 512/512 bytes at offset 0
 
 === Testing overlap while COW is in flight ===
 
-Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=1073741824
-wrote 65536/65536 bytes at offset 0
-64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 backing_file=TEST_DIR/t.IMGFMT.base
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 wrote 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 65536/65536 bytes at offset 536870912
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 2/4] block: Make wait/mark serialising requests public
  2019-11-01 10:00 [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
  2019-11-01 10:00 ` [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas" Max Reitz
@ 2019-11-01 10:00 ` Max Reitz
  2019-11-01 10:00 ` [PATCH for-4.2 3/4] block: Add bdrv_co_get_self_request() Max Reitz
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 10:00 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

Make both bdrv_mark_request_serialising() and
bdrv_wait_serialising_requests() public so they can be used from block
drivers.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h |  3 +++
 block/io.c                | 24 ++++++++++++------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 02dc0034a2..32fa323b63 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -999,6 +999,9 @@ extern unsigned int bdrv_drain_all_count;
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
 
+bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self);
+void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align);
+
 int get_tmp_filename(char *filename, int size);
 BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
                             const char *filename);
diff --git a/block/io.c b/block/io.c
index 02659f994d..039c0d49c9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -715,7 +715,7 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     qemu_co_mutex_unlock(&bs->reqs_lock);
 }
 
-static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 {
     int64_t overlap_offset = req->offset & ~(align - 1);
     uint64_t overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
@@ -805,7 +805,7 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
     bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
@@ -1437,14 +1437,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
          * with each other for the same cluster.  For example, in copy-on-read
          * it ensures that the CoR read and write operations are atomic and
          * guest writes cannot interleave between them. */
-        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
     /* BDRV_REQ_SERIALISING is only for write operation */
     assert(!(flags & BDRV_REQ_SERIALISING));
 
     if (!(flags & BDRV_REQ_NO_SERIALISING)) {
-        wait_serialising_requests(req);
+        bdrv_wait_serialising_requests(req);
     }
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -1841,10 +1841,10 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
     assert(!(flags & ~BDRV_REQ_MASK));
 
     if (flags & BDRV_REQ_SERIALISING) {
-        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
-    waited = wait_serialising_requests(req);
+    waited = bdrv_wait_serialising_requests(req);
 
     assert(!waited || !req->serialising ||
            is_request_serialising_and_aligned(req));
@@ -2008,8 +2008,8 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
 
     padding = bdrv_init_padding(bs, offset, bytes, &pad);
     if (padding) {
-        mark_request_serialising(req, align);
-        wait_serialising_requests(req);
+        bdrv_mark_request_serialising(req, align);
+        bdrv_wait_serialising_requests(req);
 
         bdrv_padding_rmw_read(child, req, &pad, true);
 
@@ -2111,8 +2111,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     }
 
     if (bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad)) {
-        mark_request_serialising(&req, align);
-        wait_serialising_requests(&req);
+        bdrv_mark_request_serialising(&req, align);
+        bdrv_wait_serialising_requests(&req);
         bdrv_padding_rmw_read(child, &req, &pad, false);
     }
 
@@ -3205,7 +3205,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
         /* BDRV_REQ_SERIALISING is only for write operation */
         assert(!(read_flags & BDRV_REQ_SERIALISING));
         if (!(read_flags & BDRV_REQ_NO_SERIALISING)) {
-            wait_serialising_requests(&req);
+            bdrv_wait_serialising_requests(&req);
         }
 
         ret = src->bs->drv->bdrv_co_copy_range_from(src->bs,
@@ -3336,7 +3336,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
      * new area, we need to make sure that no write requests are made to it
      * concurrently or they might be overwritten by preallocation. */
     if (new_bytes) {
-        mark_request_serialising(&req, 1);
+        bdrv_mark_request_serialising(&req, 1);
     }
     if (bs->read_only) {
         error_setg(errp, "Image is read-only");
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 3/4] block: Add bdrv_co_get_self_request()
  2019-11-01 10:00 [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
  2019-11-01 10:00 ` [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas" Max Reitz
  2019-11-01 10:00 ` [PATCH for-4.2 2/4] block: Make wait/mark serialising requests public Max Reitz
@ 2019-11-01 10:00 ` Max Reitz
  2019-11-01 10:00 ` [PATCH for-4.2 4/4] block/file-posix: Let post-EOF fallocate serialize Max Reitz
  2019-11-01 10:20 ` [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
  4 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 10:00 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h |  1 +
 block/io.c                | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 32fa323b63..4fc531f9b2 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1001,6 +1001,7 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
 
 bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self);
 void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align);
+BdrvTrackedRequest *bdrv_co_get_self_request(BlockDriverState *bs);
 
 int get_tmp_filename(char *filename, int size);
 BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
diff --git a/block/io.c b/block/io.c
index 039c0d49c9..e9205994a8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -742,6 +742,24 @@ static bool is_request_serialising_and_aligned(BdrvTrackedRequest *req)
            (req->bytes == req->overlap_bytes);
 }
 
+/**
+ * Return the tracked request on @bs for the current coroutine, or
+ * NULL if there is none.
+ */
+BdrvTrackedRequest *bdrv_co_get_self_request(BlockDriverState *bs)
+{
+    BdrvTrackedRequest *req;
+    Coroutine *self = qemu_coroutine_self();
+
+    QLIST_FOREACH(req, &bs->tracked_requests, list) {
+        if (req->co == self) {
+            return req;
+        }
+    }
+
+    return NULL;
+}
+
 /**
  * Round a region to cluster boundaries
  */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 4/4] block/file-posix: Let post-EOF fallocate serialize
  2019-11-01 10:00 [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
                   ` (2 preceding siblings ...)
  2019-11-01 10:00 ` [PATCH for-4.2 3/4] block: Add bdrv_co_get_self_request() Max Reitz
@ 2019-11-01 10:00 ` Max Reitz
  2019-11-01 10:20 ` [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
  4 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 10:00 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

The XFS kernel driver has a bug that may cause data loss when using
fallocate() in an I/O path, i.e. writing zeroes.  We can work around it
by treating post-EOF fallocates as serializing up until infinity
(INT64_MAX in practice).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 0b7e904d48..d5460f3e45 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2721,6 +2721,44 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
     RawPosixAIOData acb;
     ThreadPoolFunc *handler;
 
+#ifdef CONFIG_FALLOCATE
+    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+        BdrvTrackedRequest *req;
+        uint64_t end;
+
+        /*
+         * This is a workaround for a bug in the Linux XFS driver,
+         * where writes submitted through the AIO interface will be
+         * discarded if they happen beyond a concurrently running
+         * fallocate() that increases the file length (i.e., both the
+         * write and the fallocate() happen beyond the EOF).
+         *
+         * To work around it, we extend the tracked request for this
+         * zero write until INT64_MAX (effectively infinity), and mark
+         * it as serializing.
+         *
+         * We have to enable this workaround for all filesystems and
+         * AIO modes (not just XFS with aio=native), because for
+         * remote filesystems we do not know the host configuration.
+         * However, this should have no effect as long as no block
+         * driver submits zero writes beyond the EOF.
+         */
+
+        req = bdrv_co_get_self_request(bs);
+        assert(req);
+        assert(req->type == BDRV_TRACKED_WRITE);
+        assert(req->offset <= offset);
+        assert(req->offset + req->bytes >= offset + bytes);
+
+        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
+        req->bytes = end - req->offset;
+        req->overlap_bytes = req->bytes;
+
+        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
+        bdrv_wait_serialising_requests(req);
+    }
+#endif
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_fildes     = s->fd,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 10:00 [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
                   ` (3 preceding siblings ...)
  2019-11-01 10:00 ` [PATCH for-4.2 4/4] block/file-posix: Let post-EOF fallocate serialize Max Reitz
@ 2019-11-01 10:20 ` Max Reitz
  2019-11-01 10:28   ` Vladimir Sementsov-Ogievskiy
  4 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2019-11-01 10:20 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Vladimir Sementsov-Ogievskiy,
	qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 3922 bytes --]

On 01.11.19 11:00, Max Reitz wrote:
> Hi,
> 
> This series builds on the previous RFC.  The workaround is now applied
> unconditionally of AIO mode and filesystem because we don’t know those
> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
> has been moved to block/io.c.
> 
> Applying the workaround unconditionally is fine from a performance
> standpoint, because it should actually be dead code, thanks to patch 1
> (the elephant in the room).  As far as I know, there is no other block
> driver but qcow2 in handle_alloc_space() that would submit zero writes
> as part of normal I/O so it can occur concurrently to other write
> requests.  It still makes sense to take the workaround for file-posix
> because we can’t really prevent that any other block driver will submit
> zero writes as part of normal I/O in the future.
> 
> Anyway, let’s get to the elephant.
> 
> From input by XFS developers
> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
> that c8bb23cbdbe causes fundamental performance problems on XFS with
> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
> performance or we wouldn’t have it.
> 
> In general, avoiding performance regressions is more important than
> improving performance, unless the regressions are just a minor corner
> case or insignificant when compared to the improvement.  The XFS
> regression is no minor corner case, and it isn’t insignificant.  Laurent
> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).

Ah, crap.

I wanted to send this series as early today as possible to get as much
feedback as possible, so I’ve only started doing benchmarks now.

The obvious

$ qemu-img bench -t none -n -w -S 65536 test.qcow2

on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
c8bb23cbdbe reverted.  So now on to guest tests...

(Well, and the question is still where the performance regression on
ppc64 comes from.)

Max

> Thus, I believe we should revert the commit for now (and most
> importantly for 4.1.1).  We can think about reintroducing it for 5.0,
> but that would require more extensive benchmarks on a variety of
> systems, and we must see how subclusters change the picture.
> 
> 
> I would have liked to do benchmarks myself before making this decision,
> but as far as I’m informed, patches for 4.1.1 are to be collected on
> Monday, so we need to be quick.
> 
> 
> git-backport-diff against the RFC:
> 
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
> 
> 001/4:[down] 'Revert "qcow2: skip writing zero buffers to empty COW areas"'
> 002/4:[----] [-C] 'block: Make wait/mark serialising requests public'
> 003/4:[down] 'block: Add bdrv_co_get_self_request()'
> 004/4:[0036] [FC] 'block/file-posix: Let post-EOF fallocate serialize'
> 
> 
> Max Reitz (4):
>   Revert "qcow2: skip writing zero buffers to empty COW areas"
>   block: Make wait/mark serialising requests public
>   block: Add bdrv_co_get_self_request()
>   block/file-posix: Let post-EOF fallocate serialize
> 
>  qapi/block-core.json       |  4 +-
>  block/qcow2.h              |  6 ---
>  include/block/block_int.h  |  4 ++
>  block/file-posix.c         | 38 +++++++++++++++++
>  block/io.c                 | 42 +++++++++++++------
>  block/qcow2-cluster.c      |  2 +-
>  block/qcow2.c              | 86 --------------------------------------
>  block/trace-events         |  1 -
>  tests/qemu-iotests/060     |  7 +---
>  tests/qemu-iotests/060.out |  5 +--
>  10 files changed, 76 insertions(+), 119 deletions(-)
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas"
  2019-11-01 10:00 ` [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas" Max Reitz
@ 2019-11-01 10:22   ` Vladimir Sementsov-Ogievskiy
  2019-11-01 12:40   ` Eric Blake
  1 sibling, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-01 10:22 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

01.11.2019 13:00, Max Reitz wrote:
> This reverts commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a.
> 
> This commit causes fundamental performance problems on XFS (because
> fallocate() stalls the AIO pipeline), and as such it is not clear that
> we should unconditionally enable this behavior.
> 
> We expect subclusters to alleviate the performance penalty of small
> writes to newly allocated clusters, so when we get them, the originally
> intended performance gain may actually no longer be significant.
> 
> If we want to reintroduce something similar to c8bb23cbdbe, it will
> require extensive benchmarking on various systems with subclusters
> enabled.
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Max Reitz <mreitz@redhat.com>

It's sad, but OK:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> ---
>   qapi/block-core.json       |  4 +-
>   block/qcow2.h              |  6 ---
>   block/qcow2-cluster.c      |  2 +-
>   block/qcow2.c              | 86 --------------------------------------
>   block/trace-events         |  1 -
>   tests/qemu-iotests/060     |  7 +---
>   tests/qemu-iotests/060.out |  5 +--
>   7 files changed, 4 insertions(+), 107 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index aa97ee2641..f053f15431 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3304,8 +3304,6 @@
>   #
>   # @cor_write: a write due to copy-on-read (since 2.11)
>   #
> -# @cluster_alloc_space: an allocation of file space for a cluster (since 4.1)
> -#
>   # @none: triggers once at creation of the blkdebug node (since 4.1)
>   #
>   # Since: 2.9
> @@ -3326,7 +3324,7 @@
>               'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>               'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>               'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
> -            'cor_write', 'cluster_alloc_space', 'none'] }
> +            'cor_write', 'none'] }
>   
>   ##
>   # @BlkdebugIOType:
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 601c2e4c82..8166f6e311 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -418,12 +418,6 @@ typedef struct QCowL2Meta
>        */
>       Qcow2COWRegion cow_end;
>   
> -    /*
> -     * Indicates that COW regions are already handled and do not require
> -     * any more processing.
> -     */
> -    bool skip_cow;
> -
>       /**
>        * The I/O vector with the data from the actual guest write request.
>        * If non-NULL, this is meant to be merged together with the data
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 8982b7b762..fbfea8c817 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -809,7 +809,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
>       assert(start->nb_bytes + end->nb_bytes <= UINT_MAX - data_bytes);
>       assert(start->offset + start->nb_bytes <= end->offset);
>   
> -    if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
> +    if (start->nb_bytes == 0 && end->nb_bytes == 0) {
>           return 0;
>       }
>   
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 7c18721741..17555cb0a1 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2274,11 +2274,6 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>               continue;
>           }
>   
> -        /* If COW regions are handled already, skip this too */
> -        if (m->skip_cow) {
> -            continue;
> -        }
> -
>           /* The data (middle) region must be immediately after the
>            * start region */
>           if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
> @@ -2305,81 +2300,6 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>       return false;
>   }
>   
> -static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
> -{
> -    int64_t nr;
> -    return !bytes ||
> -        (!bdrv_is_allocated_above(bs, NULL, false, offset, bytes, &nr) &&
> -         nr == bytes);
> -}
> -
> -static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> -{
> -    /*
> -     * This check is designed for optimization shortcut so it must be
> -     * efficient.
> -     * Instead of is_zero(), use is_unallocated() as it is faster (but not
> -     * as accurate and can result in false negatives).
> -     */
> -    return is_unallocated(bs, m->offset + m->cow_start.offset,
> -                          m->cow_start.nb_bytes) &&
> -           is_unallocated(bs, m->offset + m->cow_end.offset,
> -                          m->cow_end.nb_bytes);
> -}
> -
> -static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> -{
> -    BDRVQcow2State *s = bs->opaque;
> -    QCowL2Meta *m;
> -
> -    if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
> -        return 0;
> -    }
> -
> -    if (bs->encrypted) {
> -        return 0;
> -    }
> -
> -    for (m = l2meta; m != NULL; m = m->next) {
> -        int ret;
> -
> -        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
> -            continue;
> -        }
> -
> -        if (!is_zero_cow(bs, m)) {
> -            continue;
> -        }
> -
> -        /*
> -         * instead of writing zero COW buffers,
> -         * efficiently zero out the whole clusters
> -         */
> -
> -        ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
> -                                            m->nb_clusters * s->cluster_size,
> -                                            true);
> -        if (ret < 0) {
> -            return ret;
> -        }
> -
> -        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
> -        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
> -                                    m->nb_clusters * s->cluster_size,
> -                                    BDRV_REQ_NO_FALLBACK);
> -        if (ret < 0) {
> -            if (ret != -ENOTSUP && ret != -EAGAIN) {
> -                return ret;
> -            }
> -            continue;
> -        }
> -
> -        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, m->nb_clusters);
> -        m->skip_cow = true;
> -    }
> -    return 0;
> -}
> -
>   /*
>    * qcow2_co_pwritev_task
>    * Called with s->lock unlocked
> @@ -2421,12 +2341,6 @@ static coroutine_fn int qcow2_co_pwritev_task(BlockDriverState *bs,
>           qiov_offset = 0;
>       }
>   
> -    /* Try to efficiently initialize the physical space with zeroes */
> -    ret = handle_alloc_space(bs, l2meta);
> -    if (ret < 0) {
> -        goto out_unlocked;
> -    }
> -
>       /*
>        * If we need to do COW, check if it's possible to merge the
>        * writing of the guest data together with that of the COW regions.
> diff --git a/block/trace-events b/block/trace-events
> index 6ba86decca..c615b26d71 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -72,7 +72,6 @@ qcow2_writev_done_part(void *co, int cur_bytes) "co %p cur_bytes %d"
>   qcow2_writev_data(void *co, uint64_t offset) "co %p offset 0x%" PRIx64
>   qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
>   qcow2_pwrite_zeroes(void *co, int64_t offset, int count) "co %p offset 0x%" PRIx64 " count %d"
> -qcow2_skip_cow(void *co, uint64_t offset, int nb_clusters) "co %p offset 0x%" PRIx64 " nb_clusters %d"
>   
>   # qcow2-cluster.c
>   qcow2_alloc_clusters_offset(void *co, uint64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
> diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
> index b91d8321bb..89e911400c 100755
> --- a/tests/qemu-iotests/060
> +++ b/tests/qemu-iotests/060
> @@ -150,15 +150,10 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | _filter_qemu_io
>   echo
>   echo "=== Testing overlap while COW is in flight ==="
>   echo
> -BACKING_IMG=$TEST_IMG.base
> -TEST_IMG=$BACKING_IMG _make_test_img 1G
> -
> -$QEMU_IO -c 'write 0k 64k' "$BACKING_IMG" | _filter_qemu_io
> -
>   # compat=0.10 is required in order to make the following discard actually
>   # unallocate the sector rather than make it a zero sector - we want COW, after
>   # all.
> -IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
> +IMGOPTS='compat=0.10' _make_test_img 1G
>   # Write two clusters, the second one enforces creation of an L2 table after
>   # the first data cluster.
>   $QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | _filter_qemu_io
> diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
> index 0f6b0658a1..e42bf8c5a9 100644
> --- a/tests/qemu-iotests/060.out
> +++ b/tests/qemu-iotests/060.out
> @@ -97,10 +97,7 @@ read 512/512 bytes at offset 0
>   
>   === Testing overlap while COW is in flight ===
>   
> -Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=1073741824
> -wrote 65536/65536 bytes at offset 0
> -64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824 backing_file=TEST_DIR/t.IMGFMT.base
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
>   wrote 65536/65536 bytes at offset 0
>   64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 65536/65536 bytes at offset 536870912
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 10:20 ` [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
@ 2019-11-01 10:28   ` Vladimir Sementsov-Ogievskiy
  2019-11-01 11:12     ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-01 10:28 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

01.11.2019 13:20, Max Reitz wrote:
> On 01.11.19 11:00, Max Reitz wrote:
>> Hi,
>>
>> This series builds on the previous RFC.  The workaround is now applied
>> unconditionally of AIO mode and filesystem because we don’t know those
>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>> has been moved to block/io.c.
>>
>> Applying the workaround unconditionally is fine from a performance
>> standpoint, because it should actually be dead code, thanks to patch 1
>> (the elephant in the room).  As far as I know, there is no other block
>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>> as part of normal I/O so it can occur concurrently to other write
>> requests.  It still makes sense to take the workaround for file-posix
>> because we can’t really prevent that any other block driver will submit
>> zero writes as part of normal I/O in the future.
>>
>> Anyway, let’s get to the elephant.
>>
>>  From input by XFS developers
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>> performance or we wouldn’t have it.
>>
>> In general, avoiding performance regressions is more important than
>> improving performance, unless the regressions are just a minor corner
>> case or insignificant when compared to the improvement.  The XFS
>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
> 
> Ah, crap.
> 
> I wanted to send this series as early today as possible to get as much
> feedback as possible, so I’ve only started doing benchmarks now.
> 
> The obvious
> 
> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
> 
> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
> c8bb23cbdbe reverted.  So now on to guest tests...

Aha, that's very interesting) What about aio-native which should be slowed down?
Could it be tested like this?

> 
> (Well, and the question is still where the performance regression on
> ppc64 comes from.)
> 
> Max
> 
>> Thus, I believe we should revert the commit for now (and most
>> importantly for 4.1.1).  We can think about reintroducing it for 5.0,
>> but that would require more extensive benchmarks on a variety of
>> systems, and we must see how subclusters change the picture.
>>
>>
>> I would have liked to do benchmarks myself before making this decision,
>> but as far as I’m informed, patches for 4.1.1 are to be collected on
>> Monday, so we need to be quick.
>>
>>
>> git-backport-diff against the RFC:
>>
>> Key:
>> [----] : patches are identical
>> [####] : number of functional differences between upstream/downstream patch
>> [down] : patch is downstream-only
>> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
>>
>> 001/4:[down] 'Revert "qcow2: skip writing zero buffers to empty COW areas"'
>> 002/4:[----] [-C] 'block: Make wait/mark serialising requests public'
>> 003/4:[down] 'block: Add bdrv_co_get_self_request()'
>> 004/4:[0036] [FC] 'block/file-posix: Let post-EOF fallocate serialize'
>>
>>
>> Max Reitz (4):
>>    Revert "qcow2: skip writing zero buffers to empty COW areas"
>>    block: Make wait/mark serialising requests public
>>    block: Add bdrv_co_get_self_request()
>>    block/file-posix: Let post-EOF fallocate serialize
>>
>>   qapi/block-core.json       |  4 +-
>>   block/qcow2.h              |  6 ---
>>   include/block/block_int.h  |  4 ++
>>   block/file-posix.c         | 38 +++++++++++++++++
>>   block/io.c                 | 42 +++++++++++++------
>>   block/qcow2-cluster.c      |  2 +-
>>   block/qcow2.c              | 86 --------------------------------------
>>   block/trace-events         |  1 -
>>   tests/qemu-iotests/060     |  7 +---
>>   tests/qemu-iotests/060.out |  5 +--
>>   10 files changed, 76 insertions(+), 119 deletions(-)
>>
> 
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 10:28   ` Vladimir Sementsov-Ogievskiy
@ 2019-11-01 11:12     ` Max Reitz
  2019-11-01 11:16       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2019-11-01 11:12 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 2711 bytes --]

On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 13:20, Max Reitz wrote:
>> On 01.11.19 11:00, Max Reitz wrote:
>>> Hi,
>>>
>>> This series builds on the previous RFC.  The workaround is now applied
>>> unconditionally of AIO mode and filesystem because we don’t know those
>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>> has been moved to block/io.c.
>>>
>>> Applying the workaround unconditionally is fine from a performance
>>> standpoint, because it should actually be dead code, thanks to patch 1
>>> (the elephant in the room).  As far as I know, there is no other block
>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>> as part of normal I/O so it can occur concurrently to other write
>>> requests.  It still makes sense to take the workaround for file-posix
>>> because we can’t really prevent that any other block driver will submit
>>> zero writes as part of normal I/O in the future.
>>>
>>> Anyway, let’s get to the elephant.
>>>
>>>  From input by XFS developers
>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>> performance or we wouldn’t have it.
>>>
>>> In general, avoiding performance regressions is more important than
>>> improving performance, unless the regressions are just a minor corner
>>> case or insignificant when compared to the improvement.  The XFS
>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>
>> Ah, crap.
>>
>> I wanted to send this series as early today as possible to get as much
>> feedback as possible, so I’ve only started doing benchmarks now.
>>
>> The obvious
>>
>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>
>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>> c8bb23cbdbe reverted.  So now on to guest tests...
> 
> Aha, that's very interesting) What about aio-native which should be slowed down?
> Could it be tested like this?

That is aio=native (-n).

But so far I don’t see any significant difference in guest tests (i.e.,
fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
--ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
about ext4 still.)

(Reverting c8bb23cbdbe makes it like 1 to 2 % faster.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 11:12     ` Max Reitz
@ 2019-11-01 11:16       ` Vladimir Sementsov-Ogievskiy
  2019-11-01 11:20         ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-01 11:16 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

01.11.2019 14:12, Max Reitz wrote:
> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 13:20, Max Reitz wrote:
>>> On 01.11.19 11:00, Max Reitz wrote:
>>>> Hi,
>>>>
>>>> This series builds on the previous RFC.  The workaround is now applied
>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>> has been moved to block/io.c.
>>>>
>>>> Applying the workaround unconditionally is fine from a performance
>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>> (the elephant in the room).  As far as I know, there is no other block
>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>> as part of normal I/O so it can occur concurrently to other write
>>>> requests.  It still makes sense to take the workaround for file-posix
>>>> because we can’t really prevent that any other block driver will submit
>>>> zero writes as part of normal I/O in the future.
>>>>
>>>> Anyway, let’s get to the elephant.
>>>>
>>>>   From input by XFS developers
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>> performance or we wouldn’t have it.
>>>>
>>>> In general, avoiding performance regressions is more important than
>>>> improving performance, unless the regressions are just a minor corner
>>>> case or insignificant when compared to the improvement.  The XFS
>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>
>>> Ah, crap.
>>>
>>> I wanted to send this series as early today as possible to get as much
>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>
>>> The obvious
>>>
>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>
>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>
>> Aha, that's very interesting) What about aio-native which should be slowed down?
>> Could it be tested like this?
> 
> That is aio=native (-n).
> 
> But so far I don’t see any significant difference in guest tests (i.e.,
> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
> about ext4 still.)

hmm, this possibly mostly tests writes to already allocated clusters. Has fio
an option to behave like qemu-img bench with -S 65536, i.e. write once into
each cluster?

> 
> (Reverting c8bb23cbdbe makes it like 1 to 2 % faster.)
> 
> Max
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 11:16       ` Vladimir Sementsov-Ogievskiy
@ 2019-11-01 11:20         ` Max Reitz
  2019-11-01 12:34           ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2019-11-01 11:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 3335 bytes --]

On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 14:12, Max Reitz wrote:
>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 13:20, Max Reitz wrote:
>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>> Hi,
>>>>>
>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>> has been moved to block/io.c.
>>>>>
>>>>> Applying the workaround unconditionally is fine from a performance
>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>> because we can’t really prevent that any other block driver will submit
>>>>> zero writes as part of normal I/O in the future.
>>>>>
>>>>> Anyway, let’s get to the elephant.
>>>>>
>>>>>   From input by XFS developers
>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>> performance or we wouldn’t have it.
>>>>>
>>>>> In general, avoiding performance regressions is more important than
>>>>> improving performance, unless the regressions are just a minor corner
>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>
>>>> Ah, crap.
>>>>
>>>> I wanted to send this series as early today as possible to get as much
>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>
>>>> The obvious
>>>>
>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>
>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>
>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>> Could it be tested like this?
>>
>> That is aio=native (-n).
>>
>> But so far I don’t see any significant difference in guest tests (i.e.,
>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>> about ext4 still.)
> 
> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
> an option to behave like qemu-img bench with -S 65536, i.e. write once into
> each cluster?

Maybe, but is that a realistic depiction of whether this change is worth
it?  That is why I’m doing the guest test, to see whether it actually
has much impact on the guest.

(The thing is that Laurent and our QE has seen significant real-world
performance regression on pcc64.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 11:20         ` Max Reitz
@ 2019-11-01 12:34           ` Max Reitz
  2019-11-01 13:09             ` Vladimir Sementsov-Ogievskiy
  2019-11-01 13:30             ` Max Reitz
  0 siblings, 2 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 12:34 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 3853 bytes --]

On 01.11.19 12:20, Max Reitz wrote:
> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 14:12, Max Reitz wrote:
>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>> has been moved to block/io.c.
>>>>>>
>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>> zero writes as part of normal I/O in the future.
>>>>>>
>>>>>> Anyway, let’s get to the elephant.
>>>>>>
>>>>>>   From input by XFS developers
>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>> performance or we wouldn’t have it.
>>>>>>
>>>>>> In general, avoiding performance regressions is more important than
>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>
>>>>> Ah, crap.
>>>>>
>>>>> I wanted to send this series as early today as possible to get as much
>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>
>>>>> The obvious
>>>>>
>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>
>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>
>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>> Could it be tested like this?
>>>
>>> That is aio=native (-n).
>>>
>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>> about ext4 still.)
>>
>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>> each cluster?
> 
> Maybe, but is that a realistic depiction of whether this change is worth
> it?  That is why I’m doing the guest test, to see whether it actually
> has much impact on the guest.

I’ve changed the above fio invocation to use --rw=randwrite and added
--fallocate=none.  The performance went down, but it went down both with
and without c8bb23cbdbe.

So on my XFS system (XFS on luks on SSD), I see:
- with c8bb23cbdbe: 26.0 - 27.9 MB/s
- without c8bb23cbdbe: 25.6 - 27 MB/s

On my ext4 system (native on SSD), I see:
- with: 39.4 - 41.5 MB/s
- without: 39.4 - 42.0 MB/s

So basically no difference for XFS, and really no difference for ext4.
(I ran these tests with 2 MB clusters.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas"
  2019-11-01 10:00 ` [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas" Max Reitz
  2019-11-01 10:22   ` Vladimir Sementsov-Ogievskiy
@ 2019-11-01 12:40   ` Eric Blake
  2019-11-01 14:01     ` Max Reitz
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Blake @ 2019-11-01 12:40 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Vladimir Sementsov-Ogievskiy,
	qemu-devel, Stefan Hajnoczi

On 11/1/19 11:00 AM, Max Reitz wrote:
> This reverts commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a.
> 
> This commit causes fundamental performance problems on XFS (because
> fallocate() stalls the AIO pipeline), and as such it is not clear that
> we should unconditionally enable this behavior.
> 
> We expect subclusters to alleviate the performance penalty of small
> writes to newly allocated clusters, so when we get them, the originally
> intended performance gain may actually no longer be significant.
> 
> If we want to reintroduce something similar to c8bb23cbdbe, it will
> require extensive benchmarking on various systems with subclusters
> enabled.
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---

> +++ b/qapi/block-core.json
> @@ -3304,8 +3304,6 @@
>   #
>   # @cor_write: a write due to copy-on-read (since 2.11)
>   #
> -# @cluster_alloc_space: an allocation of file space for a cluster (since 4.1)
> -#
>   # @none: triggers once at creation of the blkdebug node (since 4.1)

Deleting released qapi is not backwards-compatible.  However, given that 
the only known user of this interface is debug testing via iotests, I'm 
not too concerned that we would be impacting any external users.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 12:34           ` Max Reitz
@ 2019-11-01 13:09             ` Vladimir Sementsov-Ogievskiy
  2019-11-01 13:36               ` Denis Lunev
  2019-11-01 13:30             ` Max Reitz
  1 sibling, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-01 13:09 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi, Denis Lunev

01.11.2019 15:34, Max Reitz wrote:
> On 01.11.19 12:20, Max Reitz wrote:
>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 14:12, Max Reitz wrote:
>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>> has been moved to block/io.c.
>>>>>>>
>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>
>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>
>>>>>>>    From input by XFS developers
>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>> performance or we wouldn’t have it.
>>>>>>>
>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>
>>>>>> Ah, crap.
>>>>>>
>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>
>>>>>> The obvious
>>>>>>
>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>
>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>
>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>> Could it be tested like this?
>>>>
>>>> That is aio=native (-n).
>>>>
>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>> about ext4 still.)
>>>
>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>> each cluster?
>>
>> Maybe, but is that a realistic depiction of whether this change is worth
>> it?  That is why I’m doing the guest test, to see whether it actually
>> has much impact on the guest.
> 
> I’ve changed the above fio invocation to use --rw=randwrite and added
> --fallocate=none.  The performance went down, but it went down both with
> and without c8bb23cbdbe.
> 
> So on my XFS system (XFS on luks on SSD), I see:
> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
> - without c8bb23cbdbe: 25.6 - 27 MB/s
> 
> On my ext4 system (native on SSD), I see:
> - with: 39.4 - 41.5 MB/s
> - without: 39.4 - 42.0 MB/s
> 
> So basically no difference for XFS, and really no difference for ext4.
> (I ran these tests with 2 MB clusters.)
> 

Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, and this
is proved by simple tests with qemu-img bench, that fallocate is faster than zeroing
most of the cluster.

So, if some guest test doesn't show the difference, this means that "small write into
new cluster" is effectively rare case in this test.. And this doesn't prove that it's
always rare and insignificant.

I don't sure that we have a real-world example that proves necessity of this optimization,
or was there some original bug about low-performance which was fixed by this optimization..
Den, Anton, do we have something about it?

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 12:34           ` Max Reitz
  2019-11-01 13:09             ` Vladimir Sementsov-Ogievskiy
@ 2019-11-01 13:30             ` Max Reitz
  2019-11-01 15:06               ` Max Reitz
  1 sibling, 1 reply; 21+ messages in thread
From: Max Reitz @ 2019-11-01 13:30 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 5024 bytes --]

On 01.11.19 13:34, Max Reitz wrote:
> On 01.11.19 12:20, Max Reitz wrote:
>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 14:12, Max Reitz wrote:
>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>> has been moved to block/io.c.
>>>>>>>
>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>
>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>
>>>>>>>   From input by XFS developers
>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>> performance or we wouldn’t have it.
>>>>>>>
>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>
>>>>>> Ah, crap.
>>>>>>
>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>
>>>>>> The obvious
>>>>>>
>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>
>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>
>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>> Could it be tested like this?
>>>>
>>>> That is aio=native (-n).
>>>>
>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>> about ext4 still.)
>>>
>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>> each cluster?
>>
>> Maybe, but is that a realistic depiction of whether this change is worth
>> it?  That is why I’m doing the guest test, to see whether it actually
>> has much impact on the guest.
> 
> I’ve changed the above fio invocation to use --rw=randwrite and added
> --fallocate=none.  The performance went down, but it went down both with
> and without c8bb23cbdbe.
> 
> So on my XFS system (XFS on luks on SSD), I see:
> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
> - without c8bb23cbdbe: 25.6 - 27 MB/s
> 
> On my ext4 system (native on SSD), I see:
> - with: 39.4 - 41.5 MB/s
> - without: 39.4 - 42.0 MB/s
> 
> So basically no difference for XFS, and really no difference for ext4.
> (I ran these tests with 2 MB clusters.)

For 64 kB clusters, I have on XFS:
- with: 30.3 - 31.3 MB/s
- without: 27.4 - 27.9 MB/s

On ext4:
- with: 34.9 - 36.4 MB/s
- without: 33.8 - 38 MB/s

For good measure, 2 MB clusters with aio=threads (but still O_DIRECT) on
XFS:
- with: 25.7 MB/s - 27.0 MB/s
- without: 24.6 MB/s - 26.7 MB/s

On ext4:
- with: 16.7 MB/s - 19.3 MB/s
- without: 16.3 MB/s - 18.4 MB/s

(Note that the difference between ext4 and XFS is probably just because
these are two different systems with different SSDs.)

So there is little difference, if any significant at all (those were all
just three test runs).  There does seem to be a bit of a drift for
aio=threads, but then again it’s also just much slower than aio=native
for ext4.

(Also note that I ran this on an unfixed kernel.  Maybe the XFS fix will
slow things down, but I haven’t tested that yet.)

So unless there are realistic guest benchmarks for ext4 that say we
should keep the patch, I’m going to queue the revert for now (“now” =
4.1.1 and 4.2.0).

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 13:09             ` Vladimir Sementsov-Ogievskiy
@ 2019-11-01 13:36               ` Denis Lunev
  2019-11-01 13:40                 ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Denis Lunev @ 2019-11-01 13:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

On 11/1/19 4:09 PM, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 15:34, Max Reitz wrote:
>> On 01.11.19 12:20, Max Reitz wrote:
>>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>>> 01.11.2019 14:12, Max Reitz wrote:
>>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>>> has been moved to block/io.c.
>>>>>>>>
>>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>>
>>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>>
>>>>>>>>    From input by XFS developers
>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>>> performance or we wouldn’t have it.
>>>>>>>>
>>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>> Ah, crap.
>>>>>>>
>>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>>
>>>>>>> The obvious
>>>>>>>
>>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>>
>>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>>> Could it be tested like this?
>>>>> That is aio=native (-n).
>>>>>
>>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>>> about ext4 still.)
>>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>>> each cluster?
>>> Maybe, but is that a realistic depiction of whether this change is worth
>>> it?  That is why I’m doing the guest test, to see whether it actually
>>> has much impact on the guest.
>> I’ve changed the above fio invocation to use --rw=randwrite and added
>> --fallocate=none.  The performance went down, but it went down both with
>> and without c8bb23cbdbe.
>>
>> So on my XFS system (XFS on luks on SSD), I see:
>> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
>> - without c8bb23cbdbe: 25.6 - 27 MB/s
>>
>> On my ext4 system (native on SSD), I see:
>> - with: 39.4 - 41.5 MB/s
>> - without: 39.4 - 42.0 MB/s
>>
>> So basically no difference for XFS, and really no difference for ext4.
>> (I ran these tests with 2 MB clusters.)
>>
> Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, and this
> is proved by simple tests with qemu-img bench, that fallocate is faster than zeroing
> most of the cluster.
>
> So, if some guest test doesn't show the difference, this means that "small write into
> new cluster" is effectively rare case in this test.. And this doesn't prove that it's
> always rare and insignificant.
>
> I don't sure that we have a real-world example that proves necessity of this optimization,
> or was there some original bug about low-performance which was fixed by this optimization..
> Den, Anton, do we have something about it?
>
sorry, I have missed the beginning of the thread.

Which driver is used for virtual disk - cached or non-cached IO
is used in QEMU? We use non-cached by default and this could
make a difference significantly.

Max,

can you pls share your domain.xml of the guest config and
fio file for guest. I will recheck to be 120% sure.

Thank you in advance,
    Den

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 13:36               ` Denis Lunev
@ 2019-11-01 13:40                 ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 13:40 UTC (permalink / raw)
  To: Denis Lunev, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 6103 bytes --]

On 01.11.19 14:36, Denis Lunev wrote:
> On 11/1/19 4:09 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 15:34, Max Reitz wrote:
>>> On 01.11.19 12:20, Max Reitz wrote:
>>>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 14:12, Max Reitz wrote:
>>>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>>>> has been moved to block/io.c.
>>>>>>>>>
>>>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>>>
>>>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>>>
>>>>>>>>>    From input by XFS developers
>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>>>> performance or we wouldn’t have it.
>>>>>>>>>
>>>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>>> Ah, crap.
>>>>>>>>
>>>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>>>
>>>>>>>> The obvious
>>>>>>>>
>>>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>>>
>>>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>>>> Could it be tested like this?
>>>>>> That is aio=native (-n).
>>>>>>
>>>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>>>> about ext4 still.)
>>>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>>>> each cluster?
>>>> Maybe, but is that a realistic depiction of whether this change is worth
>>>> it?  That is why I’m doing the guest test, to see whether it actually
>>>> has much impact on the guest.
>>> I’ve changed the above fio invocation to use --rw=randwrite and added
>>> --fallocate=none.  The performance went down, but it went down both with
>>> and without c8bb23cbdbe.
>>>
>>> So on my XFS system (XFS on luks on SSD), I see:
>>> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
>>> - without c8bb23cbdbe: 25.6 - 27 MB/s
>>>
>>> On my ext4 system (native on SSD), I see:
>>> - with: 39.4 - 41.5 MB/s
>>> - without: 39.4 - 42.0 MB/s
>>>
>>> So basically no difference for XFS, and really no difference for ext4.
>>> (I ran these tests with 2 MB clusters.)
>>>
>> Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, and this
>> is proved by simple tests with qemu-img bench, that fallocate is faster than zeroing
>> most of the cluster.
>>
>> So, if some guest test doesn't show the difference, this means that "small write into
>> new cluster" is effectively rare case in this test.. And this doesn't prove that it's
>> always rare and insignificant.
>>
>> I don't sure that we have a real-world example that proves necessity of this optimization,
>> or was there some original bug about low-performance which was fixed by this optimization..
>> Den, Anton, do we have something about it?
>>
> sorry, I have missed the beginning of the thread.
> 
> Which driver is used for virtual disk - cached or non-cached IO
> is used in QEMU? We use non-cached by default and this could
> make a difference significantly.

I’m using no cache, the above tests were done with aio=native; I’ve sent
another response with aio=threads numbers.

> Max,
> 
> can you pls share your domain.xml of the guest config and
> fio file for guest. I will recheck to be 120% sure.

I’m running qemu directly as follows:

x86_64-softmmu/qemu-system-x86_64 \


    -serial stdio \
    -cdrom ~/tmp/arch.iso \
    -m 4096 \
    -enable-kvm \
    -drive \
  if=none,id=t,format=qcow2,file=test/test.qcow2,cache=none,aio=native \
    -device virtio-scsi \
    -device scsi-hd,drive=t \
    -net user \
    -net nic,model=rtl8139 \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \
    -cpu SandyBridge \
    -nodefaults \
    -nographic

The full FIO command line is:

fio --rw=randwrite --bs=4k --iodepth=8 --runtime=1m --direct=1 \
    --filename=/mnt/foo --name=job1 --ioengine=libaio --thread \
    --group_reporting --numjobs=16 --size=2G --time_based \
    --output=/tmp/fio_result --fallocate=none

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas"
  2019-11-01 12:40   ` Eric Blake
@ 2019-11-01 14:01     ` Max Reitz
  2019-11-01 15:42       ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2019-11-01 14:01 UTC (permalink / raw)
  To: Eric Blake, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Vladimir Sementsov-Ogievskiy,
	qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 1463 bytes --]

On 01.11.19 13:40, Eric Blake wrote:
> On 11/1/19 11:00 AM, Max Reitz wrote:
>> This reverts commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a.
>>
>> This commit causes fundamental performance problems on XFS (because
>> fallocate() stalls the AIO pipeline), and as such it is not clear that
>> we should unconditionally enable this behavior.
>>
>> We expect subclusters to alleviate the performance penalty of small
>> writes to newly allocated clusters, so when we get them, the originally
>> intended performance gain may actually no longer be significant.
>>
>> If we want to reintroduce something similar to c8bb23cbdbe, it will
>> require extensive benchmarking on various systems with subclusters
>> enabled.
>>
>> Cc: qemu-stable@nongnu.org
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
> 
>> +++ b/qapi/block-core.json
>> @@ -3304,8 +3304,6 @@
>>   #
>>   # @cor_write: a write due to copy-on-read (since 2.11)
>>   #
>> -# @cluster_alloc_space: an allocation of file space for a cluster
>> (since 4.1)
>> -#
>>   # @none: triggers once at creation of the blkdebug node (since 4.1)
> 
> Deleting released qapi is not backwards-compatible.

Right. :-/

I’ll just not make changes to the QAPI schema.  It doesn’t hurt to leave
this in.

Max

> However, given that
> the only known user of this interface is debug testing via iotests, I'm
> not too concerned that we would be impacting any external users.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS
  2019-11-01 13:30             ` Max Reitz
@ 2019-11-01 15:06               ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 15:06 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 4105 bytes --]

On 01.11.19 14:30, Max Reitz wrote:

[...]

> So unless there are realistic guest benchmarks for ext4 that say we
> should keep the patch, I’m going to queue the revert for now (“now” =
> 4.1.1 and 4.2.0).

I found one case where the performance is significantly improved by
c8bb23cbdbe: In the cases so far I had XFS in the guest, now I used
ext4, and with aio=native (on the ext4 host with 2 MB clusters), the
performance goes from 63.9 - 65.0 MB/s to 75.7 - 76.4 MB/s (so +18%).

The difference is smaller for 64 kB clusters, but still there at +13%.
That’s probably the more important fact, because these are the default
settings, and this is probably about what would happen with 2 MB
clusters + subclusters.

(Patch 4 in this series doesn’t decrease the performance.)

This is a tough decision for me because from some people tell me “Let’s
just revert it, there are problems with it and we don’t quite know what
good it does in practice”, and others say “We have (not really
practical) benchmarks that show it does something good for our specific
case”.  And all that while those two groups never seem to talk to each
other directly.

So I suppose I’m going to have to make a decision.  I now know a case
where c8bb23cbdbe is actually beneficial.  I myself have never seen
c8bb23cbdbe decrease performance, but I know Laurent has seen a drastic
performance degradation, and he’s used it to bisect the XFS problem to
that commit, so it’s really real.  But I haven’t seen it, and as far as
I know it really only happens on ppc64.


In light of this, I would prefer to revert c8bb23cbdbe for 4.1.1, and
keep it for 4.2.0.  But I don’t know whether we can do that, all I know
is that I’m not going to find out in time for 4.1.1.

If we keep c8bb23cbdbe in any way, we need patches 2 through 4, that
much is clear.

I believe we can think about the performance problem after 4.2.0.  I
would love to benchmark c8bb23cbdbe on a fixed kernel, but there just
isn’t time for that anymore.


I’m not a fan of keeping c8bb23cbdbe behind a configure switch.  If it’s
beneficial, it should be there for everyone.


OK.  Some may see this as a wrong decision, but someone needs to make
one now, so here goes: ext4 is the default Linux FS for most
distributions.  As far as I can tell from my own benchmarks, c8bb23cbdbe
brings a significant performance improvement for qcow2 images with the
default configuration on this default filesystem with aio=native and
doesn’t change much in any other case.

What happens on ppc64 is a problem, but that’s a RHEL problem because
it’s specific to XFS (and also ppc64).  It also won’t be a regression on
4.2 compared to 4.1.

Dave’s argument was good that fallocate() and AIO cannot mix (at least
on XFS), but I couldn’t see any impact of that in my benchmarks (maybe
my benchmarks were just wrong).


So I think for upstream it’ll be best if I send a v2 which doesn’t touch
handle_alloc_space(), and instead just consists of patches 2 through 4.
 (And CC it all to stable.)

I think we still need to keep track of the XFS/ppc64 issue and do more
benchmarks especially with the fixed XFS driver.


tl;dr:
The main arguments for reverting c8bb23cbdbe were (AFAIU):
- a general uneasy feeling about it
- theoretical arguments that it must be bad on XFS
- actual problems on ppc64/XFS
- “what good does it do in practice?”
- that subclusters would make it obsolete anyway

What I could see is:
- no impact on XFS in practice
- significant practical benefit on ext4
- subclusters probably wouldn’t make it obsolete, because I can still
see a +13% improvement for 64 kB clusters (2 MB clusters + subclusters
gives you 64 kB subclusters)

In addition, it needs to be considered that ext4 is the default FS for
most Linux distributions.

As such, I personally am not convinced of reverting this patch.  Let’s
keep it, have patches 2 through 4 for 4.1.1 and 4.2.0, and think about
what to do for ppc64/XFS later.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas"
  2019-11-01 14:01     ` Max Reitz
@ 2019-11-01 15:42       ` Kevin Wolf
  2019-11-01 16:02         ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2019-11-01 15:42 UTC (permalink / raw)
  To: Max Reitz
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel,
	Anton Nefedov, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1546 bytes --]

Am 01.11.2019 um 15:01 hat Max Reitz geschrieben:
> On 01.11.19 13:40, Eric Blake wrote:
> > On 11/1/19 11:00 AM, Max Reitz wrote:
> >> This reverts commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a.
> >>
> >> This commit causes fundamental performance problems on XFS (because
> >> fallocate() stalls the AIO pipeline), and as such it is not clear that
> >> we should unconditionally enable this behavior.
> >>
> >> We expect subclusters to alleviate the performance penalty of small
> >> writes to newly allocated clusters, so when we get them, the originally
> >> intended performance gain may actually no longer be significant.
> >>
> >> If we want to reintroduce something similar to c8bb23cbdbe, it will
> >> require extensive benchmarking on various systems with subclusters
> >> enabled.
> >>
> >> Cc: qemu-stable@nongnu.org
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> ---
> > 
> >> +++ b/qapi/block-core.json
> >> @@ -3304,8 +3304,6 @@
> >>   #
> >>   # @cor_write: a write due to copy-on-read (since 2.11)
> >>   #
> >> -# @cluster_alloc_space: an allocation of file space for a cluster
> >> (since 4.1)
> >> -#
> >>   # @none: triggers once at creation of the blkdebug node (since 4.1)
> > 
> > Deleting released qapi is not backwards-compatible.
> 
> Right. :-/
> 
> I’ll just not make changes to the QAPI schema.  It doesn’t hurt to leave
> this in.

Why would it be incompatible to drop an enum value that is only ever
used in output and that QEMU doesn't generate?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas"
  2019-11-01 15:42       ` Kevin Wolf
@ 2019-11-01 16:02         ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 16:02 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel,
	Anton Nefedov, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 1675 bytes --]

On 01.11.19 16:42, Kevin Wolf wrote:
> Am 01.11.2019 um 15:01 hat Max Reitz geschrieben:
>> On 01.11.19 13:40, Eric Blake wrote:
>>> On 11/1/19 11:00 AM, Max Reitz wrote:
>>>> This reverts commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a.
>>>>
>>>> This commit causes fundamental performance problems on XFS (because
>>>> fallocate() stalls the AIO pipeline), and as such it is not clear that
>>>> we should unconditionally enable this behavior.
>>>>
>>>> We expect subclusters to alleviate the performance penalty of small
>>>> writes to newly allocated clusters, so when we get them, the originally
>>>> intended performance gain may actually no longer be significant.
>>>>
>>>> If we want to reintroduce something similar to c8bb23cbdbe, it will
>>>> require extensive benchmarking on various systems with subclusters
>>>> enabled.
>>>>
>>>> Cc: qemu-stable@nongnu.org
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>
>>>> +++ b/qapi/block-core.json
>>>> @@ -3304,8 +3304,6 @@
>>>>   #
>>>>   # @cor_write: a write due to copy-on-read (since 2.11)
>>>>   #
>>>> -# @cluster_alloc_space: an allocation of file space for a cluster
>>>> (since 4.1)
>>>> -#
>>>>   # @none: triggers once at creation of the blkdebug node (since 4.1)
>>>
>>> Deleting released qapi is not backwards-compatible.
>>
>> Right. :-/
>>
>> I’ll just not make changes to the QAPI schema.  It doesn’t hurt to leave
>> this in.
> 
> Why would it be incompatible to drop an enum value that is only ever
> used in output and that QEMU doesn't generate?

This isn’t output at all, it’s input for blockdev-add for blkdebug nodes.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-11-01 16:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-01 10:00 [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
2019-11-01 10:00 ` [PATCH for-4.2 1/4] Revert "qcow2: skip writing zero buffers to empty COW areas" Max Reitz
2019-11-01 10:22   ` Vladimir Sementsov-Ogievskiy
2019-11-01 12:40   ` Eric Blake
2019-11-01 14:01     ` Max Reitz
2019-11-01 15:42       ` Kevin Wolf
2019-11-01 16:02         ` Max Reitz
2019-11-01 10:00 ` [PATCH for-4.2 2/4] block: Make wait/mark serialising requests public Max Reitz
2019-11-01 10:00 ` [PATCH for-4.2 3/4] block: Add bdrv_co_get_self_request() Max Reitz
2019-11-01 10:00 ` [PATCH for-4.2 4/4] block/file-posix: Let post-EOF fallocate serialize Max Reitz
2019-11-01 10:20 ` [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS Max Reitz
2019-11-01 10:28   ` Vladimir Sementsov-Ogievskiy
2019-11-01 11:12     ` Max Reitz
2019-11-01 11:16       ` Vladimir Sementsov-Ogievskiy
2019-11-01 11:20         ` Max Reitz
2019-11-01 12:34           ` Max Reitz
2019-11-01 13:09             ` Vladimir Sementsov-Ogievskiy
2019-11-01 13:36               ` Denis Lunev
2019-11-01 13:40                 ` Max Reitz
2019-11-01 13:30             ` Max Reitz
2019-11-01 15:06               ` Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).