qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS
@ 2019-11-01 15:25 Max Reitz
  2019-11-01 15:25 ` [PATCH for-4.2 v2 1/3] block: Make wait/mark serialising requests public Max Reitz
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 15:25 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-stable,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi, Denis V . Lunev,
	Max Reitz

Hi,

As I reasoned here:
https://lists.nongnu.org/archive/html/qemu-block/2019-11/msg00027.html
I’m no longer convinced of reverting c8bb23cbdbe.  I could see a
significant performance improvement from it on ext4 with aio=native in a
guest that does random 4k writes, and as such I feel it would be a
regression to revert it for 4.2.

To work around the XFS corruption, we still need the other three patches
from the series, of course.  We cannot restrict the workaround to just
XFS, because maybe the file is on a remote filesystem and then we don’t
know anything about the host configuration.

The performance impact should still be minimal because this just
serializes post-EOF write-zeroes and data writes, and that just doesn’t
happen very often, even with handle_alloc_space() in qcow2.


I would love to have more time to make a decision, but there simply
isn’t any.  Patches for 4.1.1 are to be collected on Monday, AFAIU.


v2:
- Dropped patch 1
- Forgot a “coroutine_fn” in patch 2 (it isn’t really important,
  qemu_coroutine_self() works in non-coroutine functions, too, but
  calling bdrv_(co_)get_self_request() from a non-coroutine just doesn’t
  make any sense)
- Patch 3:
  - Reverted the commit message to the RFC state to reflect that this is
    specifically because of handle_alloc_space()
  - Dropped the two lines that said there’d be no performance impact at
    all because no driver would submit zero writes beyond the EOF
    (because qcow2 now still does)


git-backport-diff against v1:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/3:[----] [--] 'block: Make wait/mark serialising requests public'
002/3:[0004] [FC] 'block: Add bdrv_co_get_self_request()'
003/3:[0002] [FC] 'block/file-posix: Let post-EOF fallocate serialize'


Max Reitz (3):
  block: Make wait/mark serialising requests public
  block: Add bdrv_co_get_self_request()
  block/file-posix: Let post-EOF fallocate serialize

 include/block/block_int.h |  4 ++++
 block/file-posix.c        | 36 +++++++++++++++++++++++++++++++++
 block/io.c                | 42 ++++++++++++++++++++++++++++-----------
 3 files changed, 70 insertions(+), 12 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 v2 1/3] block: Make wait/mark serialising requests public
  2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
@ 2019-11-01 15:25 ` Max Reitz
  2019-11-01 15:25 ` [PATCH for-4.2 v2 2/3] block: Add bdrv_co_get_self_request() Max Reitz
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 15:25 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-stable,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi, Denis V . Lunev,
	Max Reitz

Make both bdrv_mark_request_serialising() and
bdrv_wait_serialising_requests() public so they can be used from block
drivers.

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h |  3 +++
 block/io.c                | 24 ++++++++++++------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 02dc0034a2..32fa323b63 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -999,6 +999,9 @@ extern unsigned int bdrv_drain_all_count;
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
 
+bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self);
+void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align);
+
 int get_tmp_filename(char *filename, int size);
 BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
                             const char *filename);
diff --git a/block/io.c b/block/io.c
index 02659f994d..039c0d49c9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -715,7 +715,7 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     qemu_co_mutex_unlock(&bs->reqs_lock);
 }
 
-static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 {
     int64_t overlap_offset = req->offset & ~(align - 1);
     uint64_t overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
@@ -805,7 +805,7 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
     bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
@@ -1437,14 +1437,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
          * with each other for the same cluster.  For example, in copy-on-read
          * it ensures that the CoR read and write operations are atomic and
          * guest writes cannot interleave between them. */
-        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
     /* BDRV_REQ_SERIALISING is only for write operation */
     assert(!(flags & BDRV_REQ_SERIALISING));
 
     if (!(flags & BDRV_REQ_NO_SERIALISING)) {
-        wait_serialising_requests(req);
+        bdrv_wait_serialising_requests(req);
     }
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -1841,10 +1841,10 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
     assert(!(flags & ~BDRV_REQ_MASK));
 
     if (flags & BDRV_REQ_SERIALISING) {
-        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
-    waited = wait_serialising_requests(req);
+    waited = bdrv_wait_serialising_requests(req);
 
     assert(!waited || !req->serialising ||
            is_request_serialising_and_aligned(req));
@@ -2008,8 +2008,8 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
 
     padding = bdrv_init_padding(bs, offset, bytes, &pad);
     if (padding) {
-        mark_request_serialising(req, align);
-        wait_serialising_requests(req);
+        bdrv_mark_request_serialising(req, align);
+        bdrv_wait_serialising_requests(req);
 
         bdrv_padding_rmw_read(child, req, &pad, true);
 
@@ -2111,8 +2111,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     }
 
     if (bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad)) {
-        mark_request_serialising(&req, align);
-        wait_serialising_requests(&req);
+        bdrv_mark_request_serialising(&req, align);
+        bdrv_wait_serialising_requests(&req);
         bdrv_padding_rmw_read(child, &req, &pad, false);
     }
 
@@ -3205,7 +3205,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
         /* BDRV_REQ_SERIALISING is only for write operation */
         assert(!(read_flags & BDRV_REQ_SERIALISING));
         if (!(read_flags & BDRV_REQ_NO_SERIALISING)) {
-            wait_serialising_requests(&req);
+            bdrv_wait_serialising_requests(&req);
         }
 
         ret = src->bs->drv->bdrv_co_copy_range_from(src->bs,
@@ -3336,7 +3336,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
      * new area, we need to make sure that no write requests are made to it
      * concurrently or they might be overwritten by preallocation. */
     if (new_bytes) {
-        mark_request_serialising(&req, 1);
+        bdrv_mark_request_serialising(&req, 1);
     }
     if (bs->read_only) {
         error_setg(errp, "Image is read-only");
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 v2 2/3] block: Add bdrv_co_get_self_request()
  2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
  2019-11-01 15:25 ` [PATCH for-4.2 v2 1/3] block: Make wait/mark serialising requests public Max Reitz
@ 2019-11-01 15:25 ` Max Reitz
  2019-11-01 15:25 ` [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 15:25 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-stable,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi, Denis V . Lunev,
	Max Reitz

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h |  1 +
 block/io.c                | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 32fa323b63..dd033d0b37 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1001,6 +1001,7 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
 
 bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self);
 void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align);
+BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs);
 
 int get_tmp_filename(char *filename, int size);
 BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
diff --git a/block/io.c b/block/io.c
index 039c0d49c9..f75777f5ea 100644
--- a/block/io.c
+++ b/block/io.c
@@ -742,6 +742,24 @@ static bool is_request_serialising_and_aligned(BdrvTrackedRequest *req)
            (req->bytes == req->overlap_bytes);
 }
 
+/**
+ * Return the tracked request on @bs for the current coroutine, or
+ * NULL if there is none.
+ */
+BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs)
+{
+    BdrvTrackedRequest *req;
+    Coroutine *self = qemu_coroutine_self();
+
+    QLIST_FOREACH(req, &bs->tracked_requests, list) {
+        if (req->co == self) {
+            return req;
+        }
+    }
+
+    return NULL;
+}
+
 /**
  * Round a region to cluster boundaries
  */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
  2019-11-01 15:25 ` [PATCH for-4.2 v2 1/3] block: Make wait/mark serialising requests public Max Reitz
  2019-11-01 15:25 ` [PATCH for-4.2 v2 2/3] block: Add bdrv_co_get_self_request() Max Reitz
@ 2019-11-01 15:25 ` Max Reitz
  2019-11-14 16:27   ` Christoph Hellwig
                     ` (2 more replies)
  2019-11-01 15:48 ` [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS no-reply
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-01 15:25 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-stable,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi, Denis V . Lunev,
	Max Reitz

The XFS kernel driver has a bug that may cause data corruption for qcow2
images as of qemu commit c8bb23cbdbe32f.  We can work around it by
treating post-EOF fallocates as serializing up until infinity (INT64_MAX
in practice).

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 0b7e904d48..1f0f61a02b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
     RawPosixAIOData acb;
     ThreadPoolFunc *handler;
 
+#ifdef CONFIG_FALLOCATE
+    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+        BdrvTrackedRequest *req;
+        uint64_t end;
+
+        /*
+         * This is a workaround for a bug in the Linux XFS driver,
+         * where writes submitted through the AIO interface will be
+         * discarded if they happen beyond a concurrently running
+         * fallocate() that increases the file length (i.e., both the
+         * write and the fallocate() happen beyond the EOF).
+         *
+         * To work around it, we extend the tracked request for this
+         * zero write until INT64_MAX (effectively infinity), and mark
+         * it as serializing.
+         *
+         * We have to enable this workaround for all filesystems and
+         * AIO modes (not just XFS with aio=native), because for
+         * remote filesystems we do not know the host configuration.
+         */
+
+        req = bdrv_co_get_self_request(bs);
+        assert(req);
+        assert(req->type == BDRV_TRACKED_WRITE);
+        assert(req->offset <= offset);
+        assert(req->offset + req->bytes >= offset + bytes);
+
+        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
+        req->bytes = end - req->offset;
+        req->overlap_bytes = req->bytes;
+
+        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
+        bdrv_wait_serialising_requests(req);
+    }
+#endif
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_fildes     = s->fd,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS
  2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
                   ` (2 preceding siblings ...)
  2019-11-01 15:25 ` [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
@ 2019-11-01 15:48 ` no-reply
  2019-11-04  8:29   ` Max Reitz
  2019-11-04  9:09 ` Max Reitz
  2019-11-04  9:45 ` Stefan Hajnoczi
  5 siblings, 1 reply; 21+ messages in thread
From: no-reply @ 2019-11-01 15:48 UTC (permalink / raw)
  To: mreitz
  Cc: kwolf, anton.nefedov, qemu-block, qemu-stable, qemu-devel,
	vsementsov, stefanha, den, mreitz

Patchew URL: https://patchew.org/QEMU/20191101152510.11719-1-mreitz@redhat.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TEST    check-qtest-x86_64: tests/ahci-test
  TEST    check-unit: tests/test-aio-multithread
test-aio-multithread: /tmp/qemu-test/src/util/async.c:283: aio_ctx_finalize: Assertion `!qemu_lockcnt_count(&ctx->list_lock)' failed.
ERROR - too few tests run (expected 6, got 2)
make: *** [check-unit] Error 1
make: *** Waiting for unfinished jobs....
  TEST    iotest-qcow2: 013
  TEST    iotest-qcow2: 017
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=bf88cbb2b1b4497f8f4fc2b674903b08', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-vhhub41q/src/docker-src.2019-11-01-11.37.23.29941:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=bf88cbb2b1b4497f8f4fc2b674903b08
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-vhhub41q/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    11m36.434s
user    0m9.145s


The full log is available at
http://patchew.org/logs/20191101152510.11719-1-mreitz@redhat.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS
  2019-11-01 15:48 ` [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS no-reply
@ 2019-11-04  8:29   ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-04  8:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, anton.nefedov, qemu-block, qemu-stable, vsementsov, stefanha, den


[-- Attachment #1.1: Type: text/plain, Size: 1016 bytes --]

On 01.11.19 16:48, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/20191101152510.11719-1-mreitz@redhat.com/
> 
> 
> 
> Hi,
> 
> This series failed the docker-quick@centos7 build test. Please find the testing commands and
> their output below. If you have Docker installed, you can probably reproduce it
> locally.
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> make docker-image-centos7 V=1 NETWORK=1
> time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
> === TEST SCRIPT END ===
> 
>   TEST    check-qtest-x86_64: tests/ahci-test
>   TEST    check-unit: tests/test-aio-multithread
> test-aio-multithread: /tmp/qemu-test/src/util/async.c:283: aio_ctx_finalize: Assertion `!qemu_lockcnt_count(&ctx->list_lock)' failed.
> ERROR - too few tests run (expected 6, got 2)
> make: *** [check-unit] Error 1
> make: *** Waiting for unfinished jobs....

This doesn’t seem related to this series and I can’t reproduce it
locally, so I’ll just ignore it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS
  2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
                   ` (3 preceding siblings ...)
  2019-11-01 15:48 ` [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS no-reply
@ 2019-11-04  9:09 ` Max Reitz
  2019-11-04  9:45 ` Stefan Hajnoczi
  5 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-04  9:09 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel,
	qemu-stable, Anton Nefedov, Stefan Hajnoczi, Denis V . Lunev


[-- Attachment #1.1: Type: text/plain, Size: 1550 bytes --]

On 01.11.19 16:25, Max Reitz wrote:
> Hi,
> 
> As I reasoned here:
> https://lists.nongnu.org/archive/html/qemu-block/2019-11/msg00027.html
> I’m no longer convinced of reverting c8bb23cbdbe.  I could see a
> significant performance improvement from it on ext4 with aio=native in a
> guest that does random 4k writes, and as such I feel it would be a
> regression to revert it for 4.2.
> 
> To work around the XFS corruption, we still need the other three patches
> from the series, of course.  We cannot restrict the workaround to just
> XFS, because maybe the file is on a remote filesystem and then we don’t
> know anything about the host configuration.
> 
> The performance impact should still be minimal because this just
> serializes post-EOF write-zeroes and data writes, and that just doesn’t
> happen very often, even with handle_alloc_space() in qcow2.
> 
> 
> I would love to have more time to make a decision, but there simply
> isn’t any.  Patches for 4.1.1 are to be collected on Monday, AFAIU.

I would have liked some reviews on this series (so I waited over the
weekend, even though I didn’t expect any), but now I’ve applied it
anyway (and sent a pull request).  I understand it was difficult last
week because of KVM Forum.

AFAIU we need it today for stable, and there didn’t seem to be any
opposition on these patches here, just on the revert of c8bb23cbdbe.

I welcome you to still review the series and then shout “STOP” at the
pull request if you find it necessary.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS
  2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
                   ` (4 preceding siblings ...)
  2019-11-04  9:09 ` Max Reitz
@ 2019-11-04  9:45 ` Stefan Hajnoczi
  5 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2019-11-04  9:45 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel,
	qemu-stable, Anton Nefedov, Denis V . Lunev

[-- Attachment #1: Type: text/plain, Size: 2667 bytes --]

On Fri, Nov 01, 2019 at 04:25:07PM +0100, Max Reitz wrote:
> Hi,
> 
> As I reasoned here:
> https://lists.nongnu.org/archive/html/qemu-block/2019-11/msg00027.html
> I’m no longer convinced of reverting c8bb23cbdbe.  I could see a
> significant performance improvement from it on ext4 with aio=native in a
> guest that does random 4k writes, and as such I feel it would be a
> regression to revert it for 4.2.
> 
> To work around the XFS corruption, we still need the other three patches
> from the series, of course.  We cannot restrict the workaround to just
> XFS, because maybe the file is on a remote filesystem and then we don’t
> know anything about the host configuration.
> 
> The performance impact should still be minimal because this just
> serializes post-EOF write-zeroes and data writes, and that just doesn’t
> happen very often, even with handle_alloc_space() in qcow2.
> 
> 
> I would love to have more time to make a decision, but there simply
> isn’t any.  Patches for 4.1.1 are to be collected on Monday, AFAIU.
> 
> 
> v2:
> - Dropped patch 1
> - Forgot a “coroutine_fn” in patch 2 (it isn’t really important,
>   qemu_coroutine_self() works in non-coroutine functions, too, but
>   calling bdrv_(co_)get_self_request() from a non-coroutine just doesn’t
>   make any sense)
> - Patch 3:
>   - Reverted the commit message to the RFC state to reflect that this is
>     specifically because of handle_alloc_space()
>   - Dropped the two lines that said there’d be no performance impact at
>     all because no driver would submit zero writes beyond the EOF
>     (because qcow2 now still does)
> 
> 
> git-backport-diff against v1:
> 
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
> 
> 001/3:[----] [--] 'block: Make wait/mark serialising requests public'
> 002/3:[0004] [FC] 'block: Add bdrv_co_get_self_request()'
> 003/3:[0002] [FC] 'block/file-posix: Let post-EOF fallocate serialize'
> 
> 
> Max Reitz (3):
>   block: Make wait/mark serialising requests public
>   block: Add bdrv_co_get_self_request()
>   block/file-posix: Let post-EOF fallocate serialize
> 
>  include/block/block_int.h |  4 ++++
>  block/file-posix.c        | 36 +++++++++++++++++++++++++++++++++
>  block/io.c                | 42 ++++++++++++++++++++++++++++-----------
>  3 files changed, 70 insertions(+), 12 deletions(-)
> 
> -- 
> 2.21.0
> 

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-11-01 15:25 ` [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
@ 2019-11-14 16:27   ` Christoph Hellwig
  2019-11-14 17:15     ` Max Reitz
  2020-06-02 14:43   ` Vladimir Sementsov-Ogievskiy
  2020-08-22 17:03   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2019-11-14 16:27 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, qemu-stable, qemu-devel,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi, Denis V . Lunev

On Fri, Nov 01, 2019 at 04:25:10PM +0100, Max Reitz wrote:
> The XFS kernel driver has a bug that may cause data corruption for qcow2
> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
> in practice).

This has been fixed in the kernel a while ago.  I don't think it makes
sense to work around it in qemu.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-11-14 16:27   ` Christoph Hellwig
@ 2019-11-14 17:15     ` Max Reitz
  2019-11-14 17:16       ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2019-11-14 17:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, qemu-stable, qemu-devel,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi, Denis V . Lunev


[-- Attachment #1.1: Type: text/plain, Size: 754 bytes --]

On 14.11.19 17:27, Christoph Hellwig wrote:
> On Fri, Nov 01, 2019 at 04:25:10PM +0100, Max Reitz wrote:
>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>> in practice).
> 
> This has been fixed in the kernel a while ago.  I don't think it makes
> sense to work around it in qemu.

Has it?  It was my understanding that this is fixed by
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=249bd9087a5264d2b8a974081870e2e27671b4dcwhich
has been merged only very recently and is on track to be part of Linux
5.5, as far as I understand.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-11-14 17:15     ` Max Reitz
@ 2019-11-14 17:16       ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2019-11-14 17:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, qemu-stable, qemu-devel,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi, Denis V . Lunev


[-- Attachment #1.1: Type: text/plain, Size: 992 bytes --]

On 14.11.19 18:15, Max Reitz wrote:
> On 14.11.19 17:27, Christoph Hellwig wrote:
>> On Fri, Nov 01, 2019 at 04:25:10PM +0100, Max Reitz wrote:
>>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>>> in practice).
>>
>> This has been fixed in the kernel a while ago.  I don't think it makes
>> sense to work around it in qemu.
> 
> Has it?  It was my understanding that this is fixed by
> https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=249bd9087a5264d2b8a974081870e2e27671b4dcwhich

Sorry, broke the link.  Let me try again:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=249bd9087a5264d2b8a974081870e2e27671b4dc

Max

> has been merged only very recently and is on track to be part of Linux
> 5.5, as far as I understand.
> 
> Max
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-11-01 15:25 ` [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
  2019-11-14 16:27   ` Christoph Hellwig
@ 2020-06-02 14:43   ` Vladimir Sementsov-Ogievskiy
  2020-06-02 15:46     ` Max Reitz
  2020-08-22 17:03   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-02 14:43 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

01.11.2019 18:25, Max Reitz wrote:

Sorry for being late, I have some comments

> The XFS kernel driver has a bug that may cause data corruption for qcow2
> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
> in practice).
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 36 insertions(+)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0b7e904d48..1f0f61a02b 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
>       RawPosixAIOData acb;
>       ThreadPoolFunc *handler;
>   
> +#ifdef CONFIG_FALLOCATE
> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
> +        BdrvTrackedRequest *req;
> +        uint64_t end;
> +
> +        /*
> +         * This is a workaround for a bug in the Linux XFS driver,
> +         * where writes submitted through the AIO interface will be
> +         * discarded if they happen beyond a concurrently running
> +         * fallocate() that increases the file length (i.e., both the
> +         * write and the fallocate() happen beyond the EOF).
> +         *
> +         * To work around it, we extend the tracked request for this
> +         * zero write until INT64_MAX (effectively infinity), and mark
> +         * it as serializing.
> +         *
> +         * We have to enable this workaround for all filesystems and
> +         * AIO modes (not just XFS with aio=native), because for
> +         * remote filesystems we do not know the host configuration.
> +         */
> +
> +        req = bdrv_co_get_self_request(bs);
> +        assert(req);
> +        assert(req->type == BDRV_TRACKED_WRITE);
> +        assert(req->offset <= offset);
> +        assert(req->offset + req->bytes >= offset + bytes);

Why these assertions? TrackedRequest offset and bytes fields correspond to the original request. When request is being expanded to satisfy request_alignment, these fields are not updated.
So, maybe, we should assert overlap_offset and overlap_bytes?

> +
> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
> +        req->bytes = end - req->offset;

And I doubt that we should update req->bytes. We never updated it in other places, it corresponds to original request. It's enough to update overlap_bytes to achieve corresponding serialising.

> +        req->overlap_bytes = req->bytes;
> +
> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);

Not sure, how much should we care about request_alignment here, I think, it's enough to just set req->overlap_bytes = INT64_MAX - req->overlap_offest, but it doesn't really matter.

> +        bdrv_wait_serialising_requests(req);
> +    }
> +#endif
> +
>       acb = (RawPosixAIOData) {
>           .bs             = bs,
>           .aio_fildes     = s->fd,
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-06-02 14:43   ` Vladimir Sementsov-Ogievskiy
@ 2020-06-02 15:46     ` Max Reitz
  2020-06-02 16:16       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2020-06-02 15:46 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev


[-- Attachment #1.1: Type: text/plain, Size: 4956 bytes --]

On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 18:25, Max Reitz wrote:
> 
> Sorry for being late, I have some comments

Uh, well.  Reasonable, but I hope you don’t mind me having no longer
having this patch fresh on my mind.

>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>> in practice).
>>
>> Cc: qemu-stable@nongnu.org
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 0b7e904d48..1f0f61a02b 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
>> int64_t offset, int bytes,
>>       RawPosixAIOData acb;
>>       ThreadPoolFunc *handler;
>>   +#ifdef CONFIG_FALLOCATE
>> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
>> +        BdrvTrackedRequest *req;
>> +        uint64_t end;
>> +
>> +        /*
>> +         * This is a workaround for a bug in the Linux XFS driver,
>> +         * where writes submitted through the AIO interface will be
>> +         * discarded if they happen beyond a concurrently running
>> +         * fallocate() that increases the file length (i.e., both the
>> +         * write and the fallocate() happen beyond the EOF).
>> +         *
>> +         * To work around it, we extend the tracked request for this
>> +         * zero write until INT64_MAX (effectively infinity), and mark
>> +         * it as serializing.
>> +         *
>> +         * We have to enable this workaround for all filesystems and
>> +         * AIO modes (not just XFS with aio=native), because for
>> +         * remote filesystems we do not know the host configuration.
>> +         */
>> +
>> +        req = bdrv_co_get_self_request(bs);
>> +        assert(req);
>> +        assert(req->type == BDRV_TRACKED_WRITE);
>> +        assert(req->offset <= offset);
>> +        assert(req->offset + req->bytes >= offset + bytes);
> 
> Why these assertions?

Mostly to see that bdrv_co_get_self_request() (introduced by the same
series) actually got the right request.  (I suppose.)

> TrackedRequest offset and bytes fields correspond
> to the original request. When request is being expanded to satisfy
> request_alignment, these fields are not updated.

Well, shrunk in this case, but OK.

> So, maybe, we should assert overlap_offset and overlap_bytes?

Maybe, but would that have any benefit?  Especially after this patch
having been in qemu for over half a year?

(Also, intuitively off the top of my head I don’t see how it would make
more sense to check overlap_offset and overlap_bytes, if all the
assertions are for is to see that we got the right request.
overlap_offset and overlap_bytes may still not exactly match @offset or
@bytes, respectively.)

Your suggestion makes it sound a bit like you have a different purpose
in mind what these assertions might be useful for...?

>> +
>> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
>> +        req->bytes = end - req->offset;
> 
> And I doubt that we should update req->bytes. We never updated it in
> other places, it corresponds to original request. It's enough to update
> overlap_bytes to achieve corresponding serialising.

Does it hurt?  If so, would you send a patch?

I assume you reply to this patch instead of writing a patch because you
have the same feeling of “It probably doesn’t really matter, so let’s
have a discussion first”.

My stance is: I don’t think it matters and this whole piece of code is a
hack that shouldn’t exist, obviously.  So I don’t really care how it
fits into all of our other code.

I would like to say I wouldn’t mind a patch to drop the req->bytes
assignment, but OTOH it would mean I’d have to review it and verify that
it’s indeed sufficient to set overlap_bytes.

If it’s in any way inconvenient for you that req->bytes is adjusted,
then of course please send one.

>> +        req->overlap_bytes = req->bytes;
>> +
>> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
> 
> Not sure, how much should we care about request_alignment here, I think,
> it's enough to just set req->overlap_bytes = INT64_MAX -
> req->overlap_offest, but it doesn't really matter.

As long as req->bytes is adjusted, we have to care, or the overlap_bytes
calculation in bdrv_mark_request_serialising will overflow.

Well, one could argue that it doesn’t matter because the MAX() will
still do the right thing, but overflowing is never nice.

(Of course, it probably doesn’t matter at all if we just wouldn’t touch
req->bytes.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-06-02 15:46     ` Max Reitz
@ 2020-06-02 16:16       ` Vladimir Sementsov-Ogievskiy
  2020-06-02 16:38         ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-02 16:16 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

02.06.2020 18:46, Max Reitz wrote:
> On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 18:25, Max Reitz wrote:
>>
>> Sorry for being late, I have some comments
> 
> Uh, well.  Reasonable, but I hope you don’t mind me having no longer
> having this patch fresh on my mind.
> 
>>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>>> in practice).
>>>
>>> Cc: qemu-stable@nongnu.org
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>    block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 36 insertions(+)
>>>
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 0b7e904d48..1f0f61a02b 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
>>> int64_t offset, int bytes,
>>>        RawPosixAIOData acb;
>>>        ThreadPoolFunc *handler;
>>>    +#ifdef CONFIG_FALLOCATE
>>> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
>>> +        BdrvTrackedRequest *req;
>>> +        uint64_t end;
>>> +
>>> +        /*
>>> +         * This is a workaround for a bug in the Linux XFS driver,
>>> +         * where writes submitted through the AIO interface will be
>>> +         * discarded if they happen beyond a concurrently running
>>> +         * fallocate() that increases the file length (i.e., both the
>>> +         * write and the fallocate() happen beyond the EOF).
>>> +         *
>>> +         * To work around it, we extend the tracked request for this
>>> +         * zero write until INT64_MAX (effectively infinity), and mark
>>> +         * it as serializing.
>>> +         *
>>> +         * We have to enable this workaround for all filesystems and
>>> +         * AIO modes (not just XFS with aio=native), because for
>>> +         * remote filesystems we do not know the host configuration.
>>> +         */
>>> +
>>> +        req = bdrv_co_get_self_request(bs);
>>> +        assert(req);
>>> +        assert(req->type == BDRV_TRACKED_WRITE);
>>> +        assert(req->offset <= offset);
>>> +        assert(req->offset + req->bytes >= offset + bytes);
>>
>> Why these assertions?
> 
> Mostly to see that bdrv_co_get_self_request() (introduced by the same
> series) actually got the right request.  (I suppose.)
> 
>> TrackedRequest offset and bytes fields correspond
>> to the original request. When request is being expanded to satisfy
>> request_alignment, these fields are not updated.
> 
> Well, shrunk in this case, but OK.
> 
>> So, maybe, we should assert overlap_offset and overlap_bytes?
> 
> Maybe, but would that have any benefit?  Especially after this patch
> having been in qemu for over half a year?
> 
> (Also, intuitively off the top of my head I don’t see how it would make
> more sense to check overlap_offset and overlap_bytes, if all the
> assertions are for is to see that we got the right request.
> overlap_offset and overlap_bytes may still not exactly match @offset or
> @bytes, respectively.)
> 
> Your suggestion makes it sound a bit like you have a different purpose
> in mind what these assertions might be useful for...?

No I just think it may have false-positives, when actual request is larger
than original. So offset may be < req->offset and req->offset + req->bytes may be
less than offset + bytes. And we will crash. I should make a reproducer to
prove it, but it seems possible.

> 
>>> +
>>> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
>>> +        req->bytes = end - req->offset;
>>
>> And I doubt that we should update req->bytes. We never updated it in
>> other places, it corresponds to original request. It's enough to update
>> overlap_bytes to achieve corresponding serialising.
> 
> Does it hurt?  If so, would you send a patch?
> 
> I assume you reply to this patch instead of writing a patch because you
> have the same feeling of “It probably doesn’t really matter, so let’s
> have a discussion first”.

1. yes, and
2. I probably don't see the full picture around tracked requests

> 
> My stance is: I don’t think it matters and this whole piece of code is a
> hack that shouldn’t exist, obviously.  So I don’t really care how it
> fits into all of our other code.
> 
> I would like to say I wouldn’t mind a patch to drop the req->bytes
> assignment, but OTOH it would mean I’d have to review it and verify that
> it’s indeed sufficient to set overlap_bytes.
> 
> If it’s in any way inconvenient for you that req->bytes is adjusted,
> then of course please send one.
> 
>>> +        req->overlap_bytes = req->bytes;
>>> +
>>> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
>>
>> Not sure, how much should we care about request_alignment here, I think,
>> it's enough to just set req->overlap_bytes = INT64_MAX -
>> req->overlap_offest, but it doesn't really matter.
> 
> As long as req->bytes is adjusted, we have to care, or the overlap_bytes
> calculation in bdrv_mark_request_serialising will overflow.
> 
> Well, one could argue that it doesn’t matter because the MAX() will
> still do the right thing, but overflowing is never nice.

Hmm I think, if reduce it to just INT64_MAX, we should pass 1 as align to bdrv_mark_request_serialising.

> 
> (Of course, it probably doesn’t matter at all if we just wouldn’t touch
> req->bytes.)
> 

OK, thanks for the answer, I'll prepare a patch.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-06-02 16:16       ` Vladimir Sementsov-Ogievskiy
@ 2020-06-02 16:38         ` Max Reitz
  2020-06-02 17:01           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2020-06-02 16:38 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev


[-- Attachment #1.1: Type: text/plain, Size: 6813 bytes --]

On 02.06.20 18:16, Vladimir Sementsov-Ogievskiy wrote:
> 02.06.2020 18:46, Max Reitz wrote:
>> On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 18:25, Max Reitz wrote:
>>>
>>> Sorry for being late, I have some comments
>>
>> Uh, well.  Reasonable, but I hope you don’t mind me having no longer
>> having this patch fresh on my mind.
>>
>>>> The XFS kernel driver has a bug that may cause data corruption for
>>>> qcow2
>>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>>> treating post-EOF fallocates as serializing up until infinity
>>>> (INT64_MAX
>>>> in practice).
>>>>
>>>> Cc: qemu-stable@nongnu.org
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>    block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 36 insertions(+)
>>>>
>>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>>> index 0b7e904d48..1f0f61a02b 100644
>>>> --- a/block/file-posix.c
>>>> +++ b/block/file-posix.c
>>>> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
>>>> int64_t offset, int bytes,
>>>>        RawPosixAIOData acb;
>>>>        ThreadPoolFunc *handler;
>>>>    +#ifdef CONFIG_FALLOCATE
>>>> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
>>>> +        BdrvTrackedRequest *req;
>>>> +        uint64_t end;
>>>> +
>>>> +        /*
>>>> +         * This is a workaround for a bug in the Linux XFS driver,
>>>> +         * where writes submitted through the AIO interface will be
>>>> +         * discarded if they happen beyond a concurrently running
>>>> +         * fallocate() that increases the file length (i.e., both the
>>>> +         * write and the fallocate() happen beyond the EOF).
>>>> +         *
>>>> +         * To work around it, we extend the tracked request for this
>>>> +         * zero write until INT64_MAX (effectively infinity), and mark
>>>> +         * it as serializing.
>>>> +         *
>>>> +         * We have to enable this workaround for all filesystems and
>>>> +         * AIO modes (not just XFS with aio=native), because for
>>>> +         * remote filesystems we do not know the host configuration.
>>>> +         */
>>>> +
>>>> +        req = bdrv_co_get_self_request(bs);
>>>> +        assert(req);
>>>> +        assert(req->type == BDRV_TRACKED_WRITE);
>>>> +        assert(req->offset <= offset);
>>>> +        assert(req->offset + req->bytes >= offset + bytes);
>>>
>>> Why these assertions?
>>
>> Mostly to see that bdrv_co_get_self_request() (introduced by the same
>> series) actually got the right request.  (I suppose.)
>>
>>> TrackedRequest offset and bytes fields correspond
>>> to the original request. When request is being expanded to satisfy
>>> request_alignment, these fields are not updated.
>>
>> Well, shrunk in this case, but OK.
>>
>>> So, maybe, we should assert overlap_offset and overlap_bytes?
>>
>> Maybe, but would that have any benefit?  Especially after this patch
>> having been in qemu for over half a year?
>>
>> (Also, intuitively off the top of my head I don’t see how it would make
>> more sense to check overlap_offset and overlap_bytes, if all the
>> assertions are for is to see that we got the right request.
>> overlap_offset and overlap_bytes may still not exactly match @offset or
>> @bytes, respectively.)
>>
>> Your suggestion makes it sound a bit like you have a different purpose
>> in mind what these assertions might be useful for...?
> 
> No I just think it may have false-positives, when actual request is larger
> than original.

Seems like a bug.  Why would we zero more than originally requested?

> So offset may be < req->offset and req->offset +
> req->bytes may be
> less than offset + bytes. And we will crash. I should make a reproducer to
> prove it, but it seems possible.

I’m definitely curious.

>>>> +
>>>> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
>>>> +        req->bytes = end - req->offset;
>>>
>>> And I doubt that we should update req->bytes. We never updated it in
>>> other places, it corresponds to original request. It's enough to update
>>> overlap_bytes to achieve corresponding serialising.
>>
>> Does it hurt?  If so, would you send a patch?
>>
>> I assume you reply to this patch instead of writing a patch because you
>> have the same feeling of “It probably doesn’t really matter, so let’s
>> have a discussion first”.
> 
> 1. yes, and
> 2. I probably don't see the full picture around tracked requests

Neither do I, that’s for sure.

>> My stance is: I don’t think it matters and this whole piece of code is a
>> hack that shouldn’t exist, obviously.  So I don’t really care how it
>> fits into all of our other code.
>>
>> I would like to say I wouldn’t mind a patch to drop the req->bytes
>> assignment, but OTOH it would mean I’d have to review it and verify that
>> it’s indeed sufficient to set overlap_bytes.
>>
>> If it’s in any way inconvenient for you that req->bytes is adjusted,
>> then of course please send one.
>>
>>>> +        req->overlap_bytes = req->bytes;
>>>> +
>>>> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
>>>
>>> Not sure, how much should we care about request_alignment here, I think,
>>> it's enough to just set req->overlap_bytes = INT64_MAX -
>>> req->overlap_offest, but it doesn't really matter.
>>
>> As long as req->bytes is adjusted, we have to care, or the overlap_bytes
>> calculation in bdrv_mark_request_serialising will overflow.
>>
>> Well, one could argue that it doesn’t matter because the MAX() will
>> still do the right thing, but overflowing is never nice.
> 
> Hmm I think, if reduce it to just INT64_MAX, we should pass 1 as align
> to bdrv_mark_request_serialising.

True.

>> (Of course, it probably doesn’t matter at all if we just wouldn’t touch
>> req->bytes.)
>>
> 
> OK, thanks for the answer, I'll prepare a patch.

OK?  I’m not sure where the benefit is (apart from the perhaps failing
assertions).  So it still looks to me like putting too much energy into
a hack.

(I think the original reason I set both req->bytes and
req->overlap_bytes was actually because I just wanted to be sure, and
didn’t want to have to look too hard whether either would be sufficient.)

((Please also note that I can’t guarantee I will review your patch in a
timely manner, for one thing because I can already rarely give that
promise (as you are probably painfully aware...); and now there’s also
two weeks of mail on top for me to wade through after PTO.  So if
there’s no reason to change anything apart from saving two LoC, well.
Failing assertions are a different matter altogether, though, of course.))

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-06-02 16:38         ` Max Reitz
@ 2020-06-02 17:01           ` Vladimir Sementsov-Ogievskiy
  2020-06-02 17:08             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-02 17:01 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

02.06.2020 19:38, Max Reitz wrote:
> On 02.06.20 18:16, Vladimir Sementsov-Ogievskiy wrote:
>> 02.06.2020 18:46, Max Reitz wrote:
>>> On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> 01.11.2019 18:25, Max Reitz wrote:
>>>>
>>>> Sorry for being late, I have some comments
>>>
>>> Uh, well.  Reasonable, but I hope you don’t mind me having no longer
>>> having this patch fresh on my mind.
>>>
>>>>> The XFS kernel driver has a bug that may cause data corruption for
>>>>> qcow2
>>>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>>>> treating post-EOF fallocates as serializing up until infinity
>>>>> (INT64_MAX
>>>>> in practice).
>>>>>
>>>>> Cc: qemu-stable@nongnu.org
>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>> ---
>>>>>     block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 36 insertions(+)
>>>>>
>>>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>>>> index 0b7e904d48..1f0f61a02b 100644
>>>>> --- a/block/file-posix.c
>>>>> +++ b/block/file-posix.c
>>>>> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
>>>>> int64_t offset, int bytes,
>>>>>         RawPosixAIOData acb;
>>>>>         ThreadPoolFunc *handler;
>>>>>     +#ifdef CONFIG_FALLOCATE
>>>>> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
>>>>> +        BdrvTrackedRequest *req;
>>>>> +        uint64_t end;
>>>>> +
>>>>> +        /*
>>>>> +         * This is a workaround for a bug in the Linux XFS driver,
>>>>> +         * where writes submitted through the AIO interface will be
>>>>> +         * discarded if they happen beyond a concurrently running
>>>>> +         * fallocate() that increases the file length (i.e., both the
>>>>> +         * write and the fallocate() happen beyond the EOF).
>>>>> +         *
>>>>> +         * To work around it, we extend the tracked request for this
>>>>> +         * zero write until INT64_MAX (effectively infinity), and mark
>>>>> +         * it as serializing.
>>>>> +         *
>>>>> +         * We have to enable this workaround for all filesystems and
>>>>> +         * AIO modes (not just XFS with aio=native), because for
>>>>> +         * remote filesystems we do not know the host configuration.
>>>>> +         */
>>>>> +
>>>>> +        req = bdrv_co_get_self_request(bs);
>>>>> +        assert(req);
>>>>> +        assert(req->type == BDRV_TRACKED_WRITE);
>>>>> +        assert(req->offset <= offset);
>>>>> +        assert(req->offset + req->bytes >= offset + bytes);
>>>>
>>>> Why these assertions?
>>>
>>> Mostly to see that bdrv_co_get_self_request() (introduced by the same
>>> series) actually got the right request.  (I suppose.)
>>>
>>>> TrackedRequest offset and bytes fields correspond
>>>> to the original request. When request is being expanded to satisfy
>>>> request_alignment, these fields are not updated.
>>>
>>> Well, shrunk in this case, but OK.
>>>
>>>> So, maybe, we should assert overlap_offset and overlap_bytes?
>>>
>>> Maybe, but would that have any benefit?  Especially after this patch
>>> having been in qemu for over half a year?
>>>
>>> (Also, intuitively off the top of my head I don’t see how it would make
>>> more sense to check overlap_offset and overlap_bytes, if all the
>>> assertions are for is to see that we got the right request.
>>> overlap_offset and overlap_bytes may still not exactly match @offset or
>>> @bytes, respectively.)
>>>
>>> Your suggestion makes it sound a bit like you have a different purpose
>>> in mind what these assertions might be useful for...?
>>
>> No I just think it may have false-positives, when actual request is larger
>> than original.
> 
> Seems like a bug.  Why would we zero more than originally requested?

Hmm, you are right, seems it's not possible. We may expand the request to do COW,
but it will never produce write-zero request larger than original one.

(I really tried to reproduce to understand it :)

> 
>> So offset may be < req->offset and req->offset +
>> req->bytes may be
>> less than offset + bytes. And we will crash. I should make a reproducer to
>> prove it, but it seems possible.
> 
> I’m definitely curious.
> 
>>>>> +
>>>>> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
>>>>> +        req->bytes = end - req->offset;
>>>>
>>>> And I doubt that we should update req->bytes. We never updated it in
>>>> other places, it corresponds to original request. It's enough to update
>>>> overlap_bytes to achieve corresponding serialising.
>>>
>>> Does it hurt?  If so, would you send a patch?
>>>
>>> I assume you reply to this patch instead of writing a patch because you
>>> have the same feeling of “It probably doesn’t really matter, so let’s
>>> have a discussion first”.
>>
>> 1. yes, and
>> 2. I probably don't see the full picture around tracked requests
> 
> Neither do I, that’s for sure.
> 
>>> My stance is: I don’t think it matters and this whole piece of code is a
>>> hack that shouldn’t exist, obviously.  So I don’t really care how it
>>> fits into all of our other code.
>>>
>>> I would like to say I wouldn’t mind a patch to drop the req->bytes
>>> assignment, but OTOH it would mean I’d have to review it and verify that
>>> it’s indeed sufficient to set overlap_bytes.
>>>
>>> If it’s in any way inconvenient for you that req->bytes is adjusted,
>>> then of course please send one.
>>>
>>>>> +        req->overlap_bytes = req->bytes;
>>>>> +
>>>>> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
>>>>
>>>> Not sure, how much should we care about request_alignment here, I think,
>>>> it's enough to just set req->overlap_bytes = INT64_MAX -
>>>> req->overlap_offest, but it doesn't really matter.
>>>
>>> As long as req->bytes is adjusted, we have to care, or the overlap_bytes
>>> calculation in bdrv_mark_request_serialising will overflow.
>>>
>>> Well, one could argue that it doesn’t matter because the MAX() will
>>> still do the right thing, but overflowing is never nice.
>>
>> Hmm I think, if reduce it to just INT64_MAX, we should pass 1 as align
>> to bdrv_mark_request_serialising.
> 
> True.
> 
>>> (Of course, it probably doesn’t matter at all if we just wouldn’t touch
>>> req->bytes.)
>>>
>>
>> OK, thanks for the answer, I'll prepare a patch.
> 
> OK?  I’m not sure where the benefit is (apart from the perhaps failing
> assertions).  So it still looks to me like putting too much energy into
> a hack.
> 
> (I think the original reason I set both req->bytes and
> req->overlap_bytes was actually because I just wanted to be sure, and
> didn’t want to have to look too hard whether either would be sufficient.)
> 
> ((Please also note that I can’t guarantee I will review your patch in a
> timely manner, for one thing because I can already rarely give that
> promise (as you are probably painfully aware...); and now there’s also
> two weeks of mail on top for me to wade through after PTO.  So if
> there’s no reason to change anything apart from saving two LoC, well.
> Failing assertions are a different matter altogether, though, of course.))
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-06-02 17:01           ` Vladimir Sementsov-Ogievskiy
@ 2020-06-02 17:08             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-02 17:08 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

02.06.2020 20:01, Vladimir Sementsov-Ogievskiy wrote:
> 02.06.2020 19:38, Max Reitz wrote:
>> On 02.06.20 18:16, Vladimir Sementsov-Ogievskiy wrote:
>>> 02.06.2020 18:46, Max Reitz wrote:
>>>> On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 18:25, Max Reitz wrote:
>>>>>
>>>>> Sorry for being late, I have some comments
>>>>
>>>> Uh, well.  Reasonable, but I hope you don’t mind me having no longer
>>>> having this patch fresh on my mind.
>>>>
>>>>>> The XFS kernel driver has a bug that may cause data corruption for
>>>>>> qcow2
>>>>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>>>>> treating post-EOF fallocates as serializing up until infinity
>>>>>> (INT64_MAX
>>>>>> in practice).
>>>>>>
>>>>>> Cc: qemu-stable@nongnu.org
>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>>> ---
>>>>>>     block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 36 insertions(+)
>>>>>>
>>>>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>>>>> index 0b7e904d48..1f0f61a02b 100644
>>>>>> --- a/block/file-posix.c
>>>>>> +++ b/block/file-posix.c
>>>>>> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
>>>>>> int64_t offset, int bytes,
>>>>>>         RawPosixAIOData acb;
>>>>>>         ThreadPoolFunc *handler;
>>>>>>     +#ifdef CONFIG_FALLOCATE
>>>>>> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
>>>>>> +        BdrvTrackedRequest *req;
>>>>>> +        uint64_t end;
>>>>>> +
>>>>>> +        /*
>>>>>> +         * This is a workaround for a bug in the Linux XFS driver,
>>>>>> +         * where writes submitted through the AIO interface will be
>>>>>> +         * discarded if they happen beyond a concurrently running
>>>>>> +         * fallocate() that increases the file length (i.e., both the
>>>>>> +         * write and the fallocate() happen beyond the EOF).
>>>>>> +         *
>>>>>> +         * To work around it, we extend the tracked request for this
>>>>>> +         * zero write until INT64_MAX (effectively infinity), and mark
>>>>>> +         * it as serializing.
>>>>>> +         *
>>>>>> +         * We have to enable this workaround for all filesystems and
>>>>>> +         * AIO modes (not just XFS with aio=native), because for
>>>>>> +         * remote filesystems we do not know the host configuration.
>>>>>> +         */
>>>>>> +
>>>>>> +        req = bdrv_co_get_self_request(bs);
>>>>>> +        assert(req);
>>>>>> +        assert(req->type == BDRV_TRACKED_WRITE);
>>>>>> +        assert(req->offset <= offset);
>>>>>> +        assert(req->offset + req->bytes >= offset + bytes);
>>>>>
>>>>> Why these assertions?
>>>>
>>>> Mostly to see that bdrv_co_get_self_request() (introduced by the same
>>>> series) actually got the right request.  (I suppose.)
>>>>
>>>>> TrackedRequest offset and bytes fields correspond
>>>>> to the original request. When request is being expanded to satisfy
>>>>> request_alignment, these fields are not updated.
>>>>
>>>> Well, shrunk in this case, but OK.
>>>>
>>>>> So, maybe, we should assert overlap_offset and overlap_bytes?
>>>>
>>>> Maybe, but would that have any benefit?  Especially after this patch
>>>> having been in qemu for over half a year?
>>>>
>>>> (Also, intuitively off the top of my head I don’t see how it would make
>>>> more sense to check overlap_offset and overlap_bytes, if all the
>>>> assertions are for is to see that we got the right request.
>>>> overlap_offset and overlap_bytes may still not exactly match @offset or
>>>> @bytes, respectively.)
>>>>
>>>> Your suggestion makes it sound a bit like you have a different purpose
>>>> in mind what these assertions might be useful for...?
>>>
>>> No I just think it may have false-positives, when actual request is larger
>>> than original.
>>
>> Seems like a bug.  Why would we zero more than originally requested?
> 
> Hmm, you are right, seems it's not possible. We may expand the request to do COW,
> but it will never produce write-zero request larger than original one.
> 
> (I really tried to reproduce to understand it :)
> 
>>
>>> So offset may be < req->offset and req->offset +
>>> req->bytes may be
>>> less than offset + bytes. And we will crash. I should make a reproducer to
>>> prove it, but it seems possible.
>>
>> I’m definitely curious.
>>
>>>>>> +
>>>>>> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
>>>>>> +        req->bytes = end - req->offset;
>>>>>
>>>>> And I doubt that we should update req->bytes. We never updated it in
>>>>> other places, it corresponds to original request. It's enough to update
>>>>> overlap_bytes to achieve corresponding serialising.
>>>>
>>>> Does it hurt?  If so, would you send a patch?
>>>>
>>>> I assume you reply to this patch instead of writing a patch because you
>>>> have the same feeling of “It probably doesn’t really matter, so let’s
>>>> have a discussion first”.
>>>
>>> 1. yes, and
>>> 2. I probably don't see the full picture around tracked requests
>>
>> Neither do I, that’s for sure.
>>
>>>> My stance is: I don’t think it matters and this whole piece of code is a
>>>> hack that shouldn’t exist, obviously.  So I don’t really care how it
>>>> fits into all of our other code.
>>>>
>>>> I would like to say I wouldn’t mind a patch to drop the req->bytes
>>>> assignment, but OTOH it would mean I’d have to review it and verify that
>>>> it’s indeed sufficient to set overlap_bytes.
>>>>
>>>> If it’s in any way inconvenient for you that req->bytes is adjusted,
>>>> then of course please send one.
>>>>
>>>>>> +        req->overlap_bytes = req->bytes;
>>>>>> +
>>>>>> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
>>>>>
>>>>> Not sure, how much should we care about request_alignment here, I think,
>>>>> it's enough to just set req->overlap_bytes = INT64_MAX -
>>>>> req->overlap_offest, but it doesn't really matter.
>>>>
>>>> As long as req->bytes is adjusted, we have to care, or the overlap_bytes
>>>> calculation in bdrv_mark_request_serialising will overflow.
>>>>
>>>> Well, one could argue that it doesn’t matter because the MAX() will
>>>> still do the right thing, but overflowing is never nice.
>>>
>>> Hmm I think, if reduce it to just INT64_MAX, we should pass 1 as align
>>> to bdrv_mark_request_serialising.
>>
>> True.
>>
>>>> (Of course, it probably doesn’t matter at all if we just wouldn’t touch
>>>> req->bytes.)
>>>>
>>>
>>> OK, thanks for the answer, I'll prepare a patch.
>>
>> OK?  I’m not sure where the benefit is (apart from the perhaps failing
>> assertions).  So it still looks to me like putting too much energy into
>> a hack.
>>

Yes, now when I see that assertions should not fail I don't want to care
more neither send a patch :)

>> (I think the original reason I set both req->bytes and
>> req->overlap_bytes was actually because I just wanted to be sure, and
>> didn’t want to have to look too hard whether either would be sufficient.)
>>
>> ((Please also note that I can’t guarantee I will review your patch in a
>> timely manner, for one thing because I can already rarely give that
>> promise (as you are probably painfully aware...); and now there’s also
>> two weeks of mail on top for me to wade through after PTO.  So if
>> there’s no reason to change anything apart from saving two LoC, well.
>> Failing assertions are a different matter altogether, though, of course.))
>>
>> Max
>>
> 
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-11-01 15:25 ` [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
  2019-11-14 16:27   ` Christoph Hellwig
  2020-06-02 14:43   ` Vladimir Sementsov-Ogievskiy
@ 2020-08-22 17:03   ` Vladimir Sementsov-Ogievskiy
  2020-08-22 17:04     ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-22 17:03 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

01.11.2019 18:25, Max Reitz wrote:
> The XFS kernel driver has a bug that may cause data corruption for qcow2
> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
> in practice).

Hi! I'm doing some investigation, and here is an interesting result:

Consider the following test:

img=/ssd/x.qcow2; ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 50000 -d 64 -f qcow2 -o 1k -s 64k -t none -w $img


Bisecting results changes between 2.12 and 5.1, I found the following:

2.12: ~20s

....

c8bb23cbdbe32 "qcow2: skip writing zero buffers to empty COW areas"  -> becomes ~12s  [1]

....


292d06b925b27 "block/file-posix: Let post-EOF fallocate serialize"   -> becomes ~9s  [2]

....

v5.1 ~9s


And [1] is obvious, it is the main purpose of c8bb23cbdbe32. But [2] is a surprise for me.. Any ideas?

===

just to check: staying at c8bb23cbdbe32 I revert c8bb23cbdbe32 and get again ~19.7s. So [2] doesn't substitute [1].

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-08-22 17:03   ` Vladimir Sementsov-Ogievskiy
@ 2020-08-22 17:04     ` Vladimir Sementsov-Ogievskiy
  2020-08-26  8:23       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-22 17:04 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

22.08.2020 20:03, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 18:25, Max Reitz wrote:
>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>> in practice).
> 
> Hi! I'm doing some investigation, and here is an interesting result:
> 
> Consider the following test:
> 
> img=/ssd/x.qcow2; ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 50000 -d 64 -f qcow2 -o 1k -s 64k -t none -w $img
> 
> 
> Bisecting results changes between 2.12 and 5.1, I found the following:
> 
> 2.12: ~20s
> 
> ....
> 
> c8bb23cbdbe32 "qcow2: skip writing zero buffers to empty COW areas"  -> becomes ~12s  [1]
> 
> ....
> 
> 
> 292d06b925b27 "block/file-posix: Let post-EOF fallocate serialize"   -> becomes ~9s  [2]
> 
> ....
> 
> v5.1 ~9s
> 
> 
> And [1] is obvious, it is the main purpose of c8bb23cbdbe32. But [2] is a surprise for me.. Any ideas?
> 
> ===
> 
> just to check: staying at c8bb23cbdbe32 I revert c8bb23cbdbe32 and get again ~19.7s. So [2] doesn't substitute [1].
> 

Note, it's all ext4.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-08-22 17:04     ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26  8:23       ` Vladimir Sementsov-Ogievskiy
  2020-08-26 11:20         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26  8:23 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, qemu-stable,
	Stefan Hajnoczi, Denis V . Lunev

22.08.2020 20:04, Vladimir Sementsov-Ogievskiy wrote:
> 22.08.2020 20:03, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 18:25, Max Reitz wrote:
>>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>>> in practice).
>>
>> Hi! I'm doing some investigation, and here is an interesting result:
>>
>> Consider the following test:
>>
>> img=/ssd/x.qcow2; ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 50000 -d 64 -f qcow2 -o 1k -s 64k -t none -w $img
>>
>>
>> Bisecting results changes between 2.12 and 5.1, I found the following:
>>
>> 2.12: ~20s
>>
>> ....
>>
>> c8bb23cbdbe32 "qcow2: skip writing zero buffers to empty COW areas"  -> becomes ~12s  [1]
>>
>> ....
>>
>>
>> 292d06b925b27 "block/file-posix: Let post-EOF fallocate serialize"   -> becomes ~9s  [2]
>>
>> ....
>>
>> v5.1 ~9s
>>
>>
>> And [1] is obvious, it is the main purpose of c8bb23cbdbe32. But [2] is a surprise for me.. Any ideas?
>>
>> ===
>>
>> just to check: staying at c8bb23cbdbe32 I revert c8bb23cbdbe32 and get again ~19.7s. So [2] doesn't substitute [1].
>>
> 
> Note, it's all ext4.
> 


Let's go further:

If I add -n, to use aio native:
./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 50000 -d 64 -f qcow2 -o 1k -s 64k -t none -n -w $img;

Then this patch doesn't have such effect..

But if consider the test without an offset on hdd, the effect is very significant:

./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 45000 -d 64 -f qcow2  -s 16k -t none -n -w $img;

v2.12 ~10.4s
c8bb23cbdbe32^ ~10.6s
c8bb23cbdbe32 qcow2: skip writing zero buffers to empty COW areas: ~16.7s : degradation!!!
292d06b925b27^ ~16.4s
292d06b925 block/file-posix: Let post-EOF fallocate serialize: ~7.6s: great improvement !

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
  2020-08-26  8:23       ` Vladimir Sementsov-Ogievskiy
@ 2020-08-26 11:20         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-08-26 11:20 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: qemu-devel, qemu-stable, Denis V . Lunev, Kevin Wolf,
	Stefan Hajnoczi, Anton Nefedov

26.08.2020 11:23, Vladimir Sementsov-Ogievskiy wrote:
> 22.08.2020 20:04, Vladimir Sementsov-Ogievskiy wrote:
>> 22.08.2020 20:03, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 18:25, Max Reitz wrote:
>>>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>>>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>>>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>>>> in practice).
>>>
>>> Hi! I'm doing some investigation, and here is an interesting result:
>>>
>>> Consider the following test:
>>>
>>> img=/ssd/x.qcow2; ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 50000 -d 64 -f qcow2 -o 1k -s 64k -t none -w $img
>>>
>>>
>>> Bisecting results changes between 2.12 and 5.1, I found the following:
>>>
>>> 2.12: ~20s
>>>
>>> ....
>>>
>>> c8bb23cbdbe32 "qcow2: skip writing zero buffers to empty COW areas"  -> becomes ~12s  [1]
>>>
>>> ....
>>>
>>>
>>> 292d06b925b27 "block/file-posix: Let post-EOF fallocate serialize"   -> becomes ~9s  [2]
>>>
>>> ....
>>>
>>> v5.1 ~9s
>>>
>>>
>>> And [1] is obvious, it is the main purpose of c8bb23cbdbe32. But [2] is a surprise for me.. Any ideas?
>>>
>>> ===
>>>
>>> just to check: staying at c8bb23cbdbe32 I revert c8bb23cbdbe32 and get again ~19.7s. So [2] doesn't substitute [1].
>>>
>>
>> Note, it's all ext4.
>>
> 
> 
> Let's go further:
> 
> If I add -n, to use aio native:
> ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 50000 -d 64 -f qcow2 -o 1k -s 64k -t none -n -w $img;
> 
> Then this patch doesn't have such effect..
> 
> But if consider the test without an offset on hdd, the effect is very significant:
> 
> ./qemu-img create -f qcow2 $img 16G; ./qemu-img bench -c 45000 -d 64 -f qcow2  -s 16k -t none -n -w $img;
> 
> v2.12 ~10.4s
> c8bb23cbdbe32^ ~10.6s
> c8bb23cbdbe32 qcow2: skip writing zero buffers to empty COW areas: ~16.7s : degradation!!!
> 292d06b925b27^ ~16.4s
> 292d06b925 block/file-posix: Let post-EOF fallocate serialize: ~7.6s: great improvement !
> 

Hmm, a bit more accurate results (use separate partition on non-root hard drive). I post just numbers I get from iterations, not calculating the average.

v2.12.0 9.227 8.999 9.059 8.793
c8bb23cbdbe32^ 9.264 8.996 8.969
c8bb23cbdbe32 20.694 16.050 15.898 15.918 16.136
292d06b925b27^ 15.509 15.343 16.549
292d06b925 8.368 8.529 8.342 7.503 8.361 8.353 8.416 13.043 12.594 7.898 7.907 11.914

292d06b925 + don't-handle-alloc 18.730 9.321 9.016 9.037 18.286 18.268

(don't handle alloc is don't call qcow2_handle_alloc function which is actual revert of c8bb23cbdbe32)

very interesting..

292d06b925^ + don't-handle-alloc 9.200 9.285 9.297 9.296 9.194 9.332 9.155

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-08-26 11:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-01 15:25 [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS Max Reitz
2019-11-01 15:25 ` [PATCH for-4.2 v2 1/3] block: Make wait/mark serialising requests public Max Reitz
2019-11-01 15:25 ` [PATCH for-4.2 v2 2/3] block: Add bdrv_co_get_self_request() Max Reitz
2019-11-01 15:25 ` [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
2019-11-14 16:27   ` Christoph Hellwig
2019-11-14 17:15     ` Max Reitz
2019-11-14 17:16       ` Max Reitz
2020-06-02 14:43   ` Vladimir Sementsov-Ogievskiy
2020-06-02 15:46     ` Max Reitz
2020-06-02 16:16       ` Vladimir Sementsov-Ogievskiy
2020-06-02 16:38         ` Max Reitz
2020-06-02 17:01           ` Vladimir Sementsov-Ogievskiy
2020-06-02 17:08             ` Vladimir Sementsov-Ogievskiy
2020-08-22 17:03   ` Vladimir Sementsov-Ogievskiy
2020-08-22 17:04     ` Vladimir Sementsov-Ogievskiy
2020-08-26  8:23       ` Vladimir Sementsov-Ogievskiy
2020-08-26 11:20         ` Vladimir Sementsov-Ogievskiy
2019-11-01 15:48 ` [PATCH for-4.2 v2 0/3] qcow2: Fix data corruption on XFS no-reply
2019-11-04  8:29   ` Max Reitz
2019-11-04  9:09 ` Max Reitz
2019-11-04  9:45 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).