qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] block/file-posix: Work around XFS bug
@ 2019-10-25  9:58 Max Reitz
  2019-10-25  9:58 ` [RFC 1/3] block: Make wait/mark serialising requests public Max Reitz
                   ` (7 more replies)
  0 siblings, 8 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-25  9:58 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

Hi,

It seems to me that there is a bug in Linux’s XFS kernel driver, as
I’ve explained here:

https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html

In combination with our commit c8bb23cbdbe32f, this may lead to guest
data corruption when using qcow2 images on XFS with aio=native.

We can’t wait until the XFS kernel driver is fixed, we should work
around the problem ourselves.

This is an RFC for two reasons:
(1) I don’t know whether this is the right way to address the issue,
(2) Ideally, we should detect whether the XFS kernel driver is fixed and
    if so stop applying the workaround.
    I don’t know how we would go about this, so this series doesn’t do
    it.  (Hence it’s an RFC.)
(3) Perhaps it’s a bit of a layering violation to let the file-posix
    driver access and modify a BdrvTrackedRequest object.

As for how we can address the issue, I see three ways:
(1) The one presented in this series: On XFS with aio=native, we extend
    tracked requests for post-EOF fallocate() calls (i.e., write-zero
    operations) to reach until infinity (INT64_MAX in practice), mark
    them serializing and wait for other conflicting requests.

    Advantages:
    + Limits the impact to very specific cases
      (And that means it wouldn’t hurt too much to keep this workaround
      even when the XFS driver has been fixed)
    + Works around the bug where it happens, namely in file-posix

    Disadvantages:
    - A bit complex
    - A bit of a layering violation (should file-posix have access to
      tracked requests?)

(2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
    becomes visible due to that function: I don’t think qcow2 writes
    zeroes in any other I/O path, and raw images are fixed in size so
    post-EOF writes won’t happen.

    Advantages:
    + Maybe simpler, depending on how difficult it is to handle the
      layering violation
    + Also fixes the performance problem of handle_alloc_space() being
      slow on ppc64+XFS.

    Disadvantages:
    - Huge layering violation because qcow2 would need to know whether
      the image is stored on XFS or not.
    - We’d definitely want to skip this workaround when the XFS driver
      has been fixed, so we need some method to find out whether it has

(3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
    To my knowledge I’m the only one who has provided any benchmarks for
    this commit, and even then I was a bit skeptical because it performs
    well in some cases and bad in others.  I concluded that it’s
    probably worth it because the “some cases” are more likely to occur.

    Now we have this problem of corruption here (granted due to a bug in
    the XFS driver), and another report of massively degraded
    performance on ppc64
    (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
    private BZ; I hate that :-/  The report is about 40 % worse
    performance for an in-guest fio write benchmark.)

    So I have to ask the question about what the justification for
    keeping c8bb23cbdbe32f is.  How much does performance increase with
    it actually?  (On non-(ppc64+XFS) machines, obviously)

    Advantages:
    + Trivial
    + No layering violations
    + We wouldn’t need to keep track of whether the kernel bug has been
      fixed or not
    + Fixes the ppc64+XFS performance problem

    Disadvantages:
    - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
      levels, whatever that means

So this is the main reason this is an RFC: What should we do?  Is (1)
really the best choice?


In any case, I’ve ran the test case I showed in
https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
more than ten times with this series applied and the installation
succeeded every time.  (Without this series, it fails like every other
time.)


Max Reitz (3):
  block: Make wait/mark serialising requests public
  block/file-posix: Detect XFS with CONFIG_FALLOCATE
  block/file-posix: Let post-EOF fallocate serialize

 include/block/block_int.h |  3 +++
 block/file-posix.c        | 46 +++++++++++++++++++++++++++++++++++++--
 block/io.c                | 24 ++++++++++----------
 3 files changed, 59 insertions(+), 14 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC 1/3] block: Make wait/mark serialising requests public
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
@ 2019-10-25  9:58 ` Max Reitz
  2019-10-25  9:58 ` [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE Max Reitz
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-25  9:58 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

Make both bdrv_mark_request_serialising() and
bdrv_wait_serialising_requests() public so they can be used from block
drivers.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h |  3 +++
 block/io.c                | 24 ++++++++++++------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ca4ccac4c1..c85733293d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -984,6 +984,9 @@ extern unsigned int bdrv_drain_all_count;
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
 void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
 
+bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self);
+void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align);
+
 int get_tmp_filename(char *filename, int size);
 BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
                             const char *filename);
diff --git a/block/io.c b/block/io.c
index f0b86c1d19..a65cc7fb61 100644
--- a/block/io.c
+++ b/block/io.c
@@ -715,7 +715,7 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
     qemu_co_mutex_unlock(&bs->reqs_lock);
 }
 
-static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
+void bdrv_mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 {
     int64_t overlap_offset = req->offset & ~(align - 1);
     uint64_t overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
@@ -805,7 +805,7 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
     bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+bool coroutine_fn bdrv_wait_serialising_requests(BdrvTrackedRequest *self)
 {
     BlockDriverState *bs = self->bs;
     BdrvTrackedRequest *req;
@@ -1437,14 +1437,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
          * with each other for the same cluster.  For example, in copy-on-read
          * it ensures that the CoR read and write operations are atomic and
          * guest writes cannot interleave between them. */
-        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
     /* BDRV_REQ_SERIALISING is only for write operation */
     assert(!(flags & BDRV_REQ_SERIALISING));
 
     if (!(flags & BDRV_REQ_NO_SERIALISING)) {
-        wait_serialising_requests(req);
+        bdrv_wait_serialising_requests(req);
     }
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -1841,10 +1841,10 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t offset, uint64_t bytes,
     assert(!(flags & ~BDRV_REQ_MASK));
 
     if (flags & BDRV_REQ_SERIALISING) {
-        mark_request_serialising(req, bdrv_get_cluster_size(bs));
+        bdrv_mark_request_serialising(req, bdrv_get_cluster_size(bs));
     }
 
-    waited = wait_serialising_requests(req);
+    waited = bdrv_wait_serialising_requests(req);
 
     assert(!waited || !req->serialising ||
            is_request_serialising_and_aligned(req));
@@ -2008,8 +2008,8 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
 
     padding = bdrv_init_padding(bs, offset, bytes, &pad);
     if (padding) {
-        mark_request_serialising(req, align);
-        wait_serialising_requests(req);
+        bdrv_mark_request_serialising(req, align);
+        bdrv_wait_serialising_requests(req);
 
         bdrv_padding_rmw_read(child, req, &pad, true);
 
@@ -2111,8 +2111,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     }
 
     if (bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad)) {
-        mark_request_serialising(&req, align);
-        wait_serialising_requests(&req);
+        bdrv_mark_request_serialising(&req, align);
+        bdrv_wait_serialising_requests(&req);
         bdrv_padding_rmw_read(child, &req, &pad, false);
     }
 
@@ -3205,7 +3205,7 @@ static int coroutine_fn bdrv_co_copy_range_internal(
         /* BDRV_REQ_SERIALISING is only for write operation */
         assert(!(read_flags & BDRV_REQ_SERIALISING));
         if (!(read_flags & BDRV_REQ_NO_SERIALISING)) {
-            wait_serialising_requests(&req);
+            bdrv_wait_serialising_requests(&req);
         }
 
         ret = src->bs->drv->bdrv_co_copy_range_from(src->bs,
@@ -3332,7 +3332,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset,
      * new area, we need to make sure that no write requests are made to it
      * concurrently or they might be overwritten by preallocation. */
     if (new_bytes) {
-        mark_request_serialising(&req, 1);
+        bdrv_mark_request_serialising(&req, 1);
     }
     if (bs->read_only) {
         error_setg(errp, "Image is read-only");
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
  2019-10-25  9:58 ` [RFC 1/3] block: Make wait/mark serialising requests public Max Reitz
@ 2019-10-25  9:58 ` Max Reitz
  2019-10-25 10:19   ` Kevin Wolf
  2019-10-26 17:26   ` Nir Soffer
  2019-10-25  9:58 ` [RFC 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-25  9:58 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

We will need this for the next patch.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 695fcf740d..5cd54c8bff 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -149,7 +149,7 @@ typedef struct BDRVRawState {
     int perm_change_flags;
     BDRVReopenState *reopen_state;
 
-#ifdef CONFIG_XFS
+#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
     bool is_xfs:1;
 #endif
     bool has_discard:1;
@@ -667,7 +667,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
-#ifdef CONFIG_XFS
+#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
     if (platform_test_xfs_fd(s->fd)) {
         s->is_xfs = true;
     }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
  2019-10-25  9:58 ` [RFC 1/3] block: Make wait/mark serialising requests public Max Reitz
  2019-10-25  9:58 ` [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE Max Reitz
@ 2019-10-25  9:58 ` Max Reitz
  2019-10-26 17:28   ` Nir Soffer
  2019-10-25 13:40 ` [RFC 0/3] block/file-posix: Work around XFS bug Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-25  9:58 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

The XFS kernel driver has a bug that may cause data corruption for qcow2
images as of qemu commit c8bb23cbdbe32f.  We can work around it by
treating post-EOF fallocates as serializing up until infinity (INT64_MAX
in practice).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 5cd54c8bff..1f5a01df70 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2713,6 +2713,48 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
     RawPosixAIOData acb;
     ThreadPoolFunc *handler;
 
+#ifdef CONFIG_FALLOCATE
+    if (s->is_xfs && s->use_linux_aio &&
+        offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE)
+    {
+        BdrvTrackedRequest *req;
+        uint64_t end;
+
+        /*
+         * The Linux XFS driver has a bug where it will discard writes
+         * submitted through the AIO interface if they happen beyond a
+         * concurrently running fallocate() that increases the file
+         * length (i.e., both the write and the fallocate() happen
+         * beyond the EOF).
+         *
+         * To work around it, we look for the tracked request for this
+         * zero write, extend it until INT64_MAX (effectively
+         * infinity), and mark it as serializing.
+         *
+         * TODO: Detect whether this has been fixed in the XFS driver.
+         */
+
+        QLIST_FOREACH(req, &bs->tracked_requests, list) {
+            if (req->co == qemu_coroutine_self() &&
+                req->type == BDRV_TRACKED_WRITE)
+            {
+                break;
+            }
+        }
+
+        assert(req);
+        assert(req->offset <= offset);
+        assert(req->offset + req->bytes >= offset + bytes);
+
+        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
+        req->bytes = end - req->offset;
+        req->overlap_bytes = req->bytes;
+
+        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
+        bdrv_wait_serialising_requests(req);
+    }
+#endif
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_fildes     = s->fd,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE
  2019-10-25  9:58 ` [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE Max Reitz
@ 2019-10-25 10:19   ` Kevin Wolf
  2019-10-25 10:22     ` Max Reitz
  2019-10-26 17:26   ` Nir Soffer
  1 sibling, 1 reply; 47+ messages in thread
From: Kevin Wolf @ 2019-10-25 10:19 UTC (permalink / raw)
  To: Max Reitz
  Cc: Vladimir Sementsov-Ogievskiy, Alberto Garcia, qemu-block,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi

Am 25.10.2019 um 11:58 hat Max Reitz geschrieben:
> We will need this for the next patch.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/file-posix.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 695fcf740d..5cd54c8bff 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -149,7 +149,7 @@ typedef struct BDRVRawState {
>      int perm_change_flags;
>      BDRVReopenState *reopen_state;
>  
> -#ifdef CONFIG_XFS
> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>      bool is_xfs:1;
>  #endif
>      bool has_discard:1;
> @@ -667,7 +667,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif
>  
> -#ifdef CONFIG_XFS
> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>      if (platform_test_xfs_fd(s->fd)) {

platform_test_xfs_fd() is defined in a header file from xfsprogs. I
don't think you can call that without CONFIG_XFS, it would break the
build.

>          s->is_xfs = true;
>      }

Kevin



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE
  2019-10-25 10:19   ` Kevin Wolf
@ 2019-10-25 10:22     ` Max Reitz
  2019-10-25 10:35       ` Kevin Wolf
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-25 10:22 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, Alberto Garcia, qemu-block,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 1264 bytes --]

On 25.10.19 12:19, Kevin Wolf wrote:
> Am 25.10.2019 um 11:58 hat Max Reitz geschrieben:
>> We will need this for the next patch.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/file-posix.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 695fcf740d..5cd54c8bff 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -149,7 +149,7 @@ typedef struct BDRVRawState {
>>      int perm_change_flags;
>>      BDRVReopenState *reopen_state;
>>  
>> -#ifdef CONFIG_XFS
>> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>>      bool is_xfs:1;
>>  #endif
>>      bool has_discard:1;
>> @@ -667,7 +667,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>      }
>>  #endif
>>  
>> -#ifdef CONFIG_XFS
>> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>>      if (platform_test_xfs_fd(s->fd)) {
> 
> platform_test_xfs_fd() is defined in a header file from xfsprogs. I
> don't think you can call that without CONFIG_XFS, it would break the
> build.

Aw.

OK.  Should we then just assume is_xfs to be true (for the next patch)
with !defined(CONFIG_XFS) && defined(CONFIG_FALLOCATE)?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE
  2019-10-25 10:22     ` Max Reitz
@ 2019-10-25 10:35       ` Kevin Wolf
  2019-10-25 10:41         ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Kevin Wolf @ 2019-10-25 10:35 UTC (permalink / raw)
  To: Max Reitz
  Cc: Vladimir Sementsov-Ogievskiy, Alberto Garcia, qemu-block,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1713 bytes --]

Am 25.10.2019 um 12:22 hat Max Reitz geschrieben:
> On 25.10.19 12:19, Kevin Wolf wrote:
> > Am 25.10.2019 um 11:58 hat Max Reitz geschrieben:
> >> We will need this for the next patch.
> >>
> >> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >> ---
> >>  block/file-posix.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/file-posix.c b/block/file-posix.c
> >> index 695fcf740d..5cd54c8bff 100644
> >> --- a/block/file-posix.c
> >> +++ b/block/file-posix.c
> >> @@ -149,7 +149,7 @@ typedef struct BDRVRawState {
> >>      int perm_change_flags;
> >>      BDRVReopenState *reopen_state;
> >>  
> >> -#ifdef CONFIG_XFS
> >> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
> >>      bool is_xfs:1;
> >>  #endif
> >>      bool has_discard:1;
> >> @@ -667,7 +667,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >>      }
> >>  #endif
> >>  
> >> -#ifdef CONFIG_XFS
> >> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
> >>      if (platform_test_xfs_fd(s->fd)) {
> > 
> > platform_test_xfs_fd() is defined in a header file from xfsprogs. I
> > don't think you can call that without CONFIG_XFS, it would break the
> > build.
> 
> Aw.
> 
> OK.  Should we then just assume is_xfs to be true (for the next patch)
> with !defined(CONFIG_XFS) && defined(CONFIG_FALLOCATE)?

It's a small inline function that basically just calls statfs() and then
checks f_type.

I think we can have a small function to implement this in the file-posix
code. Don't copy it from the header because of licensing (LGPL, while
file-posix is MIT); refer to the statfs() manpage instead and write it
yourself.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE
  2019-10-25 10:35       ` Kevin Wolf
@ 2019-10-25 10:41         ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-25 10:41 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, Alberto Garcia, qemu-block,
	qemu-devel, Anton Nefedov, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 1997 bytes --]

On 25.10.19 12:35, Kevin Wolf wrote:
> Am 25.10.2019 um 12:22 hat Max Reitz geschrieben:
>> On 25.10.19 12:19, Kevin Wolf wrote:
>>> Am 25.10.2019 um 11:58 hat Max Reitz geschrieben:
>>>> We will need this for the next patch.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>  block/file-posix.c | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>>> index 695fcf740d..5cd54c8bff 100644
>>>> --- a/block/file-posix.c
>>>> +++ b/block/file-posix.c
>>>> @@ -149,7 +149,7 @@ typedef struct BDRVRawState {
>>>>      int perm_change_flags;
>>>>      BDRVReopenState *reopen_state;
>>>>  
>>>> -#ifdef CONFIG_XFS
>>>> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>>>>      bool is_xfs:1;
>>>>  #endif
>>>>      bool has_discard:1;
>>>> @@ -667,7 +667,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>>>      }
>>>>  #endif
>>>>  
>>>> -#ifdef CONFIG_XFS
>>>> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>>>>      if (platform_test_xfs_fd(s->fd)) {
>>>
>>> platform_test_xfs_fd() is defined in a header file from xfsprogs. I
>>> don't think you can call that without CONFIG_XFS, it would break the
>>> build.
>>
>> Aw.
>>
>> OK.  Should we then just assume is_xfs to be true (for the next patch)
>> with !defined(CONFIG_XFS) && defined(CONFIG_FALLOCATE)?
> 
> It's a small inline function that basically just calls statfs() and then
> checks f_type.

Yes, this is where my error came from.  I asked cscope, which happily
referred me to the inline implementation and so I didn’t bother looking
at the filename and just assumed it’d be some code that belongs to qemu.

> I think we can have a small function to implement this in the file-posix
> code. Don't copy it from the header because of licensing (LGPL, while
> file-posix is MIT); refer to the statfs() manpage instead and write it
> yourself.

OK.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
                   ` (2 preceding siblings ...)
  2019-10-25  9:58 ` [RFC 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
@ 2019-10-25 13:40 ` Vladimir Sementsov-Ogievskiy
  2019-10-25 13:56   ` Vladimir Sementsov-Ogievskiy
  2019-10-25 13:46 ` Peter Maydell
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-25 13:40 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Stefan Hajnoczi

25.10.2019 12:58, Max Reitz wrote:
> Hi,
> 
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> 
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.
> 
> We can’t wait until the XFS kernel driver is fixed, we should work
> around the problem ourselves.
> 
> This is an RFC for two reasons:
> (1) I don’t know whether this is the right way to address the issue,
> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>      if so stop applying the workaround.
>      I don’t know how we would go about this, so this series doesn’t do
>      it.  (Hence it’s an RFC.)
> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>      driver access and modify a BdrvTrackedRequest object.
> 
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>      operations) to reach until infinity (INT64_MAX in practice), mark
>      them serializing and wait for other conflicting requests.
> 
>      Advantages:
>      + Limits the impact to very specific cases
>        (And that means it wouldn’t hurt too much to keep this workaround
>        even when the XFS driver has been fixed)
>      + Works around the bug where it happens, namely in file-posix
> 
>      Disadvantages:
>      - A bit complex
>      - A bit of a layering violation (should file-posix have access to
>        tracked requests?)
> 
> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>      becomes visible due to that function: I don’t think qcow2 writes
>      zeroes in any other I/O path, and raw images are fixed in size so
>      post-EOF writes won’t happen.
> 
>      Advantages:
>      + Maybe simpler, depending on how difficult it is to handle the
>        layering violation
>      + Also fixes the performance problem of handle_alloc_space() being
>        slow on ppc64+XFS.
> 
>      Disadvantages:
>      - Huge layering violation because qcow2 would need to know whether
>        the image is stored on XFS or not.
>      - We’d definitely want to skip this workaround when the XFS driver
>        has been fixed, so we need some method to find out whether it has
> 
> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>      To my knowledge I’m the only one who has provided any benchmarks for
>      this commit, and even then I was a bit skeptical because it performs
>      well in some cases and bad in others.  I concluded that it’s
>      probably worth it because the “some cases” are more likely to occur.
> 
>      Now we have this problem of corruption here (granted due to a bug in
>      the XFS driver), and another report of massively degraded
>      performance on ppc64
>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>      private BZ; I hate that :-/  The report is about 40 % worse
>      performance for an in-guest fio write benchmark.)
> 
>      So I have to ask the question about what the justification for
>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>      it actually?  (On non-(ppc64+XFS) machines, obviously)
> 
>      Advantages:
>      + Trivial
>      + No layering violations
>      + We wouldn’t need to keep track of whether the kernel bug has been
>        fixed or not
>      + Fixes the ppc64+XFS performance problem
> 
>      Disadvantages:
>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>        levels, whatever that means
> 
> So this is the main reason this is an RFC: What should we do?  Is (1)
> really the best choice?
> 
> 
> In any case, I’ve ran the test case I showed in
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> more than ten times with this series applied and the installation
> succeeded every time.  (Without this series, it fails like every other
> time.)
> 
> 

Hi!

First, great thanks for your investigation!

We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
in time.

I've tested a bit:

test:
for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done

on master:

/ssd/test.img  cl=64K step=4K    : 0.291
/ssd/test.img  cl=64K step=64K   : 0.813
/ssd/test.img  cl=64K step=1M    : 2.799
/ssd/test.img  cl=1M  step=4K    : 0.217
/ssd/test.img  cl=1M  step=64K   : 0.332
/ssd/test.img  cl=1M  step=1M    : 0.685
/test.img      cl=64K step=4K    : 1.751
/test.img      cl=64K step=64K   : 14.811
/test.img      cl=64K step=1M    : 18.321
/test.img      cl=1M  step=4K    : 0.759
/test.img      cl=1M  step=64K   : 13.574
/test.img      cl=1M  step=1M    : 28.970

rerun on master:

/ssd/test.img  cl=64K step=4K    : 0.295
/ssd/test.img  cl=64K step=64K   : 0.803
/ssd/test.img  cl=64K step=1M    : 2.921
/ssd/test.img  cl=1M  step=4K    : 0.233
/ssd/test.img  cl=1M  step=64K   : 0.321
/ssd/test.img  cl=1M  step=1M    : 0.762
/test.img      cl=64K step=4K    : 1.873
/test.img      cl=64K step=64K   : 15.621
/test.img      cl=64K step=1M    : 18.428
/test.img      cl=1M  step=4K    : 0.883
/test.img      cl=1M  step=64K   : 13.484
/test.img      cl=1M  step=1M    : 26.244


on master + revert c8bb23cbdbe32f5c326

/ssd/test.img  cl=64K step=4K    : 0.395
/ssd/test.img  cl=64K step=64K   : 4.231
/ssd/test.img  cl=64K step=1M    : 5.598
/ssd/test.img  cl=1M  step=4K    : 0.352
/ssd/test.img  cl=1M  step=64K   : 2.519
/ssd/test.img  cl=1M  step=1M    : 38.919
/test.img      cl=64K step=4K    : 1.758
/test.img      cl=64K step=64K   : 9.838
/test.img      cl=64K step=1M    : 13.384
/test.img      cl=1M  step=4K    : 1.849
/test.img      cl=1M  step=64K   : 19.405
/test.img      cl=1M  step=1M    : 157.090

rerun:

/ssd/test.img  cl=64K step=4K    : 0.407
/ssd/test.img  cl=64K step=64K   : 3.325
/ssd/test.img  cl=64K step=1M    : 5.641
/ssd/test.img  cl=1M  step=4K    : 0.346
/ssd/test.img  cl=1M  step=64K   : 2.583
/ssd/test.img  cl=1M  step=1M    : 39.692
/test.img      cl=64K step=4K    : 1.727
/test.img      cl=64K step=64K   : 10.058
/test.img      cl=64K step=1M    : 13.441
/test.img      cl=1M  step=4K    : 1.926
/test.img      cl=1M  step=64K   : 19.738
/test.img      cl=1M  step=1M    : 158.268


So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
wrong, we need smarter heuristics..

So, I'd prefer (1) or (2).

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
                   ` (3 preceding siblings ...)
  2019-10-25 13:40 ` [RFC 0/3] block/file-posix: Work around XFS bug Vladimir Sementsov-Ogievskiy
@ 2019-10-25 13:46 ` Peter Maydell
  2019-10-25 14:16   ` Max Reitz
  2019-10-26  0:14 ` no-reply
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Maydell @ 2019-10-25 13:46 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	Qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi

On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
>
> Hi,
>
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
>
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.

Have we reported that upstream to the xfs folks?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 13:40 ` [RFC 0/3] block/file-posix: Work around XFS bug Vladimir Sementsov-Ogievskiy
@ 2019-10-25 13:56   ` Vladimir Sementsov-Ogievskiy
  2019-10-25 14:19     ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-25 13:56 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Stefan Hajnoczi

25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 12:58, Max Reitz wrote:
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
>>
>> We can’t wait until the XFS kernel driver is fixed, we should work
>> around the problem ourselves.
>>
>> This is an RFC for two reasons:
>> (1) I don’t know whether this is the right way to address the issue,
>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>      if so stop applying the workaround.
>>      I don’t know how we would go about this, so this series doesn’t do
>>      it.  (Hence it’s an RFC.)
>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>      driver access and modify a BdrvTrackedRequest object.
>>
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>      them serializing and wait for other conflicting requests.
>>
>>      Advantages:
>>      + Limits the impact to very specific cases
>>        (And that means it wouldn’t hurt too much to keep this workaround
>>        even when the XFS driver has been fixed)
>>      + Works around the bug where it happens, namely in file-posix
>>
>>      Disadvantages:
>>      - A bit complex
>>      - A bit of a layering violation (should file-posix have access to
>>        tracked requests?)
>>
>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>      becomes visible due to that function: I don’t think qcow2 writes
>>      zeroes in any other I/O path, and raw images are fixed in size so
>>      post-EOF writes won’t happen.
>>
>>      Advantages:
>>      + Maybe simpler, depending on how difficult it is to handle the
>>        layering violation
>>      + Also fixes the performance problem of handle_alloc_space() being
>>        slow on ppc64+XFS.
>>
>>      Disadvantages:
>>      - Huge layering violation because qcow2 would need to know whether
>>        the image is stored on XFS or not.
>>      - We’d definitely want to skip this workaround when the XFS driver
>>        has been fixed, so we need some method to find out whether it has
>>
>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>      To my knowledge I’m the only one who has provided any benchmarks for
>>      this commit, and even then I was a bit skeptical because it performs
>>      well in some cases and bad in others.  I concluded that it’s
>>      probably worth it because the “some cases” are more likely to occur.
>>
>>      Now we have this problem of corruption here (granted due to a bug in
>>      the XFS driver), and another report of massively degraded
>>      performance on ppc64
>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>      private BZ; I hate that :-/  The report is about 40 % worse
>>      performance for an in-guest fio write benchmark.)
>>
>>      So I have to ask the question about what the justification for
>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>
>>      Advantages:
>>      + Trivial
>>      + No layering violations
>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>        fixed or not
>>      + Fixes the ppc64+XFS performance problem
>>
>>      Disadvantages:
>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>        levels, whatever that means
>>
>> So this is the main reason this is an RFC: What should we do?  Is (1)
>> really the best choice?
>>
>>
>> In any case, I’ve ran the test case I showed in
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>> more than ten times with this series applied and the installation
>> succeeded every time.  (Without this series, it fails like every other
>> time.)
>>
>>
> 
> Hi!
> 
> First, great thanks for your investigation!
> 
> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
> in time.
> 
> I've tested a bit:
> 
> test:
> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
> 
> on master:
> 
> /ssd/test.img  cl=64K step=4K    : 0.291
> /ssd/test.img  cl=64K step=64K   : 0.813
> /ssd/test.img  cl=64K step=1M    : 2.799
> /ssd/test.img  cl=1M  step=4K    : 0.217
> /ssd/test.img  cl=1M  step=64K   : 0.332
> /ssd/test.img  cl=1M  step=1M    : 0.685
> /test.img      cl=64K step=4K    : 1.751
> /test.img      cl=64K step=64K   : 14.811
> /test.img      cl=64K step=1M    : 18.321
> /test.img      cl=1M  step=4K    : 0.759
> /test.img      cl=1M  step=64K   : 13.574
> /test.img      cl=1M  step=1M    : 28.970
> 
> rerun on master:
> 
> /ssd/test.img  cl=64K step=4K    : 0.295
> /ssd/test.img  cl=64K step=64K   : 0.803
> /ssd/test.img  cl=64K step=1M    : 2.921
> /ssd/test.img  cl=1M  step=4K    : 0.233
> /ssd/test.img  cl=1M  step=64K   : 0.321
> /ssd/test.img  cl=1M  step=1M    : 0.762
> /test.img      cl=64K step=4K    : 1.873
> /test.img      cl=64K step=64K   : 15.621
> /test.img      cl=64K step=1M    : 18.428
> /test.img      cl=1M  step=4K    : 0.883
> /test.img      cl=1M  step=64K   : 13.484
> /test.img      cl=1M  step=1M    : 26.244
> 
> 
> on master + revert c8bb23cbdbe32f5c326
> 
> /ssd/test.img  cl=64K step=4K    : 0.395
> /ssd/test.img  cl=64K step=64K   : 4.231
> /ssd/test.img  cl=64K step=1M    : 5.598
> /ssd/test.img  cl=1M  step=4K    : 0.352
> /ssd/test.img  cl=1M  step=64K   : 2.519
> /ssd/test.img  cl=1M  step=1M    : 38.919
> /test.img      cl=64K step=4K    : 1.758
> /test.img      cl=64K step=64K   : 9.838
> /test.img      cl=64K step=1M    : 13.384
> /test.img      cl=1M  step=4K    : 1.849
> /test.img      cl=1M  step=64K   : 19.405
> /test.img      cl=1M  step=1M    : 157.090
> 
> rerun:
> 
> /ssd/test.img  cl=64K step=4K    : 0.407
> /ssd/test.img  cl=64K step=64K   : 3.325
> /ssd/test.img  cl=64K step=1M    : 5.641
> /ssd/test.img  cl=1M  step=4K    : 0.346
> /ssd/test.img  cl=1M  step=64K   : 2.583
> /ssd/test.img  cl=1M  step=1M    : 39.692
> /test.img      cl=64K step=4K    : 1.727
> /test.img      cl=64K step=64K   : 10.058
> /test.img      cl=64K step=1M    : 13.441
> /test.img      cl=1M  step=4K    : 1.926
> /test.img      cl=1M  step=64K   : 19.738
> /test.img      cl=1M  step=1M    : 158.268
> 
> 
> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
> wrong, we need smarter heuristics..
> 
> So, I'd prefer (1) or (2).
> 

About degradation in some cases: I think the problem is that one (a bit larger)
write may be faster than fast-write-zeroes + small write, as the latter means
additional write to metadata. And it's expected for small clusters in
conjunction with rotational disk. But the actual limit is dependent on specific
disk. So, I think possible solution is just sometimes try work with
handle_alloc_space and sometimes without, remember time and length of request
and make dynamic limit...
-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 13:46 ` Peter Maydell
@ 2019-10-25 14:16   ` Max Reitz
  2019-10-25 14:17     ` Peter Maydell
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-25 14:16 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	Qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 797 bytes --]

On 25.10.19 15:46, Peter Maydell wrote:
> On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
>>
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
> 
> Have we reported that upstream to the xfs folks?

I’ve created an RH BZ here:

https://bugzilla.redhat.com/show_bug.cgi?id=1765547

So at least some XFS folks are aware of it (and I trust them to inform
the others as necessary :-)).  (Eric Sandeen has been part of the
discussion for the last couple of days already.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:16   ` Max Reitz
@ 2019-10-25 14:17     ` Peter Maydell
  2019-10-25 14:21       ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Maydell @ 2019-10-25 14:17 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	Qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi

On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote:
>
> On 25.10.19 15:46, Peter Maydell wrote:
> > On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> >> I’ve explained here:
> >>
> >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> >>
> >> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> >> data corruption when using qcow2 images on XFS with aio=native.
> >
> > Have we reported that upstream to the xfs folks?
>
> I’ve created an RH BZ here:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1765547

Currently "You are not authorized to access bug #1765547." for
anonymous browsers, just fyi.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 13:56   ` Vladimir Sementsov-Ogievskiy
@ 2019-10-25 14:19     ` Max Reitz
  2019-10-25 14:35       ` Kevin Wolf
                         ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-25 14:19 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 9069 bytes --]

On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
>> 25.10.2019 12:58, Max Reitz wrote:
>>> Hi,
>>>
>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>> I’ve explained here:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>
>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>> around the problem ourselves.
>>>
>>> This is an RFC for two reasons:
>>> (1) I don’t know whether this is the right way to address the issue,
>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>      if so stop applying the workaround.
>>>      I don’t know how we would go about this, so this series doesn’t do
>>>      it.  (Hence it’s an RFC.)
>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>      driver access and modify a BdrvTrackedRequest object.
>>>
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>>
>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>      becomes visible due to that function: I don’t think qcow2 writes
>>>      zeroes in any other I/O path, and raw images are fixed in size so
>>>      post-EOF writes won’t happen.
>>>
>>>      Advantages:
>>>      + Maybe simpler, depending on how difficult it is to handle the
>>>        layering violation
>>>      + Also fixes the performance problem of handle_alloc_space() being
>>>        slow on ppc64+XFS.
>>>
>>>      Disadvantages:
>>>      - Huge layering violation because qcow2 would need to know whether
>>>        the image is stored on XFS or not.
>>>      - We’d definitely want to skip this workaround when the XFS driver
>>>        has been fixed, so we need some method to find out whether it has
>>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>>
>>> So this is the main reason this is an RFC: What should we do?  Is (1)
>>> really the best choice?
>>>
>>>
>>> In any case, I’ve ran the test case I showed in
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>>> more than ten times with this series applied and the installation
>>> succeeded every time.  (Without this series, it fails like every other
>>> time.)
>>>
>>>
>>
>> Hi!
>>
>> First, great thanks for your investigation!
>>
>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
>> in time.
>>
>> I've tested a bit:
>>
>> test:
>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
>>
>> on master:
>>
>> /ssd/test.img  cl=64K step=4K    : 0.291
>> /ssd/test.img  cl=64K step=64K   : 0.813
>> /ssd/test.img  cl=64K step=1M    : 2.799
>> /ssd/test.img  cl=1M  step=4K    : 0.217
>> /ssd/test.img  cl=1M  step=64K   : 0.332
>> /ssd/test.img  cl=1M  step=1M    : 0.685
>> /test.img      cl=64K step=4K    : 1.751
>> /test.img      cl=64K step=64K   : 14.811
>> /test.img      cl=64K step=1M    : 18.321
>> /test.img      cl=1M  step=4K    : 0.759
>> /test.img      cl=1M  step=64K   : 13.574
>> /test.img      cl=1M  step=1M    : 28.970
>>
>> rerun on master:
>>
>> /ssd/test.img  cl=64K step=4K    : 0.295
>> /ssd/test.img  cl=64K step=64K   : 0.803
>> /ssd/test.img  cl=64K step=1M    : 2.921
>> /ssd/test.img  cl=1M  step=4K    : 0.233
>> /ssd/test.img  cl=1M  step=64K   : 0.321
>> /ssd/test.img  cl=1M  step=1M    : 0.762
>> /test.img      cl=64K step=4K    : 1.873
>> /test.img      cl=64K step=64K   : 15.621
>> /test.img      cl=64K step=1M    : 18.428
>> /test.img      cl=1M  step=4K    : 0.883
>> /test.img      cl=1M  step=64K   : 13.484
>> /test.img      cl=1M  step=1M    : 26.244
>>
>>
>> on master + revert c8bb23cbdbe32f5c326
>>
>> /ssd/test.img  cl=64K step=4K    : 0.395
>> /ssd/test.img  cl=64K step=64K   : 4.231
>> /ssd/test.img  cl=64K step=1M    : 5.598
>> /ssd/test.img  cl=1M  step=4K    : 0.352
>> /ssd/test.img  cl=1M  step=64K   : 2.519
>> /ssd/test.img  cl=1M  step=1M    : 38.919
>> /test.img      cl=64K step=4K    : 1.758
>> /test.img      cl=64K step=64K   : 9.838
>> /test.img      cl=64K step=1M    : 13.384
>> /test.img      cl=1M  step=4K    : 1.849
>> /test.img      cl=1M  step=64K   : 19.405
>> /test.img      cl=1M  step=1M    : 157.090
>>
>> rerun:
>>
>> /ssd/test.img  cl=64K step=4K    : 0.407
>> /ssd/test.img  cl=64K step=64K   : 3.325
>> /ssd/test.img  cl=64K step=1M    : 5.641
>> /ssd/test.img  cl=1M  step=4K    : 0.346
>> /ssd/test.img  cl=1M  step=64K   : 2.583
>> /ssd/test.img  cl=1M  step=1M    : 39.692
>> /test.img      cl=64K step=4K    : 1.727
>> /test.img      cl=64K step=64K   : 10.058
>> /test.img      cl=64K step=1M    : 13.441
>> /test.img      cl=1M  step=4K    : 1.926
>> /test.img      cl=1M  step=64K   : 19.738
>> /test.img      cl=1M  step=1M    : 158.268
>>
>>
>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
>> wrong, we need smarter heuristics..
>>
>> So, I'd prefer (1) or (2).

OK.  I wonder whether that problem would go away with Berto’s subcluster
series, though.

> About degradation in some cases: I think the problem is that one (a bit larger)
> write may be faster than fast-write-zeroes + small write, as the latter means
> additional write to metadata. And it's expected for small clusters in
> conjunction with rotational disk. But the actual limit is dependent on specific
> disk. So, I think possible solution is just sometimes try work with
> handle_alloc_space and sometimes without, remember time and length of request
> and make dynamic limit...

Maybe make a decision based both on the ratio of data size to COW area
length (only invoke handle_alloc_space() under a certain threshold), and
the absolute COW area length (always invoke it above a certain
threshold, unless the ratio doesn’t allow it)?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:17     ` Peter Maydell
@ 2019-10-25 14:21       ` Max Reitz
  2019-10-25 14:56         ` Peter Maydell
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-25 14:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	Qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 973 bytes --]

On 25.10.19 16:17, Peter Maydell wrote:
> On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote:
>>
>> On 25.10.19 15:46, Peter Maydell wrote:
>>> On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>>> I’ve explained here:
>>>>
>>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>>
>>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> Have we reported that upstream to the xfs folks?
>>
>> I’ve created an RH BZ here:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1765547
> 
> Currently "You are not authorized to access bug #1765547." for
> anonymous browsers, just fyi.

Err.  Oops.  That wasn’t my intention.  I hate that misfeature.

Thanks for telling me, I fixed it. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:19     ` Max Reitz
@ 2019-10-25 14:35       ` Kevin Wolf
  2019-10-25 14:36       ` Vladimir Sementsov-Ogievskiy
  2019-11-04 14:03       ` Alberto Garcia
  2 siblings, 0 replies; 47+ messages in thread
From: Kevin Wolf @ 2019-10-25 14:35 UTC (permalink / raw)
  To: Max Reitz
  Cc: Anton Nefedov, Alberto Garcia, qemu-block, qemu-devel,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 9589 bytes --]

Am 25.10.2019 um 16:19 hat Max Reitz geschrieben:
> On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
> > 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> >> 25.10.2019 12:58, Max Reitz wrote:
> >>> Hi,
> >>>
> >>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> >>> I’ve explained here:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> >>>
> >>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> >>> data corruption when using qcow2 images on XFS with aio=native.
> >>>
> >>> We can’t wait until the XFS kernel driver is fixed, we should work
> >>> around the problem ourselves.
> >>>
> >>> This is an RFC for two reasons:
> >>> (1) I don’t know whether this is the right way to address the issue,
> >>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
> >>>      if so stop applying the workaround.
> >>>      I don’t know how we would go about this, so this series doesn’t do
> >>>      it.  (Hence it’s an RFC.)
> >>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
> >>>      driver access and modify a BdrvTrackedRequest object.
> >>>
> >>> As for how we can address the issue, I see three ways:
> >>> (1) The one presented in this series: On XFS with aio=native, we extend
> >>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
> >>>      operations) to reach until infinity (INT64_MAX in practice), mark
> >>>      them serializing and wait for other conflicting requests.
> >>>
> >>>      Advantages:
> >>>      + Limits the impact to very specific cases
> >>>        (And that means it wouldn’t hurt too much to keep this workaround
> >>>        even when the XFS driver has been fixed)
> >>>      + Works around the bug where it happens, namely in file-posix
> >>>
> >>>      Disadvantages:
> >>>      - A bit complex
> >>>      - A bit of a layering violation (should file-posix have access to
> >>>        tracked requests?)
> >>>
> >>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
> >>>      becomes visible due to that function: I don’t think qcow2 writes
> >>>      zeroes in any other I/O path, and raw images are fixed in size so
> >>>      post-EOF writes won’t happen.
> >>>
> >>>      Advantages:
> >>>      + Maybe simpler, depending on how difficult it is to handle the
> >>>        layering violation
> >>>      + Also fixes the performance problem of handle_alloc_space() being
> >>>        slow on ppc64+XFS.
> >>>
> >>>      Disadvantages:
> >>>      - Huge layering violation because qcow2 would need to know whether
> >>>        the image is stored on XFS or not.
> >>>      - We’d definitely want to skip this workaround when the XFS driver
> >>>        has been fixed, so we need some method to find out whether it has
> >>>
> >>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
> >>>      To my knowledge I’m the only one who has provided any benchmarks for
> >>>      this commit, and even then I was a bit skeptical because it performs
> >>>      well in some cases and bad in others.  I concluded that it’s
> >>>      probably worth it because the “some cases” are more likely to occur.
> >>>
> >>>      Now we have this problem of corruption here (granted due to a bug in
> >>>      the XFS driver), and another report of massively degraded
> >>>      performance on ppc64
> >>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
> >>>      private BZ; I hate that :-/  The report is about 40 % worse
> >>>      performance for an in-guest fio write benchmark.)
> >>>
> >>>      So I have to ask the question about what the justification for
> >>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
> >>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
> >>>
> >>>      Advantages:
> >>>      + Trivial
> >>>      + No layering violations
> >>>      + We wouldn’t need to keep track of whether the kernel bug has been
> >>>        fixed or not
> >>>      + Fixes the ppc64+XFS performance problem
> >>>
> >>>      Disadvantages:
> >>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
> >>>        levels, whatever that means
> >>>
> >>> So this is the main reason this is an RFC: What should we do?  Is (1)
> >>> really the best choice?
> >>>
> >>>
> >>> In any case, I’ve ran the test case I showed in
> >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> >>> more than ten times with this series applied and the installation
> >>> succeeded every time.  (Without this series, it fails like every other
> >>> time.)
> >>>
> >>>
> >>
> >> Hi!
> >>
> >> First, great thanks for your investigation!
> >>
> >> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
> >> in time.
> >>
> >> I've tested a bit:
> >>
> >> test:
> >> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
> >>
> >> on master:
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.291
> >> /ssd/test.img  cl=64K step=64K   : 0.813
> >> /ssd/test.img  cl=64K step=1M    : 2.799
> >> /ssd/test.img  cl=1M  step=4K    : 0.217
> >> /ssd/test.img  cl=1M  step=64K   : 0.332
> >> /ssd/test.img  cl=1M  step=1M    : 0.685
> >> /test.img      cl=64K step=4K    : 1.751
> >> /test.img      cl=64K step=64K   : 14.811
> >> /test.img      cl=64K step=1M    : 18.321
> >> /test.img      cl=1M  step=4K    : 0.759
> >> /test.img      cl=1M  step=64K   : 13.574
> >> /test.img      cl=1M  step=1M    : 28.970
> >>
> >> rerun on master:
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.295
> >> /ssd/test.img  cl=64K step=64K   : 0.803
> >> /ssd/test.img  cl=64K step=1M    : 2.921
> >> /ssd/test.img  cl=1M  step=4K    : 0.233
> >> /ssd/test.img  cl=1M  step=64K   : 0.321
> >> /ssd/test.img  cl=1M  step=1M    : 0.762
> >> /test.img      cl=64K step=4K    : 1.873
> >> /test.img      cl=64K step=64K   : 15.621
> >> /test.img      cl=64K step=1M    : 18.428
> >> /test.img      cl=1M  step=4K    : 0.883
> >> /test.img      cl=1M  step=64K   : 13.484
> >> /test.img      cl=1M  step=1M    : 26.244
> >>
> >>
> >> on master + revert c8bb23cbdbe32f5c326
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.395
> >> /ssd/test.img  cl=64K step=64K   : 4.231
> >> /ssd/test.img  cl=64K step=1M    : 5.598
> >> /ssd/test.img  cl=1M  step=4K    : 0.352
> >> /ssd/test.img  cl=1M  step=64K   : 2.519
> >> /ssd/test.img  cl=1M  step=1M    : 38.919
> >> /test.img      cl=64K step=4K    : 1.758
> >> /test.img      cl=64K step=64K   : 9.838
> >> /test.img      cl=64K step=1M    : 13.384
> >> /test.img      cl=1M  step=4K    : 1.849
> >> /test.img      cl=1M  step=64K   : 19.405
> >> /test.img      cl=1M  step=1M    : 157.090
> >>
> >> rerun:
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.407
> >> /ssd/test.img  cl=64K step=64K   : 3.325
> >> /ssd/test.img  cl=64K step=1M    : 5.641
> >> /ssd/test.img  cl=1M  step=4K    : 0.346
> >> /ssd/test.img  cl=1M  step=64K   : 2.583
> >> /ssd/test.img  cl=1M  step=1M    : 39.692
> >> /test.img      cl=64K step=4K    : 1.727
> >> /test.img      cl=64K step=64K   : 10.058
> >> /test.img      cl=64K step=1M    : 13.441
> >> /test.img      cl=1M  step=4K    : 1.926
> >> /test.img      cl=1M  step=64K   : 19.738
> >> /test.img      cl=1M  step=1M    : 158.268
> >>
> >>
> >> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
> >> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
> >> wrong, we need smarter heuristics..
> >>
> >> So, I'd prefer (1) or (2).
> 
> OK.  I wonder whether that problem would go away with Berto’s subcluster
> series, though.
> 
> > About degradation in some cases: I think the problem is that one (a bit larger)
> > write may be faster than fast-write-zeroes + small write, as the latter means
> > additional write to metadata. And it's expected for small clusters in
> > conjunction with rotational disk. But the actual limit is dependent on specific
> > disk. So, I think possible solution is just sometimes try work with
> > handle_alloc_space and sometimes without, remember time and length of request
> > and make dynamic limit...
> 
> Maybe make a decision based both on the ratio of data size to COW area
> length (only invoke handle_alloc_space() under a certain threshold), and
> the absolute COW area length (always invoke it above a certain
> threshold, unless the ratio doesn’t allow it)?

I'm not sure that I would like this level of complexity in this code
path...

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:19     ` Max Reitz
  2019-10-25 14:35       ` Kevin Wolf
@ 2019-10-25 14:36       ` Vladimir Sementsov-Ogievskiy
  2019-10-27 12:21         ` Stefan Hajnoczi
  2019-11-04 14:03       ` Alberto Garcia
  2 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-25 14:36 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, Stefan Hajnoczi

25.10.2019 17:19, Max Reitz wrote:
> On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
>> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
>>> 25.10.2019 12:58, Max Reitz wrote:
>>>> Hi,
>>>>
>>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>>> I’ve explained here:
>>>>
>>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>>
>>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>>> data corruption when using qcow2 images on XFS with aio=native.
>>>>
>>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>>> around the problem ourselves.
>>>>
>>>> This is an RFC for two reasons:
>>>> (1) I don’t know whether this is the right way to address the issue,
>>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>>       if so stop applying the workaround.
>>>>       I don’t know how we would go about this, so this series doesn’t do
>>>>       it.  (Hence it’s an RFC.)
>>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>>       driver access and modify a BdrvTrackedRequest object.
>>>>
>>>> As for how we can address the issue, I see three ways:
>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>       tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>       operations) to reach until infinity (INT64_MAX in practice), mark
>>>>       them serializing and wait for other conflicting requests.
>>>>
>>>>       Advantages:
>>>>       + Limits the impact to very specific cases
>>>>         (And that means it wouldn’t hurt too much to keep this workaround
>>>>         even when the XFS driver has been fixed)
>>>>       + Works around the bug where it happens, namely in file-posix
>>>>
>>>>       Disadvantages:
>>>>       - A bit complex
>>>>       - A bit of a layering violation (should file-posix have access to
>>>>         tracked requests?)
>>>>
>>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>>       becomes visible due to that function: I don’t think qcow2 writes
>>>>       zeroes in any other I/O path, and raw images are fixed in size so
>>>>       post-EOF writes won’t happen.
>>>>
>>>>       Advantages:
>>>>       + Maybe simpler, depending on how difficult it is to handle the
>>>>         layering violation
>>>>       + Also fixes the performance problem of handle_alloc_space() being
>>>>         slow on ppc64+XFS.
>>>>
>>>>       Disadvantages:
>>>>       - Huge layering violation because qcow2 would need to know whether
>>>>         the image is stored on XFS or not.
>>>>       - We’d definitely want to skip this workaround when the XFS driver
>>>>         has been fixed, so we need some method to find out whether it has
>>>>
>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>       To my knowledge I’m the only one who has provided any benchmarks for
>>>>       this commit, and even then I was a bit skeptical because it performs
>>>>       well in some cases and bad in others.  I concluded that it’s
>>>>       probably worth it because the “some cases” are more likely to occur.
>>>>
>>>>       Now we have this problem of corruption here (granted due to a bug in
>>>>       the XFS driver), and another report of massively degraded
>>>>       performance on ppc64
>>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>       private BZ; I hate that :-/  The report is about 40 % worse
>>>>       performance for an in-guest fio write benchmark.)
>>>>
>>>>       So I have to ask the question about what the justification for
>>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>
>>>>       Advantages:
>>>>       + Trivial
>>>>       + No layering violations
>>>>       + We wouldn’t need to keep track of whether the kernel bug has been
>>>>         fixed or not
>>>>       + Fixes the ppc64+XFS performance problem
>>>>
>>>>       Disadvantages:
>>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>         levels, whatever that means
>>>>
>>>> So this is the main reason this is an RFC: What should we do?  Is (1)
>>>> really the best choice?
>>>>
>>>>
>>>> In any case, I’ve ran the test case I showed in
>>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>>>> more than ten times with this series applied and the installation
>>>> succeeded every time.  (Without this series, it fails like every other
>>>> time.)
>>>>
>>>>
>>>
>>> Hi!
>>>
>>> First, great thanks for your investigation!
>>>
>>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
>>> in time.
>>>
>>> I've tested a bit:
>>>
>>> test:
>>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
>>>
>>> on master:
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.291
>>> /ssd/test.img  cl=64K step=64K   : 0.813
>>> /ssd/test.img  cl=64K step=1M    : 2.799
>>> /ssd/test.img  cl=1M  step=4K    : 0.217
>>> /ssd/test.img  cl=1M  step=64K   : 0.332
>>> /ssd/test.img  cl=1M  step=1M    : 0.685
>>> /test.img      cl=64K step=4K    : 1.751
>>> /test.img      cl=64K step=64K   : 14.811
>>> /test.img      cl=64K step=1M    : 18.321
>>> /test.img      cl=1M  step=4K    : 0.759
>>> /test.img      cl=1M  step=64K   : 13.574
>>> /test.img      cl=1M  step=1M    : 28.970
>>>
>>> rerun on master:
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.295
>>> /ssd/test.img  cl=64K step=64K   : 0.803
>>> /ssd/test.img  cl=64K step=1M    : 2.921
>>> /ssd/test.img  cl=1M  step=4K    : 0.233
>>> /ssd/test.img  cl=1M  step=64K   : 0.321
>>> /ssd/test.img  cl=1M  step=1M    : 0.762
>>> /test.img      cl=64K step=4K    : 1.873
>>> /test.img      cl=64K step=64K   : 15.621
>>> /test.img      cl=64K step=1M    : 18.428
>>> /test.img      cl=1M  step=4K    : 0.883
>>> /test.img      cl=1M  step=64K   : 13.484
>>> /test.img      cl=1M  step=1M    : 26.244
>>>
>>>
>>> on master + revert c8bb23cbdbe32f5c326
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.395
>>> /ssd/test.img  cl=64K step=64K   : 4.231
>>> /ssd/test.img  cl=64K step=1M    : 5.598
>>> /ssd/test.img  cl=1M  step=4K    : 0.352
>>> /ssd/test.img  cl=1M  step=64K   : 2.519
>>> /ssd/test.img  cl=1M  step=1M    : 38.919
>>> /test.img      cl=64K step=4K    : 1.758
>>> /test.img      cl=64K step=64K   : 9.838
>>> /test.img      cl=64K step=1M    : 13.384
>>> /test.img      cl=1M  step=4K    : 1.849
>>> /test.img      cl=1M  step=64K   : 19.405
>>> /test.img      cl=1M  step=1M    : 157.090
>>>
>>> rerun:
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.407
>>> /ssd/test.img  cl=64K step=64K   : 3.325
>>> /ssd/test.img  cl=64K step=1M    : 5.641
>>> /ssd/test.img  cl=1M  step=4K    : 0.346
>>> /ssd/test.img  cl=1M  step=64K   : 2.583
>>> /ssd/test.img  cl=1M  step=1M    : 39.692
>>> /test.img      cl=64K step=4K    : 1.727
>>> /test.img      cl=64K step=64K   : 10.058
>>> /test.img      cl=64K step=1M    : 13.441
>>> /test.img      cl=1M  step=4K    : 1.926
>>> /test.img      cl=1M  step=64K   : 19.738
>>> /test.img      cl=1M  step=1M    : 158.268
>>>
>>>
>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
>>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
>>> wrong, we need smarter heuristics..
>>>
>>> So, I'd prefer (1) or (2).
> 
> OK.  I wonder whether that problem would go away with Berto’s subcluster
> series, though.

Very possible, I thought about it too.

> 
>> About degradation in some cases: I think the problem is that one (a bit larger)
>> write may be faster than fast-write-zeroes + small write, as the latter means
>> additional write to metadata. And it's expected for small clusters in
>> conjunction with rotational disk. But the actual limit is dependent on specific
>> disk. So, I think possible solution is just sometimes try work with
>> handle_alloc_space and sometimes without, remember time and length of request
>> and make dynamic limit...
> 
> Maybe make a decision based both on the ratio of data size to COW area
> length (only invoke handle_alloc_space() under a certain threshold), and
> the absolute COW area length (always invoke it above a certain
> threshold, unless the ratio doesn’t allow it)?
> 

Yes, something like this..

without handle_alloc_space, time = time(write aligned up to cluster)
with handle_alloc_space, time = time(fast zero write) + time(original write)

If we have some statistics on normal-write vs zero-write timing, we can just
calculate both variants and choose faster.

if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) {
    use handle_alloc_space()
}


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:21       ` Max Reitz
@ 2019-10-25 14:56         ` Peter Maydell
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Maydell @ 2019-10-25 14:56 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	Qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi

On Fri, 25 Oct 2019 at 15:21, Max Reitz <mreitz@redhat.com> wrote:
>
> On 25.10.19 16:17, Peter Maydell wrote:
> > On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote:
> >> I’ve created an RH BZ here:
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1765547
> >
> > Currently "You are not authorized to access bug #1765547." for
> > anonymous browsers, just fyi.
>
> Err.  Oops.  That wasn’t my intention.  I hate that misfeature.
>
> Thanks for telling me, I fixed it. :-)

Yeah, that's a really well written bug report, you definitely
don't want to hide it away :-) Thanks for making it public.

-- PMM


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
                   ` (4 preceding siblings ...)
  2019-10-25 13:46 ` Peter Maydell
@ 2019-10-26  0:14 ` no-reply
  2019-10-26 17:37 ` Nir Soffer
  2019-10-27 12:35 ` Stefan Hajnoczi
  7 siblings, 0 replies; 47+ messages in thread
From: no-reply @ 2019-10-26  0:14 UTC (permalink / raw)
  To: mreitz
  Cc: kwolf, anton.nefedov, berto, qemu-block, qemu-devel, mreitz,
	vsementsov, stefanha

Patchew URL: https://patchew.org/QEMU/20191025095849.25283-1-mreitz@redhat.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      block/sheepdog.o
  CC      block/accounting.o
/tmp/qemu-test/src/block/file-posix.c: In function 'raw_open_common':
/tmp/qemu-test/src/block/file-posix.c:671:5: error: implicit declaration of function 'platform_test_xfs_fd' [-Werror=implicit-function-declaration]
     if (platform_test_xfs_fd(s->fd)) {
     ^
/tmp/qemu-test/src/block/file-posix.c:671:5: error: nested extern declaration of 'platform_test_xfs_fd' [-Werror=nested-externs]
cc1: all warnings being treated as errors
make: *** [block/file-posix.o] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=c74d47e9bfb24107b6e94885fa8a2151', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-ytazf4e4/src/docker-src.2019-10-25-20.11.53.7609:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=c74d47e9bfb24107b6e94885fa8a2151
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-ytazf4e4/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    2m32.235s
user    0m8.092s


The full log is available at
http://patchew.org/logs/20191025095849.25283-1-mreitz@redhat.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE
  2019-10-25  9:58 ` [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE Max Reitz
  2019-10-25 10:19   ` Kevin Wolf
@ 2019-10-26 17:26   ` Nir Soffer
  1 sibling, 0 replies; 47+ messages in thread
From: Nir Soffer @ 2019-10-26 17:26 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi,
	Krutika Dhananjay

On Fri, Oct 25, 2019 at 1:22 PM Max Reitz <mreitz@redhat.com> wrote:
>
> We will need this for the next patch.
>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/file-posix.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 695fcf740d..5cd54c8bff 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -149,7 +149,7 @@ typedef struct BDRVRawState {
>      int perm_change_flags;
>      BDRVReopenState *reopen_state;
>
> -#ifdef CONFIG_XFS
> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>      bool is_xfs:1;
>  #endif
>      bool has_discard:1;
> @@ -667,7 +667,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif
>
> -#ifdef CONFIG_XFS
> +#if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE)
>      if (platform_test_xfs_fd(s->fd)) {
>          s->is_xfs = true;

What about remote xfs filesystem, e.g. glusterfs over xfs mounted using fuse?
(how oVirt uses glusterfs)

The buggy behavior with concurrent fallocate/pwrite can affect this, and
platform_test_xfs_fd() will probably fail to detect xfs.

Nir

>      }
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 3/3] block/file-posix: Let post-EOF fallocate serialize
  2019-10-25  9:58 ` [RFC 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
@ 2019-10-26 17:28   ` Nir Soffer
  0 siblings, 0 replies; 47+ messages in thread
From: Nir Soffer @ 2019-10-26 17:28 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, QEMU Developers,
	Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi, Krutika Dhananjay

On Fri, Oct 25, 2019 at 1:24 PM Max Reitz <mreitz@redhat.com> wrote:
>
> The XFS kernel driver has a bug that may cause data corruption for qcow2
> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
> in practice).
>
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/file-posix.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 5cd54c8bff..1f5a01df70 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -2713,6 +2713,48 @@ raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
>      RawPosixAIOData acb;
>      ThreadPoolFunc *handler;
>
> +#ifdef CONFIG_FALLOCATE
> +    if (s->is_xfs && s->use_linux_aio &&

This limit the fix to local xfs filesystem, but the fix may be needed
to remote filesystem such
as gluster over xfs.

> +        offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE)
> +    {
> +        BdrvTrackedRequest *req;
> +        uint64_t end;
> +
> +        /*
> +         * The Linux XFS driver has a bug where it will discard writes
> +         * submitted through the AIO interface if they happen beyond a
> +         * concurrently running fallocate() that increases the file
> +         * length (i.e., both the write and the fallocate() happen
> +         * beyond the EOF).
> +         *
> +         * To work around it, we look for the tracked request for this
> +         * zero write, extend it until INT64_MAX (effectively
> +         * infinity), and mark it as serializing.
> +         *
> +         * TODO: Detect whether this has been fixed in the XFS driver.
> +         */
> +
> +        QLIST_FOREACH(req, &bs->tracked_requests, list) {
> +            if (req->co == qemu_coroutine_self() &&
> +                req->type == BDRV_TRACKED_WRITE)
> +            {
> +                break;
> +            }
> +        }
> +
> +        assert(req);
> +        assert(req->offset <= offset);
> +        assert(req->offset + req->bytes >= offset + bytes);
> +
> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
> +        req->bytes = end - req->offset;
> +        req->overlap_bytes = req->bytes;
> +
> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
> +        bdrv_wait_serialising_requests(req);
> +    }
> +#endif
> +
>      acb = (RawPosixAIOData) {
>          .bs             = bs,
>          .aio_fildes     = s->fd,
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
                   ` (5 preceding siblings ...)
  2019-10-26  0:14 ` no-reply
@ 2019-10-26 17:37 ` Nir Soffer
  2019-10-26 17:52   ` Vladimir Sementsov-Ogievskiy
  2019-10-27 12:35 ` Stefan Hajnoczi
  7 siblings, 1 reply; 47+ messages in thread
From: Nir Soffer @ 2019-10-26 17:37 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, Alberto Garcia,
	qemu-block, QEMU Developers, Anton Nefedov, Stefan Hajnoczi

On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote:
>
> Hi,
>
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
>
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.
>
> We can’t wait until the XFS kernel driver is fixed, we should work
> around the problem ourselves.
>
> This is an RFC for two reasons:
> (1) I don’t know whether this is the right way to address the issue,
> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>     if so stop applying the workaround.
>     I don’t know how we would go about this, so this series doesn’t do
>     it.  (Hence it’s an RFC.)
> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>     driver access and modify a BdrvTrackedRequest object.
>
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>     operations) to reach until infinity (INT64_MAX in practice), mark
>     them serializing and wait for other conflicting requests.
>
>     Advantages:
>     + Limits the impact to very specific cases
>       (And that means it wouldn’t hurt too much to keep this workaround
>       even when the XFS driver has been fixed)
>     + Works around the bug where it happens, namely in file-posix
>
>     Disadvantages:
>     - A bit complex
>     - A bit of a layering violation (should file-posix have access to
>       tracked requests?)
>
> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>     becomes visible due to that function: I don’t think qcow2 writes
>     zeroes in any other I/O path, and raw images are fixed in size so
>     post-EOF writes won’t happen.
>
>     Advantages:
>     + Maybe simpler, depending on how difficult it is to handle the
>       layering violation
>     + Also fixes the performance problem of handle_alloc_space() being
>       slow on ppc64+XFS.
>
>     Disadvantages:
>     - Huge layering violation because qcow2 would need to know whether
>       the image is stored on XFS or not.
>     - We’d definitely want to skip this workaround when the XFS driver
>       has been fixed, so we need some method to find out whether it has
>
> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>     To my knowledge I’m the only one who has provided any benchmarks for
>     this commit, and even then I was a bit skeptical because it performs
>     well in some cases and bad in others.  I concluded that it’s
>     probably worth it because the “some cases” are more likely to occur.
>
>     Now we have this problem of corruption here (granted due to a bug in
>     the XFS driver), and another report of massively degraded
>     performance on ppc64
>     (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>     private BZ; I hate that :-/  The report is about 40 % worse
>     performance for an in-guest fio write benchmark.)
>
>     So I have to ask the question about what the justification for
>     keeping c8bb23cbdbe32f is.  How much does performance increase with
>     it actually?  (On non-(ppc64+XFS) machines, obviously)
>
>     Advantages:
>     + Trivial
>     + No layering violations
>     + We wouldn’t need to keep track of whether the kernel bug has been
>       fixed or not
>     + Fixes the ppc64+XFS performance problem
>
>     Disadvantages:
>     - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>       levels, whatever that means

Correctness is more important than performance, so this is my
preference as a user.

Nir

> So this is the main reason this is an RFC: What should we do?  Is (1)
> really the best choice?
>
>
> In any case, I’ve ran the test case I showed in
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> more than ten times with this series applied and the installation
> succeeded every time.  (Without this series, it fails like every other
> time.)
>
>
> Max Reitz (3):
>   block: Make wait/mark serialising requests public
>   block/file-posix: Detect XFS with CONFIG_FALLOCATE
>   block/file-posix: Let post-EOF fallocate serialize
>
>  include/block/block_int.h |  3 +++
>  block/file-posix.c        | 46 +++++++++++++++++++++++++++++++++++++--
>  block/io.c                | 24 ++++++++++----------
>  3 files changed, 59 insertions(+), 14 deletions(-)
>
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-26 17:37 ` Nir Soffer
@ 2019-10-26 17:52   ` Vladimir Sementsov-Ogievskiy
  2019-10-28  8:56     ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-26 17:52 UTC (permalink / raw)
  To: Nir Soffer, Max Reitz
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	QEMU Developers, Stefan Hajnoczi

26.10.2019 20:37, Nir Soffer wrote:
> On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote:
>>
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
>>
>> We can’t wait until the XFS kernel driver is fixed, we should work
>> around the problem ourselves.
>>
>> This is an RFC for two reasons:
>> (1) I don’t know whether this is the right way to address the issue,
>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>      if so stop applying the workaround.
>>      I don’t know how we would go about this, so this series doesn’t do
>>      it.  (Hence it’s an RFC.)
>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>      driver access and modify a BdrvTrackedRequest object.
>>
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>      them serializing and wait for other conflicting requests.
>>
>>      Advantages:
>>      + Limits the impact to very specific cases
>>        (And that means it wouldn’t hurt too much to keep this workaround
>>        even when the XFS driver has been fixed)
>>      + Works around the bug where it happens, namely in file-posix
>>
>>      Disadvantages:
>>      - A bit complex
>>      - A bit of a layering violation (should file-posix have access to
>>        tracked requests?)
>>
>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>      becomes visible due to that function: I don’t think qcow2 writes
>>      zeroes in any other I/O path, and raw images are fixed in size so
>>      post-EOF writes won’t happen.
>>
>>      Advantages:
>>      + Maybe simpler, depending on how difficult it is to handle the
>>        layering violation
>>      + Also fixes the performance problem of handle_alloc_space() being
>>        slow on ppc64+XFS.
>>
>>      Disadvantages:
>>      - Huge layering violation because qcow2 would need to know whether
>>        the image is stored on XFS or not.
>>      - We’d definitely want to skip this workaround when the XFS driver
>>        has been fixed, so we need some method to find out whether it has
>>
>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>      To my knowledge I’m the only one who has provided any benchmarks for
>>      this commit, and even then I was a bit skeptical because it performs
>>      well in some cases and bad in others.  I concluded that it’s
>>      probably worth it because the “some cases” are more likely to occur.
>>
>>      Now we have this problem of corruption here (granted due to a bug in
>>      the XFS driver), and another report of massively degraded
>>      performance on ppc64
>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>      private BZ; I hate that :-/  The report is about 40 % worse
>>      performance for an in-guest fio write benchmark.)
>>
>>      So I have to ask the question about what the justification for
>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>
>>      Advantages:
>>      + Trivial
>>      + No layering violations
>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>        fixed or not
>>      + Fixes the ppc64+XFS performance problem
>>
>>      Disadvantages:
>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>        levels, whatever that means
> 
> Correctness is more important than performance, so this is my
> preference as a user.
> 

Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another
software, or may be another scenario in Qemu (not sure).

> 
>> So this is the main reason this is an RFC: What should we do?  Is (1)
>> really the best choice?
>>
>>
>> In any case, I’ve ran the test case I showed in
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>> more than ten times with this series applied and the installation
>> succeeded every time.  (Without this series, it fails like every other
>> time.)
>>
>>
>> Max Reitz (3):
>>    block: Make wait/mark serialising requests public
>>    block/file-posix: Detect XFS with CONFIG_FALLOCATE
>>    block/file-posix: Let post-EOF fallocate serialize
>>
>>   include/block/block_int.h |  3 +++
>>   block/file-posix.c        | 46 +++++++++++++++++++++++++++++++++++++--
>>   block/io.c                | 24 ++++++++++----------
>>   3 files changed, 59 insertions(+), 14 deletions(-)
>>
>> --
>> 2.21.0
>>
>>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:36       ` Vladimir Sementsov-Ogievskiy
@ 2019-10-27 12:21         ` Stefan Hajnoczi
  0 siblings, 0 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2019-10-27 12:21 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	qemu-devel, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 10538 bytes --]

On Fri, Oct 25, 2019 at 02:36:49PM +0000, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 17:19, Max Reitz wrote:
> > On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
> >> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> >>> 25.10.2019 12:58, Max Reitz wrote:
> >>>> Hi,
> >>>>
> >>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> >>>> I’ve explained here:
> >>>>
> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> >>>>
> >>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> >>>> data corruption when using qcow2 images on XFS with aio=native.
> >>>>
> >>>> We can’t wait until the XFS kernel driver is fixed, we should work
> >>>> around the problem ourselves.
> >>>>
> >>>> This is an RFC for two reasons:
> >>>> (1) I don’t know whether this is the right way to address the issue,
> >>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
> >>>>       if so stop applying the workaround.
> >>>>       I don’t know how we would go about this, so this series doesn’t do
> >>>>       it.  (Hence it’s an RFC.)
> >>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
> >>>>       driver access and modify a BdrvTrackedRequest object.
> >>>>
> >>>> As for how we can address the issue, I see three ways:
> >>>> (1) The one presented in this series: On XFS with aio=native, we extend
> >>>>       tracked requests for post-EOF fallocate() calls (i.e., write-zero
> >>>>       operations) to reach until infinity (INT64_MAX in practice), mark
> >>>>       them serializing and wait for other conflicting requests.
> >>>>
> >>>>       Advantages:
> >>>>       + Limits the impact to very specific cases
> >>>>         (And that means it wouldn’t hurt too much to keep this workaround
> >>>>         even when the XFS driver has been fixed)
> >>>>       + Works around the bug where it happens, namely in file-posix
> >>>>
> >>>>       Disadvantages:
> >>>>       - A bit complex
> >>>>       - A bit of a layering violation (should file-posix have access to
> >>>>         tracked requests?)
> >>>>
> >>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
> >>>>       becomes visible due to that function: I don’t think qcow2 writes
> >>>>       zeroes in any other I/O path, and raw images are fixed in size so
> >>>>       post-EOF writes won’t happen.
> >>>>
> >>>>       Advantages:
> >>>>       + Maybe simpler, depending on how difficult it is to handle the
> >>>>         layering violation
> >>>>       + Also fixes the performance problem of handle_alloc_space() being
> >>>>         slow on ppc64+XFS.
> >>>>
> >>>>       Disadvantages:
> >>>>       - Huge layering violation because qcow2 would need to know whether
> >>>>         the image is stored on XFS or not.
> >>>>       - We’d definitely want to skip this workaround when the XFS driver
> >>>>         has been fixed, so we need some method to find out whether it has
> >>>>
> >>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
> >>>>       To my knowledge I’m the only one who has provided any benchmarks for
> >>>>       this commit, and even then I was a bit skeptical because it performs
> >>>>       well in some cases and bad in others.  I concluded that it’s
> >>>>       probably worth it because the “some cases” are more likely to occur.
> >>>>
> >>>>       Now we have this problem of corruption here (granted due to a bug in
> >>>>       the XFS driver), and another report of massively degraded
> >>>>       performance on ppc64
> >>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
> >>>>       private BZ; I hate that :-/  The report is about 40 % worse
> >>>>       performance for an in-guest fio write benchmark.)
> >>>>
> >>>>       So I have to ask the question about what the justification for
> >>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
> >>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
> >>>>
> >>>>       Advantages:
> >>>>       + Trivial
> >>>>       + No layering violations
> >>>>       + We wouldn’t need to keep track of whether the kernel bug has been
> >>>>         fixed or not
> >>>>       + Fixes the ppc64+XFS performance problem
> >>>>
> >>>>       Disadvantages:
> >>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
> >>>>         levels, whatever that means
> >>>>
> >>>> So this is the main reason this is an RFC: What should we do?  Is (1)
> >>>> really the best choice?
> >>>>
> >>>>
> >>>> In any case, I’ve ran the test case I showed in
> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> >>>> more than ten times with this series applied and the installation
> >>>> succeeded every time.  (Without this series, it fails like every other
> >>>> time.)
> >>>>
> >>>>
> >>>
> >>> Hi!
> >>>
> >>> First, great thanks for your investigation!
> >>>
> >>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
> >>> in time.
> >>>
> >>> I've tested a bit:
> >>>
> >>> test:
> >>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
> >>>
> >>> on master:
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.291
> >>> /ssd/test.img  cl=64K step=64K   : 0.813
> >>> /ssd/test.img  cl=64K step=1M    : 2.799
> >>> /ssd/test.img  cl=1M  step=4K    : 0.217
> >>> /ssd/test.img  cl=1M  step=64K   : 0.332
> >>> /ssd/test.img  cl=1M  step=1M    : 0.685
> >>> /test.img      cl=64K step=4K    : 1.751
> >>> /test.img      cl=64K step=64K   : 14.811
> >>> /test.img      cl=64K step=1M    : 18.321
> >>> /test.img      cl=1M  step=4K    : 0.759
> >>> /test.img      cl=1M  step=64K   : 13.574
> >>> /test.img      cl=1M  step=1M    : 28.970
> >>>
> >>> rerun on master:
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.295
> >>> /ssd/test.img  cl=64K step=64K   : 0.803
> >>> /ssd/test.img  cl=64K step=1M    : 2.921
> >>> /ssd/test.img  cl=1M  step=4K    : 0.233
> >>> /ssd/test.img  cl=1M  step=64K   : 0.321
> >>> /ssd/test.img  cl=1M  step=1M    : 0.762
> >>> /test.img      cl=64K step=4K    : 1.873
> >>> /test.img      cl=64K step=64K   : 15.621
> >>> /test.img      cl=64K step=1M    : 18.428
> >>> /test.img      cl=1M  step=4K    : 0.883
> >>> /test.img      cl=1M  step=64K   : 13.484
> >>> /test.img      cl=1M  step=1M    : 26.244
> >>>
> >>>
> >>> on master + revert c8bb23cbdbe32f5c326
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.395
> >>> /ssd/test.img  cl=64K step=64K   : 4.231
> >>> /ssd/test.img  cl=64K step=1M    : 5.598
> >>> /ssd/test.img  cl=1M  step=4K    : 0.352
> >>> /ssd/test.img  cl=1M  step=64K   : 2.519
> >>> /ssd/test.img  cl=1M  step=1M    : 38.919
> >>> /test.img      cl=64K step=4K    : 1.758
> >>> /test.img      cl=64K step=64K   : 9.838
> >>> /test.img      cl=64K step=1M    : 13.384
> >>> /test.img      cl=1M  step=4K    : 1.849
> >>> /test.img      cl=1M  step=64K   : 19.405
> >>> /test.img      cl=1M  step=1M    : 157.090
> >>>
> >>> rerun:
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.407
> >>> /ssd/test.img  cl=64K step=64K   : 3.325
> >>> /ssd/test.img  cl=64K step=1M    : 5.641
> >>> /ssd/test.img  cl=1M  step=4K    : 0.346
> >>> /ssd/test.img  cl=1M  step=64K   : 2.583
> >>> /ssd/test.img  cl=1M  step=1M    : 39.692
> >>> /test.img      cl=64K step=4K    : 1.727
> >>> /test.img      cl=64K step=64K   : 10.058
> >>> /test.img      cl=64K step=1M    : 13.441
> >>> /test.img      cl=1M  step=4K    : 1.926
> >>> /test.img      cl=1M  step=64K   : 19.738
> >>> /test.img      cl=1M  step=1M    : 158.268
> >>>
> >>>
> >>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
> >>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
> >>> wrong, we need smarter heuristics..
> >>>
> >>> So, I'd prefer (1) or (2).
> > 
> > OK.  I wonder whether that problem would go away with Berto’s subcluster
> > series, though.
> 
> Very possible, I thought about it too.
> 
> > 
> >> About degradation in some cases: I think the problem is that one (a bit larger)
> >> write may be faster than fast-write-zeroes + small write, as the latter means
> >> additional write to metadata. And it's expected for small clusters in
> >> conjunction with rotational disk. But the actual limit is dependent on specific
> >> disk. So, I think possible solution is just sometimes try work with
> >> handle_alloc_space and sometimes without, remember time and length of request
> >> and make dynamic limit...
> > 
> > Maybe make a decision based both on the ratio of data size to COW area
> > length (only invoke handle_alloc_space() under a certain threshold), and
> > the absolute COW area length (always invoke it above a certain
> > threshold, unless the ratio doesn’t allow it)?
> > 
> 
> Yes, something like this..
> 
> without handle_alloc_space, time = time(write aligned up to cluster)
> with handle_alloc_space, time = time(fast zero write) + time(original write)
> 
> If we have some statistics on normal-write vs zero-write timing, we can just
> calculate both variants and choose faster.
> 
> if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) {
>     use handle_alloc_space()
> }

Self-tuning based on request latency works great on a quiet host.  If
there are other processes submitting I/O to the disk then measured
latencies may be bogus.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
                   ` (6 preceding siblings ...)
  2019-10-26 17:37 ` Nir Soffer
@ 2019-10-27 12:35 ` Stefan Hajnoczi
  2019-10-28  9:24   ` Max Reitz
  2019-10-28 11:04   ` Kevin Wolf
  7 siblings, 2 replies; 47+ messages in thread
From: Stefan Hajnoczi @ 2019-10-27 12:35 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	qemu-devel, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]

On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>     operations) to reach until infinity (INT64_MAX in practice), mark
>     them serializing and wait for other conflicting requests.
> 
>     Advantages:
>     + Limits the impact to very specific cases
>       (And that means it wouldn’t hurt too much to keep this workaround
>       even when the XFS driver has been fixed)
>     + Works around the bug where it happens, namely in file-posix
> 
>     Disadvantages:
>     - A bit complex
>     - A bit of a layering violation (should file-posix have access to
>       tracked requests?)

Your patch series is reasonable.  I don't think it's too bad.

The main question is how to detect the XFS fix once it ships.  XFS
already has a ton of ioctls, so maybe they don't mind adding a
feature/quirk bit map ioctl for publishing information about bug fixes
to userspace.  I didn't see another obvious way of doing it, maybe a
mount option that the kernel automatically sets and that gets reported
to userspace?

If we imagine that XFS will not provide a mechanism to detect the
presence of the fix, then could we ask QEMU package maintainers to
./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
in the future when their distro has been shipping a fixed kernel for a
while?  It's ugly because it doesn't work if the user installs an older
custom-built kernel on the host.  But at least it will cover 98% of
users...

> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>     To my knowledge I’m the only one who has provided any benchmarks for
>     this commit, and even then I was a bit skeptical because it performs
>     well in some cases and bad in others.  I concluded that it’s
>     probably worth it because the “some cases” are more likely to occur.
> 
>     Now we have this problem of corruption here (granted due to a bug in
>     the XFS driver), and another report of massively degraded
>     performance on ppc64
>     (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>     private BZ; I hate that :-/  The report is about 40 % worse
>     performance for an in-guest fio write benchmark.)
> 
>     So I have to ask the question about what the justification for
>     keeping c8bb23cbdbe32f is.  How much does performance increase with
>     it actually?  (On non-(ppc64+XFS) machines, obviously)
> 
>     Advantages:
>     + Trivial
>     + No layering violations
>     + We wouldn’t need to keep track of whether the kernel bug has been
>       fixed or not
>     + Fixes the ppc64+XFS performance problem
> 
>     Disadvantages:
>     - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>       levels, whatever that means

My favorite because it is clean and simple, but Vladimir has a valid
use-case for requiring this performance optimization so reverting isn't
an option.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-26 17:52   ` Vladimir Sementsov-Ogievskiy
@ 2019-10-28  8:56     ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-28  8:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Nir Soffer
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	QEMU Developers, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 6299 bytes --]

On 26.10.19 19:52, Vladimir Sementsov-Ogievskiy wrote:
> 26.10.2019 20:37, Nir Soffer wrote:
>> On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote:
>>>
>>> Hi,
>>>
>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>> I’ve explained here:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>
>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>> around the problem ourselves.
>>>
>>> This is an RFC for two reasons:
>>> (1) I don’t know whether this is the right way to address the issue,
>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>      if so stop applying the workaround.
>>>      I don’t know how we would go about this, so this series doesn’t do
>>>      it.  (Hence it’s an RFC.)
>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>      driver access and modify a BdrvTrackedRequest object.
>>>
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>>
>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>      becomes visible due to that function: I don’t think qcow2 writes
>>>      zeroes in any other I/O path, and raw images are fixed in size so
>>>      post-EOF writes won’t happen.
>>>
>>>      Advantages:
>>>      + Maybe simpler, depending on how difficult it is to handle the
>>>        layering violation
>>>      + Also fixes the performance problem of handle_alloc_space() being
>>>        slow on ppc64+XFS.
>>>
>>>      Disadvantages:
>>>      - Huge layering violation because qcow2 would need to know whether
>>>        the image is stored on XFS or not.
>>>      - We’d definitely want to skip this workaround when the XFS driver
>>>        has been fixed, so we need some method to find out whether it has
>>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>
>> Correctness is more important than performance, so this is my
>> preference as a user.
>>
> 
> Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another
> software, or may be another scenario in Qemu (not sure).

I don’t see any other image format (but qcow2, raw, and filters)
creating zero-writes apart from the parallels format which does so
before writes can happen to that area.  So with parallels, it’s
impossible to get a data write behind an ongoing zero-write.

Raw and filters do not initiate zero-writes on their own.  So there must
be something else that requests them.  Block jobs and guests are limited
to generally fixed-size disks, so I don’t see a way to generate
zero-writes beyond the EOF there.

Which leaves us with qcow2.  To my knowledge the only place where qcow2
generates zero-writes in an I/O path (so they can occur concurrently to
data writes) is in handle_alloc_space().  Therefore, dropping it should
remove the only place where the XFS bug can be triggered.


That the bug is in XFS and not in qemu is a good argument from a moral
standpoint, but it isn’t very pragmatic or reasonable from a technical
standpoint.  There is no fix for XFS yet, so right now the situation is
that there is a bug that causes guest data loss, that qemu can prevent
it, and that there is no other workaround or fix.

So we must work around it.

Of course, that doesn’t mean that we have to do everything in our power
to implement a work around no matter the cost.  We do indeed have to
balance between how far we’re willing to go to fix it and how much we
impact non-XFS workloads.


Honestly, I caught myself saying “Well, that’s too bad for XFS over
gluster if XFS is broken”.  But that just isn’t a reasonable thing to say.

I suppose what we could do is drop the is_xfs check in patch 3.  The
only outcome is that you can no longer do concurrent data writes past
zero-writes past the EOF.  And as I’ve claimed above, this only affects
handle_alloc_space() from qcow2.  Would it be so bad if we simply did
that on every filesystem?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-27 12:35 ` Stefan Hajnoczi
@ 2019-10-28  9:24   ` Max Reitz
  2019-10-28  9:30     ` Max Reitz
  2019-10-28 11:04   ` Kevin Wolf
  1 sibling, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-28  9:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	qemu-devel, Vladimir Sementsov-Ogievskiy


[-- Attachment #1.1: Type: text/plain, Size: 2119 bytes --]

On 27.10.19 13:35, Stefan Hajnoczi wrote:
> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>     operations) to reach until infinity (INT64_MAX in practice), mark
>>     them serializing and wait for other conflicting requests.
>>
>>     Advantages:
>>     + Limits the impact to very specific cases
>>       (And that means it wouldn’t hurt too much to keep this workaround
>>       even when the XFS driver has been fixed)
>>     + Works around the bug where it happens, namely in file-posix
>>
>>     Disadvantages:
>>     - A bit complex
>>     - A bit of a layering violation (should file-posix have access to
>>       tracked requests?)
> 
> Your patch series is reasonable.  I don't think it's too bad.
> 
> The main question is how to detect the XFS fix once it ships.  XFS
> already has a ton of ioctls, so maybe they don't mind adding a
> feature/quirk bit map ioctl for publishing information about bug fixes
> to userspace.  I didn't see another obvious way of doing it, maybe a
> mount option that the kernel automatically sets and that gets reported
> to userspace?

I’ll add a note to the RH BZ.

> If we imagine that XFS will not provide a mechanism to detect the
> presence of the fix, then could we ask QEMU package maintainers to
> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
> in the future when their distro has been shipping a fixed kernel for a
> while?  It's ugly because it doesn't work if the user installs an older
> custom-built kernel on the host.  But at least it will cover 98% of
> users...

:-/

I don’t like it, but I suppose it would work.  We could also
automatically enable this disabling option in configure when we detect
uname to report a kernel version that must include the fix.  (This
wouldn’t work for kernel with backported fixes, but those disappear over
time...)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28  9:24   ` Max Reitz
@ 2019-10-28  9:30     ` Max Reitz
  2019-10-28  9:56       ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-28  9:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	qemu-devel, Vladimir Sementsov-Ogievskiy


[-- Attachment #1.1: Type: text/plain, Size: 2591 bytes --]

On 28.10.19 10:24, Max Reitz wrote:
> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>     operations) to reach until infinity (INT64_MAX in practice), mark
>>>     them serializing and wait for other conflicting requests.
>>>
>>>     Advantages:
>>>     + Limits the impact to very specific cases
>>>       (And that means it wouldn’t hurt too much to keep this workaround
>>>       even when the XFS driver has been fixed)
>>>     + Works around the bug where it happens, namely in file-posix
>>>
>>>     Disadvantages:
>>>     - A bit complex
>>>     - A bit of a layering violation (should file-posix have access to
>>>       tracked requests?)
>>
>> Your patch series is reasonable.  I don't think it's too bad.
>>
>> The main question is how to detect the XFS fix once it ships.  XFS
>> already has a ton of ioctls, so maybe they don't mind adding a
>> feature/quirk bit map ioctl for publishing information about bug fixes
>> to userspace.  I didn't see another obvious way of doing it, maybe a
>> mount option that the kernel automatically sets and that gets reported
>> to userspace?
> 
> I’ll add a note to the RH BZ.
> 
>> If we imagine that XFS will not provide a mechanism to detect the
>> presence of the fix, then could we ask QEMU package maintainers to
>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>> in the future when their distro has been shipping a fixed kernel for a
>> while?  It's ugly because it doesn't work if the user installs an older
>> custom-built kernel on the host.  But at least it will cover 98% of
>> users...
> 
> :-/
> 
> I don’t like it, but I suppose it would work.  We could also
> automatically enable this disabling option in configure when we detect
> uname to report a kernel version that must include the fix.  (This
> wouldn’t work for kernel with backported fixes, but those disappear over
> time...)
I just realized that none of this is going to work for the gluster case
brought up by Nir.  The affected kernel is the remote one and we have no
insight into that.  I don’t think we can do ioctls to XFS over gluster,
can we?

I also realized that uname is a stupid idea because the user may be
running a different kernel version than what runs on the build machine... :(

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28  9:30     ` Max Reitz
@ 2019-10-28  9:56       ` Max Reitz
  2019-10-28 10:07         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-28  9:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-block,
	qemu-devel, Vladimir Sementsov-Ogievskiy


[-- Attachment #1.1: Type: text/plain, Size: 2764 bytes --]

On 28.10.19 10:30, Max Reitz wrote:
> On 28.10.19 10:24, Max Reitz wrote:
>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>> As for how we can address the issue, I see three ways:
>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>     operations) to reach until infinity (INT64_MAX in practice), mark
>>>>     them serializing and wait for other conflicting requests.
>>>>
>>>>     Advantages:
>>>>     + Limits the impact to very specific cases
>>>>       (And that means it wouldn’t hurt too much to keep this workaround
>>>>       even when the XFS driver has been fixed)
>>>>     + Works around the bug where it happens, namely in file-posix
>>>>
>>>>     Disadvantages:
>>>>     - A bit complex
>>>>     - A bit of a layering violation (should file-posix have access to
>>>>       tracked requests?)
>>>
>>> Your patch series is reasonable.  I don't think it's too bad.
>>>
>>> The main question is how to detect the XFS fix once it ships.  XFS
>>> already has a ton of ioctls, so maybe they don't mind adding a
>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>> mount option that the kernel automatically sets and that gets reported
>>> to userspace?
>>
>> I’ll add a note to the RH BZ.
>>
>>> If we imagine that XFS will not provide a mechanism to detect the
>>> presence of the fix, then could we ask QEMU package maintainers to
>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>> in the future when their distro has been shipping a fixed kernel for a
>>> while?  It's ugly because it doesn't work if the user installs an older
>>> custom-built kernel on the host.  But at least it will cover 98% of
>>> users...
>>
>> :-/
>>
>> I don’t like it, but I suppose it would work.  We could also
>> automatically enable this disabling option in configure when we detect
>> uname to report a kernel version that must include the fix.  (This
>> wouldn’t work for kernel with backported fixes, but those disappear over
>> time...)
> I just realized that none of this is going to work for the gluster case
> brought up by Nir.  The affected kernel is the remote one and we have no
> insight into that.  I don’t think we can do ioctls to XFS over gluster,
> can we?

On third thought, we could try to detect whether the file is on a remote
filesystem, and if so enable the workaround unconditionally.  I suppose
it wouldn’t hurt performance-wise, given that it’s a remote filesystem
anyway.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28  9:56       ` Max Reitz
@ 2019-10-28 10:07         ` Vladimir Sementsov-Ogievskiy
  2019-10-28 10:10           ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-28 10:07 UTC (permalink / raw)
  To: Max Reitz, Stefan Hajnoczi
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block

28.10.2019 12:56, Max Reitz wrote:
> On 28.10.19 10:30, Max Reitz wrote:
>> On 28.10.19 10:24, Max Reitz wrote:
>>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>> As for how we can address the issue, I see three ways:
>>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>>>      them serializing and wait for other conflicting requests.
>>>>>
>>>>>      Advantages:
>>>>>      + Limits the impact to very specific cases
>>>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>>>        even when the XFS driver has been fixed)
>>>>>      + Works around the bug where it happens, namely in file-posix
>>>>>
>>>>>      Disadvantages:
>>>>>      - A bit complex
>>>>>      - A bit of a layering violation (should file-posix have access to
>>>>>        tracked requests?)
>>>>
>>>> Your patch series is reasonable.  I don't think it's too bad.
>>>>
>>>> The main question is how to detect the XFS fix once it ships.  XFS
>>>> already has a ton of ioctls, so maybe they don't mind adding a
>>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>>> mount option that the kernel automatically sets and that gets reported
>>>> to userspace?
>>>
>>> I’ll add a note to the RH BZ.
>>>
>>>> If we imagine that XFS will not provide a mechanism to detect the
>>>> presence of the fix, then could we ask QEMU package maintainers to
>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>>> in the future when their distro has been shipping a fixed kernel for a
>>>> while?  It's ugly because it doesn't work if the user installs an older
>>>> custom-built kernel on the host.  But at least it will cover 98% of
>>>> users...
>>>
>>> :-/
>>>
>>> I don’t like it, but I suppose it would work.  We could also
>>> automatically enable this disabling option in configure when we detect
>>> uname to report a kernel version that must include the fix.  (This
>>> wouldn’t work for kernel with backported fixes, but those disappear over
>>> time...)
>> I just realized that none of this is going to work for the gluster case
>> brought up by Nir.  The affected kernel is the remote one and we have no
>> insight into that.  I don’t think we can do ioctls to XFS over gluster,
>> can we?
> 
> On third thought, we could try to detect whether the file is on a remote
> filesystem, and if so enable the workaround unconditionally.  I suppose
> it wouldn’t hurt performance-wise, given that it’s a remote filesystem
> anyway.
> 

I think, for remote, the difference may be even higher than for local, as cost
of writing real zeroes through the wire vs fast zero command is high.

Really, can we live with simple config option, is it so bad?


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28 10:07         ` Vladimir Sementsov-Ogievskiy
@ 2019-10-28 10:10           ` Max Reitz
  2019-10-28 11:19             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-28 10:10 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 3414 bytes --]

On 28.10.19 11:07, Vladimir Sementsov-Ogievskiy wrote:
> 28.10.2019 12:56, Max Reitz wrote:
>> On 28.10.19 10:30, Max Reitz wrote:
>>> On 28.10.19 10:24, Max Reitz wrote:
>>>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>> As for how we can address the issue, I see three ways:
>>>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>>>>      them serializing and wait for other conflicting requests.
>>>>>>
>>>>>>      Advantages:
>>>>>>      + Limits the impact to very specific cases
>>>>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>>>>        even when the XFS driver has been fixed)
>>>>>>      + Works around the bug where it happens, namely in file-posix
>>>>>>
>>>>>>      Disadvantages:
>>>>>>      - A bit complex
>>>>>>      - A bit of a layering violation (should file-posix have access to
>>>>>>        tracked requests?)
>>>>>
>>>>> Your patch series is reasonable.  I don't think it's too bad.
>>>>>
>>>>> The main question is how to detect the XFS fix once it ships.  XFS
>>>>> already has a ton of ioctls, so maybe they don't mind adding a
>>>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>>>> mount option that the kernel automatically sets and that gets reported
>>>>> to userspace?
>>>>
>>>> I’ll add a note to the RH BZ.
>>>>
>>>>> If we imagine that XFS will not provide a mechanism to detect the
>>>>> presence of the fix, then could we ask QEMU package maintainers to
>>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>>>> in the future when their distro has been shipping a fixed kernel for a
>>>>> while?  It's ugly because it doesn't work if the user installs an older
>>>>> custom-built kernel on the host.  But at least it will cover 98% of
>>>>> users...
>>>>
>>>> :-/
>>>>
>>>> I don’t like it, but I suppose it would work.  We could also
>>>> automatically enable this disabling option in configure when we detect
>>>> uname to report a kernel version that must include the fix.  (This
>>>> wouldn’t work for kernel with backported fixes, but those disappear over
>>>> time...)
>>> I just realized that none of this is going to work for the gluster case
>>> brought up by Nir.  The affected kernel is the remote one and we have no
>>> insight into that.  I don’t think we can do ioctls to XFS over gluster,
>>> can we?
>>
>> On third thought, we could try to detect whether the file is on a remote
>> filesystem, and if so enable the workaround unconditionally.  I suppose
>> it wouldn’t hurt performance-wise, given that it’s a remote filesystem
>> anyway.
>>
> 
> I think, for remote, the difference may be even higher than for local, as cost
> of writing real zeroes through the wire vs fast zero command is high.

I was speaking of a workaround in general, and that includes the
workaround presented in this series.

> Really, can we live with simple config option, is it so bad?

The config option won’t work for remote hosts, though.  That’s exactly
the problem.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-27 12:35 ` Stefan Hajnoczi
  2019-10-28  9:24   ` Max Reitz
@ 2019-10-28 11:04   ` Kevin Wolf
  2019-10-28 11:25     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 47+ messages in thread
From: Kevin Wolf @ 2019-10-28 11:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-block, qemu-devel, Max Reitz,
	Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 3889 bytes --]

Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
> > As for how we can address the issue, I see three ways:
> > (1) The one presented in this series: On XFS with aio=native, we extend
> >     tracked requests for post-EOF fallocate() calls (i.e., write-zero
> >     operations) to reach until infinity (INT64_MAX in practice), mark
> >     them serializing and wait for other conflicting requests.
> > 
> >     Advantages:
> >     + Limits the impact to very specific cases
> >       (And that means it wouldn’t hurt too much to keep this workaround
> >       even when the XFS driver has been fixed)
> >     + Works around the bug where it happens, namely in file-posix
> > 
> >     Disadvantages:
> >     - A bit complex
> >     - A bit of a layering violation (should file-posix have access to
> >       tracked requests?)
> 
> Your patch series is reasonable.  I don't think it's too bad.
> 
> The main question is how to detect the XFS fix once it ships.  XFS
> already has a ton of ioctls, so maybe they don't mind adding a
> feature/quirk bit map ioctl for publishing information about bug fixes
> to userspace.  I didn't see another obvious way of doing it, maybe a
> mount option that the kernel automatically sets and that gets reported
> to userspace?

I think the CC list is too short for this question. We should involve
the XFS people here.

> If we imagine that XFS will not provide a mechanism to detect the
> presence of the fix, then could we ask QEMU package maintainers to
> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
> in the future when their distro has been shipping a fixed kernel for a
> while?  It's ugly because it doesn't work if the user installs an older
> custom-built kernel on the host.  But at least it will cover 98% of
> users...
> 
> > (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
> >     To my knowledge I’m the only one who has provided any benchmarks for
> >     this commit, and even then I was a bit skeptical because it performs
> >     well in some cases and bad in others.  I concluded that it’s
> >     probably worth it because the “some cases” are more likely to occur.
> > 
> >     Now we have this problem of corruption here (granted due to a bug in
> >     the XFS driver), and another report of massively degraded
> >     performance on ppc64
> >     (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
> >     private BZ; I hate that :-/  The report is about 40 % worse
> >     performance for an in-guest fio write benchmark.)
> > 
> >     So I have to ask the question about what the justification for
> >     keeping c8bb23cbdbe32f is.  How much does performance increase with
> >     it actually?  (On non-(ppc64+XFS) machines, obviously)
> > 
> >     Advantages:
> >     + Trivial
> >     + No layering violations
> >     + We wouldn’t need to keep track of whether the kernel bug has been
> >       fixed or not
> >     + Fixes the ppc64+XFS performance problem
> > 
> >     Disadvantages:
> >     - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
> >       levels, whatever that means
> 
> My favorite because it is clean and simple, but Vladimir has a valid
> use-case for requiring this performance optimization so reverting isn't
> an option.

Vladimir also said that qcow2 subclusters would probably also solve his
problem, so maybe reverting and applying the subcluster patches instead
is a possible solution, too?

We already have some cases where the existing handle_alloc_space()
causes performance to actually become worse, and serialising requests as
a workaround isn't going to make performance any better. So even on
these grounds, keeping commit c8bb23cbdbe32f is questionable.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28 10:10           ` Max Reitz
@ 2019-10-28 11:19             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-28 11:19 UTC (permalink / raw)
  To: Max Reitz, Stefan Hajnoczi
  Cc: Kevin Wolf, Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block

28.10.2019 13:10, Max Reitz wrote:
> On 28.10.19 11:07, Vladimir Sementsov-Ogievskiy wrote:
>> 28.10.2019 12:56, Max Reitz wrote:
>>> On 28.10.19 10:30, Max Reitz wrote:
>>>> On 28.10.19 10:24, Max Reitz wrote:
>>>>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>>> As for how we can address the issue, I see three ways:
>>>>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>>>>       tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>>>>       operations) to reach until infinity (INT64_MAX in practice), mark
>>>>>>>       them serializing and wait for other conflicting requests.
>>>>>>>
>>>>>>>       Advantages:
>>>>>>>       + Limits the impact to very specific cases
>>>>>>>         (And that means it wouldn’t hurt too much to keep this workaround
>>>>>>>         even when the XFS driver has been fixed)
>>>>>>>       + Works around the bug where it happens, namely in file-posix
>>>>>>>
>>>>>>>       Disadvantages:
>>>>>>>       - A bit complex
>>>>>>>       - A bit of a layering violation (should file-posix have access to
>>>>>>>         tracked requests?)
>>>>>>
>>>>>> Your patch series is reasonable.  I don't think it's too bad.
>>>>>>
>>>>>> The main question is how to detect the XFS fix once it ships.  XFS
>>>>>> already has a ton of ioctls, so maybe they don't mind adding a
>>>>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>>>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>>>>> mount option that the kernel automatically sets and that gets reported
>>>>>> to userspace?
>>>>>
>>>>> I’ll add a note to the RH BZ.
>>>>>
>>>>>> If we imagine that XFS will not provide a mechanism to detect the
>>>>>> presence of the fix, then could we ask QEMU package maintainers to
>>>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>>>>> in the future when their distro has been shipping a fixed kernel for a
>>>>>> while?  It's ugly because it doesn't work if the user installs an older
>>>>>> custom-built kernel on the host.  But at least it will cover 98% of
>>>>>> users...
>>>>>
>>>>> :-/
>>>>>
>>>>> I don’t like it, but I suppose it would work.  We could also
>>>>> automatically enable this disabling option in configure when we detect
>>>>> uname to report a kernel version that must include the fix.  (This
>>>>> wouldn’t work for kernel with backported fixes, but those disappear over
>>>>> time...)
>>>> I just realized that none of this is going to work for the gluster case
>>>> brought up by Nir.  The affected kernel is the remote one and we have no
>>>> insight into that.  I don’t think we can do ioctls to XFS over gluster,
>>>> can we?
>>>
>>> On third thought, we could try to detect whether the file is on a remote
>>> filesystem, and if so enable the workaround unconditionally.  I suppose
>>> it wouldn’t hurt performance-wise, given that it’s a remote filesystem
>>> anyway.
>>>
>>
>> I think, for remote, the difference may be even higher than for local, as cost
>> of writing real zeroes through the wire vs fast zero command is high.
> 
> I was speaking of a workaround in general, and that includes the
> workaround presented in this series.

Ah, yes, sorry. Then it should be OK.

> 
>> Really, can we live with simple config option, is it so bad?
> 
> The config option won’t work for remote hosts, though.  That’s exactly
> the problem.
> 

Still config option will help people who have xfs setups and want to be absolutely safe.
(Or, with another default, help people who doesn't have xfs setups (Virtuozzo) and want to be fast)


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28 11:04   ` Kevin Wolf
@ 2019-10-28 11:25     ` Vladimir Sementsov-Ogievskiy
  2019-10-29  8:50       ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-28 11:25 UTC (permalink / raw)
  To: Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block, Max Reitz

28.10.2019 14:04, Kevin Wolf wrote:
> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>
>> Your patch series is reasonable.  I don't think it's too bad.
>>
>> The main question is how to detect the XFS fix once it ships.  XFS
>> already has a ton of ioctls, so maybe they don't mind adding a
>> feature/quirk bit map ioctl for publishing information about bug fixes
>> to userspace.  I didn't see another obvious way of doing it, maybe a
>> mount option that the kernel automatically sets and that gets reported
>> to userspace?
> 
> I think the CC list is too short for this question. We should involve
> the XFS people here.
> 
>> If we imagine that XFS will not provide a mechanism to detect the
>> presence of the fix, then could we ask QEMU package maintainers to
>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>> in the future when their distro has been shipping a fixed kernel for a
>> while?  It's ugly because it doesn't work if the user installs an older
>> custom-built kernel on the host.  But at least it will cover 98% of
>> users...
>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>
>> My favorite because it is clean and simple, but Vladimir has a valid
>> use-case for requiring this performance optimization so reverting isn't
>> an option.
> 
> Vladimir also said that qcow2 subclusters would probably also solve his
> problem, so maybe reverting and applying the subcluster patches instead
> is a possible solution, too?

I'm not sure about ssd case, it may need write-zero optimization anyway.

> 
> We already have some cases where the existing handle_alloc_space()
> causes performance to actually become worse, and serialising requests as
> a workaround isn't going to make performance any better. So even on
> these grounds, keeping commit c8bb23cbdbe32f is questionable.
> 

Can keeping handle_alloc_space under some config option be an option?


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-28 11:25     ` Vladimir Sementsov-Ogievskiy
@ 2019-10-29  8:50       ` Max Reitz
  2019-10-29 11:48         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-29  8:50 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 3330 bytes --]

On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
> 28.10.2019 14:04, Kevin Wolf wrote:
>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:

[...]

>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>>      this commit, and even then I was a bit skeptical because it performs
>>>>      well in some cases and bad in others.  I concluded that it’s
>>>>      probably worth it because the “some cases” are more likely to occur.
>>>>
>>>>      Now we have this problem of corruption here (granted due to a bug in
>>>>      the XFS driver), and another report of massively degraded
>>>>      performance on ppc64
>>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>>      performance for an in-guest fio write benchmark.)
>>>>
>>>>      So I have to ask the question about what the justification for
>>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>
>>>>      Advantages:
>>>>      + Trivial
>>>>      + No layering violations
>>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>>        fixed or not
>>>>      + Fixes the ppc64+XFS performance problem
>>>>
>>>>      Disadvantages:
>>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>        levels, whatever that means
>>>
>>> My favorite because it is clean and simple, but Vladimir has a valid
>>> use-case for requiring this performance optimization so reverting isn't
>>> an option.
>>
>> Vladimir also said that qcow2 subclusters would probably also solve his
>> problem, so maybe reverting and applying the subcluster patches instead
>> is a possible solution, too?
> 
> I'm not sure about ssd case, it may need write-zero optimization anyway.

What exactly do you need?  Do you actually need to write zeroes (e.g.
because you’re storing images on block devices) or would it be
sufficient to just drop the COW areas when bdrv_has_zero_init() and
bdrv_has_zero_init_truncate() are true?

I’m asking because Dave Chinner said
(https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
fallocate() is always slow at least with aio=native because it needs to
wait for all concurrent AIO writes to finish, and so it causes the AIO
pipeline to stall.

(He suggested using XFS extent size hints to get the same effect as
write-zeroes for free, basically, but I don’t know whether that’s really
useful to us; as unallocated areas on XFS read back as zero anyway.)

>> We already have some cases where the existing handle_alloc_space()
>> causes performance to actually become worse, and serialising requests as
>> a workaround isn't going to make performance any better. So even on
>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>
> 
> Can keeping handle_alloc_space under some config option be an option?

Hm.  A config option is weird if you’re the only one who’s going to
enable it.  But other than that I don’t have anything against it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-29  8:50       ` Max Reitz
@ 2019-10-29 11:48         ` Vladimir Sementsov-Ogievskiy
  2019-10-29 11:55           ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-29 11:48 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block

29.10.2019 11:50, Max Reitz wrote:
> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>> 28.10.2019 14:04, Kevin Wolf wrote:
>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
> 
> [...]
> 
>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>       To my knowledge I’m the only one who has provided any benchmarks for
>>>>>       this commit, and even then I was a bit skeptical because it performs
>>>>>       well in some cases and bad in others.  I concluded that it’s
>>>>>       probably worth it because the “some cases” are more likely to occur.
>>>>>
>>>>>       Now we have this problem of corruption here (granted due to a bug in
>>>>>       the XFS driver), and another report of massively degraded
>>>>>       performance on ppc64
>>>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>       private BZ; I hate that :-/  The report is about 40 % worse
>>>>>       performance for an in-guest fio write benchmark.)
>>>>>
>>>>>       So I have to ask the question about what the justification for
>>>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>
>>>>>       Advantages:
>>>>>       + Trivial
>>>>>       + No layering violations
>>>>>       + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>         fixed or not
>>>>>       + Fixes the ppc64+XFS performance problem
>>>>>
>>>>>       Disadvantages:
>>>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>         levels, whatever that means
>>>>
>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>> use-case for requiring this performance optimization so reverting isn't
>>>> an option.
>>>
>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>> problem, so maybe reverting and applying the subcluster patches instead
>>> is a possible solution, too?
>>
>> I'm not sure about ssd case, it may need write-zero optimization anyway.
> 
> What exactly do you need?  Do you actually need to write zeroes (e.g.
> because you’re storing images on block devices) or would it be
> sufficient to just drop the COW areas when bdrv_has_zero_init() and
> bdrv_has_zero_init_truncate() are true?

Hmm, what do you mean? We need to zero COW areas. So, original way is to
write real zeroes, optimized way is fallocate.. What do you mean by drop?
Mark sublusters as zeroes by metadata?

But still we'll have COW areas in subcluster, and we'll need to directly zero
them.. And fallocate will most probably be faster on ssd ext4 case..

> 
> I’m asking because Dave Chinner said
> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
> fallocate() is always slow at least with aio=native because it needs to
> wait for all concurrent AIO writes to finish, and so it causes the AIO
> pipeline to stall.
> 
> (He suggested using XFS extent size hints to get the same effect as
> write-zeroes for free, basically, but I don’t know whether that’s really
> useful to us; as unallocated areas on XFS read back as zero anyway.)
> 
>>> We already have some cases where the existing handle_alloc_space()
>>> causes performance to actually become worse, and serialising requests as
>>> a workaround isn't going to make performance any better. So even on
>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>
>>
>> Can keeping handle_alloc_space under some config option be an option?
> 
> Hm.  A config option is weird if you’re the only one who’s going to
> enable it.  But other than that I don’t have anything against it.
> 

It's just a bit easier for us to maintain config option, than out-of-tree patch.
On the other hand, it's not a real problem to maintain this one patch in separate.
It may return again to the tree, when XFS bug fixed.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-29 11:48         ` Vladimir Sementsov-Ogievskiy
@ 2019-10-29 11:55           ` Max Reitz
  2019-10-29 12:05             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-29 11:55 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 4399 bytes --]

On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 11:50, Max Reitz wrote:
>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>
>> [...]
>>
>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>       To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>       this commit, and even then I was a bit skeptical because it performs
>>>>>>       well in some cases and bad in others.  I concluded that it’s
>>>>>>       probably worth it because the “some cases” are more likely to occur.
>>>>>>
>>>>>>       Now we have this problem of corruption here (granted due to a bug in
>>>>>>       the XFS driver), and another report of massively degraded
>>>>>>       performance on ppc64
>>>>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>       private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>       performance for an in-guest fio write benchmark.)
>>>>>>
>>>>>>       So I have to ask the question about what the justification for
>>>>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>
>>>>>>       Advantages:
>>>>>>       + Trivial
>>>>>>       + No layering violations
>>>>>>       + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>         fixed or not
>>>>>>       + Fixes the ppc64+XFS performance problem
>>>>>>
>>>>>>       Disadvantages:
>>>>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>         levels, whatever that means
>>>>>
>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>> an option.
>>>>
>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>> is a possible solution, too?
>>>
>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>
>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>> because you’re storing images on block devices) or would it be
>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>> bdrv_has_zero_init_truncate() are true?
> 
> Hmm, what do you mean? We need to zero COW areas. So, original way is to
> write real zeroes, optimized way is fallocate.. What do you mean by drop?
> Mark sublusters as zeroes by metadata?

Why do you need to zero COW areas?  For normal files, any data will read
as zero if you didn’t write anything there.

> But still we'll have COW areas in subcluster, and we'll need to directly zero
> them.. And fallocate will most probably be faster on ssd ext4 case..
> 
>>
>> I’m asking because Dave Chinner said
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>> fallocate() is always slow at least with aio=native because it needs to
>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>> pipeline to stall.
>>
>> (He suggested using XFS extent size hints to get the same effect as
>> write-zeroes for free, basically, but I don’t know whether that’s really
>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>
>>>> We already have some cases where the existing handle_alloc_space()
>>>> causes performance to actually become worse, and serialising requests as
>>>> a workaround isn't going to make performance any better. So even on
>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>
>>>
>>> Can keeping handle_alloc_space under some config option be an option?
>>
>> Hm.  A config option is weird if you’re the only one who’s going to
>> enable it.  But other than that I don’t have anything against it.
>>
> 
> It's just a bit easier for us to maintain config option, than out-of-tree patch.
> On the other hand, it's not a real problem to maintain this one patch in separate.
> It may return again to the tree, when XFS bug fixed.

We’ll still have the problem that fallocate() must stall aio=native
requests.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-29 11:55           ` Max Reitz
@ 2019-10-29 12:05             ` Vladimir Sementsov-Ogievskiy
  2019-10-29 12:11               ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-29 12:05 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block

29.10.2019 14:55, Max Reitz wrote:
> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>> 29.10.2019 11:50, Max Reitz wrote:
>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>
>>> [...]
>>>
>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>        To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>        this commit, and even then I was a bit skeptical because it performs
>>>>>>>        well in some cases and bad in others.  I concluded that it’s
>>>>>>>        probably worth it because the “some cases” are more likely to occur.
>>>>>>>
>>>>>>>        Now we have this problem of corruption here (granted due to a bug in
>>>>>>>        the XFS driver), and another report of massively degraded
>>>>>>>        performance on ppc64
>>>>>>>        (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>        private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>        performance for an in-guest fio write benchmark.)
>>>>>>>
>>>>>>>        So I have to ask the question about what the justification for
>>>>>>>        keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>        it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>
>>>>>>>        Advantages:
>>>>>>>        + Trivial
>>>>>>>        + No layering violations
>>>>>>>        + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>          fixed or not
>>>>>>>        + Fixes the ppc64+XFS performance problem
>>>>>>>
>>>>>>>        Disadvantages:
>>>>>>>        - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>          levels, whatever that means
>>>>>>
>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>> an option.
>>>>>
>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>> is a possible solution, too?
>>>>
>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>
>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>> because you’re storing images on block devices) or would it be
>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>> bdrv_has_zero_init_truncate() are true?
>>
>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>> Mark sublusters as zeroes by metadata?
> 
> Why do you need to zero COW areas?  For normal files, any data will read
> as zero if you didn’t write anything there.

Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
as it may be reused previously allocated cluster..

> 
>> But still we'll have COW areas in subcluster, and we'll need to directly zero
>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>
>>>
>>> I’m asking because Dave Chinner said
>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>> fallocate() is always slow at least with aio=native because it needs to
>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>> pipeline to stall.
>>>
>>> (He suggested using XFS extent size hints to get the same effect as
>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>
>>>>> We already have some cases where the existing handle_alloc_space()
>>>>> causes performance to actually become worse, and serialising requests as
>>>>> a workaround isn't going to make performance any better. So even on
>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>
>>>>
>>>> Can keeping handle_alloc_space under some config option be an option?
>>>
>>> Hm.  A config option is weird if you’re the only one who’s going to
>>> enable it.  But other than that I don’t have anything against it.
>>>
>>
>> It's just a bit easier for us to maintain config option, than out-of-tree patch.
>> On the other hand, it's not a real problem to maintain this one patch in separate.
>> It may return again to the tree, when XFS bug fixed.
> 
> We’ll still have the problem that fallocate() must stall aio=native
> requests.
> 

Does it mean that fallocate is bad in general? Practice shows the opposite..
At least I have my examples with qemu-img bench. Can that thing be shown with
qemu-img bench or something?


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-29 12:05             ` Vladimir Sementsov-Ogievskiy
@ 2019-10-29 12:11               ` Max Reitz
  2019-10-29 12:19                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-10-29 12:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 6324 bytes --]

On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 14:55, Max Reitz wrote:
>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.10.2019 11:50, Max Reitz wrote:
>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>
>>>> [...]
>>>>
>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>        To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>>        this commit, and even then I was a bit skeptical because it performs
>>>>>>>>        well in some cases and bad in others.  I concluded that it’s
>>>>>>>>        probably worth it because the “some cases” are more likely to occur.
>>>>>>>>
>>>>>>>>        Now we have this problem of corruption here (granted due to a bug in
>>>>>>>>        the XFS driver), and another report of massively degraded
>>>>>>>>        performance on ppc64
>>>>>>>>        (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>        private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>        performance for an in-guest fio write benchmark.)
>>>>>>>>
>>>>>>>>        So I have to ask the question about what the justification for
>>>>>>>>        keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>>        it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>
>>>>>>>>        Advantages:
>>>>>>>>        + Trivial
>>>>>>>>        + No layering violations
>>>>>>>>        + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>>          fixed or not
>>>>>>>>        + Fixes the ppc64+XFS performance problem
>>>>>>>>
>>>>>>>>        Disadvantages:
>>>>>>>>        - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>          levels, whatever that means
>>>>>>>
>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>> an option.
>>>>>>
>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>> is a possible solution, too?
>>>>>
>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>
>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>> because you’re storing images on block devices) or would it be
>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>> bdrv_has_zero_init_truncate() are true?
>>>
>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>> Mark sublusters as zeroes by metadata?
>>
>> Why do you need to zero COW areas?  For normal files, any data will read
>> as zero if you didn’t write anything there.
> 
> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
> as it may be reused previously allocated cluster..

Hm, right.  We could special-case something for beyond the EOF, but I
don’t know whether that really makes it better.

OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
existing clusters should be rare enough that maybe the existing code
which explicitly writes zeroes is sufficient.

>>
>>> But still we'll have COW areas in subcluster, and we'll need to directly zero
>>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>>
>>>>
>>>> I’m asking because Dave Chinner said
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>>> fallocate() is always slow at least with aio=native because it needs to
>>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>>> pipeline to stall.
>>>>
>>>> (He suggested using XFS extent size hints to get the same effect as
>>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>>
>>>>>> We already have some cases where the existing handle_alloc_space()
>>>>>> causes performance to actually become worse, and serialising requests as
>>>>>> a workaround isn't going to make performance any better. So even on
>>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>>
>>>>>
>>>>> Can keeping handle_alloc_space under some config option be an option?
>>>>
>>>> Hm.  A config option is weird if you’re the only one who’s going to
>>>> enable it.  But other than that I don’t have anything against it.
>>>>
>>>
>>> It's just a bit easier for us to maintain config option, than out-of-tree patch.
>>> On the other hand, it's not a real problem to maintain this one patch in separate.
>>> It may return again to the tree, when XFS bug fixed.
>>
>> We’ll still have the problem that fallocate() must stall aio=native
>> requests.
>>
> 
> Does it mean that fallocate is bad in general? Practice shows the opposite..
> At least I have my examples with qemu-img bench. Can that thing be shown with
> qemu-img bench or something?

I haven’t done benchmarks yet, but I don’t think Laurent will mind if I
paste his fio benchmark results from the unfortunately private BZ here:

         QCOW2 ON |          EXT4         |          XFS          |
                  |                       |                       |
(rw, bs, iodepth) |   2.12.0  |   4.1.0   |   2.12.0  |   4.1.0   |
------------------+-----------+-----------+-----------+-----------+
(write, 16k, 1)   | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s |
------------------+-----------+-----------+-----------+-----------+
(write, 64k, 1)   | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s |
------------------+-----------+-----------+-----------+-----------+
(write,  4k, 8)   |  535KiB/s |  524KiB/s | 13.9MiB/s | 1662KiB/s |
------------------+-----------+-----------+-----------+-----------+

And that just looks like ext4 performs worse with aio=native in general.
 But I’ll have to do my own benchmarks first.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-29 12:11               ` Max Reitz
@ 2019-10-29 12:19                 ` Vladimir Sementsov-Ogievskiy
  2019-10-29 12:23                   ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-10-29 12:19 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block

29.10.2019 15:11, Max Reitz wrote:
> On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
>> 29.10.2019 14:55, Max Reitz wrote:
>>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>>> 29.10.2019 11:50, Max Reitz wrote:
>>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>>         To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>>>         this commit, and even then I was a bit skeptical because it performs
>>>>>>>>>         well in some cases and bad in others.  I concluded that it’s
>>>>>>>>>         probably worth it because the “some cases” are more likely to occur.
>>>>>>>>>
>>>>>>>>>         Now we have this problem of corruption here (granted due to a bug in
>>>>>>>>>         the XFS driver), and another report of massively degraded
>>>>>>>>>         performance on ppc64
>>>>>>>>>         (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>>         private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>>         performance for an in-guest fio write benchmark.)
>>>>>>>>>
>>>>>>>>>         So I have to ask the question about what the justification for
>>>>>>>>>         keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>>>         it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>>
>>>>>>>>>         Advantages:
>>>>>>>>>         + Trivial
>>>>>>>>>         + No layering violations
>>>>>>>>>         + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>>>           fixed or not
>>>>>>>>>         + Fixes the ppc64+XFS performance problem
>>>>>>>>>
>>>>>>>>>         Disadvantages:
>>>>>>>>>         - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>>           levels, whatever that means
>>>>>>>>
>>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>>> an option.
>>>>>>>
>>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>>> is a possible solution, too?
>>>>>>
>>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>>
>>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>>> because you’re storing images on block devices) or would it be
>>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>>> bdrv_has_zero_init_truncate() are true?
>>>>
>>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>>> Mark sublusters as zeroes by metadata?
>>>
>>> Why do you need to zero COW areas?  For normal files, any data will read
>>> as zero if you didn’t write anything there.
>>
>> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
>> as it may be reused previously allocated cluster..
> 
> Hm, right.  We could special-case something for beyond the EOF, but I
> don’t know whether that really makes it better.
> 
> OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
> existing clusters should be rare enough that maybe the existing code
> which explicitly writes zeroes is sufficient.

But, as I understand pre-EOF fallocates are safe in xfs? So, we may just drop calling
fallocate past-EOF (it's zero anyway) and do fallocate pre-EOF (it's safe) ?
(the only special case is intersecting EOF.. so better is keep file length at cluster-size
boundary, truncating on exit)

> 
>>>
>>>> But still we'll have COW areas in subcluster, and we'll need to directly zero
>>>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>>>
>>>>>
>>>>> I’m asking because Dave Chinner said
>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>>>> fallocate() is always slow at least with aio=native because it needs to
>>>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>>>> pipeline to stall.
>>>>>
>>>>> (He suggested using XFS extent size hints to get the same effect as
>>>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>>>
>>>>>>> We already have some cases where the existing handle_alloc_space()
>>>>>>> causes performance to actually become worse, and serialising requests as
>>>>>>> a workaround isn't going to make performance any better. So even on
>>>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>>>
>>>>>>
>>>>>> Can keeping handle_alloc_space under some config option be an option?
>>>>>
>>>>> Hm.  A config option is weird if you’re the only one who’s going to
>>>>> enable it.  But other than that I don’t have anything against it.
>>>>>
>>>>
>>>> It's just a bit easier for us to maintain config option, than out-of-tree patch.
>>>> On the other hand, it's not a real problem to maintain this one patch in separate.
>>>> It may return again to the tree, when XFS bug fixed.
>>>
>>> We’ll still have the problem that fallocate() must stall aio=native
>>> requests.
>>>
>>
>> Does it mean that fallocate is bad in general? Practice shows the opposite..
>> At least I have my examples with qemu-img bench. Can that thing be shown with
>> qemu-img bench or something?
> 
> I haven’t done benchmarks yet, but I don’t think Laurent will mind if I
> paste his fio benchmark results from the unfortunately private BZ here:
> 
>           QCOW2 ON |          EXT4         |          XFS          |
>                    |                       |                       |
> (rw, bs, iodepth) |   2.12.0  |   4.1.0   |   2.12.0  |   4.1.0   |
> ------------------+-----------+-----------+-----------+-----------+
> (write, 16k, 1)   | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s |
> ------------------+-----------+-----------+-----------+-----------+
> (write, 64k, 1)   | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s |
> ------------------+-----------+-----------+-----------+-----------+
> (write,  4k, 8)   |  535KiB/s |  524KiB/s | 13.9MiB/s | 1662KiB/s |
> ------------------+-----------+-----------+-----------+-----------+
> 
> And that just looks like ext4 performs worse with aio=native in general.
>   But I’ll have to do my own benchmarks first.
> 
> Max
> 


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-29 12:19                 ` Vladimir Sementsov-Ogievskiy
@ 2019-10-29 12:23                   ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2019-10-29 12:23 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Stefan Hajnoczi
  Cc: Anton Nefedov, Alberto Garcia, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 4097 bytes --]

On 29.10.19 13:19, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 15:11, Max Reitz wrote:
>> On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.10.2019 14:55, Max Reitz wrote:
>>>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 29.10.2019 11:50, Max Reitz wrote:
>>>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>>>         To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>>>>         this commit, and even then I was a bit skeptical because it performs
>>>>>>>>>>         well in some cases and bad in others.  I concluded that it’s
>>>>>>>>>>         probably worth it because the “some cases” are more likely to occur.
>>>>>>>>>>
>>>>>>>>>>         Now we have this problem of corruption here (granted due to a bug in
>>>>>>>>>>         the XFS driver), and another report of massively degraded
>>>>>>>>>>         performance on ppc64
>>>>>>>>>>         (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>>>         private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>>>         performance for an in-guest fio write benchmark.)
>>>>>>>>>>
>>>>>>>>>>         So I have to ask the question about what the justification for
>>>>>>>>>>         keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>>>>         it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>>>
>>>>>>>>>>         Advantages:
>>>>>>>>>>         + Trivial
>>>>>>>>>>         + No layering violations
>>>>>>>>>>         + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>>>>           fixed or not
>>>>>>>>>>         + Fixes the ppc64+XFS performance problem
>>>>>>>>>>
>>>>>>>>>>         Disadvantages:
>>>>>>>>>>         - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>>>           levels, whatever that means
>>>>>>>>>
>>>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>>>> an option.
>>>>>>>>
>>>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>>>> is a possible solution, too?
>>>>>>>
>>>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>>>
>>>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>>>> because you’re storing images on block devices) or would it be
>>>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>>>> bdrv_has_zero_init_truncate() are true?
>>>>>
>>>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>>>> Mark sublusters as zeroes by metadata?
>>>>
>>>> Why do you need to zero COW areas?  For normal files, any data will read
>>>> as zero if you didn’t write anything there.
>>>
>>> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
>>> as it may be reused previously allocated cluster..
>>
>> Hm, right.  We could special-case something for beyond the EOF, but I
>> don’t know whether that really makes it better.
>>
>> OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
>> existing clusters should be rare enough that maybe the existing code
>> which explicitly writes zeroes is sufficient.
> 
> But, as I understand pre-EOF fallocates are safe in xfs? So, we may just drop calling
> fallocate past-EOF (it's zero anyway) and do fallocate pre-EOF (it's safe) ?

But probably slow still.  And there’s the question of how much
complexity we want to heap onto this.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-10-25 14:19     ` Max Reitz
  2019-10-25 14:35       ` Kevin Wolf
  2019-10-25 14:36       ` Vladimir Sementsov-Ogievskiy
@ 2019-11-04 14:03       ` Alberto Garcia
  2019-11-04 14:25         ` Max Reitz
  2 siblings, 1 reply; 47+ messages in thread
From: Alberto Garcia @ 2019-11-04 14:03 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

On Fri 25 Oct 2019 04:19:30 PM CEST, Max Reitz wrote:
>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>> cluster-size, even on rotational disk, which means that previous
>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>> we need smarter heuristics..
>>>
>>> So, I'd prefer (1) or (2).
>
> OK.  I wonder whether that problem would go away with Berto’s subcluster
> series, though.

Catching up with this now. I was told about this last week at the KVM
Forum, but if the problems comes with the use of fallocate() and XFS,
the I don't think subclusters will solve it.

handle_alloc_space() is used to fill a cluster with zeroes when there's
COW, and that happens the same with subclusters, just at the subcluster
level instead of course.

What can happen, if the subcluster size matches the filesystem block
size, is that there's no need for any COW and therefore the bug is never
triggered. But that's not quite the same as a fix :-)

> Maybe make a decision based both on the ratio of data size to COW area
> length (only invoke handle_alloc_space() under a certain threshold),
> and the absolute COW area length (always invoke it above a certain
> threshold, unless the ratio doesn’t allow it)?

Maybe combining that with the smaller clusters/subclusters can work
around the problem. The maximum subcluster size is 64KB (for a 2MB
cluster).

Berto


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-11-04 14:03       ` Alberto Garcia
@ 2019-11-04 14:25         ` Max Reitz
  2019-11-04 15:12           ` Alberto Garcia
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-11-04 14:25 UTC (permalink / raw)
  To: Alberto Garcia, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 2315 bytes --]

On 04.11.19 15:03, Alberto Garcia wrote:
> On Fri 25 Oct 2019 04:19:30 PM CEST, Max Reitz wrote:
>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>>> cluster-size, even on rotational disk, which means that previous
>>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>>> we need smarter heuristics..
>>>>
>>>> So, I'd prefer (1) or (2).
>>
>> OK.  I wonder whether that problem would go away with Berto’s subcluster
>> series, though.
> 
> Catching up with this now. I was told about this last week at the KVM
> Forum, but if the problems comes with the use of fallocate() and XFS,
> the I don't think subclusters will solve it.
> 
> handle_alloc_space() is used to fill a cluster with zeroes when there's
> COW, and that happens the same with subclusters, just at the subcluster
> level instead of course.
> 
> What can happen, if the subcluster size matches the filesystem block
> size, is that there's no need for any COW and therefore the bug is never
> triggered. But that's not quite the same as a fix :-)

No, what I meant was that the original problem that led to c8bb23cbdbe
would go away.

c8bb23cbdbe was added because small writes to new clusters are slow when
the clusters are large (because you need to do a 2 MB write on the host
for a 4 kB write from the guest).  So handle_alloc_space() was added to
alleviate the problem with a zero-write instead of actually writing zeroes.

The question is whether there is no need for handle_alloc_space() with
subclusters because a normal write with explicit zeroes being written in
the COW areas would be sufficiently quick.  (Because the subclusters for
2 MB clusters are just 64 kB in size.)

If that were so (right now it doesn’t look like it), we could revert
c8bb23cbdbe and wouldn’t see the bug anymore.

Max

>> Maybe make a decision based both on the ratio of data size to COW area
>> length (only invoke handle_alloc_space() under a certain threshold),
>> and the absolute COW area length (always invoke it above a certain
>> threshold, unless the ratio doesn’t allow it)?
> 
> Maybe combining that with the smaller clusters/subclusters can work
> around the problem. The maximum subcluster size is 64KB (for a 2MB
> cluster).
> 
> Berto
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-11-04 14:25         ` Max Reitz
@ 2019-11-04 15:12           ` Alberto Garcia
  2019-11-04 15:14             ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Alberto Garcia @ 2019-11-04 15:12 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

On Mon 04 Nov 2019 03:25:12 PM CET, Max Reitz wrote:
>>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>>>> cluster-size, even on rotational disk, which means that previous
>>>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>>>> we need smarter heuristics..
>>>>>
>>>>> So, I'd prefer (1) or (2).
>>>
>>> OK.  I wonder whether that problem would go away with Berto’s subcluster
>>> series, though.
>> 
>> Catching up with this now. I was told about this last week at the KVM
>> Forum, but if the problems comes with the use of fallocate() and XFS,
>> the I don't think subclusters will solve it.
>> 
>> handle_alloc_space() is used to fill a cluster with zeroes when there's
>> COW, and that happens the same with subclusters, just at the subcluster
>> level instead of course.
>> 
>> What can happen, if the subcluster size matches the filesystem block
>> size, is that there's no need for any COW and therefore the bug is never
>> triggered. But that's not quite the same as a fix :-)
>
> No, what I meant was that the original problem that led to c8bb23cbdbe
> would go away.

Ah, right. Not quite, according to my numbers:

|--------------+----------------+-----------------+-------------|
| Cluster size | subclusters=on | subclusters=off | fallocate() |
|--------------+----------------+-----------------+-------------|
|       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
|       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
|      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
|      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
|--------------+----------------+-----------------+-------------|

There's obviously no backing image, and only the last column uses
handle_alloc_space() / fallocate().

Berto


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-11-04 15:12           ` Alberto Garcia
@ 2019-11-04 15:14             ` Max Reitz
  2019-11-04 15:49               ` Alberto Garcia
  0 siblings, 1 reply; 47+ messages in thread
From: Max Reitz @ 2019-11-04 15:14 UTC (permalink / raw)
  To: Alberto Garcia, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 2200 bytes --]

On 04.11.19 16:12, Alberto Garcia wrote:
> On Mon 04 Nov 2019 03:25:12 PM CET, Max Reitz wrote:
>>>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>>>>> cluster-size, even on rotational disk, which means that previous
>>>>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>>>>> we need smarter heuristics..
>>>>>>
>>>>>> So, I'd prefer (1) or (2).
>>>>
>>>> OK.  I wonder whether that problem would go away with Berto’s subcluster
>>>> series, though.
>>>
>>> Catching up with this now. I was told about this last week at the KVM
>>> Forum, but if the problems comes with the use of fallocate() and XFS,
>>> the I don't think subclusters will solve it.
>>>
>>> handle_alloc_space() is used to fill a cluster with zeroes when there's
>>> COW, and that happens the same with subclusters, just at the subcluster
>>> level instead of course.
>>>
>>> What can happen, if the subcluster size matches the filesystem block
>>> size, is that there's no need for any COW and therefore the bug is never
>>> triggered. But that's not quite the same as a fix :-)
>>
>> No, what I meant was that the original problem that led to c8bb23cbdbe
>> would go away.
> 
> Ah, right. Not quite, according to my numbers:
> 
> |--------------+----------------+-----------------+-------------|
> | Cluster size | subclusters=on | subclusters=off | fallocate() |
> |--------------+----------------+-----------------+-------------|
> |       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
> |       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
> |      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
> |      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
> |--------------+----------------+-----------------+-------------|
> 
> There's obviously no backing image, and only the last column uses
> handle_alloc_space() / fallocate().

Thanks for providing some numbers!

It was my impression, too, that subclusters wouldn’t solve it.  But it
didn’t seem like that big of a difference to me.  Did you run this with
aio=native?  (Because that’s where we have the XFS problem)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-11-04 15:14             ` Max Reitz
@ 2019-11-04 15:49               ` Alberto Garcia
  2019-11-04 16:07                 ` Max Reitz
  0 siblings, 1 reply; 47+ messages in thread
From: Alberto Garcia @ 2019-11-04 15:49 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi

On Mon 04 Nov 2019 04:14:56 PM CET, Max Reitz wrote:

>>> No, what I meant was that the original problem that led to
>>> c8bb23cbdbe would go away.
>> 
>> Ah, right. Not quite, according to my numbers:
>> 
>> |--------------+----------------+-----------------+-------------|
>> | Cluster size | subclusters=on | subclusters=off | fallocate() |
>> |--------------+----------------+-----------------+-------------|
>> |       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
>> |       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
>> |      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
>> |      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
>> |--------------+----------------+-----------------+-------------|
>> 
>> There's obviously no backing image, and only the last column uses
>> handle_alloc_space() / fallocate().
>
> Thanks for providing some numbers!
>
> It was my impression, too, that subclusters wouldn’t solve it.  But it
> didn’t seem like that big of a difference to me.  Did you run this
> with aio=native?  (Because that’s where we have the XFS problem)

Here's with aio=native

|--------------+----------------+-----------------+-------------|
| Cluster size | subclusters=on | subclusters=off | fallocate() |
|--------------+----------------+-----------------+-------------|
|       256 KB |     19897 IOPS |        891 IOPS |  32194 IOPS |
|       512 KB |     17881 IOPS |        436 IOPS |  33092 IOPS |
|      1024 KB |     17050 IOPS |        341 IOPS |  32768 IOPS |
|      2048 KB |      7854 IOPS |        207 IOPS |  30944 IOPS |
|--------------+----------------+-----------------+-------------|

Berto


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC 0/3] block/file-posix: Work around XFS bug
  2019-11-04 15:49               ` Alberto Garcia
@ 2019-11-04 16:07                 ` Max Reitz
  0 siblings, 0 replies; 47+ messages in thread
From: Max Reitz @ 2019-11-04 16:07 UTC (permalink / raw)
  To: Alberto Garcia, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Anton Nefedov, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 1930 bytes --]

On 04.11.19 16:49, Alberto Garcia wrote:
> On Mon 04 Nov 2019 04:14:56 PM CET, Max Reitz wrote:
> 
>>>> No, what I meant was that the original problem that led to
>>>> c8bb23cbdbe would go away.
>>>
>>> Ah, right. Not quite, according to my numbers:
>>>
>>> |--------------+----------------+-----------------+-------------|
>>> | Cluster size | subclusters=on | subclusters=off | fallocate() |
>>> |--------------+----------------+-----------------+-------------|
>>> |       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
>>> |       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
>>> |      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
>>> |      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
>>> |--------------+----------------+-----------------+-------------|
>>>
>>> There's obviously no backing image, and only the last column uses
>>> handle_alloc_space() / fallocate().
>>
>> Thanks for providing some numbers!
>>
>> It was my impression, too, that subclusters wouldn’t solve it.  But it
>> didn’t seem like that big of a difference to me.  Did you run this
>> with aio=native?  (Because that’s where we have the XFS problem)
> 
> Here's with aio=native
> 
> |--------------+----------------+-----------------+-------------|
> | Cluster size | subclusters=on | subclusters=off | fallocate() |
> |--------------+----------------+-----------------+-------------|
> |       256 KB |     19897 IOPS |        891 IOPS |  32194 IOPS |
> |       512 KB |     17881 IOPS |        436 IOPS |  33092 IOPS |
> |      1024 KB |     17050 IOPS |        341 IOPS |  32768 IOPS |
> |      2048 KB |      7854 IOPS |        207 IOPS |  30944 IOPS |
> |--------------+----------------+-----------------+-------------|

OK, thanks again. :-)

So it seems right not to revert the commit and just work around the XFS
bug in file-posix.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-11-04 16:09 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-25  9:58 [RFC 0/3] block/file-posix: Work around XFS bug Max Reitz
2019-10-25  9:58 ` [RFC 1/3] block: Make wait/mark serialising requests public Max Reitz
2019-10-25  9:58 ` [RFC 2/3] block/file-posix: Detect XFS with CONFIG_FALLOCATE Max Reitz
2019-10-25 10:19   ` Kevin Wolf
2019-10-25 10:22     ` Max Reitz
2019-10-25 10:35       ` Kevin Wolf
2019-10-25 10:41         ` Max Reitz
2019-10-26 17:26   ` Nir Soffer
2019-10-25  9:58 ` [RFC 3/3] block/file-posix: Let post-EOF fallocate serialize Max Reitz
2019-10-26 17:28   ` Nir Soffer
2019-10-25 13:40 ` [RFC 0/3] block/file-posix: Work around XFS bug Vladimir Sementsov-Ogievskiy
2019-10-25 13:56   ` Vladimir Sementsov-Ogievskiy
2019-10-25 14:19     ` Max Reitz
2019-10-25 14:35       ` Kevin Wolf
2019-10-25 14:36       ` Vladimir Sementsov-Ogievskiy
2019-10-27 12:21         ` Stefan Hajnoczi
2019-11-04 14:03       ` Alberto Garcia
2019-11-04 14:25         ` Max Reitz
2019-11-04 15:12           ` Alberto Garcia
2019-11-04 15:14             ` Max Reitz
2019-11-04 15:49               ` Alberto Garcia
2019-11-04 16:07                 ` Max Reitz
2019-10-25 13:46 ` Peter Maydell
2019-10-25 14:16   ` Max Reitz
2019-10-25 14:17     ` Peter Maydell
2019-10-25 14:21       ` Max Reitz
2019-10-25 14:56         ` Peter Maydell
2019-10-26  0:14 ` no-reply
2019-10-26 17:37 ` Nir Soffer
2019-10-26 17:52   ` Vladimir Sementsov-Ogievskiy
2019-10-28  8:56     ` Max Reitz
2019-10-27 12:35 ` Stefan Hajnoczi
2019-10-28  9:24   ` Max Reitz
2019-10-28  9:30     ` Max Reitz
2019-10-28  9:56       ` Max Reitz
2019-10-28 10:07         ` Vladimir Sementsov-Ogievskiy
2019-10-28 10:10           ` Max Reitz
2019-10-28 11:19             ` Vladimir Sementsov-Ogievskiy
2019-10-28 11:04   ` Kevin Wolf
2019-10-28 11:25     ` Vladimir Sementsov-Ogievskiy
2019-10-29  8:50       ` Max Reitz
2019-10-29 11:48         ` Vladimir Sementsov-Ogievskiy
2019-10-29 11:55           ` Max Reitz
2019-10-29 12:05             ` Vladimir Sementsov-Ogievskiy
2019-10-29 12:11               ` Max Reitz
2019-10-29 12:19                 ` Vladimir Sementsov-Ogievskiy
2019-10-29 12:23                   ` Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).