qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Qemu-block <qemu-block@nongnu.org>
Cc: Alberto Garcia <berto@igalia.com>,
	Anton Nefedov <anton.nefedov@virtuozzo.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: Problems with c8bb23cbdbe3 on ppc64le
Date: Thu, 24 Oct 2019 11:08:22 +0200	[thread overview]
Message-ID: <4dd781ed-b695-1610-438c-b459fe9027c4@redhat.com> (raw)
In-Reply-To: <2e7d321c-89f4-f3fd-8331-6bc276880de2@redhat.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 2247 bytes --]

On 10.10.19 17:17, Max Reitz wrote:
> Hi everyone,
> 
> (CCs just based on tags in the commit in question)
> 
> I have two bug reports which claim problems of qcow2 on XFS on ppc64le
> machines since qemu 4.1.0.  One of those is about bad performance
> (sorry, is isn’t public :-/), the other about data corruption
> (https://bugzilla.redhat.com/show_bug.cgi?id=1751934).
> 
> It looks like in both cases reverting c8bb23cbdbe3 solves the problem
> (which optimized COW of unallocated areas).
> 
> I think I’ve looked at every angle but can‘t find what could be wrong
> with it.  Do any of you have any idea? :-/

It looks to me like an XFS bug.

On XFS, if you do FALLOC_FL_ZERO_RANGE past the EOF and an AIO pwrite
even further after that range, the pwrite will be discarded if the
fallocate settles after the pwrite (and both have been started before
either as finished).  That is, the file length will be increased as if
only the fallocate had been executed, but not the pwrite, so the
pwrite’s data is lost.

(Interestingly, this is pretty similar to the bug I introduced in qemu
in 50ba5b2d994853b38fed10e0841b119da0f8b8e5, where the ftruncate() would
not consider parallel in-flight writes.)

I’ve attached a C program to show the problem.  It creates an empty
file, issues FALLOC_FL_ZERO_RANGE on the first 4 kB in a thread, and an
AIO pwrite in parallel on the second 4 kB.  It then runs hexdump -C on
the file.

On XFS, the hexdump shows only 4 kB of 0s.  On ext4 and btrfs, it shows
4 kB of 0s and 4 kB of 42s.

(You can uncomment the IN_ORDER to execute the fallocate and pwrite
sequentially, then XFS will show the same output.)

(Note that it is possible that pwrite and fallocate are not issued
before the other is finished, or that fallocate settles before pwrite.
In such cases, the file will probably be written correctly.  However, I
see the wrong result pretty much 100 % of the time.  (So on my machine,
pwrite and fallocate pretty much always run in parallel and fallocate
finishes after pwrite.))

Compile the program like so:

$ gcc parallel-falloc-and-pwrite.c -pthread -laio -Wall -Wextra
-pedantic -std=c11

And run it like so:

$ ./a.out tmp-file

Max

[-- Attachment #1.1.2: parallel-falloc-and-pwrite.c --]
[-- Type: text/x-csrc, Size: 1794 bytes --]

#define _GNU_SOURCE

#include <assert.h>
#include <fcntl.h>
#include <libaio.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// Define this to perform the fallocate and the pwrite sequentially
// instead of in parallel

// #define IN_ORDER


int fd;

void *falloc_thread(void *arg)
{
    int ret;

    (void)arg;

    puts("starting fallocate");

    ret = fallocate(fd, FALLOC_FL_ZERO_RANGE, 0, 4096);
    assert(ret == 0);

    puts("fallocate done");

    return NULL;
}

int main(int argc, char *argv[])
{
    pthread_t falloc_thr;
    int ret;

    if (argc != 2) {
        fprintf(stderr, "Usage: %s <scratch file>\n", argv[0]);
        return 1;
    }

    fd = open(argv[1], O_CREAT | O_RDWR | O_TRUNC | O_DIRECT, 0666);
    assert(fd >= 0);

    void *buf = aligned_alloc(4096, 4096);
    memset(buf, 42, 4096);

    io_context_t ctx = 0;
    ret = io_setup(1, &ctx);
    assert(ret == 0);

    ret = pthread_create(&falloc_thr, NULL, &falloc_thread, NULL);
    assert(ret == 0);

#ifdef IN_ORDER
    ret = pthread_join(falloc_thr, NULL);
    assert(ret == 0);
#endif

    struct iocb ior;
    io_prep_pwrite(&ior, fd, buf, 4096, 4096);

    puts("submitting pwrite");

    struct iocb *ios[] = { &ior };
    ret = io_submit(ctx, 1, ios);
    assert(ret == 1);

    struct io_event evs[1];
    ret = io_getevents(ctx, 1, 1, evs, NULL);
    assert(ret == 1);

    puts("pwrite done");

#ifndef IN_ORDER
    ret = pthread_join(falloc_thr, NULL);
    assert(ret == 0);
#endif

    close(fd);
    free(buf);

    puts("\nHexdump should show 4k of 0s and 4k of 42s:\n");

    execlp("hexdump", "hexdump", "-C", argv[1], NULL);
    return 1;
}

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

      parent reply	other threads:[~2019-10-24  9:09 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-10 15:17 Problems with c8bb23cbdbe3 on ppc64le Max Reitz
2019-10-10 16:15 ` Anton Nefedov
2019-10-11  7:49   ` Max Reitz
2019-10-21 11:40     ` Max Reitz
2019-10-21 13:33 ` Max Reitz
2019-10-21 16:24   ` Max Reitz
2019-10-24  9:08 ` Max Reitz [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4dd781ed-b695-1610-438c-b459fe9027c4@redhat.com \
    --to=mreitz@redhat.com \
    --cc=anton.nefedov@virtuozzo.com \
    --cc=berto@igalia.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).