qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Nir Soffer <nsoffer@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block <qemu-block@nongnu.org>,
	Nir Soffer <nirsof@gmail.com>,
	QEMU Developers <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block
Date: Fri, 23 Aug 2019 15:58:09 +0200	[thread overview]
Message-ID: <e8db1edb-b1ee-8244-c772-8e08794181f0@redhat.com> (raw)
In-Reply-To: <CAMRbyytxF8r9LoX4J_7ca2QPRtnpWgdTtyaKq=p=7ZaoMu-uug@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 11804 bytes --]

On 22.08.19 21:01, Nir Soffer wrote:
> On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <mreitz@redhat.com
> <mailto:mreitz@redhat.com>> wrote:
> 
>     On 22.08.19 18:39, Nir Soffer wrote:
>     > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com
>     <mailto:mreitz@redhat.com>
>     > <mailto:mreitz@redhat.com <mailto:mreitz@redhat.com>>> wrote:
>     >
>     >     On 16.08.19 23:21, Nir Soffer wrote:
>     >     > When creating an image with preallocation "off" or "falloc",
>     the first
>     >     > block of the image is typically not allocated. When using
>     Gluster
>     >     > storage backed by XFS filesystem, reading this block using
>     direct I/O
>     >     > succeeds regardless of request length, fooling alignment
>     detection.
>     >     >
>     >     > In this case we fallback to a safe value (4096) instead of
>     the optimal
>     >     > value (512), which may lead to unneeded data copying when
>     aligning
>     >     > requests.  Allocating the first block avoids the fallback.
>     >     >
>     >     > When using preallocation=off, we always allocate at least one
>     >     filesystem
>     >     > block:
>     >     >
>     >     >     $ ./qemu-img create -f raw test.raw 1g
>     >     >     Formatting 'test.raw', fmt=raw size=1073741824
>     >     >
>     >     >     $ ls -lhs test.raw
>     >     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48
>     test.raw
>     >     >
>     >     > I did quick performance tests for these flows:
>     >     > - Provisioning a VM with a new raw image.
>     >     > - Copying disks with qemu-img convert to new raw target image
>     >     >
>     >     > I installed Fedora 29 server on raw sparse image, measuring
>     the time
>     >     > from clicking "Begin installation" until the "Reboot" button
>     appears:
>     >     >
>     >     > Before(s)  After(s)     Diff(%)
>     >     > -------------------------------
>     >     >      356        389        +8.4
>     >     >
>     >     > I ran this only once, so we cannot tell much from these results.
>     >
>     >     So you’d expect it to be fast but it was slower?  Well, you
>     only ran it
>     >     once and it isn’t really a precise benchmark...
>     >
>     >     > The second test was cloning the installation image with qemu-img
>     >     > convert, doing 10 runs:
>     >     >
>     >     >     for i in $(seq 10); do
>     >     >         rm -f dst.raw
>     >     >         sleep 10
>     >     >         time ./qemu-img convert -f raw -O raw -t none -T none
>     >     src.raw dst.raw
>     >     >     done
>     >     >
>     >     > Here is a table comparing the total time spent:
>     >     >
>     >     > Type    Before(s)   After(s)    Diff(%)
>     >     > ---------------------------------------
>     >     > real      530.028    469.123      -11.4
>     >     > user       17.204     10.768      -37.4
>     >     > sys        17.881      7.011      -60.7
>     >     >
>     >     > Here we see very clear improvement in CPU usage.
>     >     >
>     >     > Signed-off-by: Nir Soffer <nsoffer@redhat.com
>     <mailto:nsoffer@redhat.com>
>     >     <mailto:nsoffer@redhat.com <mailto:nsoffer@redhat.com>>>
>     >     > ---
>     >     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
>     >     >  tests/qemu-iotests/150.out |  1 +
>     >     >  tests/qemu-iotests/160     |  4 ++++
>     >     >  tests/qemu-iotests/175     | 19 +++++++++++++------
>     >     >  tests/qemu-iotests/175.out |  8 ++++----
>     >     >  tests/qemu-iotests/221.out | 12 ++++++++----
>     >     >  tests/qemu-iotests/253.out | 12 ++++++++----
>     >     >  7 files changed, 63 insertions(+), 18 deletions(-)
>     >     >
>     >     > diff --git a/block/file-posix.c b/block/file-posix.c
>     >     > index b9c33c8f6c..3964dd2021 100644
>     >     > --- a/block/file-posix.c
>     >     > +++ b/block/file-posix.c
>     >     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
>     *opaque)
>     >     >      return ret;
>     >     >  }
>     >     > 
>     >     > +/*
>     >     > + * Help alignment detection by allocating the first block.
>     >     > + *
>     >     > + * When reading with direct I/O from unallocated area on
>     Gluster
>     >     backed by XFS,
>     >     > + * reading succeeds regardless of request length. In this
>     case we
>     >     fallback to
>     >     > + * safe aligment which is not optimal. Allocating the first
>     block
>     >     avoids this
>     >     > + * fallback.
>     >     > + *
>     >     > + * Returns: 0 on success, -errno on failure.
>     >     > + */
>     >     > +static int allocate_first_block(int fd)
>     >     > +{
>     >     > +    ssize_t n;
>     >     > +
>     >     > +    do {
>     >     > +        n = pwrite(fd, "\0", 1, 0);
>     >
>     >     This breaks when fd has been opened with O_DIRECT.
>     >
>     >
>     > It seems that we always open images without O_DIRECT when creating
>     an image
>     > in qemu-img create, or when creating a target image in qemu-img
>     convert.
> 
>     Yes.  But you don’t call this function directly from image creation code
>     but instead from the truncation function.  (The former also calls the
>     latter, but truncating is also an operation on its own.)
> 
>     [...]
> 
>     >     (Which happens when you open some file with cache.direct=on,
>     and then
>     >     use e.g. QMP’s block_resize.)
>     >
>     >
>     > What would be a command triggering this? I can add a test.
> 
>     block_resize, as I’ve said:
> 
>     $ ./qemu-img create -f raw empty.img 0
> 
> 
> This is extreme edge case - why would someone create such image?

Because it works?

This is generally the fist step of image creation with blockdev-create,
because you don’t care about the size of the protocol layer.

If you have a format layer that truncates the image to a fixed size and
does not write anything into the first block itself (say because it uses
a footer), then (with O_DIRECT) allocate_first_block() will fail
(silently, because while it does return an error value, it is never
checked and there is no comment that explains why we don’t check it) and
the first block actually will not be allocated.

I could show you that with VPC (which supports a fixed subformat where
it uses a footer), but unfortunately that’s a bit broken right now
(because of a bug in blockdev-create; I’ll send a patch).

The test would go like this:

$ x86_64-softmmu/qemu-system-x86_64 -qmp stdio
{"execute":"qmp_capabilities"}

{"execute":"blockdev-create",
 "arguments":{
    "job-id":"create",
    "options":{"driver":"file",
               "filename":"test.img",
               "size":0}}}

[Wait until the job is pending]

{"execute":"job-dismiss","arguments":{"id":"create"}}

{"execute":"blockdev-add",
 "arguments":{
    "driver":"file",
    "node-name":"protocol-node",
    "filename":"test.img",
    "cache":{"direct":true}}}

{"execute":"blockdev-create",
 "arguments":{
    "job-id":"create",
    "options":{"driver":"vpc",
               "file":"protocol-node",
               "subformat":"fixed",
               "size":67108864,
               "force-size":true}}}

[Wait until the job is pending]

{"execute":"job-dismiss","arguments":{"id":"create"}}

{"execute":"quit"}

And then:

$ ./qemu-img map test.img
Offset          Length          Mapped to       File
0x4000000       0x200           0x4000000       test.img

The footer is mapped, but the first block is not allocated.


As I said, for that to work, you need a patch (because of a bug), namely:

[Start of patch]

diff --git a/block/create.c b/block/create.c
index 1bd00ed5f8..572d3a4176 100644
--- a/block/create.c
+++ b/block/create.c
@@ -48,7 +48,7 @@ static int coroutine_fn blockdev_create_run(Job *job,
Error **errp)

     qapi_free_BlockdevCreateOptions(s->opts);

-    return ret;
+    return ret < 0 ? ret : 0;
 }

 static const JobDriver blockdev_create_job_driver = {

[End of patch]

(The reason being that the vpc block driver returns 512 here to signify
success, but the job infrastructure treats anything but 0 as a failure.)

>     $ x86_64-softmmu/qemu-system-x86_64 \
>         -qmp stdio \
>         -blockdev file,node-name=file,filename=empty.img,cache.direct=on \
>          <<EOF
>     {'execute':'qmp_capabilities'}
> 
> 
> This is probably too late for the allocation, since we already probed
> the alignment before executing block_resize, and used a safe fallback
> (4096).
> It can help if the image is reopened, since we probe alignment again.

I’m not talking about getting the alignment right when you have a
zero-length image.  That can probably never work with probing.  (Well, I
mean, technically you could make allocate_first_block() probe.  I won’t
ask for that because that really seems like too little gain for too much
effort.)

I’m just talking about the fact that this allocating write will fail, so
when the image is used the next time, it will not have the first block
allocated.

[...]

>     >     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void
>     *opaque)
>     >     >                  /* posix_fallocate() doesn't set errno. */
>     >     >                  error_setg_errno(errp, -result,
>     >     >                                   "Could not preallocate new
>     data");
>     >     > +            } else if (current_length == 0) {
>     >     > +                allocate_first_block(fd);
>     >
>     >     Should posix_fallocate() not take care of precisely this?
>     >
>     >
>     > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
>     >
>     > In this case posix_fallocate() is doing:
>     >
>     >   for (offset += (len - 1) % increment; len > 0; offset += increment)
>     >     {
>     >       len -= increment;
>     >       if (offset < st.st_size)
>     >         {
>     >           unsigned char c;
>     >           ssize_t rsize = __pread (fd, &c, 1, offset);
>     >           if (rsize < 0)
>     >             return errno;
>     >           /* If there is a non-zero byte, the block must have been
>     >              allocated already.  */
>     >           else if (rsize == 1 && c != 0)
>     >             continue;
>     >         }
>     >       if (__pwrite (fd, "", 1, offset) != 1)
>     >         return errno;
>     >     }
>     >
>     >
>     https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
>     >
>     > So opening a file with O_DIRECT will break preallocation=falloc on
>     such
>     > filesystems,
> 
>     But won’t the function above just fail with EINVAL?
>     allocate_first_block() is executed only in case of success.
> 
> 
> Sure, but if posix_fallocate() fails, we fail qemu-img create/convert.

Exactly.  But if posix_fallocate() works, it should have allocated the
first block.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-08-23 13:59 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-16 21:21 [Qemu-devel] [PATCH] block: posix: Always allocate the first block Nir Soffer
2019-08-16 21:57 ` [Qemu-devel] [Qemu-block] " John Snow
2019-08-16 22:45   ` Nir Soffer
2019-08-16 23:00     ` John Snow
2019-08-22 11:30 ` [Qemu-devel] " Nir Soffer
2019-08-22 14:28 ` Max Reitz
2019-08-22 16:39   ` Nir Soffer
2019-08-22 18:11     ` Max Reitz
2019-08-22 19:01       ` Nir Soffer
2019-08-23 13:58         ` Max Reitz [this message]
2019-08-23 16:30           ` Nir Soffer
2019-08-23 17:41             ` Max Reitz
2019-08-23 16:48           ` Nir Soffer
2019-08-23 17:53             ` Max Reitz
2019-08-24 22:57               ` Nir Soffer
2019-08-25  7:44 ` [Qemu-devel] [Qemu-block] " Maxim Levitsky
2019-08-25 19:51   ` Nir Soffer
2019-08-25 22:17     ` Maxim Levitsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e8db1edb-b1ee-8244-c772-8e08794181f0@redhat.com \
    --to=mreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=nirsof@gmail.com \
    --cc=nsoffer@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).