All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nir Soffer <nsoffer@redhat.com>
To: Max Reitz <mreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block <qemu-block@nongnu.org>,
	Nir Soffer <nirsof@gmail.com>,
	QEMU Developers <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block
Date: Thu, 22 Aug 2019 22:01:09 +0300	[thread overview]
Message-ID: <CAMRbyytxF8r9LoX4J_7ca2QPRtnpWgdTtyaKq=p=7ZaoMu-uug@mail.gmail.com> (raw)
In-Reply-To: <271a2dd3-9f2b-013d-896f-f4059c804967@redhat.com>

On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <mreitz@redhat.com> wrote:

> On 22.08.19 18:39, Nir Soffer wrote:
> > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com
> > <mailto:mreitz@redhat.com>> wrote:
> >
> >     On 16.08.19 23:21, Nir Soffer wrote:
> >     > When creating an image with preallocation "off" or "falloc", the
> first
> >     > block of the image is typically not allocated. When using Gluster
> >     > storage backed by XFS filesystem, reading this block using direct
> I/O
> >     > succeeds regardless of request length, fooling alignment detection.
> >     >
> >     > In this case we fallback to a safe value (4096) instead of the
> optimal
> >     > value (512), which may lead to unneeded data copying when aligning
> >     > requests.  Allocating the first block avoids the fallback.
> >     >
> >     > When using preallocation=off, we always allocate at least one
> >     filesystem
> >     > block:
> >     >
> >     >     $ ./qemu-img create -f raw test.raw 1g
> >     >     Formatting 'test.raw', fmt=raw size=1073741824
> >     >
> >     >     $ ls -lhs test.raw
> >     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >     >
> >     > I did quick performance tests for these flows:
> >     > - Provisioning a VM with a new raw image.
> >     > - Copying disks with qemu-img convert to new raw target image
> >     >
> >     > I installed Fedora 29 server on raw sparse image, measuring the
> time
> >     > from clicking "Begin installation" until the "Reboot" button
> appears:
> >     >
> >     > Before(s)  After(s)     Diff(%)
> >     > -------------------------------
> >     >      356        389        +8.4
> >     >
> >     > I ran this only once, so we cannot tell much from these results.
> >
> >     So you’d expect it to be fast but it was slower?  Well, you only ran
> it
> >     once and it isn’t really a precise benchmark...
> >
> >     > The second test was cloning the installation image with qemu-img
> >     > convert, doing 10 runs:
> >     >
> >     >     for i in $(seq 10); do
> >     >         rm -f dst.raw
> >     >         sleep 10
> >     >         time ./qemu-img convert -f raw -O raw -t none -T none
> >     src.raw dst.raw
> >     >     done
> >     >
> >     > Here is a table comparing the total time spent:
> >     >
> >     > Type    Before(s)   After(s)    Diff(%)
> >     > ---------------------------------------
> >     > real      530.028    469.123      -11.4
> >     > user       17.204     10.768      -37.4
> >     > sys        17.881      7.011      -60.7
> >     >
> >     > Here we see very clear improvement in CPU usage.
> >     >
> >     > Signed-off-by: Nir Soffer <nsoffer@redhat.com
> >     <mailto:nsoffer@redhat.com>>
> >     > ---
> >     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
> >     >  tests/qemu-iotests/150.out |  1 +
> >     >  tests/qemu-iotests/160     |  4 ++++
> >     >  tests/qemu-iotests/175     | 19 +++++++++++++------
> >     >  tests/qemu-iotests/175.out |  8 ++++----
> >     >  tests/qemu-iotests/221.out | 12 ++++++++----
> >     >  tests/qemu-iotests/253.out | 12 ++++++++----
> >     >  7 files changed, 63 insertions(+), 18 deletions(-)
> >     >
> >     > diff --git a/block/file-posix.c b/block/file-posix.c
> >     > index b9c33c8f6c..3964dd2021 100644
> >     > --- a/block/file-posix.c
> >     > +++ b/block/file-posix.c
> >     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
> *opaque)
> >     >      return ret;
> >     >  }
> >     >
> >     > +/*
> >     > + * Help alignment detection by allocating the first block.
> >     > + *
> >     > + * When reading with direct I/O from unallocated area on Gluster
> >     backed by XFS,
> >     > + * reading succeeds regardless of request length. In this case we
> >     fallback to
> >     > + * safe aligment which is not optimal. Allocating the first block
> >     avoids this
> >     > + * fallback.
> >     > + *
> >     > + * Returns: 0 on success, -errno on failure.
> >     > + */
> >     > +static int allocate_first_block(int fd)
> >     > +{
> >     > +    ssize_t n;
> >     > +
> >     > +    do {
> >     > +        n = pwrite(fd, "\0", 1, 0);
> >
> >     This breaks when fd has been opened with O_DIRECT.
> >
> >
> > It seems that we always open images without O_DIRECT when creating an
> image
> > in qemu-img create, or when creating a target image in qemu-img convert.
>
> Yes.  But you don’t call this function directly from image creation code
> but instead from the truncation function.  (The former also calls the
> latter, but truncating is also an operation on its own.)
>
> [...]
>
> >     (Which happens when you open some file with cache.direct=on, and then
> >     use e.g. QMP’s block_resize.)
> >
> >
> > What would be a command triggering this? I can add a test.
>
> block_resize, as I’ve said:
>
> $ ./qemu-img create -f raw empty.img 0
>

This is extreme edge case - why would someone create such image?


> $ x86_64-softmmu/qemu-system-x86_64 \
>     -qmp stdio \
>     -blockdev file,node-name=file,filename=empty.img,cache.direct=on \
>      <<EOF
> {'execute':'qmp_capabilities'}
>

This is probably too late for the allocation, since we already probed
the alignment before executing block_resize, and used a safe fallback
(4096).
It can help if the image is reopened, since we probe alignment again.

> {'execute':'block_resize',
>  'arguments':{'node-name':'file',
>               'size':1048576}}

EOF
> $ ./qemu-img map empty.img
> Offset          Length          Mapped to       File
>
> (You’d expect a data chunk here.)

I suppose you can get the same effect with blockdev-create and some
> format that explicitly resizes the file to some target length (LUKS does
> this, I think), but this is the most direct route.
>

I will try to handle -blockdev in the next version.

>     It isn’t that bad because eventually you simply ignore the error.  But
> >     it still makes me wonder whether we shouldn’t write like the biggest
> >     power of two that does not exceed the new file length or
> MAX_BLOCKSIZE.
> >
> >
> > It makes sense if there is a way to cause qemu-img to use O_DIRECT when
> > creating an image.
> >
> >     > +    } while (n == -1 && errno == EINTR);
> >     > +
> >     > +    return (n == -1) ? -errno : 0;
> >     > +}
> >     > +
> >     >  static int handle_aiocb_truncate(void *opaque)
> >     >  {
> >     >      RawPosixAIOData *aiocb = opaque;
> >     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void
> *opaque)
> >     >                  /* posix_fallocate() doesn't set errno. */
> >     >                  error_setg_errno(errp, -result,
> >     >                                   "Could not preallocate new
> data");
> >     > +            } else if (current_length == 0) {
> >     > +                allocate_first_block(fd);
> >
> >     Should posix_fallocate() not take care of precisely this?
> >
> >
> > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
> >
> > In this case posix_fallocate() is doing:
> >
> >   for (offset += (len - 1) % increment; len > 0; offset += increment)
> >     {
> >       len -= increment;
> >       if (offset < st.st_size)
> >         {
> >           unsigned char c;
> >           ssize_t rsize = __pread (fd, &c, 1, offset);
> >           if (rsize < 0)
> >             return errno;
> >           /* If there is a non-zero byte, the block must have been
> >              allocated already.  */
> >           else if (rsize == 1 && c != 0)
> >             continue;
> >         }
> >       if (__pwrite (fd, "", 1, offset) != 1)
> >         return errno;
> >     }
> >
> >
> https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> >
> > So opening a file with O_DIRECT will break preallocation=falloc on such
> > filesystems,
>
> But won’t the function above just fail with EINVAL?
> allocate_first_block() is executed only in case of success.
>

Sure, but if posix_fallocate() fails, we fail qemu-img create/convert.

> and writing one byte in allocate_first_block() is safe.
> >
> >     >              }
> >     >          } else {
> >     >              result = 0;
> >
> >     [...]
> >
> >     > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> >     > index df89d3864b..ad2d054a47 100755
> >     > --- a/tests/qemu-iotests/160
> >     > +++ b/tests/qemu-iotests/160
> >     > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
> >     >      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip"
> >     -O "$IMGFMT" \
> >     >          2> /dev/null
> >     >      TEST_IMG="$TEST_IMG.out" _check_test_img
> >     > +
> >     > +    # We always write the first byte of an image.
> >     > +    printf "\0" > "$TEST_IMG.out.dd"
> >     > +
> >     >      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip"
> status=none
> >
> >     Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
> >     the lack of conv=notrunc)?
> >
> >
> > There is an issue only if dd open the file with O_TRUNC.
>
> It isn’t an issue, I just don’t understand why the printf would be
> necessary at all.
>
> dd should always truncate the output image unless conv=notrunc is
> specified.  But even if it didn’t do that, in all of these test cases it
> should copy some data from $TEST_IMG to the output, and thus should
> always overwrite the first byte anyway.
>

Right, this change is not needed.

> I will test
> > this again.
> >
> >     >
> >     >      echo
> >     > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> >     > index 51e62c8276..c6a3a7bb1e 100755
> >     > --- a/tests/qemu-iotests/175
> >     > +++ b/tests/qemu-iotests/175
> >     > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
> >     >  # the file size.  This function hides the resulting difference in
> the
> >     >  # stat -c '%b' output.
> >     >  # Parameter 1: Number of blocks an empty file occupies
> >     > -# Parameter 2: Image size in bytes
> >     > +# Parameter 2: Minimal number of blocks in an image
> >     > +# Parameter 3: Image size in bytes
> >     >  _filter_blocks()
> >     >  {
> >     >      extra_blocks=$1
> >     > -    img_size=$2
> >     > +    min_blocks=$2
> >     > +    img_size=$3
> >     >
> >     > -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing
> >     allocated/" \
> >     > -        -e "s/blocks=$((extra_blocks + img_size /
> >     512))\\(\$\\|[^0-9]\\)/everything allocated/"
> >     > +    sed -e "s/blocks=$((extra_blocks +
> >     min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> >
> >     I don’t think adding extra_blocks to min_blocks makes sense.  Just
> >     min_blocks alone should be what we want here.
> >
> >
> > We had failing tests (in vdsm) showing that filesystem may return more
> > blocs than
> > expected even for non-empty files, so this may be needed.
>
> But min_blocks is exactly the number of blocks of a file that has one
> allocated block.  I don’t see how adding the number of blocks an empty
> file occupies makes sense.
>

You are right. The test fails on filesystem that allocates extra blocks.

Nir

  reply	other threads:[~2019-08-22 19:03 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-16 21:21 [Qemu-devel] [PATCH] block: posix: Always allocate the first block Nir Soffer
2019-08-16 21:57 ` [Qemu-devel] [Qemu-block] " John Snow
2019-08-16 22:45   ` Nir Soffer
2019-08-16 23:00     ` John Snow
2019-08-22 11:30 ` [Qemu-devel] " Nir Soffer
2019-08-22 14:28 ` Max Reitz
2019-08-22 16:39   ` Nir Soffer
2019-08-22 18:11     ` Max Reitz
2019-08-22 19:01       ` Nir Soffer [this message]
2019-08-23 13:58         ` Max Reitz
2019-08-23 16:30           ` Nir Soffer
2019-08-23 17:41             ` Max Reitz
2019-08-23 16:48           ` Nir Soffer
2019-08-23 17:53             ` Max Reitz
2019-08-24 22:57               ` Nir Soffer
2019-08-25  7:44 ` [Qemu-devel] [Qemu-block] " Maxim Levitsky
2019-08-25 19:51   ` Nir Soffer
2019-08-25 22:17     ` Maxim Levitsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMRbyytxF8r9LoX4J_7ca2QPRtnpWgdTtyaKq=p=7ZaoMu-uug@mail.gmail.com' \
    --to=nsoffer@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=nirsof@gmail.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.