All of lore.kernel.org
 help / color / mirror / Atom feed
From: Richard Laager <rlaager@wiktel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
Date: Sat, 10 Mar 2012 12:02:40 -0600	[thread overview]
Message-ID: <1331402560.8577.46.camel@watermelon.coderich.net> (raw)
In-Reply-To: <4F5A46A1.4000508@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4099 bytes --]

I'm believe your patch set provides these behaviors now:
      * QEMU block drivers report discard_granularity. 
              * discard_granularity = 0 means no discard 
                      * The guest is told there's no discard support.
              * discard_granularity < 0 is undefined.
                discard_granularity > 0 is reported to the guest as
                discard support.
      * QEMU block drivers report discard_zeros_data.
                This is passed to the guest when discard_granularity >
                0.

I propose adding the following behaviors in any event:
      * If a QEMU block device reports a discard_granularity > 0, it
        must be equal to 2^n (n >= 0), or QEMU's block core will change
        it to 0. (Non-power-of-two granularities are not likely to exist
        in the real world, and this assumption greatly simplifies
        ensuring correctness.)
      * For SCSI, report an unmap_granularity to the guest as follows:
      max(logical_block_size, discard_granularity) / logical_block_size


Regarding emulating discard_zeros_data...

I agree that when discard_zeros_data is set, we will need to write
zeroes in some cases. As you noted, IDE has a fixed granularity of one
sector. And the SCSI granularity is a hint only; guests are not
guaranteed to align to that value either. [0]

As a design concept, instead of guaranteeing that 512B zero'ing discards
are supported, I think the QEMU block layer should instead guarantee
aligned discards to QEMU block devices, emulating any misaligned
discards (or portions thereof) by writing zeroes if (and only if)
discard_zeros_data is set. When the QEMU block layer gets a discard:
      * Of the specified discard range, see if it includes an aligned
        multiple of discard granularity. If so, save that as the
        starting point of a subrange. Then find the last aligned
        multiple, if any, and pass that subrange (if start != end) down
        to the block driver's discard function.
      * If the discard really fails (i.e. returns failure and sets errno
        to something other than "not supported" or equivalent), return
        failure to the guest. For "not supported", fall through to the
        code below with the full range.
      * At this point, we have zero, one, or two subranges to handle.
      * If and only if discard_zeros_data is set, write zeros to the
        remaining subranges, if any. (This would use a lower-level
        write_zeroes call which does not attempt to use discard.) If
        this fails, return failure to the guest.
      * Return success.

This leaves one remaining issue: In raw-posix.c, for files (i.e. not
devices), I assume you're going to advertise discard_granularity=1 and
discard_zeros_data=1 when compiled with support for
fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
guarantees that it zeros the data when punching holes. I haven't
verified this.

If the guest does a big discard (think mkfs) and fallocate() returns
EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
which, as you noted, will also allocate it (unless you explicitly check
for holes). This is bad. It can be avoided by not advertising
discard_zeros_data, but as you noted, that's unfortunate.

If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
going to work. This would side step these problems. You said it wasn't
possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
by extending the file by one byte and then punching that:
        char buf = 0;
        fstat(s->fd, &st);
        pwrite(s->fd, &buf, 1, st.st_size + 1);
        has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                                 st.st_size + 1, 1);
        ftruncate(s->fd, st.st_size);


[0] See the last paragraph starting on page 8:
    http://mkp.net/pubs/linux-advanced-storage.pdf

-- 
Richard


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

  reply	other threads:[~2012-03-10 18:03 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
2012-03-09 16:47   ` Kevin Wolf
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
2012-03-09 16:31   ` Kevin Wolf
2012-03-09 17:53     ` Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
2012-03-09 16:37   ` Kevin Wolf
2012-03-09 18:06     ` Paolo Bonzini
2012-03-10 18:02       ` Richard Laager [this message]
2012-03-12 12:27         ` Paolo Bonzini
2012-03-12 13:04           ` Kevin Wolf
2012-03-13 19:13           ` Richard Laager
2012-03-14  7:41             ` Paolo Bonzini
2012-03-14 12:01               ` Kevin Wolf
2012-03-14 12:14                 ` Paolo Bonzini
2012-03-14 12:37                   ` Kevin Wolf
2012-03-14 12:49                     ` Paolo Bonzini
2012-03-14 13:04                       ` Kevin Wolf
2012-03-24 15:33                       ` Christoph Hellwig
2012-03-24 15:30                   ` Christoph Hellwig
2012-03-26 19:40                     ` Richard Laager
2012-03-27 10:20                       ` Kevin Wolf
2012-03-24 15:29                 ` Christoph Hellwig
2012-03-26  9:44                   ` Daniel P. Berrange
2012-03-26  9:56                     ` Christoph Hellwig
2012-03-15  0:42               ` Richard Laager
2012-03-15  9:36                 ` Paolo Bonzini
2012-03-16  0:47                   ` Richard Laager
2012-03-16  9:34                     ` Paolo Bonzini
2012-03-24 15:27         ` Christoph Hellwig
2012-03-26 19:40           ` Richard Laager
2012-03-27  9:08             ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
2012-03-08 17:55   ` Avi Kivity
2012-03-09 16:42     ` Kevin Wolf
2012-03-12 10:42       ` Avi Kivity
2012-03-12 11:04         ` Kevin Wolf
2012-03-12 12:03           ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
2012-03-08 18:13   ` Avi Kivity
2012-03-08 18:14     ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
2012-03-08 18:17   ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
2012-03-24 15:35   ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
2012-03-09  8:20   ` Chris Wedgwood
2012-03-09  8:31     ` Paolo Bonzini
2012-03-09  8:35       ` Chris Wedgwood
2012-03-09  8:40         ` Paolo Bonzini
2012-03-09 10:31   ` Stefan Hajnoczi
2012-03-09 10:43     ` Paolo Bonzini
2012-03-09 10:53       ` Stefan Hajnoczi
2012-03-09 10:57         ` Paolo Bonzini
2012-03-09 20:36   ` Richard Laager
2012-03-12  9:34     ` Paolo Bonzini
2012-03-24 15:40     ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 15/17] raw: add get_info Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
2012-03-24 15:42   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1331402560.8577.46.camel@watermelon.coderich.net \
    --to=rlaager@wiktel.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.