Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations

From: Richard Laager <rlaager@wiktel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
Date: Sat, 10 Mar 2012 12:02:40 -0600	[thread overview]
Message-ID: <1331402560.8577.46.camel@watermelon.coderich.net> (raw)
In-Reply-To: <4F5A46A1.4000508@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4099 bytes --]

I'm believe your patch set provides these behaviors now:
      * QEMU block drivers report discard_granularity. 
              * discard_granularity = 0 means no discard 
                      * The guest is told there's no discard support.
              * discard_granularity < 0 is undefined.
                discard_granularity > 0 is reported to the guest as
                discard support.
      * QEMU block drivers report discard_zeros_data.
                This is passed to the guest when discard_granularity >
                0.

I propose adding the following behaviors in any event:
      * If a QEMU block device reports a discard_granularity > 0, it
        must be equal to 2^n (n >= 0), or QEMU's block core will change
        it to 0. (Non-power-of-two granularities are not likely to exist
        in the real world, and this assumption greatly simplifies
        ensuring correctness.)
      * For SCSI, report an unmap_granularity to the guest as follows:
      max(logical_block_size, discard_granularity) / logical_block_size

Regarding emulating discard_zeros_data...

I agree that when discard_zeros_data is set, we will need to write
zeroes in some cases. As you noted, IDE has a fixed granularity of one
sector. And the SCSI granularity is a hint only; guests are not
guaranteed to align to that value either. [0]

As a design concept, instead of guaranteeing that 512B zero'ing discards
are supported, I think the QEMU block layer should instead guarantee
aligned discards to QEMU block devices, emulating any misaligned
discards (or portions thereof) by writing zeroes if (and only if)
discard_zeros_data is set. When the QEMU block layer gets a discard:
      * Of the specified discard range, see if it includes an aligned
        multiple of discard granularity. If so, save that as the
        starting point of a subrange. Then find the last aligned
        multiple, if any, and pass that subrange (if start != end) down
        to the block driver's discard function.
      * If the discard really fails (i.e. returns failure and sets errno
        to something other than "not supported" or equivalent), return
        failure to the guest. For "not supported", fall through to the
        code below with the full range.
      * At this point, we have zero, one, or two subranges to handle.
      * If and only if discard_zeros_data is set, write zeros to the
        remaining subranges, if any. (This would use a lower-level
        write_zeroes call which does not attempt to use discard.) If
        this fails, return failure to the guest.
      * Return success.

This leaves one remaining issue: In raw-posix.c, for files (i.e. not
devices), I assume you're going to advertise discard_granularity=1 and
discard_zeros_data=1 when compiled with support for
fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
guarantees that it zeros the data when punching holes. I haven't
verified this.

If the guest does a big discard (think mkfs) and fallocate() returns
EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
which, as you noted, will also allocate it (unless you explicitly check
for holes). This is bad. It can be avoided by not advertising
discard_zeros_data, but as you noted, that's unfortunate.

If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
going to work. This would side step these problems. You said it wasn't
possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
by extending the file by one byte and then punching that:
        char buf = 0;
        fstat(s->fd, &st);
        pwrite(s->fd, &buf, 1, st.st_size + 1);
        has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                                 st.st_size + 1, 1);
        ftruncate(s->fd, st.st_size);

[0] See the last paragraph starting on page 8:
    http://mkp.net/pubs/linux-advanced-storage.pdf

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]