From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:50652) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S6QdF-0007kN-MN for qemu-devel@nongnu.org; Sat, 10 Mar 2012 13:03:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1S6Qcv-0005de-4D for qemu-devel@nongnu.org; Sat, 10 Mar 2012 13:03:05 -0500 Received: from spam1.wiktel.com ([69.89.207.151]:53450) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S6Qcu-0005dK-UW for qemu-devel@nongnu.org; Sat, 10 Mar 2012 13:02:45 -0500 From: Richard Laager In-Reply-To: <4F5A46A1.4000508@redhat.com> References: <1331226917-6658-1-git-send-email-pbonzini@redhat.com> <1331226917-6658-7-git-send-email-pbonzini@redhat.com> <4F5A31B2.3050701@redhat.com> <4F5A46A1.4000508@redhat.com> Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-9eSfTKWNkbSBBd6Vbap4" Date: Sat, 10 Mar 2012 12:02:40 -0600 Message-ID: <1331402560.8577.46.camel@watermelon.coderich.net> Mime-Version: 1.0 Subject: Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Kevin Wolf , qemu-devel@nongnu.org --=-9eSfTKWNkbSBBd6Vbap4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'm believe your patch set provides these behaviors now: * QEMU block drivers report discard_granularity.=20 * discard_granularity =3D 0 means no discard=20 * The guest is told there's no discard support. * discard_granularity < 0 is undefined. discard_granularity > 0 is reported to the guest as discard support. * QEMU block drivers report discard_zeros_data. This is passed to the guest when discard_granularity > 0. I propose adding the following behaviors in any event: * If a QEMU block device reports a discard_granularity > 0, it must be equal to 2^n (n >=3D 0), or QEMU's block core will change it to 0. (Non-power-of-two granularities are not likely to exist in the real world, and this assumption greatly simplifies ensuring correctness.) * For SCSI, report an unmap_granularity to the guest as follows: max(logical_block_size, discard_granularity) / logical_block_size Regarding emulating discard_zeros_data... I agree that when discard_zeros_data is set, we will need to write zeroes in some cases. As you noted, IDE has a fixed granularity of one sector. And the SCSI granularity is a hint only; guests are not guaranteed to align to that value either. [0] As a design concept, instead of guaranteeing that 512B zero'ing discards are supported, I think the QEMU block layer should instead guarantee aligned discards to QEMU block devices, emulating any misaligned discards (or portions thereof) by writing zeroes if (and only if) discard_zeros_data is set. When the QEMU block layer gets a discard: * Of the specified discard range, see if it includes an aligned multiple of discard granularity. If so, save that as the starting point of a subrange. Then find the last aligned multiple, if any, and pass that subrange (if start !=3D end) down to the block driver's discard function. * If the discard really fails (i.e. returns failure and sets errno to something other than "not supported" or equivalent), return failure to the guest. For "not supported", fall through to the code below with the full range. * At this point, we have zero, one, or two subranges to handle. * If and only if discard_zeros_data is set, write zeros to the remaining subranges, if any. (This would use a lower-level write_zeroes call which does not attempt to use discard.) If this fails, return failure to the guest. * Return success. This leaves one remaining issue: In raw-posix.c, for files (i.e. not devices), I assume you're going to advertise discard_granularity=3D1 and discard_zeros_data=3D1 when compiled with support for fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually guarantees that it zeros the data when punching holes. I haven't verified this. If the guest does a big discard (think mkfs) and fallocate() returns EOPNOTSUPP, you'll have to zero essentially the whole virtual disk, which, as you noted, will also allocate it (unless you explicitly check for holes). This is bad. It can be avoided by not advertising discard_zeros_data, but as you noted, that's unfortunate. If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not going to work. This would side step these problems. You said it wasn't possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing by extending the file by one byte and then punching that: char buf =3D 0; fstat(s->fd, &st); pwrite(s->fd, &buf, 1, st.st_size + 1); has_discard =3D !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_= KEEP_SIZE, st.st_size + 1, 1); ftruncate(s->fd, st.st_size); [0] See the last paragraph starting on page 8: http://mkp.net/pubs/linux-advanced-storage.pdf --=20 Richard --=-9eSfTKWNkbSBBd6Vbap4 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEABECAAYFAk9blzoACgkQbfU6uV4fG85SogCeLJJVEpSMu0BZ5IRvGp57HQzf 66AAoL/sd3IOs2EwXmXrjntykIdRa8ei =QTOC -----END PGP SIGNATURE----- --=-9eSfTKWNkbSBBd6Vbap4--