Re: [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO

From: Eric Blake <eblake@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"nbd@other.debian.org" <nbd@other.debian.org>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"libguestfs@redhat.com" <libguestfs@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO
Date: Wed, 28 Aug 2019 08:04:56 -0500	[thread overview]
Message-ID: <810779d1-0289-d635-0446-93b3dd32ec95@redhat.com> (raw)
In-Reply-To: <274bc60d-f57d-2f97-4be9-8de1aabb0949@virtuozzo.com>

[-- Attachment #1.1: Type: text/plain, Size: 6448 bytes --]

On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:

>> Hence, it is desirable to have a way for clients to specify that a
>> particular write zero request is being attempted for a fast wipe, and
>> get an immediate failure if the zero request would otherwise take the
>> same time as a write.  Conversely, if the client is not performing a
>> pre-initialization pass, it is still more efficient in terms of
>> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
>> server implements the fallback to the slower write, than it is for the
>> client to have to perform the fallback to send NBD_CMD_WRITE with a
>> zeroed buffer.
> 
> How are you going to finally use it in qemu-img convert?

It's already in use there (in fact, the cover letter shows actual timing
examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates
to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).

> Ok, we have a loop
> of sending write-zero requests. And on first ENOTSUP we'll assume that there
> is no benefit to continue? But what if actually server returns ENOTSUP only
> once when we have 1000 iterations? Seems we should still do zeroing if we
> have only a few ENOTSUPs...

When attempting a bulk zero, you try to wipe the entire device,
presumably with something that is large and aligned.  Yes, if you have
to split the write zero request due to size limitations, then you risk
that the first write zero succeeds but later ones fail, then you didn't
wipe the entire disk, but you also don't need to redo the work on the
first half of the image.  But it is much more likely that the first
write of the bulk zero is representative of the overall operation (and
so in practice, it only takes one fast zero attempt to learn if bulk
zeroing is worthwhile, then continue with fast zeroes without issues).

> 
> I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2,
> as first will always return ENOTSUP and second will never fail.. But in such way
> we'll OK with simpler extension, which only have one server-advirtised negotiation
> flag NBD_FLAG_ZERO_IS_FAST.

Advertising that a server's zero behavior is always going to be
successfully fast is a much harder flag to implement.  The justification
for the semantics I chose (advertising that the server can quickly
report failure if success is not fast, but not requiring fast zero)
covers the case when the decision of whether a zero is fast may also
depend on other factors - for example, if the server knows the image
starts in an all-zero state, then it can track a boolean: all write zero
requests while the boolean is set return immediate success (nothing to
do), but after the first regular write, the boolean is set to false, and
all further write zero requests fail as being potentially slow; and such
an implementation is still useful for the qemu-img convert case.

> 
> There is not such problem if we have only one iteration, so may be new command
> FILL_ZERO, filling the whole device by zeros?

Or better yet, implement support for 64-bit commands.  Yes, my cover
letter called out further orthogonal extensions, and implementing 64-bit
zeroing (so that you can issue a write zero request over the entire
image in one command), as well as a way for a server to advertise when
the image begins life in an all-zero state, are also further extensions
coming down the pipeline.  But as not all servers have to implement all
of the extensions, each extension that can be orthogonally implemented
and show an improvement on its own is still worthwhile; and my cover
letter has shown that fast zeroes on their own make a measurable
difference to certain workloads.

>> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
>> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
>> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
>> +    corresponding `NBD_CMD_WRITE`. Conversely, if
>> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
>> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
>> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
>> +    of the export when returning this failure. The server's
>> +    determination of a fast request MAY depend on a number of factors,
>> +    such as whether the request was suitably aligned, on whether the
>> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
>> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
>> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
>> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
>> +    a request, and SHOULD fail with `NBD_EINVAL` if the
>> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
>> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
>> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
>> +    the flag is set if the server cannot quickly determine in advance
>> +    whether that request would have been fast, even if it turns out
>> +    that the same request without the flag would be fast after all.
>> +
> 
> What if WRITE_ZERO in the average is faster than WRITE (for example by 20%),
> but server never can guarantee performance for one WRITE_ZERO operation, do
> you restrict such case? Hmm, OK, SHOULD is not MUST actually..

I think my followup mail, based on Wouter's questions, covers this: the
goal is to document the use case of optimizing the copy of a sparse
image, by probing whether a bulk pre-zeroing pass is worthwhile.  That
should be the measuring rod - if the implementation can perform a faster
sparse copy because of write zeroes that are sometimes, but not always,
faster than writes, in spite of the duplicated I/O that happens to the
data portions of the image that were touched twice by the pre-zero pass
then the actual data pass, then succeeding on fast zero requests is
okay.  But if it makes the overall image copy slower, then failing with
ENOTSUP is probably better.  And at the end of the day, it is really
just a heuristic - if the server guessed wrong, the worst that happens
is slower performance (and not data corruption).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]