Re: [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO - Vladimir Sementsov-Ogievskiy

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Eric Blake <eblake@redhat.com>,
	"nbd@other.debian.org" <nbd@other.debian.org>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"libguestfs@redhat.com" <libguestfs@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO
Date: Wed, 28 Aug 2019 13:45:08 +0000	[thread overview]
Message-ID: <3b8fd5cb-4c9d-801e-61e0-3a620e758cb7@virtuozzo.com> (raw)
In-Reply-To: <810779d1-0289-d635-0446-93b3dd32ec95@redhat.com>

28.08.2019 16:04, Eric Blake wrote:
> On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> Hence, it is desirable to have a way for clients to specify that a
>>> particular write zero request is being attempted for a fast wipe, and
>>> get an immediate failure if the zero request would otherwise take the
>>> same time as a write.  Conversely, if the client is not performing a
>>> pre-initialization pass, it is still more efficient in terms of
>>> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
>>> server implements the fallback to the slower write, than it is for the
>>> client to have to perform the fallback to send NBD_CMD_WRITE with a
>>> zeroed buffer.
>>
>> How are you going to finally use it in qemu-img convert?
> 
> It's already in use there (in fact, the cover letter shows actual timing
> examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates
> to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).
> 
>> Ok, we have a loop
>> of sending write-zero requests. And on first ENOTSUP we'll assume that there
>> is no benefit to continue? But what if actually server returns ENOTSUP only
>> once when we have 1000 iterations? Seems we should still do zeroing if we
>> have only a few ENOTSUPs...
> 
> When attempting a bulk zero, you try to wipe the entire device,
> presumably with something that is large and aligned.  Yes, if you have
> to split the write zero request due to size limitations, then you risk
> that the first write zero succeeds but later ones fail, then you didn't
> wipe the entire disk, but you also don't need to redo the work on the
> first half of the image.  But it is much more likely that the first
> write of the bulk zero is representative of the overall operation (and
> so in practice, it only takes one fast zero attempt to learn if bulk
> zeroing is worthwhile, then continue with fast zeroes without issues).
> 
>>
>> I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2,
>> as first will always return ENOTSUP and second will never fail.. But in such way
>> we'll OK with simpler extension, which only have one server-advirtised negotiation
>> flag NBD_FLAG_ZERO_IS_FAST.
> 
> Advertising that a server's zero behavior is always going to be
> successfully fast is a much harder flag to implement.  The justification
> for the semantics I chose (advertising that the server can quickly
> report failure if success is not fast, but not requiring fast zero)
> covers the case when the decision of whether a zero is fast may also
> depend on other factors - for example, if the server knows the image
> starts in an all-zero state, then it can track a boolean: all write zero
> requests while the boolean is set return immediate success (nothing to
> do), but after the first regular write, the boolean is set to false, and
> all further write zero requests fail as being potentially slow; and such
> an implementation is still useful for the qemu-img convert case.

Agreed, thanks for this example)

> 
>>
>> There is not such problem if we have only one iteration, so may be new command
>> FILL_ZERO, filling the whole device by zeros?
> 
> Or better yet, implement support for 64-bit commands.  Yes, my cover
> letter called out further orthogonal extensions, and implementing 64-bit
> zeroing (so that you can issue a write zero request over the entire
> image in one command), as well as a way for a server to advertise when
> the image begins life in an all-zero state, are also further extensions
> coming down the pipeline.  But as not all servers have to implement all
> of the extensions, each extension that can be orthogonally implemented
> and show an improvement on its own is still worthwhile; and my cover
> letter has shown that fast zeroes on their own make a measurable
> difference to certain workloads.
> 
>>> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
>>> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
>>> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
>>> +    corresponding `NBD_CMD_WRITE`. Conversely, if
>>> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
>>> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
>>> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
>>> +    of the export when returning this failure. The server's
>>> +    determination of a fast request MAY depend on a number of factors,
>>> +    such as whether the request was suitably aligned, on whether the
>>> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
>>> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
>>> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
>>> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
>>> +    a request, and SHOULD fail with `NBD_EINVAL` if the
>>> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
>>> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
>>> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
>>> +    the flag is set if the server cannot quickly determine in advance
>>> +    whether that request would have been fast, even if it turns out
>>> +    that the same request without the flag would be fast after all.
>>> +
>>
>> What if WRITE_ZERO in the average is faster than WRITE (for example by 20%),
>> but server never can guarantee performance for one WRITE_ZERO operation, do
>> you restrict such case? Hmm, OK, SHOULD is not MUST actually..
> 
> I think my followup mail, based on Wouter's questions, covers this: the

Hmm, OK, sorry for duplication

> goal is to document the use case of optimizing the copy of a sparse
> image, by probing whether a bulk pre-zeroing pass is worthwhile.  That
> should be the measuring rod - if the implementation can perform a faster
> sparse copy because of write zeroes that are sometimes, but not always,
> faster than writes, in spite of the duplicated I/O that happens to the
> data portions of the image that were touched twice by the pre-zero pass
> then the actual data pass, then succeeding on fast zero requests is
> okay.  But if it makes the overall image copy slower, then failing with
> ENOTSUP is probably better.  And at the end of the day, it is really
> just a heuristic - if the server guessed wrong, the worst that happens
> is slower performance (and not data corruption).
> 

OK, I understand, than it's all sounds good

-- 
Best regards,
Vladimir