QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: Eric Blake <eblake@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"nbd@other.debian.org" <nbd@other.debian.org>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"libguestfs@redhat.com" <libguestfs@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO
Date: Wed, 28 Aug 2019 08:04:56 -0500
Message-ID: <810779d1-0289-d635-0446-93b3dd32ec95@redhat.com> (raw)
In-Reply-To: <274bc60d-f57d-2f97-4be9-8de1aabb0949@virtuozzo.com>

[-- Attachment #1.1: Type: text/plain, Size: 6448 bytes --]

On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:

>> Hence, it is desirable to have a way for clients to specify that a
>> particular write zero request is being attempted for a fast wipe, and
>> get an immediate failure if the zero request would otherwise take the
>> same time as a write.  Conversely, if the client is not performing a
>> pre-initialization pass, it is still more efficient in terms of
>> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
>> server implements the fallback to the slower write, than it is for the
>> client to have to perform the fallback to send NBD_CMD_WRITE with a
>> zeroed buffer.
> How are you going to finally use it in qemu-img convert?

It's already in use there (in fact, the cover letter shows actual timing
examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates
to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).

> Ok, we have a loop
> of sending write-zero requests. And on first ENOTSUP we'll assume that there
> is no benefit to continue? But what if actually server returns ENOTSUP only
> once when we have 1000 iterations? Seems we should still do zeroing if we
> have only a few ENOTSUPs...

When attempting a bulk zero, you try to wipe the entire device,
presumably with something that is large and aligned.  Yes, if you have
to split the write zero request due to size limitations, then you risk
that the first write zero succeeds but later ones fail, then you didn't
wipe the entire disk, but you also don't need to redo the work on the
first half of the image.  But it is much more likely that the first
write of the bulk zero is representative of the overall operation (and
so in practice, it only takes one fast zero attempt to learn if bulk
zeroing is worthwhile, then continue with fast zeroes without issues).

> I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2,
> as first will always return ENOTSUP and second will never fail.. But in such way
> we'll OK with simpler extension, which only have one server-advirtised negotiation

Advertising that a server's zero behavior is always going to be
successfully fast is a much harder flag to implement.  The justification
for the semantics I chose (advertising that the server can quickly
report failure if success is not fast, but not requiring fast zero)
covers the case when the decision of whether a zero is fast may also
depend on other factors - for example, if the server knows the image
starts in an all-zero state, then it can track a boolean: all write zero
requests while the boolean is set return immediate success (nothing to
do), but after the first regular write, the boolean is set to false, and
all further write zero requests fail as being potentially slow; and such
an implementation is still useful for the qemu-img convert case.

> There is not such problem if we have only one iteration, so may be new command
> FILL_ZERO, filling the whole device by zeros?

Or better yet, implement support for 64-bit commands.  Yes, my cover
letter called out further orthogonal extensions, and implementing 64-bit
zeroing (so that you can issue a write zero request over the entire
image in one command), as well as a way for a server to advertise when
the image begins life in an all-zero state, are also further extensions
coming down the pipeline.  But as not all servers have to implement all
of the extensions, each extension that can be orthogonally implemented
and show an improvement on its own is still worthwhile; and my cover
letter has shown that fast zeroes on their own make a measurable
difference to certain workloads.

>> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
>> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
>> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
>> +    corresponding `NBD_CMD_WRITE`. Conversely, if
>> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
>> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
>> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
>> +    of the export when returning this failure. The server's
>> +    determination of a fast request MAY depend on a number of factors,
>> +    such as whether the request was suitably aligned, on whether the
>> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
>> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
>> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
>> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
>> +    a request, and SHOULD fail with `NBD_EINVAL` if the
>> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
>> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
>> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
>> +    the flag is set if the server cannot quickly determine in advance
>> +    whether that request would have been fast, even if it turns out
>> +    that the same request without the flag would be fast after all.
>> +
> What if WRITE_ZERO in the average is faster than WRITE (for example by 20%),
> but server never can guarantee performance for one WRITE_ZERO operation, do
> you restrict such case? Hmm, OK, SHOULD is not MUST actually..

I think my followup mail, based on Wouter's questions, covers this: the
goal is to document the use case of optimizing the copy of a sparse
image, by probing whether a bulk pre-zeroing pass is worthwhile.  That
should be the measuring rod - if the implementation can perform a faster
sparse copy because of write zeroes that are sometimes, but not always,
faster than writes, in spite of the duplicated I/O that happens to the
data portions of the image that were touched twice by the pre-zero pass
then the actual data pass, then succeeding on fast zero requests is
okay.  But if it makes the overall image copy slower, then failing with
ENOTSUP is probably better.  And at the end of the day, it is really
just a heuristic - if the server guessed wrong, the worst that happens
is slower performance (and not data corruption).

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-23 14:30 [Qemu-devel] cross-project patches: Add NBD Fast Zero support Eric Blake
2019-08-23 14:34 ` [Qemu-devel] [PATCH 0/1] NBD protocol change to add fast zero support Eric Blake
2019-08-23 14:34   ` [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO Eric Blake
2019-08-23 18:48     ` Wouter Verhelst
2019-08-23 18:58       ` Eric Blake
2019-08-24  6:44         ` Wouter Verhelst
2019-08-28  9:57     ` Vladimir Sementsov-Ogievskiy
2019-08-28 13:04       ` Eric Blake [this message]
2019-08-28 13:45         ` Vladimir Sementsov-Ogievskiy
2019-09-03 20:53   ` [Qemu-devel] [Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support Eric Blake
2019-08-23 14:37 ` [Qemu-devel] [PATCH 0/5] Add NBD fast zero support to qemu client and server Eric Blake
2019-08-23 14:37   ` [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server Eric Blake
2019-08-30 18:00     ` Vladimir Sementsov-Ogievskiy
2019-08-30 23:10       ` Eric Blake
2019-08-30 23:32         ` Eric Blake
2019-09-03 16:39           ` Eric Blake
2019-09-04 17:08     ` Vladimir Sementsov-Ogievskiy
2019-08-23 14:37   ` [Qemu-devel] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO Eric Blake
2019-08-30 18:07     ` Vladimir Sementsov-Ogievskiy
2019-08-30 23:37       ` Eric Blake
2019-08-31  8:11         ` Vladimir Sementsov-Ogievskiy
2019-09-03 18:49       ` Eric Blake
2019-08-31  8:20     ` Vladimir Sementsov-Ogievskiy
2019-08-23 14:37   ` [Qemu-devel] [PATCH 3/5] nbd: Implement client use of NBD FAST_ZERO Eric Blake
2019-08-30 18:11     ` Vladimir Sementsov-Ogievskiy
2019-08-23 14:37   ` [Qemu-devel] [PATCH 4/5] nbd: Implement server " Eric Blake
2019-08-30 18:40     ` Vladimir Sementsov-Ogievskiy
2019-08-23 14:37   ` [Qemu-devel] [PATCH 5/5] nbd: Tolerate more errors to structured reply request Eric Blake
2019-08-23 16:41     ` Eric Blake
2019-08-28 13:55   ` [Qemu-devel] [PATCH 0/5] Add NBD fast zero support to qemu client and server Vladimir Sementsov-Ogievskiy
2019-08-28 14:05     ` Eric Blake
2019-08-23 14:38 ` [Qemu-devel] [libnbd PATCH 0/1] libnbd support for new fast zero Eric Blake
2019-08-23 14:38   ` [Qemu-devel] [libnbd PATCH 1/1] api: Add support for FAST_ZERO flag Eric Blake
2019-08-27 12:25     ` [Qemu-devel] [Libguestfs] " Richard W.M. Jones
2019-08-23 14:40 ` [Qemu-devel] [nbdkit PATCH 0/3] nbdkit support for new NBD fast zero Eric Blake
2019-08-23 14:40   ` [Qemu-devel] [nbdkit PATCH 1/3] server: Add internal support for NBDKIT_FLAG_FAST_ZERO Eric Blake
2019-08-23 14:40   ` [Qemu-devel] [nbdkit PATCH 2/3] filters: Add .can_fast_zero hook Eric Blake
2019-08-23 14:40   ` [Qemu-devel] [nbdkit PATCH 3/3] plugins: " Eric Blake
2019-08-23 21:16     ` [Qemu-devel] [Libguestfs] " Eric Blake
2019-08-27 15:43     ` Richard W.M. Jones
2019-08-23 15:05 ` [Qemu-devel] cross-project patches: Add NBD Fast Zero support Vladimir Sementsov-Ogievskiy
2019-08-27 12:14 ` [Qemu-devel] [Libguestfs] " Richard W.M. Jones
2019-08-27 13:23   ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=810779d1-0289-d635-0446-93b3dd32ec95@redhat.com \
    --to=eblake@redhat.com \
    --cc=libguestfs@redhat.com \
    --cc=nbd@other.debian.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git