This is a cover letter to a series of patches being proposed in tandem
to four different projects:
- nbd: Document a new NBD_CMD_FLAG_FAST_ZERO command flag
- qemu: Implement the flag for both clients and server
- libnbd: Implement the flag for clients
- nbdkit: Implement the flag for servers, including the nbd passthrough
client

If you want to test the patches together, I've pushed a 'fast-zero'
branch to each of:
https://repo.or.cz/nbd/ericb.git/shortlog/refs/heads/fast-zero
https://repo.or.cz/qemu/ericb.git/shortlog/refs/heads/fast-zero
https://repo.or.cz/libnbd/ericb.git/shortlog/refs/heads/fast-zero
https://repo.or.cz/nbdkit/ericb.git/shortlog/refs/heads/fast-zero


I've run several tests to demonstrate why this is useful, as well as
prove that because I have multiple interoperable projects, it is worth
including in the NBD standard.  The original proposal was here:
https://lists.debian.org/nbd/2019/03/msg00004.html
where I stated:

> I will not push this without both:
> - a positive review (for example, we may decide that burning another
> NBD_FLAG_* is undesirable, and that we should instead have some sort
> of NBD_OPT_ handshake for determining when the server supports
> NBD_CMF_FLAG_FAST_ZERO)
> - a reference client and server implementation (probably both via qemu,
> since it was qemu that raised the problem in the first place)

Consensus on that thread seemed to be that a new NBD_FLAG was okay; and
this thread solves the second bullet of having reference implementations.

Here's what I did for testing full-path interoperability:

nbdkit memory -> qemu-nbd -> nbdkit nbd -> nbdsh

$ nbdkit -p 10810 --filter=nozero --filter=delay memory 1m delay-write=3
zeromode=emulate
$ qemu-nbd -p 10811 -f raw nbd://localhost:10810
$ nbdkit -p 10812 nbd nbd://localhost:10811
$ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0)'
# takes more than 3 seconds, but succeeds
$ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0,
nbd.CMD_FLAG_FAST_ZERO)'
# takes less than 1 second to fail with ENOTSUP

And here's some demonstrations on why the feature matters, starting with
this qemu thread as justification:
https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html

First, I had to create a scenario where falling back to writes is
noticeably slower than performing a zero operation, and where
pre-zeroing also shows an effect.  My choice: let's test 'qemu-img
convert' on an image that is half-sparse (every other megabyte is a
hole) to an in-memory nbd destination.  Then I use a series of nbdkit
filters to force the destination to behave in various manners:
 log logfile=>(sed ...|uniq -c) (track how many normal/fast zero
requests the client makes)
 nozero $params (fine-tune how zero requests behave - the parameters
zeromode and fastzeromode are the real drivers of my various tests)
 blocksize maxdata=256k (allows large zero requests, but forces large
writes into smaller chunks, to magnify the effects of write delays and
allow testing to provide obvious results with a smaller image)
 delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a
smaller image, with writes penalized more than zeroing)
 stats statsfile=/dev/stderr (to track overall time and a decent summary
of how much I/O occurred).
 noextents (forces the entire image to report that it is allocated,
which eliminates any testing variability based on whether qemu-img uses
that to bypass a zeroing operation [1])

So here's my one-time setup, followed by repetitions of the nbdkit
command with different parameters to the nozero filter to explore
different behaviors.

$ qemu-img create -f qcow2 src 100m
$ for i in `seq 0 2 99`; do qemu-io -f qcow2 -c "w ${i}m 1m" src; done
$ nbdkit -U - --filter=log --filter=nozero --filter=blocksize \
  --filter=delay --filter=stats --filter=noextents memory 100m \
  logfile=>(sed -n '/Zero.*\.\./ s/.*\(fast=.\).*/\1/p' |sort|uniq -c) \
  statsfile=/dev/stderr delay-write=20ms delay-zero=5s maxdata=256k \
  --run 'qemu-img convert -n -f qcow2 -O raw src $nbd' $params

Establish a baseline: when qemu-img does not see write zero support at
all (such as when talking to /dev/nbd0, because the kernel NBD
implementation still does not support write zeroes), qemu is forced to
write the entire disk, including the holes, but doesn't waste any time
pre-zeroing or checking block status for whether the disk is zero (the
default of the nozero filter is to turn off write zero advertisement):

params=
elapsed time: 8.54488 s
write: 400 ops, 104857600 bytes, 9.81712e+07 bits/s

Next, let's emulate what qemu 3.1 was like, with a blind pre-zeroing
pass of the entire image without regards to whether that pass is fast or
slow.  For this test, it was easier to use modern qemu and merely ignore
the fast zero bit in nbdkit, but the numbers should be similar when
actually using older qemu.  If qemu guessed right that pre-zeroing is
fast, we see:

params='zeromode=plugin fastzeromode=ignore'
elapsed time: 4.30183 s
write: 200 ops, 52428800 bytes, 9.75005e+07 bits/s
zero: 4 ops, 104857600 bytes, 1.95001e+08 bits/s
      4 fast=1

which is definite win - instead of having to write the half of the image
that was zero on the source, the fast pre-zeroing pass already cleared
it (qemu-img currently breaks write zeroes into 32M chunks [1], and thus
requires 4 zero requests to pre-zero the image).  But if qemu guesses wrong:

params='zeromode=emulate fastzeromode=ignore'
elapsed time: 12.5065 s
write: 600 ops, 157286400 bytes, 1.00611e+08 bits/s
      4 fast=1

Ouch - that is actually slower than the case when zeroing is not used at
all, because the zeroes turned into writes result in performing double
the I/O over the data portions of the file (once during the pre-zero
pass, then again during the data).  The qemu 3.1 behavior is very
bi-polar in nature, and we don't like that.

So qemu 4.0 introduced BDRV_REQ_NO_FALLBACK, which qemu uses during the
pre-zero request to fail quickly if pre-zeroing is not viable. At the
time, NBD did not have a way to support fast zero requests, so qemu
blindly assumes that pre-zeroing is not viable over NBD:

params='zeromode=emulate fastzeromode=none'
elapsed time: 8.32433 s
write: 400 ops, 104857600 bytes, 1.00772e+08 bits/s
     50 fast=0

When zeroing is slow, our time actually beats the baseline by about 0.2
seconds (although zeroing still turned into writes, the use of zero
requests results in less network traffic; you also see that there are 50
zero requests, one per hole, rather than 4 requests for pre-zeroing the
image).  So we've avoided the pre-zeroing penalty.  However:

params='zeromode=plugin fastzeromode=none'
elapsed time: 4.53951 s
write: 200 ops, 52428800 bytes, 9.23955e+07 bits/s
zero: 50 ops, 52428800 bytes, 9.23955e+07 bits/s
     50 fast=0

when zeroing is fast, we're still 0.2 seconds slower than the
pre-zeroing behavior (zeroing runs fast, but one request per hole is
still more transactions than pre-zeroing used to use).  The qemu 4.0
decision thus regained the worst degradation seen in 3.1 when zeroing is
slow, but at a penalty to the case when zeroing is fast.

Since guessing is never as nice as knowing, let's repeat the test, but
now exploiting the new NBD fast zero:

params='zeromode=emulate'
elapsed time: 8.41174 s
write: 400 ops, 104857600 bytes, 9.9725e+07 bits/s
     50 fast=0
      1 fast=1

Good: when zeroes are not fast, qemu-img's initial fast-zero request
immediately fails, and then it switches back to writing the entire image
using regular zeroing for the holes; performance is comparable to the
baseline and to the qemu 4.0 behavior.

params='zeromode=plugin'
elapsed time: 4.31356 s
write: 200 ops, 52428800 bytes, 9.72354e+07 bits/s
zero: 4 ops, 104857600 bytes, 1.94471e+08 bits/s
      4 fast=1

Good: when zeroes are fast, qemu-img is able to use pre-zeroing on the
entire image, resulting in fewer zero transactions overall, getting us
back to the qemu 3.1 maximum performance (and better than the 4.0 behavior).

I hope you enjoyed reading this far, and agree with my interpretation of
the numbers about why this feature is useful!



[1] Orthogonal to these patches are other ideas I have for improving the
NBD protocol in its effects to qemu-img convert, which will result in
later cross-project patches:
- NBD should have a way to advertise (probably via NBD_INFO_ during
NBD_OPT_GO) if the initial image is known to begin life with all zeroes
(if that is the case, qemu-img can skip the extents calls and
pre-zeroing pass altogether)
- improving support to allow NBD to pass larger zero requests (qemu is
currently capping zero requests at 32m based on NBD_INFO_BLOCK_SIZE, but
could easily go up to ~4G with proper info advertisement of maximum zero
request sizing, or if we introduce 64-bit commands to the NBD protocol)

Given that NBD extensions need not be present in every server, each
orthogonal improvement should be tested in isolation to show that it
helps, even though qemu-img will probably use all of the extensions at
once when the server supports all of them.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org