This is a cover letter to a series of patches being proposed in tandem to four different projects: - nbd: Document a new NBD_CMD_FLAG_FAST_ZERO command flag - qemu: Implement the flag for both clients and server - libnbd: Implement the flag for clients - nbdkit: Implement the flag for servers, including the nbd passthrough client If you want to test the patches together, I've pushed a 'fast-zero' branch to each of: https://repo.or.cz/nbd/ericb.git/shortlog/refs/heads/fast-zero https://repo.or.cz/qemu/ericb.git/shortlog/refs/heads/fast-zero https://repo.or.cz/libnbd/ericb.git/shortlog/refs/heads/fast-zero https://repo.or.cz/nbdkit/ericb.git/shortlog/refs/heads/fast-zero I've run several tests to demonstrate why this is useful, as well as prove that because I have multiple interoperable projects, it is worth including in the NBD standard. The original proposal was here: https://lists.debian.org/nbd/2019/03/msg00004.html where I stated: > I will not push this without both: > - a positive review (for example, we may decide that burning another > NBD_FLAG_* is undesirable, and that we should instead have some sort > of NBD_OPT_ handshake for determining when the server supports > NBD_CMF_FLAG_FAST_ZERO) > - a reference client and server implementation (probably both via qemu, > since it was qemu that raised the problem in the first place) Consensus on that thread seemed to be that a new NBD_FLAG was okay; and this thread solves the second bullet of having reference implementations. Here's what I did for testing full-path interoperability: nbdkit memory -> qemu-nbd -> nbdkit nbd -> nbdsh $ nbdkit -p 10810 --filter=nozero --filter=delay memory 1m delay-write=3 zeromode=emulate $ qemu-nbd -p 10811 -f raw nbd://localhost:10810 $ nbdkit -p 10812 nbd nbd://localhost:10811 $ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0)' # takes more than 3 seconds, but succeeds $ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0, nbd.CMD_FLAG_FAST_ZERO)' # takes less than 1 second to fail with ENOTSUP And here's some demonstrations on why the feature matters, starting with this qemu thread as justification: https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html First, I had to create a scenario where falling back to writes is noticeably slower than performing a zero operation, and where pre-zeroing also shows an effect. My choice: let's test 'qemu-img convert' on an image that is half-sparse (every other megabyte is a hole) to an in-memory nbd destination. Then I use a series of nbdkit filters to force the destination to behave in various manners: log logfile=>(sed ...|uniq -c) (track how many normal/fast zero requests the client makes) nozero $params (fine-tune how zero requests behave - the parameters zeromode and fastzeromode are the real drivers of my various tests) blocksize maxdata=256k (allows large zero requests, but forces large writes into smaller chunks, to magnify the effects of write delays and allow testing to provide obvious results with a smaller image) delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a smaller image, with writes penalized more than zeroing) stats statsfile=/dev/stderr (to track overall time and a decent summary of how much I/O occurred). noextents (forces the entire image to report that it is allocated, which eliminates any testing variability based on whether qemu-img uses that to bypass a zeroing operation [1]) So here's my one-time setup, followed by repetitions of the nbdkit command with different parameters to the nozero filter to explore different behaviors. $ qemu-img create -f qcow2 src 100m $ for i in `seq 0 2 99`; do qemu-io -f qcow2 -c "w ${i}m 1m" src; done $ nbdkit -U - --filter=log --filter=nozero --filter=blocksize \ --filter=delay --filter=stats --filter=noextents memory 100m \ logfile=>(sed -n '/Zero.*\.\./ s/.*\(fast=.\).*/\1/p' |sort|uniq -c) \ statsfile=/dev/stderr delay-write=20ms delay-zero=5s maxdata=256k \ --run 'qemu-img convert -n -f qcow2 -O raw src $nbd' $params Establish a baseline: when qemu-img does not see write zero support at all (such as when talking to /dev/nbd0, because the kernel NBD implementation still does not support write zeroes), qemu is forced to write the entire disk, including the holes, but doesn't waste any time pre-zeroing or checking block status for whether the disk is zero (the default of the nozero filter is to turn off write zero advertisement): params= elapsed time: 8.54488 s write: 400 ops, 104857600 bytes, 9.81712e+07 bits/s Next, let's emulate what qemu 3.1 was like, with a blind pre-zeroing pass of the entire image without regards to whether that pass is fast or slow. For this test, it was easier to use modern qemu and merely ignore the fast zero bit in nbdkit, but the numbers should be similar when actually using older qemu. If qemu guessed right that pre-zeroing is fast, we see: params='zeromode=plugin fastzeromode=ignore' elapsed time: 4.30183 s write: 200 ops, 52428800 bytes, 9.75005e+07 bits/s zero: 4 ops, 104857600 bytes, 1.95001e+08 bits/s 4 fast=1 which is definite win - instead of having to write the half of the image that was zero on the source, the fast pre-zeroing pass already cleared it (qemu-img currently breaks write zeroes into 32M chunks [1], and thus requires 4 zero requests to pre-zero the image). But if qemu guesses wrong: params='zeromode=emulate fastzeromode=ignore' elapsed time: 12.5065 s write: 600 ops, 157286400 bytes, 1.00611e+08 bits/s 4 fast=1 Ouch - that is actually slower than the case when zeroing is not used at all, because the zeroes turned into writes result in performing double the I/O over the data portions of the file (once during the pre-zero pass, then again during the data). The qemu 3.1 behavior is very bi-polar in nature, and we don't like that. So qemu 4.0 introduced BDRV_REQ_NO_FALLBACK, which qemu uses during the pre-zero request to fail quickly if pre-zeroing is not viable. At the time, NBD did not have a way to support fast zero requests, so qemu blindly assumes that pre-zeroing is not viable over NBD: params='zeromode=emulate fastzeromode=none' elapsed time: 8.32433 s write: 400 ops, 104857600 bytes, 1.00772e+08 bits/s 50 fast=0 When zeroing is slow, our time actually beats the baseline by about 0.2 seconds (although zeroing still turned into writes, the use of zero requests results in less network traffic; you also see that there are 50 zero requests, one per hole, rather than 4 requests for pre-zeroing the image). So we've avoided the pre-zeroing penalty. However: params='zeromode=plugin fastzeromode=none' elapsed time: 4.53951 s write: 200 ops, 52428800 bytes, 9.23955e+07 bits/s zero: 50 ops, 52428800 bytes, 9.23955e+07 bits/s 50 fast=0 when zeroing is fast, we're still 0.2 seconds slower than the pre-zeroing behavior (zeroing runs fast, but one request per hole is still more transactions than pre-zeroing used to use). The qemu 4.0 decision thus regained the worst degradation seen in 3.1 when zeroing is slow, but at a penalty to the case when zeroing is fast. Since guessing is never as nice as knowing, let's repeat the test, but now exploiting the new NBD fast zero: params='zeromode=emulate' elapsed time: 8.41174 s write: 400 ops, 104857600 bytes, 9.9725e+07 bits/s 50 fast=0 1 fast=1 Good: when zeroes are not fast, qemu-img's initial fast-zero request immediately fails, and then it switches back to writing the entire image using regular zeroing for the holes; performance is comparable to the baseline and to the qemu 4.0 behavior. params='zeromode=plugin' elapsed time: 4.31356 s write: 200 ops, 52428800 bytes, 9.72354e+07 bits/s zero: 4 ops, 104857600 bytes, 1.94471e+08 bits/s 4 fast=1 Good: when zeroes are fast, qemu-img is able to use pre-zeroing on the entire image, resulting in fewer zero transactions overall, getting us back to the qemu 3.1 maximum performance (and better than the 4.0 behavior). I hope you enjoyed reading this far, and agree with my interpretation of the numbers about why this feature is useful! [1] Orthogonal to these patches are other ideas I have for improving the NBD protocol in its effects to qemu-img convert, which will result in later cross-project patches: - NBD should have a way to advertise (probably via NBD_INFO_ during NBD_OPT_GO) if the initial image is known to begin life with all zeroes (if that is the case, qemu-img can skip the extents calls and pre-zeroing pass altogether) - improving support to allow NBD to pass larger zero requests (qemu is currently capping zero requests at 32m based on NBD_INFO_BLOCK_SIZE, but could easily go up to ~4G with proper info advertisement of maximum zero request sizing, or if we introduce 64-bit commands to the NBD protocol) Given that NBD extensions need not be present in every server, each orthogonal improvement should be tested in isolation to show that it helps, even though qemu-img will probably use all of the extensions at once when the server supports all of them. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org