All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes
@ 2016-07-19  4:07 Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server Eric Blake
                   ` (15 more replies)
  0 siblings, 16 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

The upstream NBD protocol is proposing an extension for efficient
write zeroes; having a qemu implementation will be one of the reasons
to promote the proposal from experimental to standard:
https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md

v4 was here:
https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg07463.html

Prerequisites:
v3 Auto-fragment large transactions at the block layer:
https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg03550.html
v2 byte-based block discard:
https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg03592.html

Also available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-zero-v5

changes since v4:
- patch 2 - Rebase to latest
- patch 6, 10 - Address review comments [Paolo]
- patch 14 - Update commit message to address review comments [Paolo]

001/14:[----] [--] 'nbd: Fix bad flag detection on server'
002/14:[0004] [FC] 'nbd: Add qemu-nbd -D for human-readable description'
003/14:[----] [--] 'nbd: Limit nbdflags to 16 bits'
004/14:[----] [--] 'nbd: Treat flags vs. command type as separate fields'
005/14:[----] [--] 'nbd: Share common reply-sending code in server'
006/14:[0010] [FC] 'nbd: Send message along with server NBD_REP_ERR errors'
007/14:[----] [--] 'nbd: Share common option-sending code in client'
008/14:[----] [--] 'nbd: Let server know when client gives up negotiation'
009/14:[----] [--] 'nbd: Let client skip portions of server reply'
010/14:[0059] [FC] 'nbd: Less allocation during NBD_OPT_LIST'
011/14:[----] [--] 'nbd: Support shorter handshake'
012/14:[----] [--] 'nbd: Improve server handling of shutdown requests'
013/14:[----] [--] 'nbd: Implement NBD_CMD_WRITE_ZEROES on server'
014/14:[----] [--] 'nbd: Implement NBD_CMD_WRITE_ZEROES on client'

Eric Blake (14):
  nbd: Fix bad flag detection on server
  nbd: Add qemu-nbd -D for human-readable description
  nbd: Limit nbdflags to 16 bits
  nbd: Treat flags vs. command type as separate fields
  nbd: Share common reply-sending code in server
  nbd: Send message along with server NBD_REP_ERR errors
  nbd: Share common option-sending code in client
  nbd: Let server know when client gives up negotiation
  nbd: Let client skip portions of server reply
  nbd: Less allocation during NBD_OPT_LIST
  nbd: Support shorter handshake
  nbd: Improve server handling of shutdown requests
  nbd: Implement NBD_CMD_WRITE_ZEROES on server
  nbd: Implement NBD_CMD_WRITE_ZEROES on client

 block/nbd-client.h  |   4 +-
 include/block/nbd.h |  69 ++++++--
 nbd/nbd-internal.h  |  12 +-
 block/nbd-client.c  |  44 ++++-
 block/nbd.c         |   4 +
 nbd/client.c        | 501 +++++++++++++++++++++++++++-------------------------
 nbd/server.c        | 271 +++++++++++++++++++---------
 qemu-nbd.c          |  16 +-
 qemu-nbd.texi       |   5 +-
 9 files changed, 576 insertions(+), 350 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 02/14] nbd: Add qemu-nbd -D for human-readable description Eric Blake
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

Commit ab7c548e added a check for invalid flags, but used an
early return on error instead of properly going through the
cleanup label.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: new patch
---
 nbd/server.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/nbd/server.c b/nbd/server.c
index 29e2099..3c1e2b3 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1057,7 +1057,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
     if (request->type & ~NBD_CMD_MASK_COMMAND & ~NBD_CMD_FLAG_FUA) {
         LOG("unsupported flags (got 0x%x)",
             request->type & ~NBD_CMD_MASK_COMMAND);
-        return -EINVAL;
+        rc = -EINVAL;
+        goto out;
     }

     rc = 0;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 02/14] nbd: Add qemu-nbd -D for human-readable description
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 03/14] nbd: Limit nbdflags to 16 bits Eric Blake
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

The NBD protocol allows servers to advertise a human-readable
description alongside an export name during NBD_OPT_LIST.  Add
an option to pass through the user's string to the NBD client.

Doing this also makes it easier to test commit 200650d4, which
is the client counterpart of receiving the description.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v5: rebase to latest
v4: rebase to latest
---
 include/block/nbd.h |  1 +
 nbd/nbd-internal.h  |  5 +++--
 nbd/server.c        | 34 ++++++++++++++++++++++++++--------
 qemu-nbd.c          | 12 +++++++++++-
 qemu-nbd.texi       |  5 ++++-
 5 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index cb91820..309db2b 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -114,6 +114,7 @@ BlockBackend *nbd_export_get_blockdev(NBDExport *exp);

 NBDExport *nbd_export_find(const char *name);
 void nbd_export_set_name(NBDExport *exp, const char *name);
+void nbd_export_set_description(NBDExport *exp, const char *description);
 void nbd_export_close_all(void);

 void nbd_client_new(NBDExport *exp,
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 93a6ca8..7e78064 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -104,9 +104,10 @@ static inline ssize_t read_sync(QIOChannel *ioc, void *buffer, size_t size)
     return nbd_wr_syncv(ioc, &iov, 1, size, true);
 }

-static inline ssize_t write_sync(QIOChannel *ioc, void *buffer, size_t size)
+static inline ssize_t write_sync(QIOChannel *ioc, const void *buffer,
+                                 size_t size)
 {
-    struct iovec iov = { .iov_base = buffer, .iov_len = size };
+    struct iovec iov = { .iov_base = (void *) buffer, .iov_len = size };

     return nbd_wr_syncv(ioc, &iov, 1, size, false);
 }
diff --git a/nbd/server.c b/nbd/server.c
index 3c1e2b3..438944c 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -61,6 +61,7 @@ struct NBDExport {

     BlockBackend *blk;
     char *name;
+    char *description;
     off_t dev_offset;
     off_t size;
     uint32_t nbdflags;
@@ -128,7 +129,8 @@ static ssize_t nbd_negotiate_read(QIOChannel *ioc, void *buffer, size_t size)

 }

-static ssize_t nbd_negotiate_write(QIOChannel *ioc, void *buffer, size_t size)
+static ssize_t nbd_negotiate_write(QIOChannel *ioc, const void *buffer,
+                                   size_t size)
 {
     ssize_t ret;
     guint watch;
@@ -224,11 +226,15 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)

 static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
 {
-    uint64_t magic, name_len;
+    uint64_t magic;
+    size_t name_len, desc_len;
     uint32_t opt, type, len;
+    const char *name = exp->name ? exp->name : "";
+    const char *desc = exp->description ? exp->description : "";

-    TRACE("Advertising export name '%s'", exp->name ? exp->name : "");
-    name_len = strlen(exp->name);
+    TRACE("Advertising export name '%s' description '%s'", name, desc);
+    name_len = strlen(name);
+    desc_len = strlen(desc);
     magic = cpu_to_be64(NBD_REP_MAGIC);
     if (nbd_negotiate_write(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
         LOG("write failed (magic)");
@@ -244,18 +250,22 @@ static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
         LOG("write failed (reply type)");
         return -EINVAL;
     }
-    len = cpu_to_be32(name_len + sizeof(len));
+    len = cpu_to_be32(name_len + desc_len + sizeof(len));
     if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
         LOG("write failed (length)");
         return -EINVAL;
     }
     len = cpu_to_be32(name_len);
     if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
-        LOG("write failed (length)");
+        LOG("write failed (name length)");
         return -EINVAL;
     }
-    if (nbd_negotiate_write(ioc, exp->name, name_len) != name_len) {
-        LOG("write failed (buffer)");
+    if (nbd_negotiate_write(ioc, name, name_len) != name_len) {
+        LOG("write failed (name buffer)");
+        return -EINVAL;
+    }
+    if (nbd_negotiate_write(ioc, desc, desc_len) != desc_len) {
+        LOG("write failed (description buffer)");
         return -EINVAL;
     }
     return 0;
@@ -883,6 +893,12 @@ void nbd_export_set_name(NBDExport *exp, const char *name)
     nbd_export_put(exp);
 }

+void nbd_export_set_description(NBDExport *exp, const char *description)
+{
+    g_free(exp->description);
+    exp->description = g_strdup(description);
+}
+
 void nbd_export_close(NBDExport *exp)
 {
     NBDClient *client, *next;
@@ -892,6 +908,7 @@ void nbd_export_close(NBDExport *exp)
         client_close(client);
     }
     nbd_export_set_name(exp, NULL);
+    nbd_export_set_description(exp, NULL);
     nbd_export_put(exp);
 }

@@ -910,6 +927,7 @@ void nbd_export_put(NBDExport *exp)

     if (--exp->refcount == 0) {
         assert(exp->name == NULL);
+        assert(exp->description == NULL);

         if (exp->close) {
             exp->close(exp);
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 321f02b..34be33d 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -82,6 +82,7 @@ static void usage(const char *name)
 "  -t, --persistent          don't exit on the last connection\n"
 "  -v, --verbose             display extra debugging information\n"
 "  -x, --export-name=NAME    expose export by name\n"
+"  -D, --description=TEXT    with -x, also export a human-readable description\n"
 "\n"
 "Exposing part of the image:\n"
 "  -o, --offset=OFFSET       offset into the image\n"
@@ -474,7 +475,7 @@ int main(int argc, char **argv)
     off_t fd_size;
     QemuOpts *sn_opts = NULL;
     const char *sn_id_or_name = NULL;
-    const char *sopt = "hVb:o:p:rsnP:c:dvk:e:f:tl:x:T:";
+    const char *sopt = "hVb:o:p:rsnP:c:dvk:e:f:tl:x:T:D:";
     struct option lopt[] = {
         { "help", no_argument, NULL, 'h' },
         { "version", no_argument, NULL, 'V' },
@@ -500,6 +501,7 @@ int main(int argc, char **argv)
         { "verbose", no_argument, NULL, 'v' },
         { "object", required_argument, NULL, QEMU_NBD_OPT_OBJECT },
         { "export-name", required_argument, NULL, 'x' },
+        { "description", required_argument, NULL, 'D' },
         { "tls-creds", required_argument, NULL, QEMU_NBD_OPT_TLSCREDS },
         { "image-opts", no_argument, NULL, QEMU_NBD_OPT_IMAGE_OPTS },
         { "trace", required_argument, NULL, 'T' },
@@ -520,6 +522,7 @@ int main(int argc, char **argv)
     BlockdevDetectZeroesOptions detect_zeroes = BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF;
     QDict *options = NULL;
     const char *export_name = NULL;
+    const char *export_description = NULL;
     const char *tlscredsid = NULL;
     bool imageOpts = false;
     bool writethrough = true;
@@ -682,6 +685,9 @@ int main(int argc, char **argv)
         case 'x':
             export_name = optarg;
             break;
+        case 'D':
+            export_description = optarg;
+            break;
         case 'v':
             verbose = 1;
             break;
@@ -918,7 +924,11 @@ int main(int argc, char **argv)
     }
     if (export_name) {
         nbd_export_set_name(exp, export_name);
+        nbd_export_set_description(exp, export_description);
         newproto = true;
+    } else if (export_description) {
+        error_report("Export description requires an export name");
+        exit(EXIT_FAILURE);
     }

     server_ioc = qio_channel_socket_new();
diff --git a/qemu-nbd.texi b/qemu-nbd.texi
index 91ebf04..9ea8826 100644
--- a/qemu-nbd.texi
+++ b/qemu-nbd.texi
@@ -79,9 +79,12 @@ Disconnect the device @var{dev}
 Allow up to @var{num} clients to share the device (default @samp{1})
 @item -t, --persistent
 Don't exit on the last connection
-@item -x NAME, --export-name=NAME
+@item -x, --export-name=@var{name}
 Set the NBD volume export name. This switches the server to use
 the new style NBD protocol negotiation
+@item -D, --description=@var{description}
+Set the NBD volume export description, as a human-readable
+string. Requires the use of @option{-x}
 @item --tls-creds=ID
 Enable mandatory TLS encryption for the server by setting the ID
 of the TLS credentials object previously created with the --object
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 03/14] nbd: Limit nbdflags to 16 bits
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 02/14] nbd: Add qemu-nbd -D for human-readable description Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 04/14] nbd: Treat flags vs. command type as separate fields Eric Blake
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, qemu-stable, Kevin Wolf, Max Reitz

Rather than asserting that nbdflags is within range, just give
it the correct type to begin with :)  nbdflags corresponds to
the per-export portion of NBD Protocol "transmission flags", which
is 16 bits in response to NBD_OPT_EXPORT_NAME and NBD_OPT_GO.

Furthermore, upstream NBD has never passed the global flags to
the kernel via ioctl(NBD_SET_FLAGS) (the ioctl was first
introduced in NBD 2.9.22; then a latent bug in NBD 3.1 actually
tried to OR the global flags with the transmission flags, with
the disaster that the addition of NBD_FLAG_NO_ZEROES in 3.9
caused all earlier NBD 3.x clients to treat every export as
read-only; NBD 3.10 and later intentionally clip things to 16
bits to pass only transmission flags).  Qemu should follow suit,
since the current two global flags (NBD_FLAG_FIXED_NEWSTYLE
and NBD_FLAG_NO_ZEROES) have no impact on the kernel's behavior
during transmission.

CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: rebase, cc qemu-stable
v3: expand scope of patch
---
 block/nbd-client.h  |  2 +-
 include/block/nbd.h |  6 +++---
 nbd/client.c        | 28 +++++++++++++++-------------
 nbd/server.c        | 10 ++++------
 qemu-nbd.c          |  4 ++--
 5 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index fa9817b..044aca4 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -20,7 +20,7 @@
 typedef struct NbdClientSession {
     QIOChannelSocket *sioc; /* The master data channel */
     QIOChannel *ioc; /* The current I/O channel which may differ (eg TLS) */
-    uint32_t nbdflags;
+    uint16_t nbdflags;
     off_t size;

     CoMutex send_mutex;
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 309db2b..fd7e30b 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -90,11 +90,11 @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
                      size_t niov,
                      size_t length,
                      bool do_read);
-int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
+int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
                           QCryptoTLSCreds *tlscreds, const char *hostname,
                           QIOChannel **outioc,
                           off_t *size, Error **errp);
-int nbd_init(int fd, QIOChannelSocket *sioc, uint32_t flags, off_t size);
+int nbd_init(int fd, QIOChannelSocket *sioc, uint16_t flags, off_t size);
 ssize_t nbd_send_request(QIOChannel *ioc, struct nbd_request *request);
 ssize_t nbd_receive_reply(QIOChannel *ioc, struct nbd_reply *reply);
 int nbd_client(int fd);
@@ -104,7 +104,7 @@ typedef struct NBDExport NBDExport;
 typedef struct NBDClient NBDClient;

 NBDExport *nbd_export_new(BlockBackend *blk, off_t dev_offset, off_t size,
-                          uint32_t nbdflags, void (*close)(NBDExport *),
+                          uint16_t nbdflags, void (*close)(NBDExport *),
                           Error **errp);
 void nbd_export_close(NBDExport *exp);
 void nbd_export_get(NBDExport *exp);
diff --git a/nbd/client.c b/nbd/client.c
index 78a7195..a92f1e2 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -408,7 +408,7 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
 }


-int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
+int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
                           QCryptoTLSCreds *tlscreds, const char *hostname,
                           QIOChannel **outioc,
                           off_t *size, Error **errp)
@@ -468,7 +468,6 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
         uint32_t opt;
         uint32_t namesize;
         uint16_t globalflags;
-        uint16_t exportflags;
         bool fixedNewStyle = false;

         if (read_sync(ioc, &globalflags, sizeof(globalflags)) !=
@@ -477,7 +476,6 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
             goto fail;
         }
         globalflags = be16_to_cpu(globalflags);
-        *flags = globalflags << 16;
         TRACE("Global flags are %" PRIx32, globalflags);
         if (globalflags & NBD_FLAG_FIXED_NEWSTYLE) {
             fixedNewStyle = true;
@@ -545,17 +543,15 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
             goto fail;
         }
         *size = be64_to_cpu(s);
-        TRACE("Size is %" PRIu64, *size);

-        if (read_sync(ioc, &exportflags, sizeof(exportflags)) !=
-            sizeof(exportflags)) {
+        if (read_sync(ioc, flags, sizeof(*flags)) != sizeof(*flags)) {
             error_setg(errp, "Failed to read export flags");
             goto fail;
         }
-        exportflags = be16_to_cpu(exportflags);
-        *flags |= exportflags;
-        TRACE("Export flags are %" PRIx16, exportflags);
+        be16_to_cpus(flags);
     } else if (magic == NBD_CLIENT_MAGIC) {
+        uint32_t oldflags;
+
         if (name) {
             error_setg(errp, "Server does not support export names");
             goto fail;
@@ -572,16 +568,22 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint32_t *flags,
         *size = be64_to_cpu(s);
         TRACE("Size is %" PRIu64, *size);

-        if (read_sync(ioc, flags, sizeof(*flags)) != sizeof(*flags)) {
+        if (read_sync(ioc, &oldflags, sizeof(oldflags)) != sizeof(oldflags)) {
             error_setg(errp, "Failed to read export flags");
             goto fail;
         }
-        *flags = be32_to_cpu(*flags);
+        be32_to_cpus(&oldflags);
+        if (oldflags & ~0xffff) {
+            error_setg(errp, "Unexpected export flags %0x" PRIx32, oldflags);
+            goto fail;
+        }
+        *flags = oldflags;
     } else {
         error_setg(errp, "Bad magic received");
         goto fail;
     }

+    TRACE("Size is %" PRIu64 ", export flags %" PRIx16, *size, *flags);
     if (read_sync(ioc, &buf, 124) != 124) {
         error_setg(errp, "Failed to read reserved block");
         goto fail;
@@ -593,7 +595,7 @@ fail:
 }

 #ifdef __linux__
-int nbd_init(int fd, QIOChannelSocket *sioc, uint32_t flags, off_t size)
+int nbd_init(int fd, QIOChannelSocket *sioc, uint16_t flags, off_t size)
 {
     unsigned long sectors = size / BDRV_SECTOR_SIZE;
     if (size / BDRV_SECTOR_SIZE != sectors) {
@@ -689,7 +691,7 @@ int nbd_disconnect(int fd)
 }

 #else
-int nbd_init(int fd, QIOChannelSocket *ioc, uint32_t flags, off_t size)
+int nbd_init(int fd, QIOChannelSocket *ioc, uint16_t flags, off_t size)
 {
     return -ENOTSUP;
 }
diff --git a/nbd/server.c b/nbd/server.c
index 438944c..6b68706 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -64,7 +64,7 @@ struct NBDExport {
     char *description;
     off_t dev_offset;
     off_t size;
-    uint32_t nbdflags;
+    uint16_t nbdflags;
     QTAILQ_HEAD(, NBDClient) clients;
     QTAILQ_ENTRY(NBDExport) next;

@@ -554,8 +554,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
     NBDClient *client = data->client;
     char buf[8 + 8 + 8 + 128];
     int rc;
-    const int myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
-                         NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
+    const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
+                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
     bool oldStyle;

     /* Old style negotiation header without options
@@ -585,7 +585,6 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)

     oldStyle = client->exp != NULL && !client->tlscreds;
     if (oldStyle) {
-        assert ((client->exp->nbdflags & ~65535) == 0);
         TRACE("advertising size %" PRIu64 " and flags %x",
               client->exp->size, client->exp->nbdflags | myflags);
         stq_be_p(buf + 8, NBD_CLIENT_MAGIC);
@@ -616,7 +615,6 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
             goto fail;
         }

-        assert ((client->exp->nbdflags & ~65535) == 0);
         TRACE("advertising size %" PRIu64 " and flags %x",
               client->exp->size, client->exp->nbdflags | myflags);
         stq_be_p(buf + 18, client->exp->size);
@@ -820,7 +818,7 @@ static void nbd_eject_notifier(Notifier *n, void *data)
 }

 NBDExport *nbd_export_new(BlockBackend *blk, off_t dev_offset, off_t size,
-                          uint32_t nbdflags, void (*close)(NBDExport *),
+                          uint16_t nbdflags, void (*close)(NBDExport *),
                           Error **errp)
 {
     NBDExport *exp = g_malloc0(sizeof(NBDExport));
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 34be33d..64f863c 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -252,7 +252,7 @@ static void *nbd_client_thread(void *arg)
 {
     char *device = arg;
     off_t size;
-    uint32_t nbdflags;
+    uint16_t nbdflags;
     QIOChannelSocket *sioc;
     int fd;
     int ret;
@@ -466,7 +466,7 @@ int main(int argc, char **argv)
     BlockBackend *blk;
     BlockDriverState *bs;
     off_t dev_offset = 0;
-    uint32_t nbdflags = 0;
+    uint16_t nbdflags = 0;
     bool disconnect = false;
     const char *bindto = "0.0.0.0";
     const char *port = NULL;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 04/14] nbd: Treat flags vs. command type as separate fields
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (2 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 03/14] nbd: Limit nbdflags to 16 bits Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server Eric Blake
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

Current upstream NBD documents that requests have a 16-bit flags,
followed by a 16-bit type integer; although older versions mentioned
only a 32-bit field with masking to find flags.  Since the protocol
is in network order (big-endian over the wire), the ABI is unchanged;
but dealing with the flags as a separate field rather than masking
will make it easier to add support for upcoming NBD extensions that
increase the number of both flags and commands.

Improve some comments in nbd.h based on the current upstream
NBD protocol (https://github.com/yoe/nbd/blob/master/doc/proto.md),
and touch some nearby code to keep checkpatch.pl happy.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: rebase to earlier changes
v3: rebase to other changes earlier in series
---
 include/block/nbd.h | 18 ++++++++++++------
 nbd/nbd-internal.h  |  4 ++--
 block/nbd-client.c  |  9 +++------
 nbd/client.c        |  9 ++++++---
 nbd/server.c        | 39 +++++++++++++++++++--------------------
 5 files changed, 42 insertions(+), 37 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index fd7e30b..6904eb4 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -1,4 +1,5 @@
 /*
+ *  Copyright (C) 2016 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori <anthony@codemonkey.ws>
  *
  *  Network Block Device
@@ -32,7 +33,8 @@ struct nbd_request {
     uint64_t handle;
     uint64_t from;
     uint32_t len;
-    uint32_t type;
+    uint16_t flags;
+    uint16_t type;
 };

 struct nbd_reply {
@@ -40,6 +42,8 @@ struct nbd_reply {
     uint32_t error;
 };

+/* Transmission (export) flags: sent from server to client during handshake,
+   but describe what will happen during transmission */
 #define NBD_FLAG_HAS_FLAGS      (1 << 0)        /* Flags are there */
 #define NBD_FLAG_READ_ONLY      (1 << 1)        /* Device is read-only */
 #define NBD_FLAG_SEND_FLUSH     (1 << 2)        /* Send FLUSH */
@@ -47,10 +51,12 @@ struct nbd_reply {
 #define NBD_FLAG_ROTATIONAL     (1 << 4)        /* Use elevator algorithm - rotational media */
 #define NBD_FLAG_SEND_TRIM      (1 << 5)        /* Send TRIM (discard) */

-/* New-style global flags. */
+/* New-style handshake (global) flags, sent from server to client, and
+   control what will happen during handshake phase. */
 #define NBD_FLAG_FIXED_NEWSTYLE     (1 << 0)    /* Fixed newstyle protocol. */

-/* New-style client flags. */
+/* New-style client flags, sent from client to server to control what happens
+   during handshake phase. */
 #define NBD_FLAG_C_FIXED_NEWSTYLE   (1 << 0)    /* Fixed newstyle protocol. */

 /* Reply types. */
@@ -61,10 +67,10 @@ struct nbd_reply {
 #define NBD_REP_ERR_INVALID     ((UINT32_C(1) << 31) | 3) /* Invalid length. */
 #define NBD_REP_ERR_TLS_REQD    ((UINT32_C(1) << 31) | 5) /* TLS required */

+/* Request flags, sent from client to server during transmission phase */
+#define NBD_CMD_FLAG_FUA        (1 << 0)

-#define NBD_CMD_MASK_COMMAND	0x0000ffff
-#define NBD_CMD_FLAG_FUA	(1 << 16)
-
+/* Supported request types */
 enum {
     NBD_CMD_READ = 0,
     NBD_CMD_WRITE = 1,
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 7e78064..99e5157 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -53,10 +53,10 @@
 /* This is all part of the "official" NBD API.
  *
  * The most up-to-date documentation is available at:
- * https://github.com/yoe/nbd/blob/master/doc/proto.txt
+ * https://github.com/yoe/nbd/blob/master/doc/proto.md
  */

-#define NBD_REQUEST_SIZE        (4 + 4 + 8 + 8 + 4)
+#define NBD_REQUEST_SIZE        (4 + 2 + 2 + 8 + 8 + 4)
 #define NBD_REPLY_SIZE          (4 + 4 + 8)
 #define NBD_REQUEST_MAGIC       0x25609513
 #define NBD_REPLY_MAGIC         0x67446698
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 2cf3237..7e9c3ec 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -1,6 +1,7 @@
 /*
  * QEMU Block driver for  NBD
  *
+ * Copyright (C) 2016 Red Hat, Inc.
  * Copyright (C) 2008 Bull S.A.S.
  *     Author: Laurent Vivier <Laurent.Vivier@bull.net>
  *
@@ -258,7 +259,7 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,

     if (flags & BDRV_REQ_FUA) {
         assert(client->nbdflags & NBD_FLAG_SEND_FUA);
-        request.type |= NBD_CMD_FLAG_FUA;
+        request.flags |= NBD_CMD_FLAG_FUA;
     }

     assert(bytes <= NBD_MAX_BUFFER_SIZE);
@@ -343,11 +344,7 @@ void nbd_client_attach_aio_context(BlockDriverState *bs,
 void nbd_client_close(BlockDriverState *bs)
 {
     NbdClientSession *client = nbd_get_client_session(bs);
-    struct nbd_request request = {
-        .type = NBD_CMD_DISC,
-        .from = 0,
-        .len = 0
-    };
+    struct nbd_request request = { .type = NBD_CMD_DISC };

     if (client->ioc == NULL) {
         return;
diff --git a/nbd/client.c b/nbd/client.c
index a92f1e2..7c172ed 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1,4 +1,5 @@
 /*
+ *  Copyright (C) 2016 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori <anthony@codemonkey.ws>
  *
  *  Network Block Device Client Side
@@ -713,11 +714,13 @@ ssize_t nbd_send_request(QIOChannel *ioc, struct nbd_request *request)

     TRACE("Sending request to server: "
           "{ .from = %" PRIu64", .len = %" PRIu32 ", .handle = %" PRIu64
-          ", .type=%" PRIu32 " }",
-          request->from, request->len, request->handle, request->type);
+          ", .flags = %" PRIx16 ", .type = %" PRIu16 " }",
+          request->from, request->len, request->handle,
+          request->flags, request->type);

     stl_be_p(buf, NBD_REQUEST_MAGIC);
-    stl_be_p(buf + 4, request->type);
+    stw_be_p(buf + 4, request->flags);
+    stw_be_p(buf + 6, request->type);
     stq_be_p(buf + 8, request->handle);
     stq_be_p(buf + 16, request->from);
     stl_be_p(buf + 24, request->len);
diff --git a/nbd/server.c b/nbd/server.c
index 6b68706..85c4f5d 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1,4 +1,5 @@
 /*
+ *  Copyright (C) 2016 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori <anthony@codemonkey.ws>
  *
  *  Network Block Device Server Side
@@ -650,21 +651,23 @@ static ssize_t nbd_receive_request(QIOChannel *ioc, struct nbd_request *request)

     /* Request
        [ 0 ..  3]   magic   (NBD_REQUEST_MAGIC)
-       [ 4 ..  7]   type    (0 == READ, 1 == WRITE)
+       [ 4 ..  5]   flags   (NBD_CMD_FLAG_FUA, ...)
+       [ 6 ..  7]   type    (NBD_CMD_READ, ...)
        [ 8 .. 15]   handle
        [16 .. 23]   from
        [24 .. 27]   len
      */

     magic = ldl_be_p(buf);
-    request->type   = ldl_be_p(buf + 4);
+    request->flags  = lduw_be_p(buf + 4);
+    request->type   = lduw_be_p(buf + 6);
     request->handle = ldq_be_p(buf + 8);
     request->from   = ldq_be_p(buf + 16);
     request->len    = ldl_be_p(buf + 24);

-    TRACE("Got request: { magic = 0x%" PRIx32 ", .type = %" PRIx32
-          ", from = %" PRIu64 " , len = %" PRIu32 " }",
-          magic, request->type, request->from, request->len);
+    TRACE("Got request: { magic = 0x%" PRIx32 ", .flags = %" PRIx16
+          ", .type = %" PRIx16 ", from = %" PRIu64 ", len = %" PRIu32 " }",
+          magic, request->flags, request->type, request->from, request->len);

     if (magic != NBD_REQUEST_MAGIC) {
         LOG("invalid magic (got 0x%" PRIx32 ")", magic);
@@ -997,7 +1000,6 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
                                       struct nbd_request *request)
 {
     NBDClient *client = req->client;
-    uint32_t command;
     ssize_t rc;

     g_assert(qemu_in_coroutine());
@@ -1014,13 +1016,12 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,

     TRACE("Decoding type");

-    command = request->type & NBD_CMD_MASK_COMMAND;
-    if (command != NBD_CMD_WRITE) {
+    if (request->type != NBD_CMD_WRITE) {
         /* No payload, we are ready to read the next request.  */
         req->complete = true;
     }

-    if (command == NBD_CMD_DISC) {
+    if (request->type == NBD_CMD_DISC) {
         /* Special case: we're going to disconnect without a reply,
          * whether or not flags, from, or len are bogus */
         TRACE("Request type is DISCONNECT");
@@ -1037,7 +1038,7 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
         goto out;
     }

-    if (command == NBD_CMD_READ || command == NBD_CMD_WRITE) {
+    if (request->type == NBD_CMD_READ || request->type == NBD_CMD_WRITE) {
         if (request->len > NBD_MAX_BUFFER_SIZE) {
             LOG("len (%" PRIu32" ) is larger than max len (%u)",
                 request->len, NBD_MAX_BUFFER_SIZE);
@@ -1051,7 +1052,7 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
             goto out;
         }
     }
-    if (command == NBD_CMD_WRITE) {
+    if (request->type == NBD_CMD_WRITE) {
         TRACE("Reading %" PRIu32 " byte(s)", request->len);

         if (read_sync(client->ioc, req->data, request->len) != request->len) {
@@ -1067,12 +1068,11 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
         LOG("operation past EOF; From: %" PRIu64 ", Len: %" PRIu32
             ", Size: %" PRIu64, request->from, request->len,
             (uint64_t)client->exp->size);
-        rc = command == NBD_CMD_WRITE ? -ENOSPC : -EINVAL;
+        rc = request->type == NBD_CMD_WRITE ? -ENOSPC : -EINVAL;
         goto out;
     }
-    if (request->type & ~NBD_CMD_MASK_COMMAND & ~NBD_CMD_FLAG_FUA) {
-        LOG("unsupported flags (got 0x%x)",
-            request->type & ~NBD_CMD_MASK_COMMAND);
+    if (request->flags & ~NBD_CMD_FLAG_FUA) {
+        LOG("unsupported flags (got 0x%x)", request->flags);
         rc = -EINVAL;
         goto out;
     }
@@ -1094,7 +1094,6 @@ static void nbd_trip(void *opaque)
     struct nbd_request request;
     struct nbd_reply reply;
     ssize_t ret;
-    uint32_t command;
     int flags;

     TRACE("Reading request.");
@@ -1118,7 +1117,6 @@ static void nbd_trip(void *opaque)
         reply.error = -ret;
         goto error_reply;
     }
-    command = request.type & NBD_CMD_MASK_COMMAND;

     if (client->closing) {
         /*
@@ -1128,11 +1126,12 @@ static void nbd_trip(void *opaque)
         goto done;
     }

-    switch (command) {
+    switch (request.type) {
     case NBD_CMD_READ:
         TRACE("Request type is READ");

-        if (request.type & NBD_CMD_FLAG_FUA) {
+        /* XXX: NBD Protocol only documents use of FUA with WRITE */
+        if (request.flags & NBD_CMD_FLAG_FUA) {
             ret = blk_co_flush(exp->blk);
             if (ret < 0) {
                 LOG("flush failed");
@@ -1165,7 +1164,7 @@ static void nbd_trip(void *opaque)
         TRACE("Writing to device");

         flags = 0;
-        if (request.type & NBD_CMD_FLAG_FUA) {
+        if (request.flags & NBD_CMD_FLAG_FUA) {
             flags |= BDRV_REQ_FUA;
         }
         ret = blk_pwrite(exp->blk, request.from + exp->dev_offset,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (3 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 04/14] nbd: Treat flags vs. command type as separate fields Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  5:10   ` Fam Zheng
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors Eric Blake
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

Rather than open-coding NBD_REP_SERVER, reuse the code we
already have by adding a length parameter.  Additionally,
the refactoring will make adding NBD_OPT_GO in a later patch
easier.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: no change
v3: rebase to changes earlier in series
---
 nbd/server.c | 48 +++++++++++++++++++++++-------------------------
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 85c4f5d..c8716f1 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -195,12 +195,15 @@ static ssize_t nbd_negotiate_drop_sync(QIOChannel *ioc, size_t size)

 */

-static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
+/* Send a reply header, including length, but no payload.
+ * Return -errno to kill connection, 0 to continue negotiation. */
+static int nbd_negotiate_send_rep_len(QIOChannel *ioc, uint32_t type,
+                                      uint32_t opt, uint32_t len)
 {
     uint64_t magic;
-    uint32_t len;

-    TRACE("Reply opt=%" PRIx32 " type=%" PRIx32, type, opt);
+    TRACE("Reply opt=%" PRIx32 " type=%" PRIx32 " len=%" PRIu32,
+          type, opt, len);

     magic = cpu_to_be64(NBD_REP_MAGIC);
     if (nbd_negotiate_write(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
@@ -217,7 +220,7 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
         LOG("write failed (rep type)");
         return -EINVAL;
     }
-    len = cpu_to_be32(0);
+    len = cpu_to_be32(len);
     if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
         LOG("write failed (rep data length)");
         return -EINVAL;
@@ -225,37 +228,32 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
     return 0;
 }

+/* Send a reply header with default 0 length.
+ * Return -errno to kill connection, 0 to continue negotiation. */
+static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
+{
+    return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
+}
+
+/* Send an NBD_REP_SERVER reply to NBD_OPT_LIST, including payload.
+ * Return -errno to kill connection, 0 to continue negotiation. */
 static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
 {
-    uint64_t magic;
     size_t name_len, desc_len;
-    uint32_t opt, type, len;
+    uint32_t len;
     const char *name = exp->name ? exp->name : "";
     const char *desc = exp->description ? exp->description : "";
+    int rc;

     TRACE("Advertising export name '%s' description '%s'", name, desc);
     name_len = strlen(name);
     desc_len = strlen(desc);
-    magic = cpu_to_be64(NBD_REP_MAGIC);
-    if (nbd_negotiate_write(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
-        LOG("write failed (magic)");
-        return -EINVAL;
-     }
-    opt = cpu_to_be32(NBD_OPT_LIST);
-    if (nbd_negotiate_write(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
-        LOG("write failed (opt)");
-        return -EINVAL;
-    }
-    type = cpu_to_be32(NBD_REP_SERVER);
-    if (nbd_negotiate_write(ioc, &type, sizeof(type)) != sizeof(type)) {
-        LOG("write failed (reply type)");
-        return -EINVAL;
-    }
-    len = cpu_to_be32(name_len + desc_len + sizeof(len));
-    if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
-        LOG("write failed (length)");
-        return -EINVAL;
+    len = name_len + desc_len + sizeof(len);
+    rc = nbd_negotiate_send_rep_len(ioc, NBD_REP_SERVER, NBD_OPT_LIST, len);
+    if (rc < 0) {
+        return rc;
     }
+
     len = cpu_to_be32(name_len);
     if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
         LOG("write failed (name length)");
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (4 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  5:15   ` Fam Zheng
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client Eric Blake
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

The NBD Protocol allows us to send human-readable messages
along with any NBD_REP_ERR error during option negotiation;
make use of this fact for clients that know what to do with
our message.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v5: don't leak 'msg'
v4: new patch
---
 nbd/server.c | 78 +++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 59 insertions(+), 19 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index c8716f1..ad31c4a 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -235,6 +235,38 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
     return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
 }

+/* Send an error reply.
+ * Return -errno to kill connection, 0 to continue negotiation. */
+static int GCC_FMT_ATTR(4, 5)
+    nbd_negotiate_send_rep_err(QIOChannel *ioc, uint32_t type,
+                               uint32_t opt, const char *fmt, ...)
+{
+    va_list va;
+    char *msg;
+    int ret;
+    size_t len;
+
+    va_start(va, fmt);
+    msg = g_strdup_vprintf(fmt, va);
+    va_end(va);
+    len = strlen(msg);
+    assert(len < 4096);
+    TRACE("sending error message \"%s\"", msg);
+    ret = nbd_negotiate_send_rep_len(ioc, type, opt, len);
+    if (ret < 0) {
+        goto out;
+    }
+    if (nbd_negotiate_write(ioc, msg, len) != len) {
+        LOG("write failed (error message)");
+        ret = -EIO;
+    } else {
+        ret = 0;
+    }
+out:
+    g_free(msg);
+    return ret;
+}
+
 /* Send an NBD_REP_SERVER reply to NBD_OPT_LIST, including payload.
  * Return -errno to kill connection, 0 to continue negotiation. */
 static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
@@ -278,8 +310,9 @@ static int nbd_negotiate_handle_list(NBDClient *client, uint32_t length)
         if (nbd_negotiate_drop_sync(client->ioc, length) != length) {
             return -EIO;
         }
-        return nbd_negotiate_send_rep(client->ioc,
-                                      NBD_REP_ERR_INVALID, NBD_OPT_LIST);
+        return nbd_negotiate_send_rep_err(client->ioc,
+                                          NBD_REP_ERR_INVALID, NBD_OPT_LIST,
+                                          "OPT_LIST should not have length");
     }

     /* For each export, send a NBD_REP_SERVER reply. */
@@ -326,7 +359,8 @@ fail:
     return rc;
 }

-
+/* Handle NBD_OPT_STARTTLS. Return NULL to drop connection, or else the
+ * new channel for all further (now-encrypted) communication. */
 static QIOChannel *nbd_negotiate_handle_starttls(NBDClient *client,
                                                  uint32_t length)
 {
@@ -340,7 +374,8 @@ static QIOChannel *nbd_negotiate_handle_starttls(NBDClient *client,
         if (nbd_negotiate_drop_sync(ioc, length) != length) {
             return NULL;
         }
-        nbd_negotiate_send_rep(ioc, NBD_REP_ERR_INVALID, NBD_OPT_STARTTLS);
+        nbd_negotiate_send_rep_err(ioc, NBD_REP_ERR_INVALID, NBD_OPT_STARTTLS,
+                                   "OPT_STARTTLS should not have length");
         return NULL;
     }

@@ -468,13 +503,15 @@ static int nbd_negotiate_options(NBDClient *client)
                 return -EINVAL;

             default:
-                TRACE("Option 0x%" PRIx32 " not permitted before TLS",
-                      clientflags);
                 if (nbd_negotiate_drop_sync(client->ioc, length) != length) {
                     return -EIO;
                 }
-                ret = nbd_negotiate_send_rep(client->ioc, NBD_REP_ERR_TLS_REQD,
-                                             clientflags);
+                ret = nbd_negotiate_send_rep_err(client->ioc,
+                                                 NBD_REP_ERR_TLS_REQD,
+                                                 clientflags,
+                                                 "Option 0x%" PRIx32
+                                                 "not permitted before TLS",
+                                                 clientflags);
                 if (ret < 0) {
                     return ret;
                 }
@@ -500,27 +537,30 @@ static int nbd_negotiate_options(NBDClient *client)
                     return -EIO;
                 }
                 if (client->tlscreds) {
-                    TRACE("TLS already enabled");
-                    ret = nbd_negotiate_send_rep(client->ioc,
-                                                 NBD_REP_ERR_INVALID,
-                                                 clientflags);
+                    ret = nbd_negotiate_send_rep_err(client->ioc,
+                                                     NBD_REP_ERR_INVALID,
+                                                     clientflags,
+                                                     "TLS already enabled");
                 } else {
-                    TRACE("TLS not configured");
-                    ret = nbd_negotiate_send_rep(client->ioc,
-                                                 NBD_REP_ERR_POLICY,
-                                                 clientflags);
+                    ret = nbd_negotiate_send_rep_err(client->ioc,
+                                                     NBD_REP_ERR_POLICY,
+                                                     clientflags,
+                                                     "TLS not configured");
                 }
                 if (ret < 0) {
                     return ret;
                 }
                 break;
             default:
-                TRACE("Unsupported option 0x%" PRIx32, clientflags);
                 if (nbd_negotiate_drop_sync(client->ioc, length) != length) {
                     return -EIO;
                 }
-                ret = nbd_negotiate_send_rep(client->ioc, NBD_REP_ERR_UNSUP,
-                                             clientflags);
+                ret = nbd_negotiate_send_rep_err(client->ioc,
+                                                 NBD_REP_ERR_UNSUP,
+                                                 clientflags,
+                                                 "Unsupported option 0x%"
+                                                 PRIx32,
+                                                 clientflags);
                 if (ret < 0) {
                     return ret;
                 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (5 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  5:31   ` Fam Zheng
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 08/14] nbd: Let server know when client gives up negotiation Eric Blake
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

Rather than open-coding each option request, it's easier to
have common helper functions do the work.  That in turn requires
having convenient packed types for handling option requests
and replies.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: rebase
v3: rebase, tweak a debug message
---
 include/block/nbd.h |  25 +++++-
 nbd/nbd-internal.h  |   2 +-
 nbd/client.c        | 250 ++++++++++++++++++++++------------------------------
 3 files changed, 126 insertions(+), 151 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 6904eb4..7c5103f 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -26,15 +26,34 @@
 #include "io/channel-socket.h"
 #include "crypto/tlscreds.h"

-/* Note: these are _NOT_ the same as the network representation of an NBD
+/* Handshake phase structs - this struct is passed on the wire */
+
+struct nbd_option {
+    uint64_t magic; /* NBD_OPTS_MAGIC */
+    uint32_t option; /* NBD_OPT_* */
+    uint32_t length;
+} QEMU_PACKED;
+typedef struct nbd_option nbd_option;
+
+struct nbd_opt_reply {
+    uint64_t magic; /* NBD_REP_MAGIC */
+    uint32_t option; /* NBD_OPT_* */
+    uint32_t type; /* NBD_REP_* */
+    uint32_t length;
+} QEMU_PACKED;
+typedef struct nbd_opt_reply nbd_opt_reply;
+
+/* Transmission phase structs
+ *
+ * Note: these are _NOT_ the same as the network representation of an NBD
  * request and reply!
  */
 struct nbd_request {
     uint64_t handle;
     uint64_t from;
     uint32_t len;
-    uint16_t flags;
-    uint16_t type;
+    uint16_t flags; /* NBD_CMD_FLAG_* */
+    uint16_t type; /* NBD_CMD_* */
 };

 struct nbd_reply {
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index 99e5157..dd57e18 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -62,7 +62,7 @@
 #define NBD_REPLY_MAGIC         0x67446698
 #define NBD_OPTS_MAGIC          0x49484156454F5054LL
 #define NBD_CLIENT_MAGIC        0x0000420281861253LL
-#define NBD_REP_MAGIC           0x3e889045565a9LL
+#define NBD_REP_MAGIC           0x0003e889045565a9LL

 #define NBD_SET_SOCK            _IO(0xab, 0)
 #define NBD_SET_BLKSIZE         _IO(0xab, 1)
diff --git a/nbd/client.c b/nbd/client.c
index 7c172ed..6e5288a 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -75,64 +75,123 @@ static QTAILQ_HEAD(, NBDExport) exports = QTAILQ_HEAD_INITIALIZER(exports);

 */

+/* Send an option request. Return 0 if successful, -1 with errp set if
+ * it is impossible to continue. */
+static int nbd_send_option_request(QIOChannel *ioc, uint32_t opt,
+                                   uint32_t len, const char *data,
+                                   Error **errp)
+{
+    nbd_option req;
+    QEMU_BUILD_BUG_ON(sizeof(req) != 16);

-/* If type represents success, return 1 without further action.
- * If type represents an error reply, consume the rest of the packet on ioc.
- * Then return 0 for unsupported (so the client can fall back to
- * other approaches), or -1 with errp set for other errors.
+    if (len == -1) {
+        req.length = len = strlen(data);
+    }
+    TRACE("Sending option request %"PRIu32", len %"PRIu32, opt, len);
+
+    stq_be_p(&req.magic, NBD_OPTS_MAGIC);
+    stl_be_p(&req.option, opt);
+    stl_be_p(&req.length, len);
+
+    if (write_sync(ioc, &req, sizeof(req)) != sizeof(req)) {
+        error_setg(errp, "Failed to send option request header");
+        return -1;
+    }
+
+    if (len && write_sync(ioc, (char *) data, len) != len) {
+        error_setg(errp, "Failed to send option request data");
+        return -1;
+    }
+
+    return 0;
+}
+
+/* Receive the header of an option reply, which should match the given
+ * opt.  Read through the length field, but NOT the length bytes of
+ * payload. Return 0 if successful, -1 with errp set if it is
+ * impossible to continue. */
+static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt,
+                                    nbd_opt_reply *reply, Error **errp)
+{
+    QEMU_BUILD_BUG_ON(sizeof(*reply) != 20);
+    if (read_sync(ioc, reply, sizeof(*reply)) != sizeof(*reply)) {
+        error_setg(errp, "failed to read option reply");
+        return -1;
+    }
+    be64_to_cpus(&reply->magic);
+    be32_to_cpus(&reply->option);
+    be32_to_cpus(&reply->type);
+    be32_to_cpus(&reply->length);
+
+    TRACE("Received option reply %"PRIx32", type %"PRIx32", len %"PRIu32,
+          reply->option, reply->type, reply->length);
+
+    if (reply->magic != NBD_REP_MAGIC) {
+        error_setg(errp, "Unexpected option reply magic");
+        return -1;
+    }
+    if (reply->option != opt) {
+        error_setg(errp, "Unexpected option type %x expected %x",
+                   reply->option, opt);
+        return -1;
+    }
+    return 0;
+}
+
+/* If reply represents success, return 1 without further action.
+ * If reply represents an error, consume the optional payload of
+ * the packet on ioc.  Then return 0 for unsupported (so the client
+ * can fall back to other approaches), or -1 with errp set for other
+ * errors.
  */
-static int nbd_handle_reply_err(QIOChannel *ioc, uint32_t opt, uint32_t type,
+static int nbd_handle_reply_err(QIOChannel *ioc, nbd_opt_reply *reply,
                                 Error **errp)
 {
-    uint32_t len;
     char *msg = NULL;
     int result = -1;

-    if (!(type & (1 << 31))) {
+    if (!(reply->type & (1 << 31))) {
         return 1;
     }

-    if (read_sync(ioc, &len, sizeof(len)) != sizeof(len)) {
-        error_setg(errp, "failed to read option length");
-        return -1;
-    }
-    len = be32_to_cpu(len);
-    if (len) {
-        if (len > NBD_MAX_BUFFER_SIZE) {
+    if (reply->length) {
+        if (reply->length > NBD_MAX_BUFFER_SIZE) {
             error_setg(errp, "server's error message is too long");
             goto cleanup;
         }
-        msg = g_malloc(len + 1);
-        if (read_sync(ioc, msg, len) != len) {
+        msg = g_malloc(reply->length + 1);
+        if (read_sync(ioc, msg, reply->length) != reply->length) {
             error_setg(errp, "failed to read option error message");
             goto cleanup;
         }
-        msg[len] = '\0';
+        msg[reply->length] = '\0';
     }

-    switch (type) {
+    switch (reply->type) {
     case NBD_REP_ERR_UNSUP:
         TRACE("server doesn't understand request %" PRIx32
-              ", attempting fallback", opt);
+              ", attempting fallback", reply->option);
         result = 0;
         goto cleanup;

     case NBD_REP_ERR_POLICY:
-        error_setg(errp, "Denied by server for option %" PRIx32, opt);
+        error_setg(errp, "Denied by server for option %" PRIx32,
+                   reply->option);
         break;

     case NBD_REP_ERR_INVALID:
-        error_setg(errp, "Invalid data length for option %" PRIx32, opt);
+        error_setg(errp, "Invalid data length for option %" PRIx32,
+                   reply->option);
         break;

     case NBD_REP_ERR_TLS_REQD:
         error_setg(errp, "TLS negotiation required before option %" PRIx32,
-                   opt);
+                   reply->option);
         break;

     default:
         error_setg(errp, "Unknown error code when asking for option %" PRIx32,
-                   opt);
+                   reply->option);
         break;
     }

@@ -147,58 +206,29 @@ static int nbd_handle_reply_err(QIOChannel *ioc, uint32_t opt, uint32_t type,

 static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
 {
-    uint64_t magic;
-    uint32_t opt;
-    uint32_t type;
+    nbd_opt_reply reply;
     uint32_t len;
     uint32_t namelen;
     int error;

     *name = NULL;
-    if (read_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
-        error_setg(errp, "failed to read list option magic");
+    if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, &reply, errp) < 0) {
         return -1;
     }
-    magic = be64_to_cpu(magic);
-    if (magic != NBD_REP_MAGIC) {
-        error_setg(errp, "Unexpected option list magic");
-        return -1;
-    }
-    if (read_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
-        error_setg(errp, "failed to read list option");
-        return -1;
-    }
-    opt = be32_to_cpu(opt);
-    if (opt != NBD_OPT_LIST) {
-        error_setg(errp, "Unexpected option type %" PRIx32 " expected %x",
-                   opt, NBD_OPT_LIST);
-        return -1;
-    }
-
-    if (read_sync(ioc, &type, sizeof(type)) != sizeof(type)) {
-        error_setg(errp, "failed to read list option type");
-        return -1;
-    }
-    type = be32_to_cpu(type);
-    error = nbd_handle_reply_err(ioc, opt, type, errp);
+    error = nbd_handle_reply_err(ioc, &reply, errp);
     if (error <= 0) {
         return error;
     }
+    len = reply.length;

-    if (read_sync(ioc, &len, sizeof(len)) != sizeof(len)) {
-        error_setg(errp, "failed to read option length");
-        return -1;
-    }
-    len = be32_to_cpu(len);
-
-    if (type == NBD_REP_ACK) {
+    if (reply.type == NBD_REP_ACK) {
         if (len != 0) {
             error_setg(errp, "length too long for option end");
             return -1;
         }
-    } else if (type == NBD_REP_SERVER) {
+    } else if (reply.type == NBD_REP_SERVER) {
         if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
-            error_setg(errp, "incorrect option length");
+            error_setg(errp, "incorrect option length %"PRIu32, len);
             return -1;
         }
         if (read_sync(ioc, &namelen, sizeof(namelen)) != sizeof(namelen)) {
@@ -240,7 +270,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
         }
     } else {
         error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
-                   type, NBD_REP_SERVER);
+                   reply.type, NBD_REP_SERVER);
         return -1;
     }
     return 1;
@@ -251,24 +281,10 @@ static int nbd_receive_query_exports(QIOChannel *ioc,
                                      const char *wantname,
                                      Error **errp)
 {
-    uint64_t magic = cpu_to_be64(NBD_OPTS_MAGIC);
-    uint32_t opt = cpu_to_be32(NBD_OPT_LIST);
-    uint32_t length = 0;
     bool foundExport = false;

     TRACE("Querying export list");
-    if (write_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
-        error_setg(errp, "Failed to send list option magic");
-        return -1;
-    }
-
-    if (write_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
-        error_setg(errp, "Failed to send list option number");
-        return -1;
-    }
-
-    if (write_sync(ioc, &length, sizeof(length)) != sizeof(length)) {
-        error_setg(errp, "Failed to send list option length");
+    if (nbd_send_option_request(ioc, NBD_OPT_LIST, 0, NULL, errp) < 0) {
         return -1;
     }

@@ -314,72 +330,29 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
                                         QCryptoTLSCreds *tlscreds,
                                         const char *hostname, Error **errp)
 {
-    uint64_t magic = cpu_to_be64(NBD_OPTS_MAGIC);
-    uint32_t opt = cpu_to_be32(NBD_OPT_STARTTLS);
-    uint32_t length = 0;
-    uint32_t type;
+    nbd_opt_reply reply;
     QIOChannelTLS *tioc;
     struct NBDTLSHandshakeData data = { 0 };

     TRACE("Requesting TLS from server");
-    if (write_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
-        error_setg(errp, "Failed to send option magic");
-        return NULL;
-    }
-
-    if (write_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
-        error_setg(errp, "Failed to send option number");
-        return NULL;
-    }
-
-    if (write_sync(ioc, &length, sizeof(length)) != sizeof(length)) {
-        error_setg(errp, "Failed to send option length");
-        return NULL;
-    }
-
-    TRACE("Getting TLS reply from server1");
-    if (read_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
-        error_setg(errp, "failed to read option magic");
-        return NULL;
-    }
-    magic = be64_to_cpu(magic);
-    if (magic != NBD_REP_MAGIC) {
-        error_setg(errp, "Unexpected option magic");
-        return NULL;
-    }
-    TRACE("Getting TLS reply from server2");
-    if (read_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
-        error_setg(errp, "failed to read option");
-        return NULL;
-    }
-    opt = be32_to_cpu(opt);
-    if (opt != NBD_OPT_STARTTLS) {
-        error_setg(errp, "Unexpected option type %" PRIx32 " expected %x",
-                   opt, NBD_OPT_STARTTLS);
+    if (nbd_send_option_request(ioc, NBD_OPT_STARTTLS, 0, NULL, errp) < 0) {
         return NULL;
     }

     TRACE("Getting TLS reply from server");
-    if (read_sync(ioc, &type, sizeof(type)) != sizeof(type)) {
-        error_setg(errp, "failed to read option type");
+    if (nbd_receive_option_reply(ioc, NBD_OPT_STARTTLS, &reply, errp) < 0) {
         return NULL;
     }
-    type = be32_to_cpu(type);
-    if (type != NBD_REP_ACK) {
+
+    if (reply.type != NBD_REP_ACK) {
         error_setg(errp, "Server rejected request to start TLS %" PRIx32,
-                   type);
+                   reply.type);
         return NULL;
     }

-    TRACE("Getting TLS reply from server");
-    if (read_sync(ioc, &length, sizeof(length)) != sizeof(length)) {
-        error_setg(errp, "failed to read option length");
-        return NULL;
-    }
-    length = be32_to_cpu(length);
-    if (length != 0) {
+    if (reply.length != 0) {
         error_setg(errp, "Start TLS response was not zero %" PRIu32,
-                   length);
+                   reply.length);
         return NULL;
     }

@@ -466,8 +439,6 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,

     if (magic == NBD_OPTS_MAGIC) {
         uint32_t clientflags = 0;
-        uint32_t opt;
-        uint32_t namesize;
         uint16_t globalflags;
         bool fixedNewStyle = false;

@@ -517,28 +488,13 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
                 goto fail;
             }
         }
-        /* write the export name */
-        magic = cpu_to_be64(magic);
-        if (write_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
-            error_setg(errp, "Failed to send export name magic");
-            goto fail;
-        }
-        opt = cpu_to_be32(NBD_OPT_EXPORT_NAME);
-        if (write_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
-            error_setg(errp, "Failed to send export name option number");
-            goto fail;
-        }
-        namesize = cpu_to_be32(strlen(name));
-        if (write_sync(ioc, &namesize, sizeof(namesize)) !=
-            sizeof(namesize)) {
-            error_setg(errp, "Failed to send export name length");
-            goto fail;
-        }
-        if (write_sync(ioc, (char *)name, strlen(name)) != strlen(name)) {
-            error_setg(errp, "Failed to send export name");
+        /* write the export name request */
+        if (nbd_send_option_request(ioc, NBD_OPT_EXPORT_NAME, -1, name,
+                                    errp) < 0) {
             goto fail;
         }

+        /* Read the response */
         if (read_sync(ioc, &s, sizeof(s)) != sizeof(s)) {
             error_setg(errp, "Failed to read export length");
             goto fail;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 08/14] nbd: Let server know when client gives up negotiation
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (6 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 09/14] nbd: Let client skip portions of server reply Eric Blake
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

The NBD spec says that a client should send NBD_OPT_ABORT
rather than just dropping the connection, if the client doesn't
like something the server sent during option negotiation.  This
is a best-effort attempt only, and can only be done in places
where we know the server is still in sync with what we've sent,
whether or not we've read everything the server has sent.
Technically, the server then has to reply with NBD_REP_ACK, but
it's not worth complicating the client to wait around for that
reply.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: new patch
---
 nbd/client.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/nbd/client.c b/nbd/client.c
index 6e5288a..a78e81c 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -106,6 +106,19 @@ static int nbd_send_option_request(QIOChannel *ioc, uint32_t opt,
     return 0;
 }

+/* Send NBD_OPT_ABORT as a courtesy to let the server know that we are
+ * not going to attempt further negotiation. */
+static void nbd_send_opt_abort(QIOChannel *ioc)
+{
+    /* Technically, a compliant server is supposed to reply to us; but
+     * older servers disconnected instead. At any rate, we're allowed
+     * to disconnect without waiting for the server reply, so we don't
+     * even care if the request makes it to the server, let alone
+     * waiting around for whether the server replies. */
+    nbd_send_option_request(ioc, NBD_OPT_ABORT, 0, NULL, NULL);
+}
+
+
 /* Receive the header of an option reply, which should match the given
  * opt.  Read through the length field, but NOT the length bytes of
  * payload. Return 0 if successful, -1 with errp set if it is
@@ -116,6 +129,7 @@ static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt,
     QEMU_BUILD_BUG_ON(sizeof(*reply) != 20);
     if (read_sync(ioc, reply, sizeof(*reply)) != sizeof(*reply)) {
         error_setg(errp, "failed to read option reply");
+        nbd_send_opt_abort(ioc);
         return -1;
     }
     be64_to_cpus(&reply->magic);
@@ -128,11 +142,13 @@ static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt,

     if (reply->magic != NBD_REP_MAGIC) {
         error_setg(errp, "Unexpected option reply magic");
+        nbd_send_opt_abort(ioc);
         return -1;
     }
     if (reply->option != opt) {
         error_setg(errp, "Unexpected option type %x expected %x",
                    reply->option, opt);
+        nbd_send_opt_abort(ioc);
         return -1;
     }
     return 0;
@@ -201,6 +217,9 @@ static int nbd_handle_reply_err(QIOChannel *ioc, nbd_opt_reply *reply,

  cleanup:
     g_free(msg);
+    if (result < 0) {
+        nbd_send_opt_abort(ioc);
+    }
     return result;
 }

@@ -224,25 +243,30 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
     if (reply.type == NBD_REP_ACK) {
         if (len != 0) {
             error_setg(errp, "length too long for option end");
+            nbd_send_opt_abort(ioc);
             return -1;
         }
     } else if (reply.type == NBD_REP_SERVER) {
         if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
             error_setg(errp, "incorrect option length %"PRIu32, len);
+            nbd_send_opt_abort(ioc);
             return -1;
         }
         if (read_sync(ioc, &namelen, sizeof(namelen)) != sizeof(namelen)) {
             error_setg(errp, "failed to read option name length");
+            nbd_send_opt_abort(ioc);
             return -1;
         }
         namelen = be32_to_cpu(namelen);
         len -= sizeof(namelen);
         if (len < namelen) {
             error_setg(errp, "incorrect option name length");
+            nbd_send_opt_abort(ioc);
             return -1;
         }
         if (namelen > NBD_MAX_NAME_SIZE) {
             error_setg(errp, "export name length too long %" PRIu32, namelen);
+            nbd_send_opt_abort(ioc);
             return -1;
         }

@@ -251,6 +275,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
             error_setg(errp, "failed to read export name");
             g_free(*name);
             *name = NULL;
+            nbd_send_opt_abort(ioc);
             return -1;
         }
         (*name)[namelen] = '\0';
@@ -262,6 +287,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
                 g_free(*name);
                 g_free(buf);
                 *name = NULL;
+                nbd_send_opt_abort(ioc);
                 return -1;
             }
             buf[len] = '\0';
@@ -271,6 +297,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
     } else {
         error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
                    reply.type, NBD_REP_SERVER);
+        nbd_send_opt_abort(ioc);
         return -1;
     }
     return 1;
@@ -320,6 +347,7 @@ static int nbd_receive_query_exports(QIOChannel *ioc,

     if (!foundExport) {
         error_setg(errp, "No export with name '%s' available", wantname);
+        nbd_send_opt_abort(ioc);
         return -1;
     }

@@ -347,12 +375,14 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
     if (reply.type != NBD_REP_ACK) {
         error_setg(errp, "Server rejected request to start TLS %" PRIx32,
                    reply.type);
+        nbd_send_opt_abort(ioc);
         return NULL;
     }

     if (reply.length != 0) {
         error_setg(errp, "Start TLS response was not zero %" PRIu32,
                    reply.length);
+        nbd_send_opt_abort(ioc);
         return NULL;
     }

-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 09/14] nbd: Let client skip portions of server reply
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (7 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 08/14] nbd: Let server know when client gives up negotiation Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 10/14] nbd: Less allocation during NBD_OPT_LIST Eric Blake
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

The server has a nice helper function nbd_negotiate_drop_sync()
which lets it easily ignore fluff from the client (such as the
payload to an unknown option request).  We can't quite make it
common, since it depends on nbd_negotiate_read() which handles
coroutine magic, but we can copy the idea into the client where
we have places where we want to ignore data (such as the
description tacked on the end of NBD_REP_SERVER).

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: rebase
v3: rebase
---
 nbd/client.c | 47 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index a78e81c..004dec6 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -75,6 +75,32 @@ static QTAILQ_HEAD(, NBDExport) exports = QTAILQ_HEAD_INITIALIZER(exports);

 */

+/* Discard length bytes from channel.  Return -errno on failure, or
+ * the amount of bytes consumed. */
+static ssize_t drop_sync(QIOChannel *ioc, size_t size)
+{
+    ssize_t ret, dropped = size;
+    char small[1024];
+    char *buffer;
+
+    buffer = sizeof(small) < size ? small : g_malloc(MIN(65536, size));
+    while (size > 0) {
+        ret = read_sync(ioc, buffer, MIN(65536, size));
+        if (ret < 0) {
+            goto cleanup;
+        }
+        assert(ret <= size);
+        size -= ret;
+    }
+    ret = dropped;
+
+ cleanup:
+    if (buffer != small) {
+        g_free(buffer);
+    }
+    return ret;
+}
+
 /* Send an option request. Return 0 if successful, -1 with errp set if
  * it is impossible to continue. */
 static int nbd_send_option_request(QIOChannel *ioc, uint32_t opt,
@@ -280,19 +306,12 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
         }
         (*name)[namelen] = '\0';
         len -= namelen;
-        if (len) {
-            char *buf = g_malloc(len + 1);
-            if (read_sync(ioc, buf, len) != len) {
-                error_setg(errp, "failed to read export description");
-                g_free(*name);
-                g_free(buf);
-                *name = NULL;
-                nbd_send_opt_abort(ioc);
-                return -1;
-            }
-            buf[len] = '\0';
-            TRACE("Ignoring export description: %s", buf);
-            g_free(buf);
+        if (drop_sync(ioc, len) != len) {
+            error_setg(errp, "failed to read export description");
+            g_free(*name);
+            *name = NULL;
+            nbd_send_opt_abort(ioc);
+            return -1;
         }
     } else {
         error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
@@ -571,7 +590,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
     }

     TRACE("Size is %" PRIu64 ", export flags %" PRIx16, *size, *flags);
-    if (read_sync(ioc, &buf, 124) != 124) {
+    if (drop_sync(ioc, 124) != 124) {
         error_setg(errp, "Failed to read reserved block");
         goto fail;
     }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 10/14] nbd: Less allocation during NBD_OPT_LIST
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (8 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 09/14] nbd: Let client skip portions of server reply Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 11/14] nbd: Support shorter handshake Eric Blake
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini

Since we know that the maximum name we are willing to accept
is small enough to stack-allocate, rework the iteration over
NBD_OPT_LIST responses to reuse a stack buffer rather than
allocating every time.  Furthermore, we don't even have to
allocate if we know the server's length doesn't match what
we are searching for.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v5: alter signature of nbd_receive_list for simpler logic
v4: rebase
v3: tweak commit message
---
 nbd/client.c | 145 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 70 insertions(+), 75 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index 004dec6..344ebcc 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -249,19 +249,28 @@ static int nbd_handle_reply_err(QIOChannel *ioc, nbd_opt_reply *reply,
     return result;
 }

-static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
+/* Process another portion of the NBD_OPT_LIST reply.  Set *@match if
+ * the current reply matches @want or if the server does not support
+ * NBD_OPT_LIST, otherwise leave @match alone.  Return 0 if iteration
+ * is complete, positive if more replies are expected, or negative
+ * with @errp set if an unrecoverable error occurred. */
+static int nbd_receive_list(QIOChannel *ioc, const char *want, bool *match,
+                            Error **errp)
 {
     nbd_opt_reply reply;
     uint32_t len;
     uint32_t namelen;
+    char name[NBD_MAX_NAME_SIZE + 1];
     int error;

-    *name = NULL;
     if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, &reply, errp) < 0) {
         return -1;
     }
     error = nbd_handle_reply_err(ioc, &reply, errp);
     if (error <= 0) {
+        /* The server did not support NBD_OPT_LIST, so set *match on
+         * the assumption that any name will be accepted.  */
+        *match = true;
         return error;
     }
     len = reply.length;
@@ -272,105 +281,91 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
             nbd_send_opt_abort(ioc);
             return -1;
         }
-    } else if (reply.type == NBD_REP_SERVER) {
-        if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
-            error_setg(errp, "incorrect option length %"PRIu32, len);
-            nbd_send_opt_abort(ioc);
-            return -1;
-        }
-        if (read_sync(ioc, &namelen, sizeof(namelen)) != sizeof(namelen)) {
-            error_setg(errp, "failed to read option name length");
-            nbd_send_opt_abort(ioc);
-            return -1;
-        }
-        namelen = be32_to_cpu(namelen);
-        len -= sizeof(namelen);
-        if (len < namelen) {
-            error_setg(errp, "incorrect option name length");
-            nbd_send_opt_abort(ioc);
-            return -1;
-        }
-        if (namelen > NBD_MAX_NAME_SIZE) {
-            error_setg(errp, "export name length too long %" PRIu32, namelen);
-            nbd_send_opt_abort(ioc);
-            return -1;
-        }
-
-        *name = g_new0(char, namelen + 1);
-        if (read_sync(ioc, *name, namelen) != namelen) {
-            error_setg(errp, "failed to read export name");
-            g_free(*name);
-            *name = NULL;
-            nbd_send_opt_abort(ioc);
-            return -1;
-        }
-        (*name)[namelen] = '\0';
-        len -= namelen;
-        if (drop_sync(ioc, len) != len) {
-            error_setg(errp, "failed to read export description");
-            g_free(*name);
-            *name = NULL;
-            nbd_send_opt_abort(ioc);
-            return -1;
-        }
-    } else {
+        return 0;
+    } else if (reply.type != NBD_REP_SERVER) {
         error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
                    reply.type, NBD_REP_SERVER);
         nbd_send_opt_abort(ioc);
         return -1;
     }
+
+    if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
+        error_setg(errp, "incorrect option length %"PRIu32, len);
+        nbd_send_opt_abort(ioc);
+        return -1;
+    }
+    if (read_sync(ioc, &namelen, sizeof(namelen)) != sizeof(namelen)) {
+        error_setg(errp, "failed to read option name length");
+        nbd_send_opt_abort(ioc);
+        return -1;
+    }
+    namelen = be32_to_cpu(namelen);
+    len -= sizeof(namelen);
+    if (len < namelen) {
+        error_setg(errp, "incorrect option name length");
+        nbd_send_opt_abort(ioc);
+        return -1;
+    }
+    if (namelen != strlen(want)) {
+        if (drop_sync(ioc, len) != len) {
+            error_setg(errp, "failed to skip export name with wrong length");
+            nbd_send_opt_abort(ioc);
+            return -1;
+        }
+        return 1;
+    }
+
+    assert(namelen < sizeof(name));
+    if (read_sync(ioc, name, namelen) != namelen) {
+        error_setg(errp, "failed to read export name");
+        nbd_send_opt_abort(ioc);
+        return -1;
+    }
+    name[namelen] = '\0';
+    len -= namelen;
+    if (drop_sync(ioc, len) != len) {
+        error_setg(errp, "failed to read export description");
+        nbd_send_opt_abort(ioc);
+        return -1;
+    }
+    if (!strcmp(name, want)) {
+        *match = true;
+    }
     return 1;
 }


+/* Return -1 on failure, 0 if wantname is an available export. */
 static int nbd_receive_query_exports(QIOChannel *ioc,
                                      const char *wantname,
                                      Error **errp)
 {
     bool foundExport = false;

-    TRACE("Querying export list");
+    TRACE("Querying export list for '%s'", wantname);
     if (nbd_send_option_request(ioc, NBD_OPT_LIST, 0, NULL, errp) < 0) {
         return -1;
     }

     TRACE("Reading available export names");
     while (1) {
-        char *name = NULL;
-        int ret = nbd_receive_list(ioc, &name, errp);
+        int ret = nbd_receive_list(ioc, wantname, &foundExport, errp);

         if (ret < 0) {
-            g_free(name);
-            name = NULL;
+            /* Server gave unexpected reply */
             return -1;
+        } else if (ret == 0) {
+            /* Done iterating. */
+            if (!foundExport) {
+                error_setg(errp, "No export with name '%s' available",
+                           wantname);
+                nbd_send_opt_abort(ioc);
+                return -1;
+            }
+            TRACE("Found desired export name '%s'", wantname);
+            return 0;
         }
-        if (ret == 0) {
-            /* Server doesn't support export listing, so
-             * we will just assume an export with our
-             * wanted name exists */
-            foundExport = true;
-            break;
-        }
-        if (name == NULL) {
-            TRACE("End of export name list");
-            break;
-        }
-        if (g_str_equal(name, wantname)) {
-            foundExport = true;
-            TRACE("Found desired export name '%s'", name);
-        } else {
-            TRACE("Ignored export name '%s'", name);
-        }
-        g_free(name);
-    }
-
-    if (!foundExport) {
-        error_setg(errp, "No export with name '%s' available", wantname);
-        nbd_send_opt_abort(ioc);
-        return -1;
     }
-
-    return 0;
 }

 static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 11/14] nbd: Support shorter handshake
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (9 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 10/14] nbd: Less allocation during NBD_OPT_LIST Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 12/14] nbd: Improve server handling of shutdown requests Eric Blake
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

The NBD Protocol allows the server and client to mutually agree
on a shorter handshake (omit the 124 bytes of reserved 0), via
the server advertising NBD_FLAG_NO_ZEROES and the client
acknowledging with NBD_FLAG_C_NO_ZEROES (only possible in
newstyle, whether or not it is fixed newstyle).  It doesn't
shave much off the wire, but we might as well implement it.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alex Bligh <alex@alex.org.uk>

---
v4: rebase
v3: rebase
---
 include/block/nbd.h |  6 ++++--
 nbd/client.c        |  8 +++++++-
 nbd/server.c        | 15 +++++++++++----
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 7c5103f..6c1d9f4 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -72,11 +72,13 @@ struct nbd_reply {

 /* New-style handshake (global) flags, sent from server to client, and
    control what will happen during handshake phase. */
-#define NBD_FLAG_FIXED_NEWSTYLE     (1 << 0)    /* Fixed newstyle protocol. */
+#define NBD_FLAG_FIXED_NEWSTYLE   (1 << 0) /* Fixed newstyle protocol. */
+#define NBD_FLAG_NO_ZEROES        (1 << 1) /* End handshake without zeroes. */

 /* New-style client flags, sent from client to server to control what happens
    during handshake phase. */
-#define NBD_FLAG_C_FIXED_NEWSTYLE   (1 << 0)    /* Fixed newstyle protocol. */
+#define NBD_FLAG_C_FIXED_NEWSTYLE (1 << 0) /* Fixed newstyle protocol. */
+#define NBD_FLAG_C_NO_ZEROES      (1 << 1) /* End handshake without zeroes. */

 /* Reply types. */
 #define NBD_REP_ACK             (1)             /* Data sending finished. */
diff --git a/nbd/client.c b/nbd/client.c
index 344ebcc..123165b 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -434,6 +434,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
     char buf[256];
     uint64_t magic, s;
     int rc;
+    bool zeroes = true;

     TRACE("Receiving negotiation tlscreds=%p hostname=%s.",
           tlscreds, hostname ? hostname : "<null>");
@@ -498,6 +499,11 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
             TRACE("Server supports fixed new style");
             clientflags |= NBD_FLAG_C_FIXED_NEWSTYLE;
         }
+        if (globalflags & NBD_FLAG_NO_ZEROES) {
+            zeroes = false;
+            TRACE("Server supports no zeroes");
+            clientflags |= NBD_FLAG_C_NO_ZEROES;
+        }
         /* client requested flags */
         clientflags = cpu_to_be32(clientflags);
         if (write_sync(ioc, &clientflags, sizeof(clientflags)) !=
@@ -585,7 +591,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
     }

     TRACE("Size is %" PRIu64 ", export flags %" PRIx16, *size, *flags);
-    if (drop_sync(ioc, 124) != 124) {
+    if (zeroes && drop_sync(ioc, 124) != 124) {
         error_setg(errp, "Failed to read reserved block");
         goto fail;
     }
diff --git a/nbd/server.c b/nbd/server.c
index ad31c4a..d0ccff0 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -80,6 +80,7 @@ struct NBDClient {
     int refcount;
     void (*close)(NBDClient *client);

+    bool no_zeroes;
     NBDExport *exp;
     QCryptoTLSCreds *tlscreds;
     char *tlsaclname;
@@ -444,6 +445,11 @@ static int nbd_negotiate_options(NBDClient *client)
         fixedNewstyle = true;
         flags &= ~NBD_FLAG_C_FIXED_NEWSTYLE;
     }
+    if (flags & NBD_FLAG_C_NO_ZEROES) {
+        TRACE("Client supports no zeroes at handshake end");
+        client->no_zeroes = true;
+        flags &= ~NBD_FLAG_C_NO_ZEROES;
+    }
     if (flags != 0) {
         TRACE("Unknown client flags 0x%" PRIx32 " received", flags);
         return -EIO;
@@ -596,6 +602,7 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
     const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
                               NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
     bool oldStyle;
+    size_t len;

     /* Old style negotiation header without options
         [ 0 ..   7]   passwd       ("NBDMAGIC")
@@ -612,7 +619,7 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
         ....options sent....
         [18 ..  25]   size
         [26 ..  27]   export flags
-        [28 .. 151]   reserved     (0)
+        [28 .. 151]   reserved     (0, omit if no_zeroes)
      */

     qio_channel_set_blocking(client->ioc, false, NULL);
@@ -631,7 +638,7 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
         stw_be_p(buf + 26, client->exp->nbdflags | myflags);
     } else {
         stq_be_p(buf + 8, NBD_OPTS_MAGIC);
-        stw_be_p(buf + 16, NBD_FLAG_FIXED_NEWSTYLE);
+        stw_be_p(buf + 16, NBD_FLAG_FIXED_NEWSTYLE | NBD_FLAG_NO_ZEROES);
     }

     if (oldStyle) {
@@ -658,8 +665,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
               client->exp->size, client->exp->nbdflags | myflags);
         stq_be_p(buf + 18, client->exp->size);
         stw_be_p(buf + 26, client->exp->nbdflags | myflags);
-        if (nbd_negotiate_write(client->ioc, buf + 18, sizeof(buf) - 18) !=
-            sizeof(buf) - 18) {
+        len = client->no_zeroes ? 10 : sizeof(buf) - 18;
+        if (nbd_negotiate_write(client->ioc, buf + 18, len) != len) {
             LOG("write failed");
             goto fail;
         }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 12/14] nbd: Improve server handling of shutdown requests
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (10 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 11/14] nbd: Support shorter handshake Eric Blake
@ 2016-07-19  4:07 ` Eric Blake
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

NBD commit 6d34500b clarified how clients and servers are supposed
to behave before closing a connection. It added NBD_REP_ERR_SHUTDOWN
(for the server to announce it is about to go away during option
haggling, so the client should quit sending NBD_OPT_* other than
NBD_OPT_ABORT) and ESHUTDOWN (for the server to announce it is about
to go away during transmission, so the client should quit sending
NBD_CMD_* other than NBD_CMD_DISC).  It also clarified that
NBD_OPT_ABORT gets a reply, while NBD_CMD_DISC does not.

This patch merely adds the missing reply to NBD_OPT_ABORT and teaches
the client to recognize server errors.  Actually teaching the server
to send NBD_REP_ERR_SHUTDOWN or ESHUTDOWN would require knowing that
the server has been requested to shut down soon (maybe we could do
that by installing a SIGINT handler in qemu-nbd, which transitions
from RUNNING to a new state that waits for the client to react,
rather than just out-right quitting - but that's a bigger task for
another day).

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: new patch
---
 include/block/nbd.h | 13 +++++++++----
 nbd/nbd-internal.h  |  1 +
 nbd/client.c        | 16 ++++++++++++++++
 nbd/server.c        | 10 ++++++++++
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 6c1d9f4..fc4426c 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -81,12 +81,17 @@ struct nbd_reply {
 #define NBD_FLAG_C_NO_ZEROES      (1 << 1) /* End handshake without zeroes. */

 /* Reply types. */
+#define NBD_REP_ERR(value) ((UINT32_C(1) << 31) | (value))
+
 #define NBD_REP_ACK             (1)             /* Data sending finished. */
 #define NBD_REP_SERVER          (2)             /* Export description. */
-#define NBD_REP_ERR_UNSUP       ((UINT32_C(1) << 31) | 1) /* Unknown option. */
-#define NBD_REP_ERR_POLICY      ((UINT32_C(1) << 31) | 2) /* Server denied */
-#define NBD_REP_ERR_INVALID     ((UINT32_C(1) << 31) | 3) /* Invalid length. */
-#define NBD_REP_ERR_TLS_REQD    ((UINT32_C(1) << 31) | 5) /* TLS required */
+
+#define NBD_REP_ERR_UNSUP       NBD_REP_ERR(1)  /* Unknown option */
+#define NBD_REP_ERR_POLICY      NBD_REP_ERR(2)  /* Server denied */
+#define NBD_REP_ERR_INVALID     NBD_REP_ERR(3)  /* Invalid length */
+#define NBD_REP_ERR_PLATFORM    NBD_REP_ERR(4)  /* Not compiled in */
+#define NBD_REP_ERR_TLS_REQD    NBD_REP_ERR(5)  /* TLS required */
+#define NBD_REP_ERR_SHUTDOWN    NBD_REP_ERR(7)  /* Server shutting down */

 /* Request flags, sent from client to server during transmission phase */
 #define NBD_CMD_FLAG_FUA        (1 << 0)
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index dd57e18..eee20ab 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -92,6 +92,7 @@
 #define NBD_ENOMEM     12
 #define NBD_EINVAL     22
 #define NBD_ENOSPC     28
+#define NBD_ESHUTDOWN  108

 static inline ssize_t read_sync(QIOChannel *ioc, void *buffer, size_t size)
 {
diff --git a/nbd/client.c b/nbd/client.c
index 123165b..4fb077d 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -34,6 +34,8 @@ static int nbd_errno_to_system_errno(int err)
         return ENOMEM;
     case NBD_ENOSPC:
         return ENOSPC;
+    case NBD_ESHUTDOWN:
+        return ESHUTDOWN;
     default:
         TRACE("Squashing unexpected error %d to EINVAL", err);
         /* fallthrough */
@@ -226,11 +228,21 @@ static int nbd_handle_reply_err(QIOChannel *ioc, nbd_opt_reply *reply,
                    reply->option);
         break;

+    case NBD_REP_ERR_PLATFORM:
+        error_setg(errp, "Server lacks support for option %" PRIx32,
+                   reply->option);
+        break;
+
     case NBD_REP_ERR_TLS_REQD:
         error_setg(errp, "TLS negotiation required before option %" PRIx32,
                    reply->option);
         break;

+    case NBD_REP_ERR_SHUTDOWN:
+        error_setg(errp, "Server shutting down before option %" PRIx32,
+                   reply->option);
+        break;
+
     default:
         error_setg(errp, "Unknown error code when asking for option %" PRIx32,
                    reply->option);
@@ -779,6 +791,10 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, struct nbd_reply *reply)
         LOG("invalid magic (got 0x%" PRIx32 ")", magic);
         return -EINVAL;
     }
+    if (reply->error == ESHUTDOWN) {
+        LOG("server shutting down");
+        return -EINVAL;
+    }
     return 0;
 }

diff --git a/nbd/server.c b/nbd/server.c
index d0ccff0..689636c 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -39,6 +39,8 @@ static int system_errno_to_nbd_errno(int err)
     case EFBIG:
     case ENOSPC:
         return NBD_ENOSPC;
+    case ESHUTDOWN:
+        return NBD_ESHUTDOWN;
     case EINVAL:
     default:
         return NBD_EINVAL;
@@ -521,6 +523,10 @@ static int nbd_negotiate_options(NBDClient *client)
                 if (ret < 0) {
                     return ret;
                 }
+                /* Let the client keep trying, unless they asked to quit */
+                if (clientflags == NBD_OPT_ABORT) {
+                    return -EINVAL;
+                }
                 break;
             }
         } else if (fixedNewstyle) {
@@ -533,6 +539,10 @@ static int nbd_negotiate_options(NBDClient *client)
                 break;

             case NBD_OPT_ABORT:
+                /* NBD spec says we must try to reply before
+                 * disconnecting, but that we must also tolerate
+                 * guests that don't wait for our reply. */
+                nbd_negotiate_send_rep(client->ioc, NBD_REP_ACK, clientflags);
                 return -EINVAL;

             case NBD_OPT_EXPORT_NAME:
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (11 preceding siblings ...)
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 12/14] nbd: Improve server handling of shutdown requests Eric Blake
@ 2016-07-19  4:08 ` Eric Blake
  2016-07-19  6:21   ` Fam Zheng
  2016-08-18 13:50   ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Vladimir Sementsov-Ogievskiy
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client Eric Blake
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

Upstream NBD protocol recently added the ability to efficiently
write zeroes without having to send the zeroes over the wire,
along with a flag to control whether the client wants a hole.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v4: rebase, fix value for constant
v3: abandon NBD_CMD_CLOSE extension, rebase to use blk_pwrite_zeroes
---
 include/block/nbd.h |  8 ++++++--
 nbd/server.c        | 42 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index fc4426c..e23ef73 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -69,6 +69,7 @@ struct nbd_reply {
 #define NBD_FLAG_SEND_FUA       (1 << 3)        /* Send FUA (Force Unit Access) */
 #define NBD_FLAG_ROTATIONAL     (1 << 4)        /* Use elevator algorithm - rotational media */
 #define NBD_FLAG_SEND_TRIM      (1 << 5)        /* Send TRIM (discard) */
+#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6)     /* Send WRITE_ZEROES */

 /* New-style handshake (global) flags, sent from server to client, and
    control what will happen during handshake phase. */
@@ -94,7 +95,8 @@ struct nbd_reply {
 #define NBD_REP_ERR_SHUTDOWN    NBD_REP_ERR(7)  /* Server shutting down */

 /* Request flags, sent from client to server during transmission phase */
-#define NBD_CMD_FLAG_FUA        (1 << 0)
+#define NBD_CMD_FLAG_FUA        (1 << 0) /* 'force unit access' during write */
+#define NBD_CMD_FLAG_NO_HOLE    (1 << 1) /* don't punch hole on zero run */

 /* Supported request types */
 enum {
@@ -102,7 +104,9 @@ enum {
     NBD_CMD_WRITE = 1,
     NBD_CMD_DISC = 2,
     NBD_CMD_FLUSH = 3,
-    NBD_CMD_TRIM = 4
+    NBD_CMD_TRIM = 4,
+    /* 5 reserved for failed experiment NBD_CMD_CACHE */
+    NBD_CMD_WRITE_ZEROES = 6,
 };

 #define NBD_DEFAULT_PORT	10809
diff --git a/nbd/server.c b/nbd/server.c
index 689636c..3a2fecb 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -610,7 +610,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
     char buf[8 + 8 + 8 + 128];
     int rc;
     const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
-                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
+                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA |
+                              NBD_FLAG_SEND_WRITE_ZEROES);
     bool oldStyle;
     size_t len;

@@ -1126,11 +1127,17 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
         rc = request->type == NBD_CMD_WRITE ? -ENOSPC : -EINVAL;
         goto out;
     }
-    if (request->flags & ~NBD_CMD_FLAG_FUA) {
+    if (request->flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) {
         LOG("unsupported flags (got 0x%x)", request->flags);
         rc = -EINVAL;
         goto out;
     }
+    if (request->type != NBD_CMD_WRITE_ZEROES &&
+        (request->flags & NBD_CMD_FLAG_NO_HOLE)) {
+        LOG("unexpected flags (got 0x%x)", request->flags);
+        rc = -EINVAL;
+        goto out;
+    }

     rc = 0;

@@ -1235,6 +1242,37 @@ static void nbd_trip(void *opaque)
         }
         break;

+    case NBD_CMD_WRITE_ZEROES:
+        TRACE("Request type is WRITE_ZEROES");
+
+        if (exp->nbdflags & NBD_FLAG_READ_ONLY) {
+            TRACE("Server is read-only, return error");
+            reply.error = EROFS;
+            goto error_reply;
+        }
+
+        TRACE("Writing to device");
+
+        flags = 0;
+        if (request.flags & NBD_CMD_FLAG_FUA) {
+            flags |= BDRV_REQ_FUA;
+        }
+        if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
+            flags |= BDRV_REQ_MAY_UNMAP;
+        }
+        ret = blk_pwrite_zeroes(exp->blk, request.from + exp->dev_offset,
+                                request.len, flags);
+        if (ret < 0) {
+            LOG("writing to file failed");
+            reply.error = -ret;
+            goto error_reply;
+        }
+
+        if (nbd_co_send_reply(req, &reply, 0) < 0) {
+            goto out;
+        }
+        break;
+
     case NBD_CMD_DISC:
         /* unreachable, thanks to special case in nbd_co_receive_request() */
         abort();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (12 preceding siblings ...)
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
@ 2016-07-19  4:08 ` Eric Blake
  2016-07-19  6:24   ` Fam Zheng
  2016-07-19  6:33 ` [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Fam Zheng
  2016-07-19  8:53 ` Paolo Bonzini
  15 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19  4:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, pbonzini, Kevin Wolf, Max Reitz

Upstream NBD protocol recently added the ability to efficiently
write zeroes without having to send the zeroes over the wire,
along with a flag to control whether the client wants a hole.

The generic block code takes care of falling back to the obvious
write of lots of zeroes if we return -ENOTSUP because the server
does not have WRITE_ZEROES.

Ideally, since NBD_CMD_WRITE_ZEROES does not involve any data
over the wire, we want to support transactions that are much
larger than the normal 32M limit imposed on NBD_CMD_WRITE.  But
the server may still have a limit smaller than UINT_MAX, so
until experimental NBD protocol additions for advertising various
command sizes is finalized (see [1], [2]), for now we just stick to
the same limits as normal writes.

[1] https://github.com/yoe/nbd/blob/extension-info/doc/proto.md
[2] https://sourceforge.net/p/nbd/mailman/message/35081223/

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v5: enhance commit message
v4: rebase to byte-based limits
v3: rebase, tell block layer about our support
---
 block/nbd-client.h |  2 ++
 block/nbd-client.c | 35 +++++++++++++++++++++++++++++++++++
 block/nbd.c        |  4 ++++
 3 files changed, 41 insertions(+)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index 044aca4..2cfe377 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -48,6 +48,8 @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count);
 int nbd_client_co_flush(BlockDriverState *bs);
 int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
                           uint64_t bytes, QEMUIOVector *qiov, int flags);
+int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+                                int count, BdrvRequestFlags flags);
 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
                          uint64_t bytes, QEMUIOVector *qiov, int flags);

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 7e9c3ec..104ba2f 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -275,6 +275,41 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
     return -reply.error;
 }

+int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+                                int count, BdrvRequestFlags flags)
+{
+    ssize_t ret;
+    NbdClientSession *client = nbd_get_client_session(bs);
+    struct nbd_request request = {
+        .type = NBD_CMD_WRITE_ZEROES,
+        .from = offset,
+        .len = count,
+    };
+    struct nbd_reply reply;
+
+    if (!(client->nbdflags & NBD_FLAG_SEND_WRITE_ZEROES)) {
+        return -ENOTSUP;
+    }
+
+    if (flags & BDRV_REQ_FUA) {
+        assert(client->nbdflags & NBD_FLAG_SEND_FUA);
+        request.flags |= NBD_CMD_FLAG_FUA;
+    }
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        request.flags |= NBD_CMD_FLAG_NO_HOLE;
+    }
+
+    nbd_coroutine_start(client, &request);
+    ret = nbd_co_send_request(bs, &request, NULL);
+    if (ret < 0) {
+        reply.error = -ret;
+    } else {
+        nbd_co_receive_reply(client, &request, &reply, NULL);
+    }
+    nbd_coroutine_end(client, &request);
+    return -reply.error;
+}
+
 int nbd_client_co_flush(BlockDriverState *bs)
 {
     NbdClientSession *client = nbd_get_client_session(bs);
diff --git a/block/nbd.c b/block/nbd.c
index 8d57220..049d1bd 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -357,6 +357,7 @@ static int nbd_co_flush(BlockDriverState *bs)
 static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     bs->bl.max_pdiscard = NBD_MAX_BUFFER_SIZE;
+    bs->bl.max_pwrite_zeroes = NBD_MAX_BUFFER_SIZE;
     bs->bl.max_transfer = NBD_MAX_BUFFER_SIZE;
 }

@@ -440,6 +441,7 @@ static BlockDriver bdrv_nbd = {
     .bdrv_file_open             = nbd_open,
     .bdrv_co_preadv             = nbd_client_co_preadv,
     .bdrv_co_pwritev            = nbd_client_co_pwritev,
+    .bdrv_co_pwrite_zeroes      = nbd_client_co_pwrite_zeroes,
     .bdrv_close                 = nbd_close,
     .bdrv_co_flush_to_os        = nbd_co_flush,
     .bdrv_co_pdiscard           = nbd_client_co_pdiscard,
@@ -458,6 +460,7 @@ static BlockDriver bdrv_nbd_tcp = {
     .bdrv_file_open             = nbd_open,
     .bdrv_co_preadv             = nbd_client_co_preadv,
     .bdrv_co_pwritev            = nbd_client_co_pwritev,
+    .bdrv_co_pwrite_zeroes      = nbd_client_co_pwrite_zeroes,
     .bdrv_close                 = nbd_close,
     .bdrv_co_flush_to_os        = nbd_co_flush,
     .bdrv_co_pdiscard           = nbd_client_co_pdiscard,
@@ -476,6 +479,7 @@ static BlockDriver bdrv_nbd_unix = {
     .bdrv_file_open             = nbd_open,
     .bdrv_co_preadv             = nbd_client_co_preadv,
     .bdrv_co_pwritev            = nbd_client_co_pwritev,
+    .bdrv_co_pwrite_zeroes      = nbd_client_co_pwrite_zeroes,
     .bdrv_close                 = nbd_close,
     .bdrv_co_flush_to_os        = nbd_co_flush,
     .bdrv_co_pdiscard           = nbd_client_co_pdiscard,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server Eric Blake
@ 2016-07-19  5:10   ` Fam Zheng
  2016-07-19 14:52     ` Eric Blake
  0 siblings, 1 reply; 55+ messages in thread
From: Fam Zheng @ 2016-07-19  5:10 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, pbonzini, qemu-block

On Mon, 07/18 22:07, Eric Blake wrote:
> Rather than open-coding NBD_REP_SERVER, reuse the code we
> already have by adding a length parameter.  Additionally,
> the refactoring will make adding NBD_OPT_GO in a later patch
> easier.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v4: no change
> v3: rebase to changes earlier in series
> ---
>  nbd/server.c | 48 +++++++++++++++++++++++-------------------------
>  1 file changed, 23 insertions(+), 25 deletions(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index 85c4f5d..c8716f1 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -195,12 +195,15 @@ static ssize_t nbd_negotiate_drop_sync(QIOChannel *ioc, size_t size)
> 
>  */
> 
> -static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
> +/* Send a reply header, including length, but no payload.
> + * Return -errno to kill connection, 0 to continue negotiation. */

Not a show stopper but I'm not sure documenting the control logic of the
outermost caller a few layers away is a good idea, the same question applies to
functions below as well.

> +static int nbd_negotiate_send_rep_len(QIOChannel *ioc, uint32_t type,
> +                                      uint32_t opt, uint32_t len)
>  {
>      uint64_t magic;
> -    uint32_t len;
> 
> -    TRACE("Reply opt=%" PRIx32 " type=%" PRIx32, type, opt);
> +    TRACE("Reply opt=%" PRIx32 " type=%" PRIx32 " len=%" PRIu32,
> +          type, opt, len);
> 
>      magic = cpu_to_be64(NBD_REP_MAGIC);
>      if (nbd_negotiate_write(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> @@ -217,7 +220,7 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
>          LOG("write failed (rep type)");
>          return -EINVAL;
>      }
> -    len = cpu_to_be32(0);
> +    len = cpu_to_be32(len);
>      if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
>          LOG("write failed (rep data length)");
>          return -EINVAL;
> @@ -225,37 +228,32 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
>      return 0;
>  }
> 
> +/* Send a reply header with default 0 length.
> + * Return -errno to kill connection, 0 to continue negotiation. */
> +static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
> +{
> +    return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
> +}
> +
> +/* Send an NBD_REP_SERVER reply to NBD_OPT_LIST, including payload.
> + * Return -errno to kill connection, 0 to continue negotiation. */
>  static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
>  {
> -    uint64_t magic;
>      size_t name_len, desc_len;
> -    uint32_t opt, type, len;
> +    uint32_t len;
>      const char *name = exp->name ? exp->name : "";
>      const char *desc = exp->description ? exp->description : "";
> +    int rc;
> 
>      TRACE("Advertising export name '%s' description '%s'", name, desc);
>      name_len = strlen(name);
>      desc_len = strlen(desc);
> -    magic = cpu_to_be64(NBD_REP_MAGIC);
> -    if (nbd_negotiate_write(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> -        LOG("write failed (magic)");
> -        return -EINVAL;
> -     }
> -    opt = cpu_to_be32(NBD_OPT_LIST);
> -    if (nbd_negotiate_write(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
> -        LOG("write failed (opt)");
> -        return -EINVAL;
> -    }
> -    type = cpu_to_be32(NBD_REP_SERVER);
> -    if (nbd_negotiate_write(ioc, &type, sizeof(type)) != sizeof(type)) {
> -        LOG("write failed (reply type)");
> -        return -EINVAL;
> -    }
> -    len = cpu_to_be32(name_len + desc_len + sizeof(len));
> -    if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
> -        LOG("write failed (length)");
> -        return -EINVAL;
> +    len = name_len + desc_len + sizeof(len);
> +    rc = nbd_negotiate_send_rep_len(ioc, NBD_REP_SERVER, NBD_OPT_LIST, len);
> +    if (rc < 0) {
> +        return rc;
>      }
> +
>      len = cpu_to_be32(name_len);
>      if (nbd_negotiate_write(ioc, &len, sizeof(len)) != sizeof(len)) {
>          LOG("write failed (name length)");
> -- 
> 2.5.5
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors Eric Blake
@ 2016-07-19  5:15   ` Fam Zheng
  2016-10-11 15:12     ` Eric Blake
  0 siblings, 1 reply; 55+ messages in thread
From: Fam Zheng @ 2016-07-19  5:15 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, pbonzini, qemu-block

On Mon, 07/18 22:07, Eric Blake wrote:
>  nbd/server.c | 78 +++++++++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 59 insertions(+), 19 deletions(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index c8716f1..ad31c4a 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -235,6 +235,38 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
>      return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
>  }
> 
> +/* Send an error reply.
> + * Return -errno to kill connection, 0 to continue negotiation. */
> +static int GCC_FMT_ATTR(4, 5)
> +    nbd_negotiate_send_rep_err(QIOChannel *ioc, uint32_t type,

Isn't the function name supposed to be place at col 0?

> +                               uint32_t opt, const char *fmt, ...)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client
  2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client Eric Blake
@ 2016-07-19  5:31   ` Fam Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Fam Zheng @ 2016-07-19  5:31 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Kevin Wolf, pbonzini, qemu-block, Max Reitz

On Mon, 07/18 22:07, Eric Blake wrote:
> Rather than open-coding each option request, it's easier to
> have common helper functions do the work.  That in turn requires
> having convenient packed types for handling option requests
> and replies.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v4: rebase
> v3: rebase, tweak a debug message
> ---
>  include/block/nbd.h |  25 +++++-
>  nbd/nbd-internal.h  |   2 +-
>  nbd/client.c        | 250 ++++++++++++++++++++++------------------------------
>  3 files changed, 126 insertions(+), 151 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 6904eb4..7c5103f 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -26,15 +26,34 @@
>  #include "io/channel-socket.h"
>  #include "crypto/tlscreds.h"
> 
> -/* Note: these are _NOT_ the same as the network representation of an NBD
> +/* Handshake phase structs - this struct is passed on the wire */
> +
> +struct nbd_option {
> +    uint64_t magic; /* NBD_OPTS_MAGIC */
> +    uint32_t option; /* NBD_OPT_* */
> +    uint32_t length;
> +} QEMU_PACKED;
> +typedef struct nbd_option nbd_option;
> +
> +struct nbd_opt_reply {
> +    uint64_t magic; /* NBD_REP_MAGIC */
> +    uint32_t option; /* NBD_OPT_* */
> +    uint32_t type; /* NBD_REP_* */
> +    uint32_t length;
> +} QEMU_PACKED;
> +typedef struct nbd_opt_reply nbd_opt_reply;

Maybe someday in the future we should rename nbd type names to CamelCase?

> +
> +/* Transmission phase structs
> + *
> + * Note: these are _NOT_ the same as the network representation of an NBD
>   * request and reply!
>   */
>  struct nbd_request {
>      uint64_t handle;
>      uint64_t from;
>      uint32_t len;
> -    uint16_t flags;
> -    uint16_t type;
> +    uint16_t flags; /* NBD_CMD_FLAG_* */
> +    uint16_t type; /* NBD_CMD_* */
>  };
> 
>  struct nbd_reply {
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index 99e5157..dd57e18 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -62,7 +62,7 @@
>  #define NBD_REPLY_MAGIC         0x67446698
>  #define NBD_OPTS_MAGIC          0x49484156454F5054LL
>  #define NBD_CLIENT_MAGIC        0x0000420281861253LL
> -#define NBD_REP_MAGIC           0x3e889045565a9LL
> +#define NBD_REP_MAGIC           0x0003e889045565a9LL
> 
>  #define NBD_SET_SOCK            _IO(0xab, 0)
>  #define NBD_SET_BLKSIZE         _IO(0xab, 1)
> diff --git a/nbd/client.c b/nbd/client.c
> index 7c172ed..6e5288a 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -75,64 +75,123 @@ static QTAILQ_HEAD(, NBDExport) exports = QTAILQ_HEAD_INITIALIZER(exports);
> 
>  */
> 
> +/* Send an option request. Return 0 if successful, -1 with errp set if
> + * it is impossible to continue. */
> +static int nbd_send_option_request(QIOChannel *ioc, uint32_t opt,
> +                                   uint32_t len, const char *data,
> +                                   Error **errp)
> +{
> +    nbd_option req;
> +    QEMU_BUILD_BUG_ON(sizeof(req) != 16);
> 
> -/* If type represents success, return 1 without further action.
> - * If type represents an error reply, consume the rest of the packet on ioc.
> - * Then return 0 for unsupported (so the client can fall back to
> - * other approaches), or -1 with errp set for other errors.
> +    if (len == -1) {

This looks like magic, move the strlen() call to caller, or add a comment?

> +        req.length = len = strlen(data);
> +    }
> +    TRACE("Sending option request %"PRIu32", len %"PRIu32, opt, len);

s/%"PRI/%" PRI/g ?

> +
> +    stq_be_p(&req.magic, NBD_OPTS_MAGIC);
> +    stl_be_p(&req.option, opt);
> +    stl_be_p(&req.length, len);
> +
> +    if (write_sync(ioc, &req, sizeof(req)) != sizeof(req)) {
> +        error_setg(errp, "Failed to send option request header");
> +        return -1;
> +    }
> +
> +    if (len && write_sync(ioc, (char *) data, len) != len) {
> +        error_setg(errp, "Failed to send option request data");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/* Receive the header of an option reply, which should match the given
> + * opt.  Read through the length field, but NOT the length bytes of
> + * payload. Return 0 if successful, -1 with errp set if it is
> + * impossible to continue. */
> +static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt,
> +                                    nbd_opt_reply *reply, Error **errp)
> +{
> +    QEMU_BUILD_BUG_ON(sizeof(*reply) != 20);
> +    if (read_sync(ioc, reply, sizeof(*reply)) != sizeof(*reply)) {
> +        error_setg(errp, "failed to read option reply");
> +        return -1;
> +    }
> +    be64_to_cpus(&reply->magic);
> +    be32_to_cpus(&reply->option);
> +    be32_to_cpus(&reply->type);
> +    be32_to_cpus(&reply->length);
> +
> +    TRACE("Received option reply %"PRIx32", type %"PRIx32", len %"PRIu32,
> +          reply->option, reply->type, reply->length);
> +
> +    if (reply->magic != NBD_REP_MAGIC) {
> +        error_setg(errp, "Unexpected option reply magic");
> +        return -1;
> +    }
> +    if (reply->option != opt) {
> +        error_setg(errp, "Unexpected option type %x expected %x",
> +                   reply->option, opt);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +/* If reply represents success, return 1 without further action.
> + * If reply represents an error, consume the optional payload of
> + * the packet on ioc.  Then return 0 for unsupported (so the client
> + * can fall back to other approaches), or -1 with errp set for other
> + * errors.
>   */
> -static int nbd_handle_reply_err(QIOChannel *ioc, uint32_t opt, uint32_t type,
> +static int nbd_handle_reply_err(QIOChannel *ioc, nbd_opt_reply *reply,
>                                  Error **errp)
>  {
> -    uint32_t len;
>      char *msg = NULL;
>      int result = -1;
> 
> -    if (!(type & (1 << 31))) {
> +    if (!(reply->type & (1 << 31))) {
>          return 1;
>      }
> 
> -    if (read_sync(ioc, &len, sizeof(len)) != sizeof(len)) {
> -        error_setg(errp, "failed to read option length");
> -        return -1;
> -    }
> -    len = be32_to_cpu(len);
> -    if (len) {
> -        if (len > NBD_MAX_BUFFER_SIZE) {
> +    if (reply->length) {
> +        if (reply->length > NBD_MAX_BUFFER_SIZE) {
>              error_setg(errp, "server's error message is too long");
>              goto cleanup;
>          }
> -        msg = g_malloc(len + 1);
> -        if (read_sync(ioc, msg, len) != len) {
> +        msg = g_malloc(reply->length + 1);
> +        if (read_sync(ioc, msg, reply->length) != reply->length) {
>              error_setg(errp, "failed to read option error message");
>              goto cleanup;
>          }
> -        msg[len] = '\0';
> +        msg[reply->length] = '\0';
>      }
> 
> -    switch (type) {
> +    switch (reply->type) {
>      case NBD_REP_ERR_UNSUP:
>          TRACE("server doesn't understand request %" PRIx32
> -              ", attempting fallback", opt);
> +              ", attempting fallback", reply->option);
>          result = 0;
>          goto cleanup;
> 
>      case NBD_REP_ERR_POLICY:
> -        error_setg(errp, "Denied by server for option %" PRIx32, opt);
> +        error_setg(errp, "Denied by server for option %" PRIx32,
> +                   reply->option);
>          break;
> 
>      case NBD_REP_ERR_INVALID:
> -        error_setg(errp, "Invalid data length for option %" PRIx32, opt);
> +        error_setg(errp, "Invalid data length for option %" PRIx32,
> +                   reply->option);
>          break;
> 
>      case NBD_REP_ERR_TLS_REQD:
>          error_setg(errp, "TLS negotiation required before option %" PRIx32,
> -                   opt);
> +                   reply->option);
>          break;
> 
>      default:
>          error_setg(errp, "Unknown error code when asking for option %" PRIx32,
> -                   opt);
> +                   reply->option);
>          break;
>      }
> 
> @@ -147,58 +206,29 @@ static int nbd_handle_reply_err(QIOChannel *ioc, uint32_t opt, uint32_t type,
> 
>  static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
>  {
> -    uint64_t magic;
> -    uint32_t opt;
> -    uint32_t type;
> +    nbd_opt_reply reply;
>      uint32_t len;
>      uint32_t namelen;
>      int error;
> 
>      *name = NULL;
> -    if (read_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> -        error_setg(errp, "failed to read list option magic");
> +    if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, &reply, errp) < 0) {
>          return -1;
>      }
> -    magic = be64_to_cpu(magic);
> -    if (magic != NBD_REP_MAGIC) {
> -        error_setg(errp, "Unexpected option list magic");
> -        return -1;
> -    }
> -    if (read_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
> -        error_setg(errp, "failed to read list option");
> -        return -1;
> -    }
> -    opt = be32_to_cpu(opt);
> -    if (opt != NBD_OPT_LIST) {
> -        error_setg(errp, "Unexpected option type %" PRIx32 " expected %x",
> -                   opt, NBD_OPT_LIST);
> -        return -1;
> -    }
> -
> -    if (read_sync(ioc, &type, sizeof(type)) != sizeof(type)) {
> -        error_setg(errp, "failed to read list option type");
> -        return -1;
> -    }
> -    type = be32_to_cpu(type);
> -    error = nbd_handle_reply_err(ioc, opt, type, errp);
> +    error = nbd_handle_reply_err(ioc, &reply, errp);
>      if (error <= 0) {
>          return error;
>      }
> +    len = reply.length;
> 
> -    if (read_sync(ioc, &len, sizeof(len)) != sizeof(len)) {
> -        error_setg(errp, "failed to read option length");
> -        return -1;
> -    }
> -    len = be32_to_cpu(len);
> -
> -    if (type == NBD_REP_ACK) {
> +    if (reply.type == NBD_REP_ACK) {
>          if (len != 0) {
>              error_setg(errp, "length too long for option end");
>              return -1;
>          }
> -    } else if (type == NBD_REP_SERVER) {
> +    } else if (reply.type == NBD_REP_SERVER) {
>          if (len < sizeof(namelen) || len > NBD_MAX_BUFFER_SIZE) {
> -            error_setg(errp, "incorrect option length");
> +            error_setg(errp, "incorrect option length %"PRIu32, len);

s/%"/%" /

>              return -1;
>          }
>          if (read_sync(ioc, &namelen, sizeof(namelen)) != sizeof(namelen)) {
> @@ -240,7 +270,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, Error **errp)
>          }
>      } else {
>          error_setg(errp, "Unexpected reply type %" PRIx32 " expected %x",
> -                   type, NBD_REP_SERVER);
> +                   reply.type, NBD_REP_SERVER);
>          return -1;
>      }
>      return 1;
> @@ -251,24 +281,10 @@ static int nbd_receive_query_exports(QIOChannel *ioc,
>                                       const char *wantname,
>                                       Error **errp)
>  {
> -    uint64_t magic = cpu_to_be64(NBD_OPTS_MAGIC);
> -    uint32_t opt = cpu_to_be32(NBD_OPT_LIST);
> -    uint32_t length = 0;
>      bool foundExport = false;
> 
>      TRACE("Querying export list");
> -    if (write_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> -        error_setg(errp, "Failed to send list option magic");
> -        return -1;
> -    }
> -
> -    if (write_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
> -        error_setg(errp, "Failed to send list option number");
> -        return -1;
> -    }
> -
> -    if (write_sync(ioc, &length, sizeof(length)) != sizeof(length)) {
> -        error_setg(errp, "Failed to send list option length");
> +    if (nbd_send_option_request(ioc, NBD_OPT_LIST, 0, NULL, errp) < 0) {
>          return -1;
>      }
> 
> @@ -314,72 +330,29 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
>                                          QCryptoTLSCreds *tlscreds,
>                                          const char *hostname, Error **errp)
>  {
> -    uint64_t magic = cpu_to_be64(NBD_OPTS_MAGIC);
> -    uint32_t opt = cpu_to_be32(NBD_OPT_STARTTLS);
> -    uint32_t length = 0;
> -    uint32_t type;
> +    nbd_opt_reply reply;
>      QIOChannelTLS *tioc;
>      struct NBDTLSHandshakeData data = { 0 };
> 
>      TRACE("Requesting TLS from server");
> -    if (write_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> -        error_setg(errp, "Failed to send option magic");
> -        return NULL;
> -    }
> -
> -    if (write_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
> -        error_setg(errp, "Failed to send option number");
> -        return NULL;
> -    }
> -
> -    if (write_sync(ioc, &length, sizeof(length)) != sizeof(length)) {
> -        error_setg(errp, "Failed to send option length");
> -        return NULL;
> -    }
> -
> -    TRACE("Getting TLS reply from server1");
> -    if (read_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> -        error_setg(errp, "failed to read option magic");
> -        return NULL;
> -    }
> -    magic = be64_to_cpu(magic);
> -    if (magic != NBD_REP_MAGIC) {
> -        error_setg(errp, "Unexpected option magic");
> -        return NULL;
> -    }
> -    TRACE("Getting TLS reply from server2");
> -    if (read_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
> -        error_setg(errp, "failed to read option");
> -        return NULL;
> -    }
> -    opt = be32_to_cpu(opt);
> -    if (opt != NBD_OPT_STARTTLS) {
> -        error_setg(errp, "Unexpected option type %" PRIx32 " expected %x",
> -                   opt, NBD_OPT_STARTTLS);
> +    if (nbd_send_option_request(ioc, NBD_OPT_STARTTLS, 0, NULL, errp) < 0) {
>          return NULL;
>      }
> 
>      TRACE("Getting TLS reply from server");
> -    if (read_sync(ioc, &type, sizeof(type)) != sizeof(type)) {
> -        error_setg(errp, "failed to read option type");
> +    if (nbd_receive_option_reply(ioc, NBD_OPT_STARTTLS, &reply, errp) < 0) {
>          return NULL;
>      }
> -    type = be32_to_cpu(type);
> -    if (type != NBD_REP_ACK) {
> +
> +    if (reply.type != NBD_REP_ACK) {
>          error_setg(errp, "Server rejected request to start TLS %" PRIx32,
> -                   type);
> +                   reply.type);
>          return NULL;
>      }
> 
> -    TRACE("Getting TLS reply from server");
> -    if (read_sync(ioc, &length, sizeof(length)) != sizeof(length)) {
> -        error_setg(errp, "failed to read option length");
> -        return NULL;
> -    }
> -    length = be32_to_cpu(length);
> -    if (length != 0) {
> +    if (reply.length != 0) {
>          error_setg(errp, "Start TLS response was not zero %" PRIu32,
> -                   length);
> +                   reply.length);
>          return NULL;
>      }
> 
> @@ -466,8 +439,6 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
> 
>      if (magic == NBD_OPTS_MAGIC) {
>          uint32_t clientflags = 0;
> -        uint32_t opt;
> -        uint32_t namesize;
>          uint16_t globalflags;
>          bool fixedNewStyle = false;
> 
> @@ -517,28 +488,13 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t *flags,
>                  goto fail;
>              }
>          }
> -        /* write the export name */
> -        magic = cpu_to_be64(magic);
> -        if (write_sync(ioc, &magic, sizeof(magic)) != sizeof(magic)) {
> -            error_setg(errp, "Failed to send export name magic");
> -            goto fail;
> -        }
> -        opt = cpu_to_be32(NBD_OPT_EXPORT_NAME);
> -        if (write_sync(ioc, &opt, sizeof(opt)) != sizeof(opt)) {
> -            error_setg(errp, "Failed to send export name option number");
> -            goto fail;
> -        }
> -        namesize = cpu_to_be32(strlen(name));
> -        if (write_sync(ioc, &namesize, sizeof(namesize)) !=
> -            sizeof(namesize)) {
> -            error_setg(errp, "Failed to send export name length");
> -            goto fail;
> -        }
> -        if (write_sync(ioc, (char *)name, strlen(name)) != strlen(name)) {
> -            error_setg(errp, "Failed to send export name");
> +        /* write the export name request */
> +        if (nbd_send_option_request(ioc, NBD_OPT_EXPORT_NAME, -1, name,
> +                                    errp) < 0) {
>              goto fail;
>          }
> 
> +        /* Read the response */
>          if (read_sync(ioc, &s, sizeof(s)) != sizeof(s)) {
>              error_setg(errp, "Failed to read export length");
>              goto fail;
> -- 
> 2.5.5
> 
> 

Fam

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
@ 2016-07-19  6:21   ` Fam Zheng
  2016-07-19 15:28     ` Eric Blake
  2016-08-18 13:50   ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 55+ messages in thread
From: Fam Zheng @ 2016-07-19  6:21 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Kevin Wolf, pbonzini, qemu-block, Max Reitz

On Mon, 07/18 22:08, Eric Blake wrote:
> Upstream NBD protocol recently added the ability to efficiently
> write zeroes without having to send the zeroes over the wire,
> along with a flag to control whether the client wants a hole.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v4: rebase, fix value for constant
> v3: abandon NBD_CMD_CLOSE extension, rebase to use blk_pwrite_zeroes
> ---
>  include/block/nbd.h |  8 ++++++--
>  nbd/server.c        | 42 ++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index fc4426c..e23ef73 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -69,6 +69,7 @@ struct nbd_reply {
>  #define NBD_FLAG_SEND_FUA       (1 << 3)        /* Send FUA (Force Unit Access) */
>  #define NBD_FLAG_ROTATIONAL     (1 << 4)        /* Use elevator algorithm - rotational media */
>  #define NBD_FLAG_SEND_TRIM      (1 << 5)        /* Send TRIM (discard) */
> +#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6)     /* Send WRITE_ZEROES */
> 
>  /* New-style handshake (global) flags, sent from server to client, and
>     control what will happen during handshake phase. */
> @@ -94,7 +95,8 @@ struct nbd_reply {
>  #define NBD_REP_ERR_SHUTDOWN    NBD_REP_ERR(7)  /* Server shutting down */
> 
>  /* Request flags, sent from client to server during transmission phase */
> -#define NBD_CMD_FLAG_FUA        (1 << 0)
> +#define NBD_CMD_FLAG_FUA        (1 << 0) /* 'force unit access' during write */
> +#define NBD_CMD_FLAG_NO_HOLE    (1 << 1) /* don't punch hole on zero run */
> 
>  /* Supported request types */
>  enum {
> @@ -102,7 +104,9 @@ enum {
>      NBD_CMD_WRITE = 1,
>      NBD_CMD_DISC = 2,
>      NBD_CMD_FLUSH = 3,
> -    NBD_CMD_TRIM = 4
> +    NBD_CMD_TRIM = 4,
> +    /* 5 reserved for failed experiment NBD_CMD_CACHE */
> +    NBD_CMD_WRITE_ZEROES = 6,
>  };
> 
>  #define NBD_DEFAULT_PORT	10809
> diff --git a/nbd/server.c b/nbd/server.c
> index 689636c..3a2fecb 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -610,7 +610,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
>      char buf[8 + 8 + 8 + 128];
>      int rc;
>      const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
> -                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
> +                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA |
> +                              NBD_FLAG_SEND_WRITE_ZEROES);
>      bool oldStyle;
>      size_t len;
> 
> @@ -1126,11 +1127,17 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
>          rc = request->type == NBD_CMD_WRITE ? -ENOSPC : -EINVAL;
>          goto out;
>      }
> -    if (request->flags & ~NBD_CMD_FLAG_FUA) {
> +    if (request->flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) {
>          LOG("unsupported flags (got 0x%x)", request->flags);
>          rc = -EINVAL;
>          goto out;
>      }
> +    if (request->type != NBD_CMD_WRITE_ZEROES &&
> +        (request->flags & NBD_CMD_FLAG_NO_HOLE)) {
> +        LOG("unexpected flags (got 0x%x)", request->flags);
> +        rc = -EINVAL;
> +        goto out;
> +    }
> 
>      rc = 0;
> 
> @@ -1235,6 +1242,37 @@ static void nbd_trip(void *opaque)
>          }
>          break;
> 
> +    case NBD_CMD_WRITE_ZEROES:
> +        TRACE("Request type is WRITE_ZEROES");
> +
> +        if (exp->nbdflags & NBD_FLAG_READ_ONLY) {
> +            TRACE("Server is read-only, return error");
> +            reply.error = EROFS;
> +            goto error_reply;
> +        }
> +
> +        TRACE("Writing to device");
> +
> +        flags = 0;
> +        if (request.flags & NBD_CMD_FLAG_FUA) {
> +            flags |= BDRV_REQ_FUA;
> +        }
> +        if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
> +            flags |= BDRV_REQ_MAY_UNMAP;

If I'm reading the NBD proto.md correctly, this is not enough if
NBD_CMD_FLAG_NO_HOLE is specified. We probably need to use a zeroed buffer with
blk_pwrite, or pass a new flag (BDRV_RED_NO_HOLE) to blk_pwrite_zeroes to
enforce the bdrv_driver_pwritev() branch in bdrv_co_do_pwrite_zeroes().

> +        }
> +        ret = blk_pwrite_zeroes(exp->blk, request.from + exp->dev_offset,
> +                                request.len, flags);
> +        if (ret < 0) {
> +            LOG("writing to file failed");
> +            reply.error = -ret;
> +            goto error_reply;
> +        }
> +
> +        if (nbd_co_send_reply(req, &reply, 0) < 0) {
> +            goto out;
> +        }
> +        break;
> +
>      case NBD_CMD_DISC:
>          /* unreachable, thanks to special case in nbd_co_receive_request() */
>          abort();
> -- 
> 2.5.5
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client Eric Blake
@ 2016-07-19  6:24   ` Fam Zheng
  2016-07-19 15:31     ` Eric Blake
  0 siblings, 1 reply; 55+ messages in thread
From: Fam Zheng @ 2016-07-19  6:24 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Kevin Wolf, pbonzini, qemu-block, Max Reitz

On Mon, 07/18 22:08, Eric Blake wrote:
> Upstream NBD protocol recently added the ability to efficiently
> write zeroes without having to send the zeroes over the wire,
> along with a flag to control whether the client wants a hole.
> 
> The generic block code takes care of falling back to the obvious
> write of lots of zeroes if we return -ENOTSUP because the server
> does not have WRITE_ZEROES.
> 
> Ideally, since NBD_CMD_WRITE_ZEROES does not involve any data
> over the wire, we want to support transactions that are much
> larger than the normal 32M limit imposed on NBD_CMD_WRITE.  But
> the server may still have a limit smaller than UINT_MAX, so
> until experimental NBD protocol additions for advertising various
> command sizes is finalized (see [1], [2]), for now we just stick to
> the same limits as normal writes.
> 
> [1] https://github.com/yoe/nbd/blob/extension-info/doc/proto.md
> [2] https://sourceforge.net/p/nbd/mailman/message/35081223/
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v5: enhance commit message
> v4: rebase to byte-based limits
> v3: rebase, tell block layer about our support
> ---
>  block/nbd-client.h |  2 ++
>  block/nbd-client.c | 35 +++++++++++++++++++++++++++++++++++
>  block/nbd.c        |  4 ++++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/block/nbd-client.h b/block/nbd-client.h
> index 044aca4..2cfe377 100644
> --- a/block/nbd-client.h
> +++ b/block/nbd-client.h
> @@ -48,6 +48,8 @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int count);
>  int nbd_client_co_flush(BlockDriverState *bs);
>  int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
>                            uint64_t bytes, QEMUIOVector *qiov, int flags);
> +int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
> +                                int count, BdrvRequestFlags flags);
>  int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
>                           uint64_t bytes, QEMUIOVector *qiov, int flags);
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 7e9c3ec..104ba2f 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -275,6 +275,41 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
>      return -reply.error;
>  }
> 
> +int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
> +                                int count, BdrvRequestFlags flags)
> +{
> +    ssize_t ret;
> +    NbdClientSession *client = nbd_get_client_session(bs);
> +    struct nbd_request request = {
> +        .type = NBD_CMD_WRITE_ZEROES,
> +        .from = offset,
> +        .len = count,
> +    };
> +    struct nbd_reply reply;
> +
> +    if (!(client->nbdflags & NBD_FLAG_SEND_WRITE_ZEROES)) {
> +        return -ENOTSUP;
> +    }
> +
> +    if (flags & BDRV_REQ_FUA) {
> +        assert(client->nbdflags & NBD_FLAG_SEND_FUA);
> +        request.flags |= NBD_CMD_FLAG_FUA;
> +    }
> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {

Correct me if I'm wrong, I don't think we care about BDRV_REQ_MAY_UNMAP here,
the NBD protocol can never issue an unmap request. In other words I think
NO_HOLE and MAY_UNMAP are two different things.

Fam

> +        request.flags |= NBD_CMD_FLAG_NO_HOLE;
> +    }
> +
> +    nbd_coroutine_start(client, &request);
> +    ret = nbd_co_send_request(bs, &request, NULL);
> +    if (ret < 0) {
> +        reply.error = -ret;
> +    } else {
> +        nbd_co_receive_reply(client, &request, &reply, NULL);
> +    }
> +    nbd_coroutine_end(client, &request);
> +    return -reply.error;
> +}
> +
>  int nbd_client_co_flush(BlockDriverState *bs)
>  {
>      NbdClientSession *client = nbd_get_client_session(bs);
> diff --git a/block/nbd.c b/block/nbd.c
> index 8d57220..049d1bd 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -357,6 +357,7 @@ static int nbd_co_flush(BlockDriverState *bs)
>  static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>      bs->bl.max_pdiscard = NBD_MAX_BUFFER_SIZE;
> +    bs->bl.max_pwrite_zeroes = NBD_MAX_BUFFER_SIZE;
>      bs->bl.max_transfer = NBD_MAX_BUFFER_SIZE;
>  }
> 
> @@ -440,6 +441,7 @@ static BlockDriver bdrv_nbd = {
>      .bdrv_file_open             = nbd_open,
>      .bdrv_co_preadv             = nbd_client_co_preadv,
>      .bdrv_co_pwritev            = nbd_client_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = nbd_client_co_pwrite_zeroes,
>      .bdrv_close                 = nbd_close,
>      .bdrv_co_flush_to_os        = nbd_co_flush,
>      .bdrv_co_pdiscard           = nbd_client_co_pdiscard,
> @@ -458,6 +460,7 @@ static BlockDriver bdrv_nbd_tcp = {
>      .bdrv_file_open             = nbd_open,
>      .bdrv_co_preadv             = nbd_client_co_preadv,
>      .bdrv_co_pwritev            = nbd_client_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = nbd_client_co_pwrite_zeroes,
>      .bdrv_close                 = nbd_close,
>      .bdrv_co_flush_to_os        = nbd_co_flush,
>      .bdrv_co_pdiscard           = nbd_client_co_pdiscard,
> @@ -476,6 +479,7 @@ static BlockDriver bdrv_nbd_unix = {
>      .bdrv_file_open             = nbd_open,
>      .bdrv_co_preadv             = nbd_client_co_preadv,
>      .bdrv_co_pwritev            = nbd_client_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = nbd_client_co_pwrite_zeroes,
>      .bdrv_close                 = nbd_close,
>      .bdrv_co_flush_to_os        = nbd_co_flush,
>      .bdrv_co_pdiscard           = nbd_client_co_pdiscard,
> -- 
> 2.5.5
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (13 preceding siblings ...)
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client Eric Blake
@ 2016-07-19  6:33 ` Fam Zheng
  2016-07-19  8:53 ` Paolo Bonzini
  15 siblings, 0 replies; 55+ messages in thread
From: Fam Zheng @ 2016-07-19  6:33 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, pbonzini, qemu-block

On Mon, 07/18 22:07, Eric Blake wrote:
> The upstream NBD protocol is proposing an extension for efficient
> write zeroes; having a qemu implementation will be one of the reasons
> to promote the proposal from experimental to standard:
> https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md

My apologies for the last minute reviewing (I've been doing poor in processing
my backlog recently)!

The only question I have is about the relationship between NO_HOLE and
MAY_UNMAP, maybe you or Paolo can clarify that. Otherwise, the series looks
good. I've left a few other superficial questions to individual patches.

Thanks for the work!

Fam

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes
  2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
                   ` (14 preceding siblings ...)
  2016-07-19  6:33 ` [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Fam Zheng
@ 2016-07-19  8:53 ` Paolo Bonzini
  2016-07-19 15:33   ` Eric Blake
  15 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-19  8:53 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block


> The upstream NBD protocol is proposing an extension for efficient
> write zeroes; having a qemu implementation will be one of the reasons
> to promote the proposal from experimental to standard:
> https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md
> 
> v4 was here:
> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg07463.html
> 
> Prerequisites:
> v3 Auto-fragment large transactions at the block layer:
> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg03550.html
> v2 byte-based block discard:
> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg03592.html
> 
> Also available as a tag at:
> git fetch git://repo.or.cz/qemu/ericb.git nbd-zero-v5
> 
> changes since v4:
> - patch 2 - Rebase to latest
> - patch 6, 10 - Address review comments [Paolo]
> - patch 14 - Update commit message to address review comments [Paolo]

I'm sorry, it's too late for 2.7 now.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server
  2016-07-19  5:10   ` Fam Zheng
@ 2016-07-19 14:52     ` Eric Blake
  2016-07-20  4:39       ` Fam Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19 14:52 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, pbonzini, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1650 bytes --]

On 07/18/2016 11:10 PM, Fam Zheng wrote:
> On Mon, 07/18 22:07, Eric Blake wrote:
>> Rather than open-coding NBD_REP_SERVER, reuse the code we
>> already have by adding a length parameter.  Additionally,
>> the refactoring will make adding NBD_OPT_GO in a later patch
>> easier.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>
>> ---
>> v4: no change
>> v3: rebase to changes earlier in series
>> ---
>>  nbd/server.c | 48 +++++++++++++++++++++++-------------------------
>>  1 file changed, 23 insertions(+), 25 deletions(-)
>>
>> diff --git a/nbd/server.c b/nbd/server.c
>> index 85c4f5d..c8716f1 100644
>> --- a/nbd/server.c
>> +++ b/nbd/server.c
>> @@ -195,12 +195,15 @@ static ssize_t nbd_negotiate_drop_sync(QIOChannel *ioc, size_t size)
>>
>>  */
>>
>> -static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
>> +/* Send a reply header, including length, but no payload.
>> + * Return -errno to kill connection, 0 to continue negotiation. */
> 
> Not a show stopper but I'm not sure documenting the control logic of the
> outermost caller a few layers away is a good idea, the same question applies to
> functions below as well.

The documentation here accurately describes this function.  Or is your
complaint that the outermost caller is lacking documentation, and
therefore I should first do a patch that uniformly adds documentation,
before changing behavior, so that this function doesn't end up with
details while the outermost caller remains undocumented?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-19  6:21   ` Fam Zheng
@ 2016-07-19 15:28     ` Eric Blake
  2016-07-19 15:45       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19 15:28 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Kevin Wolf, pbonzini, qemu-block, Max Reitz, nbd-general

[-- Attachment #1: Type: text/plain, Size: 3030 bytes --]

[adding nbd list]

On 07/19/2016 12:21 AM, Fam Zheng wrote:
> On Mon, 07/18 22:08, Eric Blake wrote:
>> Upstream NBD protocol recently added the ability to efficiently
>> write zeroes without having to send the zeroes over the wire,
>> along with a flag to control whether the client wants a hole.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>
>> ---

>> @@ -1235,6 +1242,37 @@ static void nbd_trip(void *opaque)
>>          }
>>          break;
>>
>> +    case NBD_CMD_WRITE_ZEROES:
>> +        TRACE("Request type is WRITE_ZEROES");
>> +
>> +        if (exp->nbdflags & NBD_FLAG_READ_ONLY) {
>> +            TRACE("Server is read-only, return error");
>> +            reply.error = EROFS;
>> +            goto error_reply;
>> +        }
>> +
>> +        TRACE("Writing to device");
>> +
>> +        flags = 0;
>> +        if (request.flags & NBD_CMD_FLAG_FUA) {
>> +            flags |= BDRV_REQ_FUA;
>> +        }
>> +        if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
>> +            flags |= BDRV_REQ_MAY_UNMAP;
> 
> If I'm reading the NBD proto.md correctly, this is not enough if
> NBD_CMD_FLAG_NO_HOLE is specified. We probably need to use a zeroed buffer with
> blk_pwrite, or pass a new flag (BDRV_RED_NO_HOLE) to blk_pwrite_zeroes to
> enforce the bdrv_driver_pwritev() branch in bdrv_co_do_pwrite_zeroes().

If that's how you read it, then my proposal to proto.md needs updating.
 I specifically wrote the proposal to be as close as possible to the
existing qemu semantics, except that we negated the sense of the bit
because we wanted to allow the bit value of 0 to allow the server the
most flexibility in performing optimizations.  The code here (and in
14/14 on the client side) is merely catering to the fact that the bit
has opposite sense in the two projects.

That is, the rules in qemu are:

MAY_UNMAP == 0 : must write zeroes
MAY_UNMAP == 1 : may optimize if supported (where reads will see 0), but
must write zeroes if not

while the rules in NBD are:

FLAG_NO_HOLE == 1 : must write zeroes
FLAG_NO_HOLE == 0 : may optimize if supported (where reads will see 0),
but must write zeroes if not

Or another way of putting it: in qemu, the ability to punch holes was
added after the fact (default of no holes being 0 due to back-compat),
where prior to its addition full allocation was the only option, and
most callers have to worry about passing MAY_UNMAP when they care about
optimal use of storage; while in NBD we want to allow the server the
freedom to have optimal usage of storage by default, but need a way to
specifically ask for full allocation.

If you think the NBD flag is poorly named, we have not yet committed to
the NBD_CMD_WRITE_EXTENSIONS documentation yet, and are free to patch
proto.md to choose a better name and/or wording to better describe what
we actually mean on the NBD side of things.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client
  2016-07-19  6:24   ` Fam Zheng
@ 2016-07-19 15:31     ` Eric Blake
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-07-19 15:31 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Kevin Wolf, pbonzini, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]

On 07/19/2016 12:24 AM, Fam Zheng wrote:
> On Mon, 07/18 22:08, Eric Blake wrote:
>> Upstream NBD protocol recently added the ability to efficiently
>> write zeroes without having to send the zeroes over the wire,
>> along with a flag to control whether the client wants a hole.
>>

>> +int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
>> +                                int count, BdrvRequestFlags flags)
>> +{
>> +    ssize_t ret;
>> +    NbdClientSession *client = nbd_get_client_session(bs);
>> +    struct nbd_request request = {
>> +        .type = NBD_CMD_WRITE_ZEROES,
>> +        .from = offset,
>> +        .len = count,
>> +    };
>> +    struct nbd_reply reply;
>> +
>> +    if (!(client->nbdflags & NBD_FLAG_SEND_WRITE_ZEROES)) {
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (flags & BDRV_REQ_FUA) {
>> +        assert(client->nbdflags & NBD_FLAG_SEND_FUA);
>> +        request.flags |= NBD_CMD_FLAG_FUA;
>> +    }
>> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> 
> Correct me if I'm wrong, I don't think we care about BDRV_REQ_MAY_UNMAP here,
> the NBD protocol can never issue an unmap request. In other words I think
> NO_HOLE and MAY_UNMAP are two different things.

No. The server is (and should be) allowed to manage storage as
efficiently as it wants, and should only be required to fully allocate
storage if the client has requested that.  The NBD protocol CAN issue an
unmap request (NBD_CMD_TRIM), but we also document in the NBD protocol
that the server SHOULD be able to unmap instead of writing zeroes so
long as the result still reads as zeroes.  So we WANT to issue MAY_UNMAP
as an optimization in all cases except where the client specifically
asked for full allocation.  NO_HOLE and MAY_UNMAP are (supposed to be)
the same thing, except for being negated in sense based on what the
default value of 0 represents.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes
  2016-07-19  8:53 ` Paolo Bonzini
@ 2016-07-19 15:33   ` Eric Blake
  2016-07-19 15:41     ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-19 15:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

On 07/19/2016 02:53 AM, Paolo Bonzini wrote:
> 
>> The upstream NBD protocol is proposing an extension for efficient
>> write zeroes; having a qemu implementation will be one of the reasons
>> to promote the proposal from experimental to standard:
>> https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md
>>
>> v4 was here:
>> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg07463.html
>>
>> Prerequisites:
>> v3 Auto-fragment large transactions at the block layer:
>> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg03550.html
>> v2 byte-based block discard:
>> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg03592.html
>>
>> Also available as a tag at:
>> git fetch git://repo.or.cz/qemu/ericb.git nbd-zero-v5
>>
>> changes since v4:
>> - patch 2 - Rebase to latest
>> - patch 6, 10 - Address review comments [Paolo]
>> - patch 14 - Update commit message to address review comments [Paolo]
> 
> I'm sorry, it's too late for 2.7 now.

Fair enough.  I'll pull out the bug fixes (at least 1/14 is still 2.7
material) and re-submit those along with the other pending block bug
fixes that I need to do in hard freeze (iscsi with the 15M unmap/zero
granularity).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes
  2016-07-19 15:33   ` Eric Blake
@ 2016-07-19 15:41     ` Paolo Bonzini
  0 siblings, 0 replies; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-19 15:41 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]



On 19/07/2016 17:33, Eric Blake wrote:
>> > I'm sorry, it's too late for 2.7 now.
> Fair enough.  I'll pull out the bug fixes (at least 1/14 is still 2.7
> material) and re-submit those along with the other pending block bug
> fixes that I need to do in hard freeze (iscsi with the 15M unmap/zero
> granularity).

Sure.  To be clear, it's just that I wouldn't have had time to test and
send a pull request in time.

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-19 15:28     ` Eric Blake
@ 2016-07-19 15:45       ` Paolo Bonzini
  2016-07-20  3:34         ` Fam Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-19 15:45 UTC (permalink / raw)
  To: Eric Blake, Fam Zheng
  Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, nbd-general

[-- Attachment #1: Type: text/plain, Size: 508 bytes --]



On 19/07/2016 17:28, Eric Blake wrote:
>> If I'm reading the NBD proto.md correctly, this is not enough if
>> NBD_CMD_FLAG_NO_HOLE is specified. We probably need to use a zeroed buffer with
>> blk_pwrite, or pass a new flag (BDRV_RED_NO_HOLE) to blk_pwrite_zeroes to
>> enforce the bdrv_driver_pwritev() branch in bdrv_co_do_pwrite_zeroes().

I agree with Eric's interpretation.  It's a bit weird to have the
direction inverted, but I'm not sure I see the ambiguity.  Can you explain?

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-19 15:45       ` Paolo Bonzini
@ 2016-07-20  3:34         ` Fam Zheng
  2016-07-20  3:47           ` Eric Blake
  0 siblings, 1 reply; 55+ messages in thread
From: Fam Zheng @ 2016-07-20  3:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eric Blake, Kevin Wolf, nbd-general, qemu-devel, qemu-block, Max Reitz

On Tue, 07/19 17:45, Paolo Bonzini wrote:
> 
> 
> On 19/07/2016 17:28, Eric Blake wrote:
> >> If I'm reading the NBD proto.md correctly, this is not enough if
> >> NBD_CMD_FLAG_NO_HOLE is specified. We probably need to use a zeroed buffer with
> >> blk_pwrite, or pass a new flag (BDRV_RED_NO_HOLE) to blk_pwrite_zeroes to
> >> enforce the bdrv_driver_pwritev() branch in bdrv_co_do_pwrite_zeroes().
> 
> I agree with Eric's interpretation.  It's a bit weird to have the
> direction inverted, but I'm not sure I see the ambiguity.  Can you explain?

Write zeroes _means_ "punch hole" on a raw file.

In block/raw-posix.c:handle_aiocb_write_zeroes():
> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>     if (s->has_discard && s->has_fallocate) {
>         int ret = do_fallocate(s->fd,
>                                FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                                aiocb->aio_offset, aiocb->aio_nbytes);
>         if (ret == 0) {
>             ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
>             if (ret == 0 || ret != -ENOTSUP) {
>                 return ret;
>             }
>             s->has_fallocate = false;
>         } else if (ret != -ENOTSUP) {
>             return ret;
>         } else {
>             s->has_discard = false;
>         }
>     }
> #endif

And unmap is translated to "punch hole", too.

In block/raw-posix.c:handle_aiocb_discard():
> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>         ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                            aiocb->aio_offset, aiocb->aio_nbytes);
> #endif

So I agree that NBD_CMD_FLAG_NO_HOLE is a poorly named flag, because there is
always going to be a hole event if it's set.

Fam

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-20  3:34         ` Fam Zheng
@ 2016-07-20  3:47           ` Eric Blake
  2016-07-20  4:37             ` Fam Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Blake @ 2016-07-20  3:47 UTC (permalink / raw)
  To: Fam Zheng, Paolo Bonzini
  Cc: Kevin Wolf, nbd-general, qemu-devel, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2498 bytes --]

On 07/19/2016 09:34 PM, Fam Zheng wrote:
> On Tue, 07/19 17:45, Paolo Bonzini wrote:
>>
>>
>> On 19/07/2016 17:28, Eric Blake wrote:
>>>> If I'm reading the NBD proto.md correctly, this is not enough if
>>>> NBD_CMD_FLAG_NO_HOLE is specified. We probably need to use a zeroed buffer with
>>>> blk_pwrite, or pass a new flag (BDRV_RED_NO_HOLE) to blk_pwrite_zeroes to
>>>> enforce the bdrv_driver_pwritev() branch in bdrv_co_do_pwrite_zeroes().
>>
>> I agree with Eric's interpretation.  It's a bit weird to have the
>> direction inverted, but I'm not sure I see the ambiguity.  Can you explain?
> 
> Write zeroes _means_ "punch hole" on a raw file.
> 
> In block/raw-posix.c:handle_aiocb_write_zeroes():
>> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>     if (s->has_discard && s->has_fallocate) {
>>         int ret = do_fallocate(s->fd,
>>                                FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>>                                aiocb->aio_offset, aiocb->aio_nbytes);
>>         if (ret == 0) {
>>             ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);

That is just implementation: punch a hole, BUT THEN reallocate it back,
so that in the end, the file is still not sparse in that region.  Or am
I reading it wrong?

But the implementation under the hood is not visible to the guest - as
long as the end result is that a guest requesting NO_HOLE ends up with a
non-sparse file, and the data reads back as all 0, the client doesn't
care whether the zeros were written byte-by-byte or sped up by punching
a hole then reallocating.

> 
> And unmap is translated to "punch hole", too.
> 
> In block/raw-posix.c:handle_aiocb_discard():
>> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>         ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>>                            aiocb->aio_offset, aiocb->aio_nbytes);
>> #endif

No, this call is different - it punches a hole, then stops.  There is no
followup do_fallocate(,0,,) to reallocate, so the file remains sparse.

> 
> So I agree that NBD_CMD_FLAG_NO_HOLE is a poorly named flag, because there is
> always going to be a hole event if it's set.

If we are punching holes even when BDRV_REQ_MAY_UNMAP is not set, that
seems like we have a bug in qemu (unless we are immediately then
reallocating so that there is no resulting hole).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-20  3:47           ` Eric Blake
@ 2016-07-20  4:37             ` Fam Zheng
  2016-07-20  7:09               ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Fam Zheng @ 2016-07-20  4:37 UTC (permalink / raw)
  To: Eric Blake
  Cc: Paolo Bonzini, Kevin Wolf, nbd-general, qemu-devel, qemu-block,
	Max Reitz

On Tue, 07/19 21:47, Eric Blake wrote:
> > In block/raw-posix.c:handle_aiocb_write_zeroes():
> >> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> >>     if (s->has_discard && s->has_fallocate) {
> >>         int ret = do_fallocate(s->fd,
> >>                                FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> >>                                aiocb->aio_offset, aiocb->aio_nbytes);
> >>         if (ret == 0) {
> >>             ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
> 
> That is just implementation: punch a hole, BUT THEN reallocate it back,
> so that in the end, the file is still not sparse in that region.  Or am
> I reading it wrong?

Yes, you are right about this, I was confused because "qemu-img map" does not
report this allocation state after zero write. (No idea why SEEK_DATA doesn't
hit the fallocate'ed area.)

Then, when it comes to qcow2, it's a bit different: after a NO_HOLE write zero
doesn't allocate anything, it only sets the bit in cluster table. In other
words, NO_HOLE only ensures allocated clusters are not deallocated, but it
doesn't ensure the population of previously unallocated area. Is that intended?

Fam

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server
  2016-07-19 14:52     ` Eric Blake
@ 2016-07-20  4:39       ` Fam Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Fam Zheng @ 2016-07-20  4:39 UTC (permalink / raw)
  To: Eric Blake; +Cc: pbonzini, qemu-devel, qemu-block

On Tue, 07/19 08:52, Eric Blake wrote:
> On 07/18/2016 11:10 PM, Fam Zheng wrote:
> > On Mon, 07/18 22:07, Eric Blake wrote:
> >> Rather than open-coding NBD_REP_SERVER, reuse the code we
> >> already have by adding a length parameter.  Additionally,
> >> the refactoring will make adding NBD_OPT_GO in a later patch
> >> easier.
> >>
> >> Signed-off-by: Eric Blake <eblake@redhat.com>
> >>
> >> ---
> >> v4: no change
> >> v3: rebase to changes earlier in series
> >> ---
> >>  nbd/server.c | 48 +++++++++++++++++++++++-------------------------
> >>  1 file changed, 23 insertions(+), 25 deletions(-)
> >>
> >> diff --git a/nbd/server.c b/nbd/server.c
> >> index 85c4f5d..c8716f1 100644
> >> --- a/nbd/server.c
> >> +++ b/nbd/server.c
> >> @@ -195,12 +195,15 @@ static ssize_t nbd_negotiate_drop_sync(QIOChannel *ioc, size_t size)
> >>
> >>  */
> >>
> >> -static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
> >> +/* Send a reply header, including length, but no payload.
> >> + * Return -errno to kill connection, 0 to continue negotiation. */
> > 
> > Not a show stopper but I'm not sure documenting the control logic of the
> > outermost caller a few layers away is a good idea, the same question applies to
> > functions below as well.
> 
> The documentation here accurately describes this function.  Or is your
> complaint that the outermost caller is lacking documentation, and
> therefore I should first do a patch that uniformly adds documentation,
> before changing behavior, so that this function doesn't end up with
> details while the outermost caller remains undocumented?

No, I didn't like it because "kill connection" logic is actually in the caller
and not a functionality of this function. I'd say "Return -errno if an error
occured, otherwise return 0".

Fam

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-20  4:37             ` Fam Zheng
@ 2016-07-20  7:09               ` Paolo Bonzini
  2016-07-20  7:38                 ` Fam Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-20  7:09 UTC (permalink / raw)
  To: Fam Zheng, Eric Blake
  Cc: Kevin Wolf, nbd-general, qemu-block, qemu-devel, Max Reitz



On 20/07/2016 06:37, Fam Zheng wrote:
> Yes, you are right about this, I was confused because "qemu-img map" does not
> report this allocation state after zero write. (No idea why SEEK_DATA doesn't
> hit the fallocate'ed area.)

Apparently it's because it's zeroed.

$ fallocate -z -o 10485760 -l 10485760 test
$ fallocate -p -o 49152000 -l 10485760 test
$ fallocate -o 49152000 -l 10485760 test
$ fallocate -p -o 65536000 -l 10485760 test

Now we have:

- a zero area at 10240K..20480K

- an hole+allocated area at 48000K..59240K

- a hole at 64000K..74240K

$ qemu-img map test
Offset          Length          Mapped to       File
0               0xa00000        0               test    << ends at 10240K
0x1400000       0x1ae0000       0x1400000       test    << ends at 48000K
0x38e0000       0x5a0000        0x38e0000       test    << ends at 64000K
0x4880000       0x1b80000       0x4880000       test

So "qemu-img map" hides both zeroed and hole areas.  With the JSON format
we get more information:

$ qemu-img map --output=json test
[{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
{ "start": 20971520, "length": 28180480, "depth": 0, "zero": false, "data": true, "offset": 20971520},
{ "start": 49152000, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 49152000},
{ "start": 59637760, "length": 5898240, "depth": 0, "zero": false, "data": true, "offset": 59637760},
{ "start": 65536000, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 65536000},
{ "start": 76021760, "length": 28835840, "depth": 0, "zero": false, "data": true, "offset": 76021760}]

Both zeroed and holes are reported as "zero": true, "data": false.  This
limitation stems from the fact that we cannot use FIEMAP.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-20  7:09               ` Paolo Bonzini
@ 2016-07-20  7:38                 ` Fam Zheng
  2016-07-20  8:16                   ` Paolo Bonzini
  2016-07-20  9:19                   ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) Paolo Bonzini
  0 siblings, 2 replies; 55+ messages in thread
From: Fam Zheng @ 2016-07-20  7:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eric Blake, Kevin Wolf, nbd-general, qemu-block, qemu-devel, Max Reitz

On Wed, 07/20 09:09, Paolo Bonzini wrote:
> 
> 
> On 20/07/2016 06:37, Fam Zheng wrote:
> > Yes, you are right about this, I was confused because "qemu-img map" does not
> > report this allocation state after zero write. (No idea why SEEK_DATA doesn't
> > hit the fallocate'ed area.)
> 
> Apparently it's because it's zeroed.
> 
> $ fallocate -z -o 10485760 -l 10485760 test
> $ fallocate -p -o 49152000 -l 10485760 test
> $ fallocate -o 49152000 -l 10485760 test
> $ fallocate -p -o 65536000 -l 10485760 test
> 
> Now we have:
> 
> - a zero area at 10240K..20480K
> 
> - an hole+allocated area at 48000K..59240K
> 
> - a hole at 64000K..74240K
> 
> $ qemu-img map test
> Offset          Length          Mapped to       File
> 0               0xa00000        0               test    << ends at 10240K
> 0x1400000       0x1ae0000       0x1400000       test    << ends at 48000K
> 0x38e0000       0x5a0000        0x38e0000       test    << ends at 64000K
> 0x4880000       0x1b80000       0x4880000       test

Hah? I'm apparently missing something. I can't see these entries:

> fam@ad:/var/tmp$ cat /tmp/sh
> touch test
> fallocate -z -o 10485760 -l 10485760 test
> fallocate -p -o 49152000 -l 10485760 test
> fallocate -o 49152000 -l 10485760 test
> fallocate -p -o 65536000 -l 10485760 test
> qemu-img map test
> fam@ad:/var/tmp$ sh /tmp/sh
> Offset          Length          Mapped to       File

(I'm using Fedora 24 but I've also tried RHEL 7.)

Fam

> 
> So "qemu-img map" hides both zeroed and hole areas.  With the JSON format
> we get more information:
> 
> $ qemu-img map --output=json test
> [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
> { "start": 20971520, "length": 28180480, "depth": 0, "zero": false, "data": true, "offset": 20971520},
> { "start": 49152000, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 49152000},
> { "start": 59637760, "length": 5898240, "depth": 0, "zero": false, "data": true, "offset": 59637760},
> { "start": 65536000, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 65536000},
> { "start": 76021760, "length": 28835840, "depth": 0, "zero": false, "data": true, "offset": 76021760}]
> 
> Both zeroed and holes are reported as "zero": true, "data": false.  This
> limitation stems from the fact that we cannot use FIEMAP.
> 
> Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-20  7:38                 ` Fam Zheng
@ 2016-07-20  8:16                   ` Paolo Bonzini
  2016-07-20  9:04                     ` Fam Zheng
  2016-07-20  9:19                   ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) Paolo Bonzini
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-20  8:16 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Eric Blake, Kevin Wolf, nbd-general, qemu-block, qemu-devel, Max Reitz

> On Wed, 07/20 09:09, Paolo Bonzini wrote:
> > On 20/07/2016 06:37, Fam Zheng wrote:
> > > Yes, you are right about this, I was confused because "qemu-img map" does
> > > not
> > > report this allocation state after zero write. (No idea why SEEK_DATA
> > > doesn't
> > > hit the fallocate'ed area.)
> > 
> > Apparently it's because it's zeroed.
> > 
> > $ fallocate -z -o 10485760 -l 10485760 test
> > $ fallocate -p -o 49152000 -l 10485760 test
> > $ fallocate -o 49152000 -l 10485760 test
> > $ fallocate -p -o 65536000 -l 10485760 test
> > 
> > Now we have:
> > 
> > - a zero area at 10240K..20480K
> > 
> > - an hole+allocated area at 48000K..59240K
> > 
> > - a hole at 64000K..74240K
> > 
> > $ qemu-img map test
> > Offset          Length          Mapped to       File
> > 0               0xa00000        0               test    << ends at 10240K
> > 0x1400000       0x1ae0000       0x1400000       test    << ends at 48000K
> > 0x38e0000       0x5a0000        0x38e0000       test    << ends at 64000K
> > 0x4880000       0x1b80000       0x4880000       test
> 
> Hah? I'm apparently missing something. I can't see these entries:
> 
> > fam@ad:/var/tmp$ cat /tmp/sh
> > touch test
> > fallocate -z -o 10485760 -l 10485760 test
> > fallocate -p -o 49152000 -l 10485760 test
> > fallocate -o 49152000 -l 10485760 test
> > fallocate -p -o 65536000 -l 10485760 test
> > qemu-img map test
> > fam@ad:/var/tmp$ sh /tmp/sh
> > Offset          Length          Mapped to       File
> 
> (I'm using Fedora 24 but I've also tried RHEL 7.)

That was after a full allocation of a 1GB file.  I didn't copy that line
because I had the file lying around from another day--sorry.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-20  8:16                   ` Paolo Bonzini
@ 2016-07-20  9:04                     ` Fam Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Fam Zheng @ 2016-07-20  9:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eric Blake, Kevin Wolf, nbd-general, qemu-block, qemu-devel, Max Reitz

On Wed, 07/20 04:16, Paolo Bonzini wrote:
> That was after a full allocation of a 1GB file.  I didn't copy that line
> because I had the file lying around from another day--sorry.

Makes sense now, thanks for explaning!

Fam

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-20  7:38                 ` Fam Zheng
  2016-07-20  8:16                   ` Paolo Bonzini
@ 2016-07-20  9:19                   ` Paolo Bonzini
  2016-07-20 12:30                     ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-20  9:19 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Eric Blake, Kevin Wolf, qemu-block, qemu-devel, Max Reitz,
	Lukas Czerner, Dave Chinner, P, Niels de Vos

Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick recap, the
issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in
"qemu-img map".  This command prints metadata about a virtual disk image---which
in the case of a raw image amounts to detecting holes and unwritten extents.

First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and
"SEEK_ZERO", on both ext4 and XFS.    You can see that unwritten extents are
reported by "qemu-img map" as holes:

    $ dd if=/dev/urandom of=test.img bs=1M count=100
    $ fallocate -z -o 10M -l 10M test.img
    $ du -h test.img
    $ qemu-img map --output=json test.img
    [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
    { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
    { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]

On the second line, zero=true data=false identifies a hole.  The right output
would either have zero=true data=true (unwritten extent) or just

    [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0},

since the zero flag is advisory (it doesn't try to detect zeroes beyond what the
filesystem says).

This leads to the second question, which is about FIEMAP and FIEMAP_FLAG_SYNC in
particular.  Until 2014, QEMU used FIEMAP to implement "qemu-img map".  I
resurrected that cod eand indeed it works:

    $ dd if=/dev/urandom of=test.img bs=1M count=100
    $ du -h test.img
    $ fallocate -z -o 10M -l 10M test.img
    $ ./qemu-img map --output=json test.img
    [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
    { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": true, "offset": 10485760},
    { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]

This time qemu-img correctly reports an unwritten extent on the second line.  It
reports correctly holes too; continuing the previous example:

    $ fallocate -p -o 20M -l 10M test.img
    $ ./qemu-img map --output=json test.img
    [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
    { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": true, "offset": 10485760},
    { "start": 20971520, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 20971520},
    { "start": 31457280, "length": 73400320, "depth": 0, "zero": false, "data": true, "offset": 31457280}]

Notice that you have data=true on the second line (unwritten extent) but
data=false (hole) on the third.

The reason why we disabled FIEMAP was a combination of a corruption and performance
issue.  The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815
and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at
https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by using
FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out to be too
slow, so we dropped FIEMAP altogether.

However, today I found kernel commit 91dd8c114499 ("ext4: prevent race while walking
extent tree for fiemap", 2012-11-28) whose commit message says:

    Moreover the extent currently in delayed allocation might be allocated
    after we search the extent tree and before we search extent status tree
    delayed buffers resulting in those delayed buffers being completely
    missed, even though completely written and allocated.

This seems pretty much like our data corruption bug; it would mean that
using FIEMAP_FLAG_SYNC was only working around a bug and delayed allocations
_should_ be reported as usual by FIEMAP.

Except that the commit went in kernel 3.8 and as said above Trusty had 3.13.
So either there are other bugs, or my understanding of the commit is not correct.
So the questions for Lukas and Dave are:

1) is it expected that SEEK_HOLE skips unwritten extents?  If not, would
it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
would be similar to what SEEK_HOLE/SEEK_DATA do now?

2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
filesystems and kernel releases is it really not needed?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-20  9:19                   ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) Paolo Bonzini
@ 2016-07-20 12:30                     ` Dave Chinner
  2016-07-20 13:35                       ` Niels de Vos
  2016-07-20 13:40                       ` Paolo Bonzini
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2016-07-20 12:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, P, Niels de Vos

On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
> Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick recap, the
> issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in
> "qemu-img map".  This command prints metadata about a virtual disk image---which
> in the case of a raw image amounts to detecting holes and unwritten extents.
> 
> First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and
> "SEEK_ZERO", on both ext4 and XFS.    You can see that unwritten extents are
> reported by "qemu-img map" as holes:

Correctly so. seek hole/data knows nothing about the underlying
filesystem storage implementation. A "hole" is defined as a region
of zeros, and the filesystem is free call anything it kows for
certain contains zeros as a hole.

FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
seek API that uses these definitions for hole and data....

>     $ dd if=/dev/urandom of=test.img bs=1M count=100
>     $ fallocate -z -o 10M -l 10M test.img
>     $ du -h test.img
>     $ qemu-img map --output=json test.img
>     [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
>     { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
>     { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]

> 
> On the second line, zero=true data=false identifies a hole.  The right output
> would either have zero=true data=true (unwritten extent) or just
> 
>     [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0},
>
> since the zero flag is advisory (it doesn't try to detect zeroes beyond what the
> filesystem says).

Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
of a range of zeros" and "this is the start of a range of data". And
for filesystems that don't specifically implement these seek
operations, zeros are considered data. i.e. SEEK_HOLE will take you
to the end of file, SEEK_DATA returns the current position....

i.e. unwritten extents contain no data, so they are semantically
identical to holes for the purposes of seeking and hence SEEK_DATA
can skip over them.

> The reason why we disabled FIEMAP was a combination of a corruption and performance
> issue.  The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815
> and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at
> https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by using
> FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out to be too
> slow, so we dropped FIEMAP altogether.

Yes, because FIEMAP output is only useful for diagnostic purposes as
it can be stale even before the syscall returns to userspace. i.e.
it can't be used in userspace for optimising file copies....

Finding regions of valid data in a file is what SEEK_HOLE/SEEK_DATA
is for, not FIEMAP. FIEMAP only reports the physical layout of the
file, now where the currently valid data in the file lies.

> However, today I found kernel commit 91dd8c114499 ("ext4: prevent race while walking
> extent tree for fiemap", 2012-11-28) whose commit message says:
> 
>     Moreover the extent currently in delayed allocation might be allocated
>     after we search the extent tree and before we search extent status tree
>     delayed buffers resulting in those delayed buffers being completely
>     missed, even though completely written and allocated.
> 
> This seems pretty much like our data corruption bug; it would mean that
> using FIEMAP_FLAG_SYNC was only working around a bug and delayed allocations
> _should_ be reported as usual by FIEMAP.

What do you think you can do with the delayed allocation regions in
the file?

Indeed, with specualtive delayed allocation, XFS can report delalloc
regions that have no actual resemblence to the layout of dirty data
in the file. SEEK_DATA will iterate the delalloc regions containing
data precisely, however.

> Except that the commit went in kernel 3.8 and as said above Trusty had 3.13.
> So either there are other bugs, or my understanding of the commit is not correct.
> So the questions for Lukas and Dave are:
> 
> 1) is it expected that SEEK_HOLE skips unwritten extents?

There are multiple answers to this, all of which are correct depending
on current context and state:

1. No - some filesystems will report clean unwritten extents as holes.

2. Yes - some filesystems will report clean unwritten extents as
data.

3.  Maybe - if there is written data in memory over the unwritten
extent on disk (i.e. hasn't been flushed to disk, it will be
considered a data region with non-zero data. (FIEMAP will still
report is as unwritten)

> If not, would
> it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> would be similar to what SEEK_HOLE/SEEK_DATA do now?

To solve what problem? You haven't explained what problem you are
trying to solve yet.

> 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> filesystems and kernel releases is it really not needed?

I can't answer this question, either, because I don't know what
you want the fiemap information for.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-20 12:30                     ` Dave Chinner
@ 2016-07-20 13:35                       ` Niels de Vos
  2016-07-21 11:43                         ` Dave Chinner
  2016-07-20 13:40                       ` Paolo Bonzini
  1 sibling, 1 reply; 55+ messages in thread
From: Niels de Vos @ 2016-07-20 13:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Paolo Bonzini, Fam Zheng, Eric Blake, Kevin Wolf, qemu-block,
	qemu-devel, Max Reitz, Lukas Czerner, P

[-- Attachment #1: Type: text/plain, Size: 6395 bytes --]

On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote:
> On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
> > Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick recap, the
> > issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in
> > "qemu-img map".  This command prints metadata about a virtual disk image---which
> > in the case of a raw image amounts to detecting holes and unwritten extents.
> > 
> > First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and
> > "SEEK_ZERO", on both ext4 and XFS.    You can see that unwritten extents are
> > reported by "qemu-img map" as holes:
> 
> Correctly so. seek hole/data knows nothing about the underlying
> filesystem storage implementation. A "hole" is defined as a region
> of zeros, and the filesystem is free call anything it kows for
> certain contains zeros as a hole.
> 
> FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
> seek API that uses these definitions for hole and data....
> 
> >     $ dd if=/dev/urandom of=test.img bs=1M count=100
> >     $ fallocate -z -o 10M -l 10M test.img
> >     $ du -h test.img
> >     $ qemu-img map --output=json test.img
> >     [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
> >     { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
> >     { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]
> 
> > 
> > On the second line, zero=true data=false identifies a hole.  The right output
> > would either have zero=true data=true (unwritten extent) or just
> > 
> >     [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0},
> >
> > since the zero flag is advisory (it doesn't try to detect zeroes beyond what the
> > filesystem says).
> 
> Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
> of a range of zeros" and "this is the start of a range of data". And
> for filesystems that don't specifically implement these seek
> operations, zeros are considered data. i.e. SEEK_HOLE will take you
> to the end of file, SEEK_DATA returns the current position....
> 
> i.e. unwritten extents contain no data, so they are semantically
> identical to holes for the purposes of seeking and hence SEEK_DATA
> can skip over them.
> 
> > The reason why we disabled FIEMAP was a combination of a corruption and performance
> > issue.  The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815
> > and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at
> > https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by using
> > FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out to be too
> > slow, so we dropped FIEMAP altogether.
> 
> Yes, because FIEMAP output is only useful for diagnostic purposes as
> it can be stale even before the syscall returns to userspace. i.e.
> it can't be used in userspace for optimising file copies....

Oh... And I was surprised to learn that "cp" does use FIEMAP and not
SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
  http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87

> Finding regions of valid data in a file is what SEEK_HOLE/SEEK_DATA
> is for, not FIEMAP. FIEMAP only reports the physical layout of the
> file, now where the currently valid data in the file lies.
> 
> > However, today I found kernel commit 91dd8c114499 ("ext4: prevent race while walking
> > extent tree for fiemap", 2012-11-28) whose commit message says:
> > 
> >     Moreover the extent currently in delayed allocation might be allocated
> >     after we search the extent tree and before we search extent status tree
> >     delayed buffers resulting in those delayed buffers being completely
> >     missed, even though completely written and allocated.
> > 
> > This seems pretty much like our data corruption bug; it would mean that
> > using FIEMAP_FLAG_SYNC was only working around a bug and delayed allocations
> > _should_ be reported as usual by FIEMAP.
> 
> What do you think you can do with the delayed allocation regions in
> the file?
> 
> Indeed, with specualtive delayed allocation, XFS can report delalloc
> regions that have no actual resemblence to the layout of dirty data
> in the file. SEEK_DATA will iterate the delalloc regions containing
> data precisely, however.
> 
> > Except that the commit went in kernel 3.8 and as said above Trusty had 3.13.
> > So either there are other bugs, or my understanding of the commit is not correct.
> > So the questions for Lukas and Dave are:
> > 
> > 1) is it expected that SEEK_HOLE skips unwritten extents?
> 
> There are multiple answers to this, all of which are correct depending
> on current context and state:
> 
> 1. No - some filesystems will report clean unwritten extents as holes.
> 
> 2. Yes - some filesystems will report clean unwritten extents as
> data.
> 
> 3.  Maybe - if there is written data in memory over the unwritten
> extent on disk (i.e. hasn't been flushed to disk, it will be
> considered a data region with non-zero data. (FIEMAP will still
> report is as unwritten)
> 
> > If not, would
> > it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> > would be similar to what SEEK_HOLE/SEEK_DATA do now?
> 
> To solve what problem? You haven't explained what problem you are
> trying to solve yet.

I think this is about optimizing file copy. There is a difference
between pre-allocated (but unwritten) and zero'd out extents. In the
Gluster network protocol, we could use FALLOCATE for the first, and
ZEROFILL for the second. Depending on what the underlaying filesystem
and storage support, performance would differ.

But, maybe this distinction is not important enough to optimize for?
That is not something I can judge for the QEMU use-case.

> > 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> > filesystems and kernel releases is it really not needed?
> 
> I can't answer this question, either, because I don't know what
> you want the fiemap information for.

Great details, thanks for sharing!

HTH,
Niels

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-20 12:30                     ` Dave Chinner
  2016-07-20 13:35                       ` Niels de Vos
@ 2016-07-20 13:40                       ` Paolo Bonzini
  2016-07-21 12:41                         ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-20 13:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, P, Niels de Vos

> > 1) is it expected that SEEK_HOLE skips unwritten extents?
> 
> There are multiple answers to this, all of which are correct depending
> on current context and state:
> 
> 1. No - some filesystems will report clean unwritten extents as holes.
> 
> 2. Yes - some filesystems will report clean unwritten extents as data.
> 
> 3.  Maybe - if there is written data in memory over the unwritten
> extent on disk (i.e. hasn't been flushed to disk, it will be
> considered a data region with non-zero data. (FIEMAP will still
> report is as unwritten)

Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC
in this case (not FIEMAP_EXTENT_UNWRITTEN).

> > If not, would
> > it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> > would be similar to what SEEK_HOLE/SEEK_DATA do now?
> 
> To solve what problem? You haven't explained what problem you are
> trying to solve yet.
> 
> > 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> > filesystems and kernel releases is it really not needed?
> 
> I can't answer this question, either, because I don't know what
> you want the fiemap information for.

The answer is the same no matter if we use both lseek and FIEMAP, so
I'll answer just once.  We want to do two things:

1) avoid copying zero data, to keep the copy process efficient.  For this,
SEEK_HOLE/SEEK_DATA are enough.

2) copy file contents while preserving the allocation state of the file's extents.
There can be various reasons why the user has preallocated the file (because they
don't want an ENOSPC to happen while the VM runs; on some filesystems, to
minimize cases where io_submit is very un-asynchronous; or just because someone
had a reason to do a BLKZEROOUT ioctl on the virtual disk).  We want to preserve
these while converting or otherwise moving the file around.

Preallocation can result in unwritten extents in various cases.  Of course, it is
often a simple fallocate on the host.  However, when the guest sends a BLKZEROOUT
ioctl that will also be converted by QEMU into fallocate(FALLOC_FL_ZERO_RANGE) or
ioctl(XFS_IOC_ZERO_RANGE).  In this case in fact the user could be preallocating
only part of the disk.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-20 13:35                       ` Niels de Vos
@ 2016-07-21 11:43                         ` Dave Chinner
  2016-07-21 12:31                           ` Pádraig Brady
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2016-07-21 11:43 UTC (permalink / raw)
  To: Niels de Vos
  Cc: Paolo Bonzini, Fam Zheng, Eric Blake, Kevin Wolf, qemu-block,
	qemu-devel, Max Reitz, Lukas Czerner, P

On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote:
> On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote:
> > On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
> > > Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick recap, the
> > > issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in
> > > "qemu-img map".  This command prints metadata about a virtual disk image---which
> > > in the case of a raw image amounts to detecting holes and unwritten extents.
> > > 
> > > First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and
> > > "SEEK_ZERO", on both ext4 and XFS.    You can see that unwritten extents are
> > > reported by "qemu-img map" as holes:
> > 
> > Correctly so. seek hole/data knows nothing about the underlying
> > filesystem storage implementation. A "hole" is defined as a region
> > of zeros, and the filesystem is free call anything it kows for
> > certain contains zeros as a hole.
> > 
> > FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
> > seek API that uses these definitions for hole and data....
> > 
> > >     $ dd if=/dev/urandom of=test.img bs=1M count=100
> > >     $ fallocate -z -o 10M -l 10M test.img
> > >     $ du -h test.img
> > >     $ qemu-img map --output=json test.img
> > >     [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
> > >     { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
> > >     { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]
> > 
> > > 
> > > On the second line, zero=true data=false identifies a hole.  The right output
> > > would either have zero=true data=true (unwritten extent) or just
> > > 
> > >     [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0},
> > >
> > > since the zero flag is advisory (it doesn't try to detect zeroes beyond what the
> > > filesystem says).
> > 
> > Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
> > of a range of zeros" and "this is the start of a range of data". And
> > for filesystems that don't specifically implement these seek
> > operations, zeros are considered data. i.e. SEEK_HOLE will take you
> > to the end of file, SEEK_DATA returns the current position....
> > 
> > i.e. unwritten extents contain no data, so they are semantically
> > identical to holes for the purposes of seeking and hence SEEK_DATA
> > can skip over them.
> > 
> > > The reason why we disabled FIEMAP was a combination of a corruption and performance
> > > issue.  The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815
> > > and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at
> > > https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by using
> > > FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out to be too
> > > slow, so we dropped FIEMAP altogether.
> > 
> > Yes, because FIEMAP output is only useful for diagnostic purposes as
> > it can be stale even before the syscall returns to userspace. i.e.
> > it can't be used in userspace for optimising file copies....
> 
> Oh... And I was surprised to learn that "cp" does use FIEMAP and not
> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
>   http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87

My understanding was that FIEMAP was disabled almost immediately
after the data corruption problems were understood, and it hasn't
been turned back on since. I don't see FIEMAP calls in strace when I
copy sparse files, even when I use --sparse=auto which is supposed
to trigger the sparse source file checks using FIEMAP.

So, yeah, while there's FIEMAP code present, that doesn't mean it's
actually used. And, yes, there are comments in coreutils commits in
2011 saying that the FIEMAP workarounds are temporary until
SEK_HOLE/DATA are supported, which were added to the kernel in 2011
soon after the 'cp-corrupts-data-with-fiemap' debacle came to light.
Five years later, and the fiemap code is stil there?

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-21 11:43                         ` Dave Chinner
@ 2016-07-21 12:31                           ` Pádraig Brady
  2016-07-21 13:15                             ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Pádraig Brady @ 2016-07-21 12:31 UTC (permalink / raw)
  To: Dave Chinner, Niels de Vos
  Cc: Paolo Bonzini, Fam Zheng, Eric Blake, Kevin Wolf, qemu-block,
	qemu-devel, Max Reitz, Lukas Czerner

On 21/07/16 12:43, Dave Chinner wrote:
> On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote:
>> On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote:
>>> On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
>>>> Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick recap, the
>>>> issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in
>>>> "qemu-img map".  This command prints metadata about a virtual disk image---which
>>>> in the case of a raw image amounts to detecting holes and unwritten extents.
>>>>
>>>> First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and
>>>> "SEEK_ZERO", on both ext4 and XFS.    You can see that unwritten extents are
>>>> reported by "qemu-img map" as holes:
>>>
>>> Correctly so. seek hole/data knows nothing about the underlying
>>> filesystem storage implementation. A "hole" is defined as a region
>>> of zeros, and the filesystem is free call anything it kows for
>>> certain contains zeros as a hole.
>>>
>>> FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
>>> seek API that uses these definitions for hole and data....
>>>
>>>>     $ dd if=/dev/urandom of=test.img bs=1M count=100
>>>>     $ fallocate -z -o 10M -l 10M test.img
>>>>     $ du -h test.img
>>>>     $ qemu-img map --output=json test.img
>>>>     [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
>>>>     { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
>>>>     { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]
>>>
>>>>
>>>> On the second line, zero=true data=false identifies a hole.  The right output
>>>> would either have zero=true data=true (unwritten extent) or just
>>>>
>>>>     [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0},
>>>>
>>>> since the zero flag is advisory (it doesn't try to detect zeroes beyond what the
>>>> filesystem says).
>>>
>>> Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
>>> of a range of zeros" and "this is the start of a range of data". And
>>> for filesystems that don't specifically implement these seek
>>> operations, zeros are considered data. i.e. SEEK_HOLE will take you
>>> to the end of file, SEEK_DATA returns the current position....
>>>
>>> i.e. unwritten extents contain no data, so they are semantically
>>> identical to holes for the purposes of seeking and hence SEEK_DATA
>>> can skip over them.
>>>
>>>> The reason why we disabled FIEMAP was a combination of a corruption and performance
>>>> issue.  The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815
>>>> and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at
>>>> https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by using
>>>> FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out to be too
>>>> slow, so we dropped FIEMAP altogether.
>>>
>>> Yes, because FIEMAP output is only useful for diagnostic purposes as
>>> it can be stale even before the syscall returns to userspace. i.e.
>>> it can't be used in userspace for optimising file copies....
>>
>> Oh... And I was surprised to learn that "cp" does use FIEMAP and not
>> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
>>   http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87
> 
> My understanding was that FIEMAP was disabled almost immediately
> after the data corruption problems were understood, and it hasn't
> been turned back on since. I don't see FIEMAP calls in strace when I
> copy sparse files, even when I use --sparse=auto which is supposed
> to trigger the sparse source file checks using FIEMAP.
> 
> So, yeah, while there's FIEMAP code present, that doesn't mean it's
> actually used. And, yes, there are comments in coreutils commits in
> 2011 saying that the FIEMAP workarounds are temporary until
> SEK_HOLE/DATA are supported, which were added to the kernel in 2011
> soon after the 'cp-corrupts-data-with-fiemap' debacle came to light.
> Five years later, and the fiemap code is stil there?

Yes it's still there, and hasn't caused issues.
It's used only for sparse files:

  $ truncate -s1G t.fiemap
  $ strace -e ioctl cp t.fiemap t2.fiemap
  ioctl(3, FS_IOC_FIEMAP, 0x7ffff007c910) = 0

It's a pity fiemap is racy (on some file systems?) wrt reporting
of data not yet written out, and is thus fairly useless IMHO
without the FIEMAP_FLAG_SYNC workaround.

We _are_ planning to use SEEK_HOLE in future.

It would be nice though to have a way to distinguish zeros from holes
to efficiently/consistently maintain allocations to avoid future ENOSPC
or inefficient future allocations.

thanks,
Pádraig

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-20 13:40                       ` Paolo Bonzini
@ 2016-07-21 12:41                         ` Dave Chinner
  2016-07-21 13:01                           ` Pádraig Brady
  2016-07-21 14:23                           ` Paolo Bonzini
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2016-07-21 12:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, P, Niels de Vos

On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote:
> > > 1) is it expected that SEEK_HOLE skips unwritten extents?
> > 
> > There are multiple answers to this, all of which are correct depending
> > on current context and state:
> > 
> > 1. No - some filesystems will report clean unwritten extents as holes.
> > 
> > 2. Yes - some filesystems will report clean unwritten extents as data.
> > 
> > 3.  Maybe - if there is written data in memory over the unwritten
> > extent on disk (i.e. hasn't been flushed to disk, it will be
> > considered a data region with non-zero data. (FIEMAP will still
> > report is as unwritten)
> 
> Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC
> in this case (not FIEMAP_EXTENT_UNWRITTEN).

No. FIEMAP only returns the known extent state at the given file
offset.  "delalloc" extents exist in memory, indicating the space
has already been accounted for over that offset, but the extent has
not been physically allocated. Like all other types of extents,
there may or may not be valid data over a delayed allocation extent. 

IOWs, fiemap only gives you a snapshot of extent state, not the
ranges of valid data in the file.

> > > If not, would
> > > it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> > > would be similar to what SEEK_HOLE/SEEK_DATA do now?
> > 
> > To solve what problem? You haven't explained what problem you are
> > trying to solve yet.
> > 
> > > 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> > > filesystems and kernel releases is it really not needed?
> > 
> > I can't answer this question, either, because I don't know what
> > you want the fiemap information for.
> 
> The answer is the same no matter if we use both lseek and FIEMAP, so
> I'll answer just once.  We want to do two things:
> 
> 1) avoid copying zero data, to keep the copy process efficient.  For this,
> SEEK_HOLE/SEEK_DATA are enough.
> 
> 2) copy file contents while preserving the allocation state of the file's extents.

Which is /very difficult/ to do safely and reliably.

We do actually do reliable, safe, exact hole and preallocation
layout duplication with xfs_fsr, but that uses kernel provided
cookies (from XFS_IOC_BULKSTAT) to detect that data in the source
file has not changed while it was being copied before executing the
final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes
the new copy of the data user visible.

i.e. the use of fiemap to duplicate the exact layout of a file
from userspace is only posisble if you can /guarantee/ the source
file has not changed in any way during the copy operation at the
pointin time you finalise the destination data copy.

> There can be various reasons why the user has preallocated the file (because they
> don't want an ENOSPC to happen while the VM runs; on some filesystems, to
> minimize cases where io_submit is very un-asynchronous; or just because someone
> had a reason to do a BLKZEROOUT ioctl on the virtual disk).  We want to preserve
> these while converting or otherwise moving the file around.

Sure, there's many reasons for using prealloc/punch/zero. The real
difference to other file operations is that they interface with low
level filesystem structure, not the data contained within the
extents. That's what makes them problematic for duplication -
userspace cannot serialise against low level filesystem structure
modifications.

Optimising file copies safely is one of the reasons the
copy_file_range() syscall has been introduced (in 4.5). While we
haven't implemented anything special in XFS yet, it will internally
use splice to do a zero-copy data transfer from source to
destination file. Optimising for exact layout copies is precisely
the sort of thing this syscall is intended for.

It's also intended to enable applications to take advantage of
hardware acceleration of data copying (e.g. server side copies to
avoid round trips as has been implemented for NFS, or storage array
offload of data copying) when such support is provided by the kernel.

IOWs, I think you should be looking to optimise file copies by using
copy_file_range() and getting filesystems to do exactly what you
need. Using FIEMAP, fallocate and moving data through userspace
won't ever be reliable without special filesystem help (that only
exists for XFS right now), nor will it enable the application to
transparently use smart storage protocols and hardware when it is
present on user systems....

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-21 12:41                         ` Dave Chinner
@ 2016-07-21 13:01                           ` Pádraig Brady
  2016-07-21 14:23                           ` Paolo Bonzini
  1 sibling, 0 replies; 55+ messages in thread
From: Pádraig Brady @ 2016-07-21 13:01 UTC (permalink / raw)
  To: Dave Chinner, Paolo Bonzini
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, Niels de Vos

On 21/07/16 13:41, Dave Chinner wrote:
> On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote:
>>>> 1) is it expected that SEEK_HOLE skips unwritten extents?
>>>
>>> There are multiple answers to this, all of which are correct depending
>>> on current context and state:
>>>
>>> 1. No - some filesystems will report clean unwritten extents as holes.
>>>
>>> 2. Yes - some filesystems will report clean unwritten extents as data.
>>>
>>> 3.  Maybe - if there is written data in memory over the unwritten
>>> extent on disk (i.e. hasn't been flushed to disk, it will be
>>> considered a data region with non-zero data. (FIEMAP will still
>>> report is as unwritten)
>>
>> Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC
>> in this case (not FIEMAP_EXTENT_UNWRITTEN).
> 
> No. FIEMAP only returns the known extent state at the given file
> offset.  "delalloc" extents exist in memory, indicating the space
> has already been accounted for over that offset, but the extent has
> not been physically allocated. Like all other types of extents,
> there may or may not be valid data over a delayed allocation extent. 
> 
> IOWs, fiemap only gives you a snapshot of extent state, not the
> ranges of valid data in the file.
> 
>>>> If not, would
>>>> it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
>>>> would be similar to what SEEK_HOLE/SEEK_DATA do now?
>>>
>>> To solve what problem? You haven't explained what problem you are
>>> trying to solve yet.
>>>
>>>> 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
>>>> filesystems and kernel releases is it really not needed?
>>>
>>> I can't answer this question, either, because I don't know what
>>> you want the fiemap information for.
>>
>> The answer is the same no matter if we use both lseek and FIEMAP, so
>> I'll answer just once.  We want to do two things:
>>
>> 1) avoid copying zero data, to keep the copy process efficient.  For this,
>> SEEK_HOLE/SEEK_DATA are enough.
>>
>> 2) copy file contents while preserving the allocation state of the file's extents.
> 
> Which is /very difficult/ to do safely and reliably.
> 
> We do actually do reliable, safe, exact hole and preallocation
> layout duplication with xfs_fsr, but that uses kernel provided
> cookies (from XFS_IOC_BULKSTAT) to detect that data in the source
> file has not changed while it was being copied before executing the
> final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes
> the new copy of the data user visible.
> 
> i.e. the use of fiemap to duplicate the exact layout of a file
> from userspace is only posisble if you can /guarantee/ the source
> file has not changed in any way during the copy operation at the
> pointin time you finalise the destination data copy.
> 
>> There can be various reasons why the user has preallocated the file (because they
>> don't want an ENOSPC to happen while the VM runs; on some filesystems, to
>> minimize cases where io_submit is very un-asynchronous; or just because someone
>> had a reason to do a BLKZEROOUT ioctl on the virtual disk).  We want to preserve
>> these while converting or otherwise moving the file around.
> 
> Sure, there's many reasons for using prealloc/punch/zero. The real
> difference to other file operations is that they interface with low
> level filesystem structure, not the data contained within the
> extents. That's what makes them problematic for duplication -
> userspace cannot serialise against low level filesystem structure
> modifications.
> 
> Optimising file copies safely is one of the reasons the
> copy_file_range() syscall has been introduced (in 4.5). While we
> haven't implemented anything special in XFS yet, it will internally
> use splice to do a zero-copy data transfer from source to
> destination file. Optimising for exact layout copies is precisely
> the sort of thing this syscall is intended for.
> 
> It's also intended to enable applications to take advantage of
> hardware acceleration of data copying (e.g. server side copies to
> avoid round trips as has been implemented for NFS, or storage array
> offload of data copying) when such support is provided by the kernel.
> 
> IOWs, I think you should be looking to optimise file copies by using
> copy_file_range() and getting filesystems to do exactly what you
> need. Using FIEMAP, fallocate and moving data through userspace
> won't ever be reliable without special filesystem help (that only
> exists for XFS right now), nor will it enable the application to
> transparently use smart storage protocols and hardware when it is
> present on user systems....

Yes higher level calls are useful here and we'll consider using them in cp etc.
When I previously looked at this I noticed some implementations would
fall back to do_splice_direct() which is essentially sendfile()
and that expands holes which wouldn't be a good default.
So there may be soem need for control flags for copy_file_range()
to have it generally useful.

thanks for the info,
Pádraig.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-21 12:31                           ` Pádraig Brady
@ 2016-07-21 13:15                             ` Dave Chinner
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2016-07-21 13:15 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Niels de Vos, Paolo Bonzini, Fam Zheng, Eric Blake, Kevin Wolf,
	qemu-block, qemu-devel, Max Reitz, Lukas Czerner

On Thu, Jul 21, 2016 at 01:31:21PM +0100, Pádraig Brady wrote:
> On 21/07/16 12:43, Dave Chinner wrote:
> > On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote:
> >> Oh... And I was surprised to learn that "cp" does use FIEMAP and not
> >> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
> >>   http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87
> > 
> > My understanding was that FIEMAP was disabled almost immediately
> > after the data corruption problems were understood, and it hasn't
> > been turned back on since. I don't see FIEMAP calls in strace when I
> > copy sparse files, even when I use --sparse=auto which is supposed
> > to trigger the sparse source file checks using FIEMAP.
> > 
> > So, yeah, while there's FIEMAP code present, that doesn't mean it's
> > actually used. And, yes, there are comments in coreutils commits in
> > 2011 saying that the FIEMAP workarounds are temporary until
> > SEK_HOLE/DATA are supported, which were added to the kernel in 2011
> > soon after the 'cp-corrupts-data-with-fiemap' debacle came to light.
> > Five years later, and the fiemap code is stil there?
> 
> Yes it's still there, and hasn't caused issues.
> It's used only for sparse files:
> 
>   $ truncate -s1G t.fiemap
>   $ strace -e ioctl cp t.fiemap t2.fiemap
>   ioctl(3, FS_IOC_FIEMAP, 0x7ffff007c910) = 0

So, my tests were done with coreutils 8.25, and a sparse 100M file with
a layout like this:

$ xfs_bmap -vvp testfile
testfile:
 EXT: FILE-OFFSET       BLOCK-RANGE        AG AG-OFFSET            TOTAL FLAGS
   0: [0..20479]:       71768912..71789391  3 (11216720..11237199) 20480 00000
   1: [20480..40959]:   hole                                       20480
   2: [40960..61439]:   71809872..71830351  3 (11257680..11278159) 20480 00000
   3: [61440..81919]:   71830352..71850831  3 (11278160..11298639) 20480 10000
   4: [81920..96223]:   71850832..71865135  3 (11298640..11312943) 14304 00000
   5: [96224..190471]:  60457944..60552191  2 (20089816..20184063) 94248 00000
   6: [190472..204799]: 73101864..73116191  3 (12549672..12563999) 14328 00000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

i.e. a 10MB hole @ 10MB, a 10MB unwritten extent @ 30MB. And, yes, I
do see a fiemap call with your example above.

IOWs, it seems the "is sparse" heuristic that cp uses is not very
reliable.  That doesn't inspire confidence, and explains why none of
my cp tests on sparse files ove rthe past 5 years have ever shown
FIEMAP calls....

/me is now somewhat afraid to use cp for copies that require data
integrity.

> It's a pity fiemap is racy (on some file systems?) wrt reporting
> of data not yet written out, and is thus fairly useless IMHO
> without the FIEMAP_FLAG_SYNC workaround.

It's racy on *all filesystems* - all the FIEMAP_FLAG_SYNC flag does
is make the race window smaller. The extent map is only coherent
with the data in memory while the inode locks are held by the
kernel. The moment the inode locks are dropped by the kernel in the
syscall, the extent map held in the fiemap structure is considered
stale and no longer coherent with the extent map held on the inode.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-21 12:41                         ` Dave Chinner
  2016-07-21 13:01                           ` Pádraig Brady
@ 2016-07-21 14:23                           ` Paolo Bonzini
  2016-07-22  8:58                             ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-21 14:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, P, Niels de Vos

> > 1) avoid copying zero data, to keep the copy process efficient.  For this,
> > SEEK_HOLE/SEEK_DATA are enough.
> > 
> > 2) copy file contents while preserving the allocation state of the file's
> > extents.
> 
> Which is /very difficult/ to do safely and reliably.
> i.e. the use of fiemap to duplicate the exact layout of a file
> from userspace is only posisble if you can /guarantee/ the source
> file has not changed in any way during the copy operation at the
> pointin time you finalise the destination data copy.

We don't do exactly that, exactly because it's messy when you have
concurrent accesses (which shouldn't be done but you never know).  When
doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
querying one extent at a time.  If you proceed this way, all of these
can cause the same races:

- pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
not copied

- pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
copied

- lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
copied

- lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
copied

- ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
the 10MB..20MB area is not copied

- ioctl(FIEMAP at 10MB) returns an unwritten extent at 10MB..20MB,
so the 10MB..20MB area is FALLOC_FL_ZERO_RANGE-ed on the destination

- ioctl(FIEMAP at 10MB) returns an extent at 10MB..20MB, so
the 10MB..20MB area is copied

But unfortunately in the FIEMAP case you'd need FIEMAP_FLAG_SYNC.  It
would be nice to have a way to do this kind of query without syncing
and with expressiveness that is at least close to FIEMAP's results.
We don't need need the full power of FIEMAP---as mentioned above we
only query one extent---so having a new ioctl would be fine too!
I'm putting a sample program after my signature for reference.

> IOWs, I think you should be looking to optimise file copies by using
> copy_file_range() and getting filesystems to do exactly what you
> need. Using FIEMAP, fallocate and moving data through userspace
> won't ever be reliable without special filesystem help (that only
> exists for XFS right now), nor will it enable the application to
> transparently use smart storage protocols and hardware when it is
> present on user systems....

We want to do that.  We also want to present SCSI extended copy to the
guest, and map that to copy_file_range() on the host if possible.  However,
there are going to be cases where we need to do low-level queries.  For
example, it is on my todo list to implement lseek(SEEK_HOLE/SEEK_DATA)
on /dev/sd* devices as a GET LBA STATUS SCSI command.  Dually, QEMU would
convert the GET LBA STATUS command to lseek or FIEMAP.  QEMU only does a
simple conversion of commands and results, it's up to the guest to avoid
races.

It is of course the best if the guest can use SCSI extended copy, but we
cannot guarantee that.

Paolo

/*
 * Skeleton that walks the extents of a file (e.g. to copy
 * each part while preserving the allocation structure).
 *
 * Usage: ./fiemap FILENAME
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <linux/fiemap.h>
#include <linux/fs.h>
#include <sys/ioctl.h>
#include <errno.h>

// #define UNSAFE

#define MIN(a, b) ((a) < (b) ? (a) : (b))

enum {
    BDRV_BLOCK_ZERO = 1,
    BDRV_BLOCK_DATA = 2
};

/* Probe the area at [start, start+length) and return whether it begins with a
 * hole (BDRV_BLOCK_ZERO), an unwritten extent (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)
 * or allocated data (BDRV_BLOCK_DATA).  The return value applies to the
 * [start, *next) range.  *next should be used as the "start" value in a
 * subsequent call to get_next_fiemap.  
 *
 * If successful, return the kind of data at offset start in the file.
 * If start is beyond the end of file, return -ENXIO.
 * Otherwise, return -errno.
 */
int get_next_fiemap(int fd, off_t start, off_t length, off_t *next)
{
    struct {
        struct fiemap fm;
        struct fiemap_extent fe;
    } f;

#if UNSAFE
    f.fm.fm_flags = 0; 
#else
    f.fm.fm_flags = FIEMAP_FLAG_SYNC;
#endif
    f.fm.fm_start = start;
    f.fm.fm_length = length;
    f.fm.fm_extent_count = 1;
    f.fm.fm_reserved = 0;
    if (ioctl(fd, FS_IOC_FIEMAP, &f) == -1) {
        return -errno;
    }

    if (f.fm.fm_mapped_extents == 0) {
        /*
         * No extents found, any data must be at start + length or later.
         * Check if we're at end of file, and if not, return a hole as large
         * as the size we probed; splitting a hole across multiple calls is okay.
         */
        off_t filelen = lseek(fd, 0, SEEK_END);
        if (start >= filelen) {
            return -ENXIO;
        }
        /* Pay attention to overflows. */
        *next = start + MIN(length, filelen - start);
        return BDRV_BLOCK_ZERO;
    }

    if (f.fe.fe_logical > f.fm.fm_start) {
        /*
         * Found an extent, but not at the probed point.  We're in a hole,
         * return it and we'll get to that extent on the next call.
         */
        *next = f.fe.fe_logical;
        return BDRV_BLOCK_ZERO;
    }

    /* Found an extent, and we're inside it.  */
    *next = f.fe.fe_logical + f.fe.fe_length;
    if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
        return BDRV_BLOCK_DATA|BDRV_BLOCK_ZERO;
    } else {
        return BDRV_BLOCK_DATA;
    }
}

int main(int argc, char **argv)
{
    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
        perror("open");
        exit(1);
    }

    off_t start = 0, next;
    off_t length = 8 * 1024 * 1024;
    int ret;
    while ((ret = get_next_fiemap(fd, start, length, &next)) >= 0) {
        printf("%-15ld %-15ld%s%s\n",
               start, next - start,
               (ret & BDRV_BLOCK_DATA) ? " data" : "",
               (ret & BDRV_BLOCK_ZERO) ? " zero" : "");
        /* ... copy or fallocate the destination here ... */
        start = next;
    }

    if (ret != -ENXIO) {
        perror("FIEMAP");
        exit(1);
    }
    exit(0);
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-21 14:23                           ` Paolo Bonzini
@ 2016-07-22  8:58                             ` Dave Chinner
  2016-07-22 10:41                               ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2016-07-22  8:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, P, Niels de Vos

On Thu, Jul 21, 2016 at 10:23:48AM -0400, Paolo Bonzini wrote:
> > > 1) avoid copying zero data, to keep the copy process efficient.  For this,
> > > SEEK_HOLE/SEEK_DATA are enough.
> > > 
> > > 2) copy file contents while preserving the allocation state of the file's
> > > extents.
> > 
> > Which is /very difficult/ to do safely and reliably.
> > i.e. the use of fiemap to duplicate the exact layout of a file
> > from userspace is only posisble if you can /guarantee/ the source
> > file has not changed in any way during the copy operation at the
> > pointin time you finalise the destination data copy.
> 
> We don't do exactly that, exactly because it's messy when you have
> concurrent accesses (which shouldn't be done but you never know).

Which means you *cannot make the assumption it won't happen*.

FIEMAP is not guaranteed to tell you exactly where the data in the
file is that you need to copy is and that nothing you can do from
userspace changes that. I can't say it any clearer than that.

> When
> doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
> querying one extent at a time.  If you proceed this way, all of these
> can cause the same races:
> 
> - pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
> not copied
> 
> - pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
> copied
> 
> - lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
> copied
> 
> - lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
> copied
> 
> - ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
> the 10MB..20MB area is not copied

No, FIEMAP is not guaranteed to behave like this. what is returned
is filesystem dependent. Fielsystems that don't support holes will
return data extents. Filesystems that support compression might
return a compressed data extent rather than a hole. Encrypted files
might not expose holes at all, so people can't easily find known
plain text regions in the encrypted data. Filesystems could report
holes as deduplicated data, etc.  What do you do when FIEMAP returns
"OFFLINE" to indicate that the data is located elsewhere and will
need to be retrieved by the HSM operating on top of the filesystem
before layout can be determined?

All of the above are *valid* and *correct*, because the filesytem
defines what FIEMAP returns for a given file offset. just because
ext4 and XFS have mostly the same behaviour, it doesn't mean that
every other filesystem behaves the same way.

The assumptions being made about FIEMAP behaviour will only lead to
user data corruption, as they already have several times in the past.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-22  8:58                             ` Dave Chinner
@ 2016-07-22 10:41                               ` Paolo Bonzini
  2018-02-15 16:40                                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2016-07-22 10:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fam Zheng, Eric Blake, Kevin Wolf, qemu-block, qemu-devel,
	Max Reitz, Lukas Czerner, P, Niels de Vos

> > > i.e. the use of fiemap to duplicate the exact layout of a file
> > > from userspace is only posisble if you can /guarantee/ the source
> > > file has not changed in any way during the copy operation at the
> > > pointin time you finalise the destination data copy.
> > 
> > We don't do exactly that, exactly because it's messy when you have
> > concurrent accesses (which shouldn't be done but you never know).
> 
> Which means you *cannot make the assumption it won't happen*.
> FIEMAP is not guaranteed to tell you exactly where the data in the
> file is that you need to copy is and that nothing you can do from
> userspace changes that. I can't say it any clearer than that.

You've said it very clearly.  But I'm not saying "fix the damn FIEMAP", I'm
saying "this is what we need, lseek doesn't provide it, FIEMAP comes
close but really doesn't".  If the solution is to fix FIEMAP, if it's
possible at all, that'd be great.  But any other solution is okay.

Do you at least agree that it's possible to use the kind of information
in struct fiemap_extent (extent start, length and flags) in a way that is
not racy, or at least not any different from SEEK_DATA and SEEK_HOLE's
raciness?  I'm not saying that you'd get that information from FIEMAP.
It's just the kind of information that I'd like to get. 

(BTW, the documentation of FIEMAP is horrible.  It does not say at all
that FIEMAP_FLAG_SYNC is needed to return extents that match what
SEEK_HOLE/SEEK_DATA would return.  The obvious reading is that
FIEMAP_FLAG_SYNC would avoid returning FIEMAP_EXTENT_DELALLOC extents,
and that in turn would not be a problem if you don't need fe_physical.
Perhaps it would help if fiemap.txt said started with *why* would one
use FIEMAP, not just what it does).

> > When doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
> > querying one extent at a time.  If you proceed this way, all of these
> > can cause the same races:
> > 
> > - pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
> > not copied
> > 
> > - pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
> > copied
> > 
> > - lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
> > copied
> > 
> > - lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
> > copied
> > 
> > - ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
> > the 10MB..20MB area is not copied
> 
> No, FIEMAP is not guaranteed to behave like this. what is returned
> is filesystem dependent. Fielsystems that don't support holes will
> return data extents. Filesystems that support compression might
> return a compressed data extent rather than a hole. Encrypted files
> might not expose holes at all, so people can't easily find known
> plain text regions in the encrypted data. Filesystems could report
> holes as deduplicated data, etc.  What do you do when FIEMAP returns
> "OFFLINE" to indicate that the data is located elsewhere and will
> need to be retrieved by the HSM operating on top of the filesystem
> before layout can be determined?

lseek(SEEK_DATA) might also say you're not on a hole because the file
is compressed/encrypted/deduplicated/offline/whatnot.  So lseek is
also filesystem dependent, isn't it?  It also doesn't work on block
devices, so it's really file descriptor dependent.  That's not news.

Of course read, lseek and FIEMAP will not return exactly the same
information.  The point is that they're subject to exactly the same
races, and that struct fiemap_extent *can* be parsed conservatively.
The code I attached to the previous message does that, if it finds any
extent kind other than an unwritten extent it just treats it as data.

Based on this it would be nice to understand the reason why FIEMAP needs
FIEMAP_FLAG_SYNC to return meaningful values (the meaningful values
are there, they're just what lseek or read use), or to have a more
powerful function than just lseek(SEEK_DATA/SEEK_HOLE).  All we want is
to be able to distinguish between the three fallocate modes.

> The assumptions being made about FIEMAP behaviour will only lead to
> user data corruption, as they already have several times in the past.

Indeed, FIEMAP as is ranks just above gets() in usability.  But there's
no reason why it has to be that way.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
  2016-07-19  6:21   ` Fam Zheng
@ 2016-08-18 13:50   ` Vladimir Sementsov-Ogievskiy
  2016-08-18 13:52     ` Paolo Bonzini
  1 sibling, 1 reply; 55+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-08-18 13:50 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf, pbonzini, qemu-block, Max Reitz

On 19.07.2016 07:08, Eric Blake wrote:
> Upstream NBD protocol recently added the ability to efficiently
> write zeroes without having to send the zeroes over the wire,
> along with a flag to control whether the client wants a hole.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>
>
> ---
> v4: rebase, fix value for constant
> v3: abandon NBD_CMD_CLOSE extension, rebase to use blk_pwrite_zeroes
> ---
>   include/block/nbd.h |  8 ++++++--
>   nbd/server.c        | 42 ++++++++++++++++++++++++++++++++++++++++--
>   2 files changed, 46 insertions(+), 4 deletions(-)
>
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index fc4426c..e23ef73 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -69,6 +69,7 @@ struct nbd_reply {
>   #define NBD_FLAG_SEND_FUA       (1 << 3)        /* Send FUA (Force Unit Access) */
>   #define NBD_FLAG_ROTATIONAL     (1 << 4)        /* Use elevator algorithm - rotational media */
>   #define NBD_FLAG_SEND_TRIM      (1 << 5)        /* Send TRIM (discard) */
> +#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6)     /* Send WRITE_ZEROES */
>
>   /* New-style handshake (global) flags, sent from server to client, and
>      control what will happen during handshake phase. */
> @@ -94,7 +95,8 @@ struct nbd_reply {
>   #define NBD_REP_ERR_SHUTDOWN    NBD_REP_ERR(7)  /* Server shutting down */
>
>   /* Request flags, sent from client to server during transmission phase */
> -#define NBD_CMD_FLAG_FUA        (1 << 0)
> +#define NBD_CMD_FLAG_FUA        (1 << 0) /* 'force unit access' during write */
> +#define NBD_CMD_FLAG_NO_HOLE    (1 << 1) /* don't punch hole on zero run */
>
>   /* Supported request types */
>   enum {
> @@ -102,7 +104,9 @@ enum {
>       NBD_CMD_WRITE = 1,
>       NBD_CMD_DISC = 2,
>       NBD_CMD_FLUSH = 3,
> -    NBD_CMD_TRIM = 4
> +    NBD_CMD_TRIM = 4,
> +    /* 5 reserved for failed experiment NBD_CMD_CACHE */
> +    NBD_CMD_WRITE_ZEROES = 6,
>   };
>
>   #define NBD_DEFAULT_PORT	10809
> diff --git a/nbd/server.c b/nbd/server.c
> index 689636c..3a2fecb 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -610,7 +610,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData *data)
>       char buf[8 + 8 + 8 + 128];
>       int rc;
>       const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
> -                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
> +                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA |
> +                              NBD_FLAG_SEND_WRITE_ZEROES);
>       bool oldStyle;
>       size_t len;
>
> @@ -1126,11 +1127,17 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
>           rc = request->type == NBD_CMD_WRITE ? -ENOSPC : -EINVAL;
>           goto out;
>       }
> -    if (request->flags & ~NBD_CMD_FLAG_FUA) {
> +    if (request->flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) {
>           LOG("unsupported flags (got 0x%x)", request->flags);
>           rc = -EINVAL;
>           goto out;
>       }
> +    if (request->type != NBD_CMD_WRITE_ZEROES &&
> +        (request->flags & NBD_CMD_FLAG_NO_HOLE)) {
> +        LOG("unexpected flags (got 0x%x)", request->flags);
> +        rc = -EINVAL;
> +        goto out;
> +    }
>
>       rc = 0;
>
> @@ -1235,6 +1242,37 @@ static void nbd_trip(void *opaque)
>           }
>           break;
>
> +    case NBD_CMD_WRITE_ZEROES:
> +        TRACE("Request type is WRITE_ZEROES");
> +
> +        if (exp->nbdflags & NBD_FLAG_READ_ONLY) {
> +            TRACE("Server is read-only, return error");
> +            reply.error = EROFS;

from nbd proto.md:

"Finally, it SHOULD return |EPERM| if it receives a write or trim 
request on a read-only export."

And EROFS is not mentioned in proto.md

(however the same bug is in NBD_CMD_WRITE case.)

> +            goto error_reply;
> +        }
> +
> +        TRACE("Writing to device");
> +
> +        flags = 0;
> +        if (request.flags & NBD_CMD_FLAG_FUA) {
> +            flags |= BDRV_REQ_FUA;
> +        }
> +        if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
> +            flags |= BDRV_REQ_MAY_UNMAP;
> +        }
> +        ret = blk_pwrite_zeroes(exp->blk, request.from + exp->dev_offset,
> +                                request.len, flags);
> +        if (ret < 0) {
> +            LOG("writing to file failed");
> +            reply.error = -ret;
> +            goto error_reply;
> +        }
> +
> +        if (nbd_co_send_reply(req, &reply, 0) < 0) {
> +            goto out;
> +        }
> +        break;
> +
>       case NBD_CMD_DISC:
>           /* unreachable, thanks to special case in nbd_co_receive_request() */
>           abort();


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server
  2016-08-18 13:50   ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Vladimir Sementsov-Ogievskiy
@ 2016-08-18 13:52     ` Paolo Bonzini
  0 siblings, 0 replies; 55+ messages in thread
From: Paolo Bonzini @ 2016-08-18 13:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz



On 18/08/2016 15:50, Vladimir Sementsov-Ogievskiy wrote:
> 
> from nbd proto.md:
> 
> "Finally, it SHOULD return |EPERM| if it receives a write or trim
> request on a read-only export."
> 
> And EROFS is not mentioned in proto.md
> 
> (however the same bug is in NBD_CMD_WRITE case.)

system_errno_to_nbd_errno takes care of fixing up the return value.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors
  2016-07-19  5:15   ` Fam Zheng
@ 2016-10-11 15:12     ` Eric Blake
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2016-10-11 15:12 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, pbonzini, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On 07/19/2016 12:15 AM, Fam Zheng wrote:
> On Mon, 07/18 22:07, Eric Blake wrote:
>>  nbd/server.c | 78 +++++++++++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 59 insertions(+), 19 deletions(-)
>>
>> diff --git a/nbd/server.c b/nbd/server.c
>> index c8716f1..ad31c4a 100644
>> --- a/nbd/server.c
>> +++ b/nbd/server.c
>> @@ -235,6 +235,38 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t opt)
>>      return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
>>  }
>>
>> +/* Send an error reply.
>> + * Return -errno to kill connection, 0 to continue negotiation. */
>> +static int GCC_FMT_ATTR(4, 5)
>> +    nbd_negotiate_send_rep_err(QIOChannel *ioc, uint32_t type,
> 
> Isn't the function name supposed to be place at col 0?

I blame emacs for that one.  I'm (finally) back to working on this
series, and will have a v6 posted soon addressing comments here and
elsewhere.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2016-07-22 10:41                               ` Paolo Bonzini
@ 2018-02-15 16:40                                 ` Vladimir Sementsov-Ogievskiy
  2018-02-15 16:42                                   ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-02-15 16:40 UTC (permalink / raw)
  To: Paolo Bonzini, Dave Chinner
  Cc: Kevin Wolf, Fam Zheng, qemu-block, P, qemu-devel, Max Reitz,
	Lukas Czerner, Niels de Vos, Dmitry Monakhov, Denis V. Lunev

Hi all.

Two years later, is there are any news on the topic?

I can't understand the following thing:

  - FIEMAP without FLAG_SYNC is unsafe
  - FIEMAP with FLAG_SYNC is safe but slow
  - so, we've dropped FIEMAP and use only lseek. So, it means that lseek 
is safe _and_ fast (at least, faster than FIEMAP with FLAG_SYNC)...

but how is it possible, if lseek and fiemap share same code in the kernel?

Do your code with

     /* Found an extent, and we're inside it.  */
     *next = f.fe.fe_logical + f.fe.fe_length;
     if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
         return BDRV_BLOCK_DATA|BDRV_BLOCK_ZERO;
     } else {
         return BDRV_BLOCK_DATA;
     }

provide safe block_status based on FIEMAP without FLAG_SYNC?


------
may help: link to discussion start:
http://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04641.html

22.07.2016 13:41, Paolo Bonzini wrote:
>>>> i.e. the use of fiemap to duplicate the exact layout of a file
>>>> from userspace is only posisble if you can /guarantee/ the source
>>>> file has not changed in any way during the copy operation at the
>>>> pointin time you finalise the destination data copy.
>>> We don't do exactly that, exactly because it's messy when you have
>>> concurrent accesses (which shouldn't be done but you never know).
>> Which means you *cannot make the assumption it won't happen*.
>> FIEMAP is not guaranteed to tell you exactly where the data in the
>> file is that you need to copy is and that nothing you can do from
>> userspace changes that. I can't say it any clearer than that.
> You've said it very clearly.  But I'm not saying "fix the damn FIEMAP", I'm
> saying "this is what we need, lseek doesn't provide it, FIEMAP comes
> close but really doesn't".  If the solution is to fix FIEMAP, if it's
> possible at all, that'd be great.  But any other solution is okay.
>
> Do you at least agree that it's possible to use the kind of information
> in struct fiemap_extent (extent start, length and flags) in a way that is
> not racy, or at least not any different from SEEK_DATA and SEEK_HOLE's
> raciness?  I'm not saying that you'd get that information from FIEMAP.
> It's just the kind of information that I'd like to get.
>
> (BTW, the documentation of FIEMAP is horrible.  It does not say at all
> that FIEMAP_FLAG_SYNC is needed to return extents that match what
> SEEK_HOLE/SEEK_DATA would return.  The obvious reading is that
> FIEMAP_FLAG_SYNC would avoid returning FIEMAP_EXTENT_DELALLOC extents,
> and that in turn would not be a problem if you don't need fe_physical.
> Perhaps it would help if fiemap.txt said started with *why* would one
> use FIEMAP, not just what it does).
>
>>> When doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
>>> querying one extent at a time.  If you proceed this way, all of these
>>> can cause the same races:
>>>
>>> - pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
>>> not copied
>>>
>>> - pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
>>> copied
>>>
>>> - lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
>>> copied
>>>
>>> - lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
>>> copied
>>>
>>> - ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
>>> the 10MB..20MB area is not copied
>> No, FIEMAP is not guaranteed to behave like this. what is returned
>> is filesystem dependent. Fielsystems that don't support holes will
>> return data extents. Filesystems that support compression might
>> return a compressed data extent rather than a hole. Encrypted files
>> might not expose holes at all, so people can't easily find known
>> plain text regions in the encrypted data. Filesystems could report
>> holes as deduplicated data, etc.  What do you do when FIEMAP returns
>> "OFFLINE" to indicate that the data is located elsewhere and will
>> need to be retrieved by the HSM operating on top of the filesystem
>> before layout can be determined?
> lseek(SEEK_DATA) might also say you're not on a hole because the file
> is compressed/encrypted/deduplicated/offline/whatnot.  So lseek is
> also filesystem dependent, isn't it?  It also doesn't work on block
> devices, so it's really file descriptor dependent.  That's not news.
>
> Of course read, lseek and FIEMAP will not return exactly the same
> information.  The point is that they're subject to exactly the same
> races, and that struct fiemap_extent *can* be parsed conservatively.
> The code I attached to the previous message does that, if it finds any
> extent kind other than an unwritten extent it just treats it as data.
>
> Based on this it would be nice to understand the reason why FIEMAP needs
> FIEMAP_FLAG_SYNC to return meaningful values (the meaningful values
> are there, they're just what lseek or read use), or to have a more
> powerful function than just lseek(SEEK_DATA/SEEK_HOLE).  All we want is
> to be able to distinguish between the three fallocate modes.
>
>> The assumptions being made about FIEMAP behaviour will only lead to
>> user data corruption, as they already have several times in the past.
> Indeed, FIEMAP as is ranks just above gets() in usability.  But there's
> no reason why it has to be that way.
>
> Paolo
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2018-02-15 16:40                                 ` Vladimir Sementsov-Ogievskiy
@ 2018-02-15 16:42                                   ` Paolo Bonzini
  2018-04-18 14:25                                     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2018-02-15 16:42 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Dave Chinner
  Cc: Kevin Wolf, Fam Zheng, qemu-block, P, qemu-devel, Max Reitz,
	Lukas Czerner, Niels de Vos, Dmitry Monakhov, Denis V. Lunev

On 15/02/2018 17:40, Vladimir Sementsov-Ogievskiy wrote:
> Hi all.
> 
> Two years later, is there are any news on the topic?
> 
> I can't understand the following thing:
> 
>  - FIEMAP without FLAG_SYNC is unsafe
>  - FIEMAP with FLAG_SYNC is safe but slow
>  - so, we've dropped FIEMAP and use only lseek. So, it means that lseek
> is safe _and_ fast (at least, faster than FIEMAP with FLAG_SYNC)...
> 
> but how is it possible, if lseek and fiemap share same code in the kernel?

Do they?  Maybe not for all file systems.

> Do your code with
> 
>     /* Found an extent, and we're inside it.  */
>     *next = f.fe.fe_logical + f.fe.fe_length;
>     if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
>         return BDRV_BLOCK_DATA|BDRV_BLOCK_ZERO;
>     } else {
>         return BDRV_BLOCK_DATA;
>     }
> 
> provide safe block_status based on FIEMAP without FLAG_SYNC?

No, in fact we found data corruption with FIEMAP.

Paolo

> ------
> may help: link to discussion start:
> http://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04641.html
> 
> 22.07.2016 13:41, Paolo Bonzini wrote:
>>>>> i.e. the use of fiemap to duplicate the exact layout of a file
>>>>> from userspace is only posisble if you can /guarantee/ the source
>>>>> file has not changed in any way during the copy operation at the
>>>>> pointin time you finalise the destination data copy.
>>>> We don't do exactly that, exactly because it's messy when you have
>>>> concurrent accesses (which shouldn't be done but you never know).
>>> Which means you *cannot make the assumption it won't happen*.
>>> FIEMAP is not guaranteed to tell you exactly where the data in the
>>> file is that you need to copy is and that nothing you can do from
>>> userspace changes that. I can't say it any clearer than that.
>> You've said it very clearly.  But I'm not saying "fix the damn
>> FIEMAP", I'm
>> saying "this is what we need, lseek doesn't provide it, FIEMAP comes
>> close but really doesn't".  If the solution is to fix FIEMAP, if it's
>> possible at all, that'd be great.  But any other solution is okay.
>>
>> Do you at least agree that it's possible to use the kind of information
>> in struct fiemap_extent (extent start, length and flags) in a way that is
>> not racy, or at least not any different from SEEK_DATA and SEEK_HOLE's
>> raciness?  I'm not saying that you'd get that information from FIEMAP.
>> It's just the kind of information that I'd like to get.
>>
>> (BTW, the documentation of FIEMAP is horrible.  It does not say at all
>> that FIEMAP_FLAG_SYNC is needed to return extents that match what
>> SEEK_HOLE/SEEK_DATA would return.  The obvious reading is that
>> FIEMAP_FLAG_SYNC would avoid returning FIEMAP_EXTENT_DELALLOC extents,
>> and that in turn would not be a problem if you don't need fe_physical.
>> Perhaps it would help if fiemap.txt said started with *why* would one
>> use FIEMAP, not just what it does).
>>
>>>> When doing a copy, we use(d to use) FIEMAP the same way as you'd use
>>>> lseek,
>>>> querying one extent at a time.  If you proceed this way, all of these
>>>> can cause the same races:
>>>>
>>>> - pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
>>>> not copied
>>>>
>>>> - pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
>>>> copied
>>>>
>>>> - lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
>>>> copied
>>>>
>>>> - lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
>>>> copied
>>>>
>>>> - ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
>>>> the 10MB..20MB area is not copied
>>> No, FIEMAP is not guaranteed to behave like this. what is returned
>>> is filesystem dependent. Fielsystems that don't support holes will
>>> return data extents. Filesystems that support compression might
>>> return a compressed data extent rather than a hole. Encrypted files
>>> might not expose holes at all, so people can't easily find known
>>> plain text regions in the encrypted data. Filesystems could report
>>> holes as deduplicated data, etc.  What do you do when FIEMAP returns
>>> "OFFLINE" to indicate that the data is located elsewhere and will
>>> need to be retrieved by the HSM operating on top of the filesystem
>>> before layout can be determined?
>> lseek(SEEK_DATA) might also say you're not on a hole because the file
>> is compressed/encrypted/deduplicated/offline/whatnot.  So lseek is
>> also filesystem dependent, isn't it?  It also doesn't work on block
>> devices, so it's really file descriptor dependent.  That's not news.
>>
>> Of course read, lseek and FIEMAP will not return exactly the same
>> information.  The point is that they're subject to exactly the same
>> races, and that struct fiemap_extent *can* be parsed conservatively.
>> The code I attached to the previous message does that, if it finds any
>> extent kind other than an unwritten extent it just treats it as data.
>>
>> Based on this it would be nice to understand the reason why FIEMAP needs
>> FIEMAP_FLAG_SYNC to return meaningful values (the meaningful values
>> are there, they're just what lseek or read use), or to have a more
>> powerful function than just lseek(SEEK_DATA/SEEK_HOLE).  All we want is
>> to be able to distinguish between the three fallocate modes.
>>
>>> The assumptions being made about FIEMAP behaviour will only lead to
>>> user data corruption, as they already have several times in the past.
>> Indeed, FIEMAP as is ranks just above gets() in usability.  But there's
>> no reason why it has to be that way.
>>
>> Paolo
>>
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
  2018-02-15 16:42                                   ` Paolo Bonzini
@ 2018-04-18 14:25                                     ` Vladimir Sementsov-Ogievskiy
  2018-04-18 14:41                                       ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC Eric Blake
  0 siblings, 1 reply; 55+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-04-18 14:25 UTC (permalink / raw)
  To: Paolo Bonzini, Dave Chinner
  Cc: Kevin Wolf, Fam Zheng, qemu-block, P, qemu-devel, Max Reitz,
	Lukas Czerner, Niels de Vos, Dmitry Monakhov, Denis V. Lunev

15.02.2018 19:42, Paolo Bonzini wrote:
> On 15/02/2018 17:40, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all.
>>
>> Two years later, is there are any news on the topic?
>>
>> I can't understand the following thing:
>>
>>   - FIEMAP without FLAG_SYNC is unsafe
>>   - FIEMAP with FLAG_SYNC is safe but slow
>>   - so, we've dropped FIEMAP and use only lseek. So, it means that lseek
>> is safe _and_ fast (at least, faster than FIEMAP with FLAG_SYNC)...
>>
>> but how is it possible, if lseek and fiemap share same code in the kernel?
> Do they?  Maybe not for all file systems.
>
>> Do your code with
>>
>>      /* Found an extent, and we're inside it.  */
>>      *next = f.fe.fe_logical + f.fe.fe_length;
>>      if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
>>          return BDRV_BLOCK_DATA|BDRV_BLOCK_ZERO;
>>      } else {
>>          return BDRV_BLOCK_DATA;
>>      }
>>
>> provide safe block_status based on FIEMAP without FLAG_SYNC?
> No, in fact we found data corruption with FIEMAP.

How to reproduce it? I've tried your code, looks like it shows all 
"data" regions even if I didn't call "sync".


> Paolo
>
>> ------
>> may help: link to discussion start:
>> http://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04641.html
>>
>> 22.07.2016 13:41, Paolo Bonzini wrote:
>>>>>> i.e. the use of fiemap to duplicate the exact layout of a file
>>>>>> from userspace is only posisble if you can /guarantee/ the source
>>>>>> file has not changed in any way during the copy operation at the
>>>>>> pointin time you finalise the destination data copy.
>>>>> We don't do exactly that, exactly because it's messy when you have
>>>>> concurrent accesses (which shouldn't be done but you never know).
>>>> Which means you *cannot make the assumption it won't happen*.
>>>> FIEMAP is not guaranteed to tell you exactly where the data in the
>>>> file is that you need to copy is and that nothing you can do from
>>>> userspace changes that. I can't say it any clearer than that.
>>> You've said it very clearly.  But I'm not saying "fix the damn
>>> FIEMAP", I'm
>>> saying "this is what we need, lseek doesn't provide it, FIEMAP comes
>>> close but really doesn't".  If the solution is to fix FIEMAP, if it's
>>> possible at all, that'd be great.  But any other solution is okay.
>>>
>>> Do you at least agree that it's possible to use the kind of information
>>> in struct fiemap_extent (extent start, length and flags) in a way that is
>>> not racy, or at least not any different from SEEK_DATA and SEEK_HOLE's
>>> raciness?  I'm not saying that you'd get that information from FIEMAP.
>>> It's just the kind of information that I'd like to get.
>>>
>>> (BTW, the documentation of FIEMAP is horrible.  It does not say at all
>>> that FIEMAP_FLAG_SYNC is needed to return extents that match what
>>> SEEK_HOLE/SEEK_DATA would return.  The obvious reading is that
>>> FIEMAP_FLAG_SYNC would avoid returning FIEMAP_EXTENT_DELALLOC extents,
>>> and that in turn would not be a problem if you don't need fe_physical.
>>> Perhaps it would help if fiemap.txt said started with *why* would one
>>> use FIEMAP, not just what it does).
>>>
>>>>> When doing a copy, we use(d to use) FIEMAP the same way as you'd use
>>>>> lseek,
>>>>> querying one extent at a time.  If you proceed this way, all of these
>>>>> can cause the same races:
>>>>>
>>>>> - pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
>>>>> not copied
>>>>>
>>>>> - pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
>>>>> copied
>>>>>
>>>>> - lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
>>>>> copied
>>>>>
>>>>> - lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
>>>>> copied
>>>>>
>>>>> - ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
>>>>> the 10MB..20MB area is not copied
>>>> No, FIEMAP is not guaranteed to behave like this. what is returned
>>>> is filesystem dependent. Fielsystems that don't support holes will
>>>> return data extents. Filesystems that support compression might
>>>> return a compressed data extent rather than a hole. Encrypted files
>>>> might not expose holes at all, so people can't easily find known
>>>> plain text regions in the encrypted data. Filesystems could report
>>>> holes as deduplicated data, etc.  What do you do when FIEMAP returns
>>>> "OFFLINE" to indicate that the data is located elsewhere and will
>>>> need to be retrieved by the HSM operating on top of the filesystem
>>>> before layout can be determined?
>>> lseek(SEEK_DATA) might also say you're not on a hole because the file
>>> is compressed/encrypted/deduplicated/offline/whatnot.  So lseek is
>>> also filesystem dependent, isn't it?  It also doesn't work on block
>>> devices, so it's really file descriptor dependent.  That's not news.
>>>
>>> Of course read, lseek and FIEMAP will not return exactly the same
>>> information.  The point is that they're subject to exactly the same
>>> races, and that struct fiemap_extent *can* be parsed conservatively.
>>> The code I attached to the previous message does that, if it finds any
>>> extent kind other than an unwritten extent it just treats it as data.
>>>
>>> Based on this it would be nice to understand the reason why FIEMAP needs
>>> FIEMAP_FLAG_SYNC to return meaningful values (the meaningful values
>>> are there, they're just what lseek or read use), or to have a more
>>> powerful function than just lseek(SEEK_DATA/SEEK_HOLE).  All we want is
>>> to be able to distinguish between the three fallocate modes.
>>>
>>>> The assumptions being made about FIEMAP behaviour will only lead to
>>>> user data corruption, as they already have several times in the past.
>>> Indeed, FIEMAP as is ranks just above gets() in usability.  But there's
>>> no reason why it has to be that way.
>>>
>>> Paolo
>>>
>>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC
  2018-04-18 14:25                                     ` Vladimir Sementsov-Ogievskiy
@ 2018-04-18 14:41                                       ` Eric Blake
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Blake @ 2018-04-18 14:41 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, Dave Chinner
  Cc: Kevin Wolf, Fam Zheng, qemu-block, qemu-devel, Max Reitz,
	Dmitry Monakhov, Denis V. Lunev, Lukas Czerner, P, Niels de Vos

[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]

On 04/18/2018 09:25 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Do your code with
>>>
>>>      /* Found an extent, and we're inside it.  */
>>>      *next = f.fe.fe_logical + f.fe.fe_length;
>>>      if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
>>>          return BDRV_BLOCK_DATA|BDRV_BLOCK_ZERO;
>>>      } else {
>>>          return BDRV_BLOCK_DATA;
>>>      }
>>>
>>> provide safe block_status based on FIEMAP without FLAG_SYNC?
>> No, in fact we found data corruption with FIEMAP.
> 
> How to reproduce it? I've tried your code, looks like it shows all
> "data" regions even if I didn't call "sync".
> 

There's no easy way to reproduce unsafe data races reliably; but FIEMAP
without sync is such an unsafe data race (most of the time, you will get
the answer you expect, but under the right conditions, FIEMAP may report
the area as unallocated even though you have already called write(); if
you treat that unallocated region as BDRV_BLOCK_ZERO, rather than
read()ing it, you have corrupted data).  That's because FIEMAP only
reports what the disk has allocated, but file systems can have delayed
allocations where contents in the kernel cache are NOT yet flushed to
disk unless you use sync; but using sync kills performance.

If you want examples of FIEMAP corrupting data, look at the coreutils
archive from several years ago, where FIEMAP without sync caused
corruptions during cp. A quick search found at least this example:
https://lists.gnu.org/archive/html/bug-coreutils/2011-04/msg00023.html

For more details, see qemu commits c4875e5b and 38c4d0a, and discussion
at https://lists.gnu.org/archive/html/qemu-devel/2014-09/msg04921.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2018-04-18 14:41 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-19  4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 02/14] nbd: Add qemu-nbd -D for human-readable description Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 03/14] nbd: Limit nbdflags to 16 bits Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 04/14] nbd: Treat flags vs. command type as separate fields Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server Eric Blake
2016-07-19  5:10   ` Fam Zheng
2016-07-19 14:52     ` Eric Blake
2016-07-20  4:39       ` Fam Zheng
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors Eric Blake
2016-07-19  5:15   ` Fam Zheng
2016-10-11 15:12     ` Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client Eric Blake
2016-07-19  5:31   ` Fam Zheng
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 08/14] nbd: Let server know when client gives up negotiation Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 09/14] nbd: Let client skip portions of server reply Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 10/14] nbd: Less allocation during NBD_OPT_LIST Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 11/14] nbd: Support shorter handshake Eric Blake
2016-07-19  4:07 ` [Qemu-devel] [PATCH v5 12/14] nbd: Improve server handling of shutdown requests Eric Blake
2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
2016-07-19  6:21   ` Fam Zheng
2016-07-19 15:28     ` Eric Blake
2016-07-19 15:45       ` Paolo Bonzini
2016-07-20  3:34         ` Fam Zheng
2016-07-20  3:47           ` Eric Blake
2016-07-20  4:37             ` Fam Zheng
2016-07-20  7:09               ` Paolo Bonzini
2016-07-20  7:38                 ` Fam Zheng
2016-07-20  8:16                   ` Paolo Bonzini
2016-07-20  9:04                     ` Fam Zheng
2016-07-20  9:19                   ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) Paolo Bonzini
2016-07-20 12:30                     ` Dave Chinner
2016-07-20 13:35                       ` Niels de Vos
2016-07-21 11:43                         ` Dave Chinner
2016-07-21 12:31                           ` Pádraig Brady
2016-07-21 13:15                             ` Dave Chinner
2016-07-20 13:40                       ` Paolo Bonzini
2016-07-21 12:41                         ` Dave Chinner
2016-07-21 13:01                           ` Pádraig Brady
2016-07-21 14:23                           ` Paolo Bonzini
2016-07-22  8:58                             ` Dave Chinner
2016-07-22 10:41                               ` Paolo Bonzini
2018-02-15 16:40                                 ` Vladimir Sementsov-Ogievskiy
2018-02-15 16:42                                   ` Paolo Bonzini
2018-04-18 14:25                                     ` Vladimir Sementsov-Ogievskiy
2018-04-18 14:41                                       ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC Eric Blake
2016-08-18 13:50   ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Vladimir Sementsov-Ogievskiy
2016-08-18 13:52     ` Paolo Bonzini
2016-07-19  4:08 ` [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client Eric Blake
2016-07-19  6:24   ` Fam Zheng
2016-07-19 15:31     ` Eric Blake
2016-07-19  6:33 ` [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Fam Zheng
2016-07-19  8:53 ` Paolo Bonzini
2016-07-19 15:33   ` Eric Blake
2016-07-19 15:41     ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.