All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 0/6] block: zero writes
@ 2011-12-21 16:00 Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

This series adds an interface for efficient writes when data contains all
zeros.  It also takes advantage of this new interface by extending the
copy-on-read feature to perform zero-detection.

The details of efficient zero representations depend on the image format.  This
series includes a patch for the QED image format to write special "zero
clusters" that keep the image file compact.  In the future qcow2v3 could also
support an efficient zero representation.

The new BlockDriver interface is called .bdrv_co_write_zeroes() and is
optional.  If the interface is not implemented by a BlockDriver then a regular
.bdrv_co_writev() operation will be performed.  The public interface is called
bdrv_co_write_zeroes() and can be tested via the new qemu-io write -z option.

Copy-on-read is extended to detect zeroes and invoke the
.bdrv_co_write_zeroes() interface when possible.  As a result we avoid bloating
the image file if the backing file contains zeroes.

My motivation for this feature is efficient image streaming.  The destination
file must stay compact even when copying zeroes from the source file.

We now only do zero detection for copy-on-read requests, whereas previous
revisions of this patch series scanned all write requests for zeroes.  The old
behavior wasted CPU cycles in most cases but we could add a feature to
explicitly scan guest writes for zeroes in the future, if desired.

v2:
 * Introduce .bdrv_co_write_zeroes() [Kevin]
 * Perform zero detection only on copy-on-read requests [Kevin]

Stefan Hajnoczi (6):
  cutils: extract buffer_is_zero() from qemu-img.c
  block: add .bdrv_co_write_zeroes() interface
  block: perform zero-detection during copy-on-read
  qed: replace is_write with flags field
  qed: add .bdrv_co_write_zeroes() support
  qemu-io: add write -z option for bdrv_co_write_zeroes

 block.c       |   64 +++++++++++++++++++++++++----
 block.h       |    7 +++
 block/qed.c   |  125 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 block/qed.h   |    7 +++-
 block_int.h   |    8 ++++
 cutils.c      |   34 +++++++++++++++
 qemu-common.h |    2 +
 qemu-img.c    |   46 +++------------------
 qemu-io.c     |   77 +++++++++++++++++++++++++++++++----
 trace-events  |    3 +-
 10 files changed, 300 insertions(+), 73 deletions(-)

-- 
1.7.7.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
@ 2011-12-21 16:00 ` Stefan Hajnoczi
  2011-12-21 16:43   ` Eric Blake
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

The qemu-img.c:is_not_zero() function checks if a buffer contains all
zeroes.  This function will come in handy for zero-detection in the
block layer, so clean it up and move it to cutils.c.

Note that the function now returns true if the buffer is all zeroes.
This avoids the double-negatives (i.e. !is_not_zero()) that the old
function can cause in callers.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 cutils.c      |   34 ++++++++++++++++++++++++++++++++++
 qemu-common.h |    2 ++
 qemu-img.c    |   46 +++++++---------------------------------------
 3 files changed, 43 insertions(+), 39 deletions(-)

diff --git a/cutils.c b/cutils.c
index 24b3fe3..c9560c3 100644
--- a/cutils.c
+++ b/cutils.c
@@ -301,6 +301,40 @@ void qemu_iovec_memset_skip(QEMUIOVector *qiov, int c, size_t count,
     }
 }
 
+/*
+ * Checks if a buffer is all zeroes
+ *
+ * Attention! The len must be a multiple of 4 * sizeof(long) due to
+ * restriction of optimizations in this function.
+ */
+bool buffer_is_zero(const void *buf, size_t len)
+{
+    /*
+     * Use long as the biggest available internal data type that fits into the
+     * CPU register and unroll the loop to smooth out the effect of memory
+     * latency.
+     */
+
+    size_t i;
+    long d0, d1, d2, d3;
+    const long * const data = buf;
+
+    len /= sizeof(long);
+
+    for (i = 0; i < len; i += 4) {
+        d0 = data[i + 0];
+        d1 = data[i + 1];
+        d2 = data[i + 2];
+        d3 = data[i + 3];
+
+        if (d0 || d1 || d2 || d3) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
 #ifndef _WIN32
 /* Sets a specific flag */
 int fcntl_setfl(int fd, int flag)
diff --git a/qemu-common.h b/qemu-common.h
index b2de015..95fa2b2 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -293,6 +293,8 @@ void qemu_iovec_memset(QEMUIOVector *qiov, int c, size_t count);
 void qemu_iovec_memset_skip(QEMUIOVector *qiov, int c, size_t count,
                             size_t skip);
 
+bool buffer_is_zero(const void *buf, size_t len);
+
 void qemu_progress_init(int enabled, float min_skip);
 void qemu_progress_end(void);
 void qemu_progress_print(float delta, int max);
diff --git a/qemu-img.c b/qemu-img.c
index 01cc0d3..c4bcf41 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -515,40 +515,6 @@ static int img_commit(int argc, char **argv)
 }
 
 /*
- * Checks whether the sector is not a zero sector.
- *
- * Attention! The len must be a multiple of 4 * sizeof(long) due to
- * restriction of optimizations in this function.
- */
-static int is_not_zero(const uint8_t *sector, int len)
-{
-    /*
-     * Use long as the biggest available internal data type that fits into the
-     * CPU register and unroll the loop to smooth out the effect of memory
-     * latency.
-     */
-
-    int i;
-    long d0, d1, d2, d3;
-    const long * const data = (const long *) sector;
-
-    len /= sizeof(long);
-
-    for(i = 0; i < len; i += 4) {
-        d0 = data[i + 0];
-        d1 = data[i + 1];
-        d2 = data[i + 2];
-        d3 = data[i + 3];
-
-        if (d0 || d1 || d2 || d3) {
-            return 1;
-        }
-    }
-
-    return 0;
-}
-
-/*
  * Returns true iff the first sector pointed to by 'buf' contains at least
  * a non-NUL byte.
  *
@@ -557,20 +523,22 @@ static int is_not_zero(const uint8_t *sector, int len)
  */
 static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum)
 {
-    int v, i;
+    bool is_zero;
+    int i;
 
     if (n <= 0) {
         *pnum = 0;
         return 0;
     }
-    v = is_not_zero(buf, 512);
+    is_zero = buffer_is_zero(buf, 512);
     for(i = 1; i < n; i++) {
         buf += 512;
-        if (v != is_not_zero(buf, 512))
+        if (is_zero != buffer_is_zero(buf, 512)) {
             break;
+        }
     }
     *pnum = i;
-    return v;
+    return !is_zero;
 }
 
 /*
@@ -955,7 +923,7 @@ static int img_convert(int argc, char **argv)
             if (n < cluster_sectors) {
                 memset(buf + n * 512, 0, cluster_size - n * 512);
             }
-            if (is_not_zero(buf, cluster_size)) {
+            if (!buffer_is_zero(buf, cluster_size)) {
                 ret = bdrv_write_compressed(out_bs, sector_num, buf,
                                             cluster_sectors);
                 if (ret != 0) {
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
@ 2011-12-21 16:00 ` Stefan Hajnoczi
  2011-12-21 16:50   ` Christoph Hellwig
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 3/6] block: perform zero-detection during copy-on-read Stefan Hajnoczi
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

The ability to zero regions of an image file is a useful primitive for
higher-level features such as image streaming or zero write detection.

Image formats may support an optimized metadata representation instead
of writing zeroes into the image file.  This allows zero writes to be
potentially faster than regular write operations and also preserve
sparseness of the image file.

The .bdrv_co_write_zeroes() interface should be implemented by block
drivers that wish to provide efficient zeroing.

Note that this operation is different from the discard operation, which
may leave the contents of the region indeterminate.  That means
discarded blocks are not guaranteed to contain zeroes and may contain
junk data instead.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c      |   50 ++++++++++++++++++++++++++++++++++++++++++++------
 block.h      |    7 +++++++
 block_int.h  |    8 ++++++++
 trace-events |    1 +
 4 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 3f072f6..5ebbd4d 100644
--- a/block.c
+++ b/block.c
@@ -64,7 +64,7 @@ static int coroutine_fn bdrv_co_writev_em(BlockDriverState *bs,
 static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
 static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov, bool write_zeroes);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
@@ -1291,7 +1291,7 @@ static void coroutine_fn bdrv_rw_co_entry(void *opaque)
                                      rwco->nb_sectors, rwco->qiov);
     } else {
         rwco->ret = bdrv_co_do_writev(rwco->bs, rwco->sector_num,
-                                      rwco->nb_sectors, rwco->qiov);
+                                      rwco->nb_sectors, rwco->qiov, false);
     }
 }
 
@@ -1608,11 +1608,37 @@ int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     return bdrv_co_do_readv(bs, sector_num, nb_sectors, qiov);
 }
 
+static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors)
+{
+    BlockDriver *drv = bs->drv;
+    QEMUIOVector qiov;
+    struct iovec iov;
+    int ret;
+
+    /* First try the efficient write zeroes operation */
+    if (drv->bdrv_co_write_zeroes) {
+        return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+    }
+
+    /* Fall back to bounce buffer if write zeroes is unsupported */
+    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
+    iov.iov_base = qemu_blockalign(bs, iov.iov_len);
+    memset(iov.iov_base, 0, iov.iov_len);
+    qemu_iovec_init_external(&qiov, &iov, 1);
+
+    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, &qiov);
+
+    qemu_vfree(iov.iov_base);
+    return ret;
+}
+
 /*
  * Handle a write request in coroutine context
  */
 static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+    bool write_zeroes)
 {
     BlockDriver *drv = bs->drv;
     BdrvTrackedRequest req;
@@ -1639,7 +1665,11 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
-    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+    if (write_zeroes) {
+        ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
+    } else {
+        ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+    }
 
     if (bs->dirty_bitmap) {
         set_dirty_bitmap(bs, sector_num, nb_sectors, 1);
@@ -1659,7 +1689,15 @@ int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_co_writev(bs, sector_num, nb_sectors);
 
-    return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov);
+    return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov, false);
+}
+
+int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs,
+                                      int64_t sector_num, int nb_sectors)
+{
+    trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+
+    return bdrv_co_do_writev(bs, sector_num, nb_sectors, NULL, true);
 }
 
 /**
@@ -3143,7 +3181,7 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
             acb->req.nb_sectors, acb->req.qiov);
     } else {
         acb->req.error = bdrv_co_do_writev(bs, acb->req.sector,
-            acb->req.nb_sectors, acb->req.qiov);
+            acb->req.nb_sectors, acb->req.qiov, false);
     }
 
     acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
diff --git a/block.h b/block.h
index 3bd4398..51b90c7 100644
--- a/block.h
+++ b/block.h
@@ -144,6 +144,13 @@ int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
+/*
+ * Efficiently zero a region of the disk image.  Note that this is a regular
+ * I/O request like read or write and should have a reasonable size.  This
+ * function is not suitable for zeroing the entire image in a single request.
+ */
+int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
+    int nb_sectors);
 int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, int *pnum);
 int bdrv_truncate(BlockDriverState *bs, int64_t offset);
diff --git a/block_int.h b/block_int.h
index 311bd2a..5362180 100644
--- a/block_int.h
+++ b/block_int.h
@@ -101,6 +101,14 @@ struct BlockDriver {
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
     int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+    /*
+     * Efficiently zero a region of the disk image.  Typically an image format
+     * would use a compact metadata representation to implement this.  This
+     * function pointer may be NULL and .bdrv_co_writev() will be called
+     * instead.
+     */
+    int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_is_allocated)(BlockDriverState *bs,
diff --git a/trace-events b/trace-events
index 514849a..fd2d7d9 100644
--- a/trace-events
+++ b/trace-events
@@ -66,6 +66,7 @@ bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
 bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
+bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
 bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 3/6] block: perform zero-detection during copy-on-read
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
@ 2011-12-21 16:00 ` Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 4/6] qed: replace is_write with flags field Stefan Hajnoczi
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

Copy-on-Read populates the image file with data read from a backing
image.  In order to avoid bloating the image file when all zeroes are
read we should scan the buffer and perform an optimized zero write
operation.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 5ebbd4d..967a583 100644
--- a/block.c
+++ b/block.c
@@ -1506,6 +1506,7 @@ static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
      */
     void *bounce_buffer;
 
+    BlockDriver *drv = bs->drv;
     struct iovec iov;
     QEMUIOVector bounce_qiov;
     int64_t cluster_sector_num;
@@ -1526,14 +1527,21 @@ static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
     iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
     qemu_iovec_init_external(&bounce_qiov, &iov, 1);
 
-    ret = bs->drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
-                                 &bounce_qiov);
+    ret = drv->bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors,
+                             &bounce_qiov);
     if (ret < 0) {
         goto err;
     }
 
-    ret = bs->drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
+    if (drv->bdrv_co_write_zeroes &&
+        buffer_is_zero(bounce_buffer, iov.iov_len)) {
+        ret = drv->bdrv_co_write_zeroes(bs, cluster_sector_num,
+                                        cluster_nb_sectors);
+    } else {
+        ret = drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
                                   &bounce_qiov);
+    }
+
     if (ret < 0) {
         /* It might be okay to ignore write errors for guest requests.  If this
          * is a deliberate copy-on-read then we don't want to ignore the error.
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 4/6] qed: replace is_write with flags field
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 3/6] block: perform zero-detection during copy-on-read Stefan Hajnoczi
@ 2011-12-21 16:00 ` Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 5/6] qed: add .bdrv_co_write_zeroes() support Stefan Hajnoczi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

Per-request attributes like read/write are currently implemented as bool
fields in the QEDAIOCB struct.  This becomes unwiedly as the number of
attributes grows.  For example, the qed_aio_setup() function would have
to take multiple bool arguments and at call sites it would be hard to
distinguish the meaning of each bool.

Instead use a flags field with bitmask constants.  This will be used
when zero write support is added.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qed.c  |   15 ++++++++-------
 block/qed.h  |    6 +++++-
 trace-events |    2 +-
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 8da3ebe..b66dd17 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1233,8 +1233,8 @@ static void qed_aio_next_io(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
-    QEDFindClusterFunc *io_fn =
-        acb->is_write ? qed_aio_write_data : qed_aio_read_data;
+    QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
+                                qed_aio_write_data : qed_aio_read_data;
 
     trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
 
@@ -1264,14 +1264,14 @@ static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs,
                                        int64_t sector_num,
                                        QEMUIOVector *qiov, int nb_sectors,
                                        BlockDriverCompletionFunc *cb,
-                                       void *opaque, bool is_write)
+                                       void *opaque, int flags)
 {
     QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
 
     trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
-                         opaque, is_write);
+                        opaque, flags);
 
-    acb->is_write = is_write;
+    acb->flags = flags;
     acb->finished = NULL;
     acb->qiov = qiov;
     acb->qiov_offset = 0;
@@ -1291,7 +1291,7 @@ static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
                                             BlockDriverCompletionFunc *cb,
                                             void *opaque)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, false);
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
@@ -1300,7 +1300,8 @@ static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
                                              BlockDriverCompletionFunc *cb,
                                              void *opaque)
 {
-    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, true);
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb,
+                         opaque, QED_AIOCB_WRITE);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index 62cbd3b..abed147 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -123,12 +123,16 @@ typedef struct QEDRequest {
     CachedL2Table *l2_table;
 } QEDRequest;
 
+enum {
+    QED_AIOCB_WRITE = 0x0001,       /* read or write? */
+};
+
 typedef struct QEDAIOCB {
     BlockDriverAIOCB common;
     QEMUBH *bh;
     int bh_ret;                     /* final return status for completion bh */
     QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
-    bool is_write;                  /* false - read, true - write */
+    int flags;                      /* QED_AIOCB_* bits ORed together */
     bool *finished;                 /* signal for cancel completion */
     uint64_t end_pos;               /* request end on block device, in bytes */
 
diff --git a/trace-events b/trace-events
index fd2d7d9..6d063c4 100644
--- a/trace-events
+++ b/trace-events
@@ -309,7 +309,7 @@ qed_need_check_timer_cb(void *s) "s %p"
 qed_start_need_check_timer(void *s) "s %p"
 qed_cancel_need_check_timer(void *s) "s %p"
 qed_aio_complete(void *s, void *acb, int ret) "s %p acb %p ret %d"
-qed_aio_setup(void *s, void *acb, int64_t sector_num, int nb_sectors, void *opaque, int is_write) "s %p acb %p sector_num %"PRId64" nb_sectors %d opaque %p is_write %d"
+qed_aio_setup(void *s, void *acb, int64_t sector_num, int nb_sectors, void *opaque, int flags) "s %p acb %p sector_num %"PRId64" nb_sectors %d opaque %p flags %#x"
 qed_aio_next_io(void *s, void *acb, int ret, uint64_t cur_pos) "s %p acb %p ret %d cur_pos %"PRIu64
 qed_aio_read_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
 qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 5/6] qed: add .bdrv_co_write_zeroes() support
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 4/6] qed: replace is_write with flags field Stefan Hajnoczi
@ 2011-12-21 16:00 ` Stefan Hajnoczi
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes Stefan Hajnoczi
  2011-12-21 16:03 ` [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
  6 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

Zero writes are a dedicated interface for writing regions of zeroes into
the image file.  If clusters are not yet allocated it is possible to use
an efficient metadata representation which keeps the image file compact
and does not store individual zero bytes.

Implementing this for the QED image format is fairly straightforward.
The only issue is that when a zero write touches an existing cluster we
have to allocate a bounce buffer and perform a regular write.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qed.c |  110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 block/qed.h |    1 +
 2 files changed, 103 insertions(+), 8 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index b66dd17..a041d31 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -875,6 +875,12 @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
     qemu_iovec_destroy(&acb->cur_qiov);
     qed_unref_l2_cache_entry(acb->request.l2_table);
 
+    /* Free the buffer we may have allocated for zero writes */
+    if (acb->flags & QED_AIOCB_ZERO) {
+        qemu_vfree(acb->qiov->iov[0].iov_base);
+        acb->qiov->iov[0].iov_base = NULL;
+    }
+
     /* Arrange for a bh to invoke the completion function */
     acb->bh_ret = ret;
     acb->bh = qemu_bh_new(qed_aio_complete_bh, acb);
@@ -941,9 +947,8 @@ static void qed_aio_write_l1_update(void *opaque, int ret)
 /**
  * Update L2 table with new cluster offsets and write them out
  */
-static void qed_aio_write_l2_update(void *opaque, int ret)
+static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
     int index;
@@ -959,7 +964,7 @@ static void qed_aio_write_l2_update(void *opaque, int ret)
 
     index = qed_l2_index(s, acb->cur_pos);
     qed_update_l2_table(s, acb->request.l2_table->table, index, acb->cur_nclusters,
-                         acb->cur_cluster);
+                         offset);
 
     if (need_alloc) {
         /* Write out the whole new L2 table */
@@ -976,6 +981,12 @@ err:
     qed_aio_complete(acb, ret);
 }
 
+static void qed_aio_write_l2_update_cb(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+}
+
 /**
  * Flush new data clusters before updating the L2 table
  *
@@ -990,7 +1001,7 @@ static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
 
-    if (!bdrv_aio_flush(s->bs->file, qed_aio_write_l2_update, opaque)) {
+    if (!bdrv_aio_flush(s->bs->file, qed_aio_write_l2_update_cb, opaque)) {
         qed_aio_complete(acb, -EIO);
     }
 }
@@ -1019,7 +1030,7 @@ static void qed_aio_write_main(void *opaque, int ret)
         if (s->bs->backing_hd) {
             next_fn = qed_aio_write_flush_before_l2_update;
         } else {
-            next_fn = qed_aio_write_l2_update;
+            next_fn = qed_aio_write_l2_update_cb;
         }
     }
 
@@ -1081,6 +1092,18 @@ static bool qed_should_set_need_check(BDRVQEDState *s)
     return !(s->header.features & QED_F_NEED_CHECK);
 }
 
+static void qed_aio_write_zero_cluster(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    qed_aio_write_l2_update(acb, 0, 1);
+}
+
 /**
  * Write new data cluster
  *
@@ -1092,6 +1115,7 @@ static bool qed_should_set_need_check(BDRVQEDState *s)
 static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 {
     BDRVQEDState *s = acb_to_s(acb);
+    BlockDriverCompletionFunc *cb;
 
     /* Cancel timer when the first allocating request comes in */
     if (QSIMPLEQ_EMPTY(&s->allocating_write_reqs)) {
@@ -1109,14 +1133,26 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
             qed_offset_into_cluster(s, acb->cur_pos) + len);
-    acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
     qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
 
+    if (acb->flags & QED_AIOCB_ZERO) {
+        /* Skip ahead if the clusters are already zero */
+        if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
+            qed_aio_next_io(acb, 0);
+            return;
+        }
+
+        cb = qed_aio_write_zero_cluster;
+    } else {
+        cb = qed_aio_write_prefill;
+        acb->cur_cluster = qed_alloc_clusters(s, acb->cur_nclusters);
+    }
+
     if (qed_should_set_need_check(s)) {
         s->header.features |= QED_F_NEED_CHECK;
-        qed_write_header(s, qed_aio_write_prefill, acb);
+        qed_write_header(s, cb, acb);
     } else {
-        qed_aio_write_prefill(acb, 0);
+        cb(acb, 0);
     }
 }
 
@@ -1131,6 +1167,16 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  */
 static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 {
+    /* Allocate buffer for zero writes */
+    if (acb->flags & QED_AIOCB_ZERO) {
+        struct iovec *iov = acb->qiov->iov;
+
+        if (!iov->iov_base) {
+            iov->iov_base = qemu_blockalign(acb->common.bs, iov->iov_len);
+            memset(iov->iov_base, 0, iov->iov_len);
+        }
+    }
+
     /* Calculate the I/O vector */
     acb->cur_cluster = offset;
     qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
@@ -1311,6 +1357,53 @@ static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
     return bdrv_aio_flush(bs->file, cb, opaque);
 }
 
+typedef struct {
+    Coroutine *co;
+    int ret;
+    bool done;
+} QEDWriteZeroesCB;
+
+static void coroutine_fn qed_co_write_zeroes_cb(void *opaque, int ret)
+{
+    QEDWriteZeroesCB *cb = opaque;
+
+    cb->done = true;
+    cb->ret = ret;
+    if (cb->co) {
+        qemu_coroutine_enter(cb->co, NULL);
+    }
+}
+
+static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
+                                                 int64_t sector_num,
+                                                 int nb_sectors)
+{
+    BlockDriverAIOCB *blockacb;
+    QEDWriteZeroesCB cb = { .done = false };
+    QEMUIOVector qiov;
+    struct iovec iov;
+
+    /* Zero writes start without an I/O buffer.  If a buffer becomes necessary
+     * then it will be allocated during request processing.
+     */
+    iov.iov_base = NULL,
+    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE,
+
+    qemu_iovec_init_external(&qiov, &iov, 1);
+    blockacb = qed_aio_setup(bs, sector_num, &qiov, nb_sectors,
+                             qed_co_write_zeroes_cb, &cb,
+                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
+    if (!blockacb) {
+        return -EIO;
+    }
+    if (!cb.done) {
+        cb.co = qemu_coroutine_self();
+        qemu_coroutine_yield();
+    }
+    assert(cb.done);
+    return cb.ret;
+}
+
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
 {
     BDRVQEDState *s = bs->opaque;
@@ -1470,6 +1563,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_aio_readv           = bdrv_qed_aio_readv,
     .bdrv_aio_writev          = bdrv_qed_aio_writev,
     .bdrv_aio_flush           = bdrv_qed_aio_flush,
+    .bdrv_co_write_zeroes     = bdrv_qed_co_write_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
     .bdrv_get_info            = bdrv_qed_get_info,
diff --git a/block/qed.h b/block/qed.h
index abed147..62624a1 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -125,6 +125,7 @@ typedef struct QEDRequest {
 
 enum {
     QED_AIOCB_WRITE = 0x0001,       /* read or write? */
+    QED_AIOCB_ZERO  = 0x0002,       /* zero write, used with QED_AIOCB_WRITE */
 };
 
 typedef struct QEDAIOCB {
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v3 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 5/6] qed: add .bdrv_co_write_zeroes() support Stefan Hajnoczi
@ 2011-12-21 16:00 ` Stefan Hajnoczi
  2011-12-21 16:03 ` [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
  6 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Marcelo Tosatti, Stefan Hajnoczi

Extend the qemu-io write command with the -z option to call
bdrv_co_write_zeroes().  Exposing the zero write interface from qemu-io
allows us to write tests that exercise this new block layer interface.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 qemu-io.c |   77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/qemu-io.c b/qemu-io.c
index ffa62fb..48e3393 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -218,6 +218,51 @@ static int do_pwrite(char *buf, int64_t offset, int count, int *total)
     return 1;
 }
 
+typedef struct {
+    int64_t offset;
+    int count;
+    int *total;
+    int ret;
+    bool done;
+} CoWriteZeroes;
+
+static void coroutine_fn co_write_zeroes_entry(void *opaque)
+{
+    CoWriteZeroes *data = opaque;
+
+    data->ret = bdrv_co_write_zeroes(bs, data->offset / BDRV_SECTOR_SIZE,
+                                     data->count / BDRV_SECTOR_SIZE);
+    data->done = true;
+    if (data->ret < 0) {
+        *data->total = data->ret;
+        return;
+    }
+
+    *data->total = data->count;
+}
+
+static int do_co_write_zeroes(int64_t offset, int count, int *total)
+{
+    Coroutine *co;
+    CoWriteZeroes data = {
+        .offset = offset,
+        .count  = count,
+        .total  = total,
+        .done   = false,
+    };
+
+    co = qemu_coroutine_create(co_write_zeroes_entry);
+    qemu_coroutine_enter(co, &data);
+    while (!data.done) {
+        qemu_aio_wait();
+    }
+    if (data.ret < 0) {
+        return data.ret;
+    } else {
+        return 1;
+    }
+}
+
 static int do_load_vmstate(char *buf, int64_t offset, int count, int *total)
 {
     *total = bdrv_load_vmstate(bs, (uint8_t *)buf, offset, count);
@@ -643,6 +688,7 @@ static void write_help(void)
 " -P, -- use different pattern to fill file\n"
 " -C, -- report statistics in a machine parsable format\n"
 " -q, -- quiet mode, do not show I/O statistics\n"
+" -z, -- write zeroes using bdrv_co_write_zeroes\n"
 "\n");
 }
 
@@ -654,7 +700,7 @@ static const cmdinfo_t write_cmd = {
     .cfunc      = write_f,
     .argmin     = 2,
     .argmax     = -1,
-    .args       = "[-abCpq] [-P pattern ] off len",
+    .args       = "[-bCpqz] [-P pattern ] off len",
     .oneline    = "writes a number of bytes at a specified offset",
     .help       = write_help,
 };
@@ -662,16 +708,16 @@ static const cmdinfo_t write_cmd = {
 static int write_f(int argc, char **argv)
 {
     struct timeval t1, t2;
-    int Cflag = 0, pflag = 0, qflag = 0, bflag = 0;
+    int Cflag = 0, pflag = 0, qflag = 0, bflag = 0, Pflag = 0, zflag = 0;
     int c, cnt;
-    char *buf;
+    char *buf = NULL;
     int64_t offset;
     int count;
     /* Some compilers get confused and warn if this is not initialized.  */
     int total = 0;
     int pattern = 0xcd;
 
-    while ((c = getopt(argc, argv, "bCpP:q")) != EOF) {
+    while ((c = getopt(argc, argv, "bCpP:qz")) != EOF) {
         switch (c) {
         case 'b':
             bflag = 1;
@@ -683,6 +729,7 @@ static int write_f(int argc, char **argv)
             pflag = 1;
             break;
         case 'P':
+            Pflag = 1;
             pattern = parse_pattern(optarg);
             if (pattern < 0) {
                 return 0;
@@ -691,6 +738,9 @@ static int write_f(int argc, char **argv)
         case 'q':
             qflag = 1;
             break;
+        case 'z':
+            zflag = 1;
+            break;
         default:
             return command_usage(&write_cmd);
         }
@@ -700,8 +750,13 @@ static int write_f(int argc, char **argv)
         return command_usage(&write_cmd);
     }
 
-    if (bflag && pflag) {
-        printf("-b and -p cannot be specified at the same time\n");
+    if (bflag + pflag + zflag > 1) {
+        printf("-b, -p, or -z cannot be specified at the same time\n");
+        return 0;
+    }
+
+    if (zflag && Pflag) {
+        printf("-z and -P cannot be specified at the same time\n");
         return 0;
     }
 
@@ -732,13 +787,17 @@ static int write_f(int argc, char **argv)
         }
     }
 
-    buf = qemu_io_alloc(count, pattern);
+    if (!zflag) {
+        buf = qemu_io_alloc(count, pattern);
+    }
 
     gettimeofday(&t1, NULL);
     if (pflag) {
         cnt = do_pwrite(buf, offset, count, &total);
     } else if (bflag) {
         cnt = do_save_vmstate(buf, offset, count, &total);
+    } else if (zflag) {
+        cnt = do_co_write_zeroes(offset, count, &total);
     } else {
         cnt = do_write(buf, offset, count, &total);
     }
@@ -758,7 +817,9 @@ static int write_f(int argc, char **argv)
     print_report("wrote", &t2, offset, count, total, cnt, Cflag);
 
 out:
-    qemu_io_free(buf);
+    if (!zflag) {
+        qemu_io_free(buf);
+    }
 
     return 0;
 }
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/6] block: zero writes
  2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes Stefan Hajnoczi
@ 2011-12-21 16:03 ` Stefan Hajnoczi
  6 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-21 16:03 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel

On Wed, Dec 21, 2011 at 4:00 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> This series adds an interface for efficient writes when data contains all
> zeros.  It also takes advantage of this new interface by extending the
> copy-on-read feature to perform zero-detection.
>
> The details of efficient zero representations depend on the image format.  This
> series includes a patch for the QED image format to write special "zero
> clusters" that keep the image file compact.  In the future qcow2v3 could also
> support an efficient zero representation.
>
> The new BlockDriver interface is called .bdrv_co_write_zeroes() and is
> optional.  If the interface is not implemented by a BlockDriver then a regular
> .bdrv_co_writev() operation will be performed.  The public interface is called
> bdrv_co_write_zeroes() and can be tested via the new qemu-io write -z option.
>
> Copy-on-read is extended to detect zeroes and invoke the
> .bdrv_co_write_zeroes() interface when possible.  As a result we avoid bloating
> the image file if the backing file contains zeroes.
>
> My motivation for this feature is efficient image streaming.  The destination
> file must stay compact even when copying zeroes from the source file.
>
> We now only do zero detection for copy-on-read requests, whereas previous
> revisions of this patch series scanned all write requests for zeroes.  The old
> behavior wasted CPU cycles in most cases but we could add a feature to
> explicitly scan guest writes for zeroes in the future, if desired.
>
> v2:

Typo:
s/v2/v3/

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
@ 2011-12-21 16:43   ` Eric Blake
  2011-12-22  7:47     ` Stefan Hajnoczi
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Blake @ 2011-12-21 16:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On 12/21/2011 09:00 AM, Stefan Hajnoczi wrote:
> The qemu-img.c:is_not_zero() function checks if a buffer contains all
> zeroes.  This function will come in handy for zero-detection in the
> block layer, so clean it up and move it to cutils.c.
> 
> Note that the function now returns true if the buffer is all zeroes.
> This avoids the double-negatives (i.e. !is_not_zero()) that the old
> function can cause in callers.

Are there plans to improve the efficiency of buffer_is_zero to take
advantage of metadata about sparseness?

That is, there are cases where we can use metadata to prove a region of
a file is sparse, without having to read every byte within that region.
 Now that this series is giving QED special metadata that marks a zero
cluster, it is faster to query if that metadata exists denoting a zero
cluster than it is to read the entire cluster and check for non-zero.
Likewise, with regular files, the kernel provides lseek(SEEK_HOLE) (or
the older, lower-level, ioctl(FS_IOC_FIEMAP)); which at least GNU
coreutils is using for efficient sparse detection in source files.

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 620 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface
  2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
@ 2011-12-21 16:50   ` Christoph Hellwig
  2011-12-22  7:54     ` Stefan Hajnoczi
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2011-12-21 16:50 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel

On Wed, Dec 21, 2011 at 04:00:36PM +0000, Stefan Hajnoczi wrote:
> The ability to zero regions of an image file is a useful primitive for
> higher-level features such as image streaming or zero write detection.
> 
> Image formats may support an optimized metadata representation instead
> of writing zeroes into the image file.  This allows zero writes to be
> potentially faster than regular write operations and also preserve
> sparseness of the image file.
> 
> The .bdrv_co_write_zeroes() interface should be implemented by block
> drivers that wish to provide efficient zeroing.
> 
> Note that this operation is different from the discard operation, which
> may leave the contents of the region indeterminate.  That means
> discarded blocks are not guaranteed to contain zeroes and may contain
> junk data instead.

Most real life discard operations zero the data, and both the ATA and SCSI
spec allow the device to set a bit which gurantees this behaviour.  I think
we also should make these one interface, and if the caller needs it to
actually zero out the discarded blocks it should check if the discard
implementation guarantees that.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c
  2011-12-21 16:43   ` Eric Blake
@ 2011-12-22  7:47     ` Stefan Hajnoczi
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-22  7:47 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel

On Wed, Dec 21, 2011 at 09:43:55AM -0700, Eric Blake wrote:
> On 12/21/2011 09:00 AM, Stefan Hajnoczi wrote:
> > The qemu-img.c:is_not_zero() function checks if a buffer contains all
> > zeroes.  This function will come in handy for zero-detection in the
> > block layer, so clean it up and move it to cutils.c.
> > 
> > Note that the function now returns true if the buffer is all zeroes.
> > This avoids the double-negatives (i.e. !is_not_zero()) that the old
> > function can cause in callers.
> 
> Are there plans to improve the efficiency of buffer_is_zero to take
> advantage of metadata about sparseness?
> 
> That is, there are cases where we can use metadata to prove a region of
> a file is sparse, without having to read every byte within that region.
>  Now that this series is giving QED special metadata that marks a zero
> cluster, it is faster to query if that metadata exists denoting a zero
> cluster than it is to read the entire cluster and check for non-zero.
> Likewise, with regular files, the kernel provides lseek(SEEK_HOLE) (or
> the older, lower-level, ioctl(FS_IOC_FIEMAP)); which at least GNU
> coreutils is using for efficient sparse detection in source files.

Yes, there are ways to optimize this for specific storage backends.  But
we need a code path that supports all storage systems first.  For
example, raw files over NFS or an image file over HTTP (curl).

In the case of qcow2 or QED backing files we already don't read zeroes
today.  Instead we memset the read buffer to zero and the waste CPU
cycles doing buffer_is_zero() detection.  At least this means that file
I/O (and network I/O, if using NFS) is already optimal if your backing
file is qcow2 or QED - it's just the CPU cycles that we can optimize
away.

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface
  2011-12-21 16:50   ` Christoph Hellwig
@ 2011-12-22  7:54     ` Stefan Hajnoczi
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2011-12-22  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kevin Wolf, Marcelo Tosatti, qemu-devel

On Wed, Dec 21, 2011 at 05:50:32PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2011 at 04:00:36PM +0000, Stefan Hajnoczi wrote:
> > The ability to zero regions of an image file is a useful primitive for
> > higher-level features such as image streaming or zero write detection.
> > 
> > Image formats may support an optimized metadata representation instead
> > of writing zeroes into the image file.  This allows zero writes to be
> > potentially faster than regular write operations and also preserve
> > sparseness of the image file.
> > 
> > The .bdrv_co_write_zeroes() interface should be implemented by block
> > drivers that wish to provide efficient zeroing.
> > 
> > Note that this operation is different from the discard operation, which
> > may leave the contents of the region indeterminate.  That means
> > discarded blocks are not guaranteed to contain zeroes and may contain
> > junk data instead.
> 
> Most real life discard operations zero the data, and both the ATA and SCSI
> spec allow the device to set a bit which gurantees this behaviour.  I think
> we also should make these one interface, and if the caller needs it to
> actually zero out the discarded blocks it should check if the discard
> implementation guarantees that.

Okay, I see how that could work but still need to look into the details
of how to combine the two and check the zero/indeterminate bit coming
from ATA/SCSI.

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-12-22  9:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21 16:00 [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi
2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 1/6] cutils: extract buffer_is_zero() from qemu-img.c Stefan Hajnoczi
2011-12-21 16:43   ` Eric Blake
2011-12-22  7:47     ` Stefan Hajnoczi
2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 2/6] block: add .bdrv_co_write_zeroes() interface Stefan Hajnoczi
2011-12-21 16:50   ` Christoph Hellwig
2011-12-22  7:54     ` Stefan Hajnoczi
2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 3/6] block: perform zero-detection during copy-on-read Stefan Hajnoczi
2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 4/6] qed: replace is_write with flags field Stefan Hajnoczi
2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 5/6] qed: add .bdrv_co_write_zeroes() support Stefan Hajnoczi
2011-12-21 16:00 ` [Qemu-devel] [PATCH v3 6/6] qemu-io: add write -z option for bdrv_co_write_zeroes Stefan Hajnoczi
2011-12-21 16:03 ` [Qemu-devel] [PATCH v3 0/6] block: zero writes Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.