[Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes
@ 2012-03-08 17:15 Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
                   ` (16 more replies)
  0 siblings, 17 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

This series reworks write_zeroes and discard support in QEMU, with
the aim of enabling discard support in guests.  The problem here is
that so far the semantics of discard were left to the block drivers.
Discard could only be enabled as long as all storage had the same
characteristics, including all of them supporting it and all of them
having the same discard granularity.  Hence, nobody really bothered to
enable it.  (This is good, because we have fewer backwards-compatibility
issues).

This series fixes this by specifying the semantics of discard at the block
layer level: from now on, discard will always have sector granularity
and write zeroes to the devices.  Given this choice, the equivalent
of Linux discard_zeroes_data can also be easily implemented.

Patches 1-4 add information on discard to bdrv_get_info, and implement
discard for QED.

Patches 5-8 replace the write_zeroes operation with discard.

Patches 9-12 turn discard on in the device models.

Patches 13-15 add discard support for files and enable hole-punching
in other filesystems than XFS.

To complete thin provisioning support, patches 16 and 17 implement
is_allocated for raw using the brand new SEEK_HOLE/SEEK_DATA
interface.  With these patches and with the kernel patch at
http://patchwork.xfs.org/patch/3264/ it should be possible to do efficient
streaming with raw on XFS.

Various bits could be separated and committed in parts; comments
are welcome.

Paolo Bonzini (17):
  qemu-iotests: add a simple test for write_zeroes
  qed: make write-zeroes bounce buffer smaller than a single cluster
  block: add discard properties to BlockDriverInfo
  qed: implement bdrv_aio_discard
  block: pass around qiov for write_zeroes operation
  block: use bdrv_{co,aio}_discard for write_zeroes operations
  block: make high level discard operation always zero
  block: kill the write zeroes operation
  ide: issue discard asynchronously but serialize the pieces
  ide/scsi: add discard_zeroes_data property
  ide/scsi: prepare for flipping the discard defaults
  ide/scsi: turn on discard
  block: fallback from discard to writes
  block: support FALLOC_FL_PUNCH_HOLE trimming
  raw: add get_info
  qemu-io: fix the alloc command
  raw: implement is_allocated

 block.c                    |   62 +++++++------
 block.h                    |   30 ++++---
 block/qcow2.c              |    2 +
 block/qed.c                |  126 ++++++++++++-------------
 block/raw-posix.c          |  100 +++++++++++++++++++-
 block/raw.c                |   14 +++
 block_int.h                |    8 --
 hw/ide/core.c              |   81 +++++++++++-----
 hw/ide/qdev.c              |    6 +-
 hw/pc_piix.c               |  224 ++++++++++++++++++++++++++++++++++++++++++++
 hw/scsi-disk.c             |   17 +++-
 qemu-io.c                  |   61 ++----------
 tests/qemu-iotests/031     |   73 ++++++++++++++
 tests/qemu-iotests/031.out |   29 ++++++
 tests/qemu-iotests/group   |    1 +
 15 files changed, 635 insertions(+), 199 deletions(-)
 create mode 100755 tests/qemu-iotests/031
 create mode 100644 tests/qemu-iotests/031.out

-- 
1.7.7.6

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster Paolo Bonzini
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tests/qemu-iotests/031     |   73 ++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/031.out |   29 +++++++++++++++++
 tests/qemu-iotests/group   |    1 +
 3 files changed, 103 insertions(+), 0 deletions(-)
 create mode 100755 tests/qemu-iotests/031
 create mode 100644 tests/qemu-iotests/031.out

diff --git a/tests/qemu-iotests/031 b/tests/qemu-iotests/031
new file mode 100755
index 0000000..9aee078
--- /dev/null
+++ b/tests/qemu-iotests/031
@@ -0,0 +1,73 @@
+#!/bin/bash
+#
+# Test aligned and misaligned write zeroes operations.
+#
+# Copyright (C) 2012 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=pbonzini@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt generic
+_supported_proto generic
+_supported_os Linux
+
+
+size=128M
+_make_test_img $size
+
+echo
+echo "== preparing image =="
+$QEMU_IO -c "write -P 0xa 0x200 0x400" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "write -P 0xa 0x20000 0x600" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "write -z 0x400 0x20000" $TEST_IMG | _filter_qemu_io
+
+echo
+echo "== verifying patterns (1) =="
+$QEMU_IO -c "read -P 0xa 0x200 0x200" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "read -P 0x0 0x400 0x20000" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "read -P 0xa 0x20400 0x200" $TEST_IMG | _filter_qemu_io
+
+echo
+echo "== rewriting zeroes =="
+$QEMU_IO -c "write -P 0xb 0x10000 0x10000" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "write -z 0x10000 0x10000" $TEST_IMG | _filter_qemu_io
+
+echo
+echo "== verifying patterns (2) =="
+$QEMU_IO -c "read -P 0x0 0x400 0x20000" $TEST_IMG | _filter_qemu_io
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
new file mode 100644
index 0000000..3c990dd
--- /dev/null
+++ b/tests/qemu-iotests/031.out
@@ -0,0 +1,29 @@
+QA output created by 031
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
+
+== preparing image ==
+wrote 1024/1024 bytes at offset 512
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1536/1536 bytes at offset 131072
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 1024
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verifying patterns (1) ==
+read 512/512 bytes at offset 512
+512.000000 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 131072/131072 bytes at offset 1024
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 132096
+512.000000 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== rewriting zeroes ==
+wrote 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verifying patterns (2) ==
+read 131072/131072 bytes at offset 1024
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index fcf869d..4bae393 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -37,3 +37,4 @@
 028 rw backing auto
 029 rw auto
 030 rw auto
+031 rw auto
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Currently, a write-zeroes operation could allocates memory for the
whole I/O operation if it is not aligned.  This is not necessary,
because only two unaligned clusters could be written.

This makes the write-zeroes operation slower because it proceeds
one cluster at a time, even if all clusters are currently available
and zero.  However, write-zeroes (and discard) are not fast paths.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/qed.c |   68 +++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 43 insertions(+), 25 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index a041d31..4f3d88d 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -878,7 +878,9 @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
     /* Free the buffer we may have allocated for zero writes */
     if (acb->flags & QED_AIOCB_ZERO) {
         qemu_vfree(acb->qiov->iov[0].iov_base);
-        acb->qiov->iov[0].iov_base = NULL;
+        qemu_iovec_destroy(acb->qiov);
+        g_free(acb->qiov);
+        acb->qiov = NULL;
     }
 
     /* Arrange for a bh to invoke the completion function */
@@ -1105,6 +1107,34 @@ static void qed_aio_write_zero_cluster(void *opaque, int ret)
 }
 
 /**
+ * Calculate the I/O vector
+ *
+ * @acb:        Write request
+ * @len:        Length in bytes
+ */
+static void qed_prepare_qiov(QEDAIOCB *acb, size_t len)
+{
+    /* Calculate the I/O vector */
+    if (acb->flags & QED_AIOCB_ZERO) {
+        /* Allocate buffer for zero writes */
+        if (!acb->qiov) {
+            BDRVQEDState *s = acb_to_s(acb);
+            char *base;
+
+            acb->qiov = g_malloc(sizeof(QEMUIOVector));
+            base = qemu_blockalign(s->bs, s->header.cluster_size);
+            qemu_iovec_init(acb->qiov, 1);
+            qemu_iovec_add(acb->qiov, base, s->header.cluster_size);
+            memset(base, 0, s->header.cluster_size);
+        }
+	assert(len <= acb->qiov->size);
+        qemu_iovec_add(&acb->cur_qiov, acb->qiov->iov[0].iov_base, len);
+    } else {
+        qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
+    }
+}
+
+/**
  * Write new data cluster
  *
  * @acb:        Write request
@@ -1133,7 +1163,7 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
 
     acb->cur_nclusters = qed_bytes_to_clusters(s,
             qed_offset_into_cluster(s, acb->cur_pos) + len);
-    qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
+    qed_prepare_qiov(acb, len);
 
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
@@ -1167,19 +1197,8 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
  */
 static void qed_aio_write_inplace(QEDAIOCB *acb, uint64_t offset, size_t len)
 {
-    /* Allocate buffer for zero writes */
-    if (acb->flags & QED_AIOCB_ZERO) {
-        struct iovec *iov = acb->qiov->iov;
-
-        if (!iov->iov_base) {
-            iov->iov_base = qemu_blockalign(acb->common.bs, iov->iov_len);
-            memset(iov->iov_base, 0, iov->iov_len);
-        }
-    }
-
-    /* Calculate the I/O vector */
     acb->cur_cluster = offset;
-    qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
+    qed_prepare_qiov(acb, len);
 
     /* Do the actual write */
     qed_aio_write_main(acb, 0);
@@ -1279,6 +1298,7 @@ static void qed_aio_next_io(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
+    uint64_t len;
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
 
@@ -1300,10 +1320,14 @@ static void qed_aio_next_io(void *opaque, int ret)
         return;
     }
 
+    /* Limit buffer size when writing zeroes.  */
+    len = acb->end_pos - acb->cur_pos;
+    if (acb->flags & QED_AIOCB_ZERO) {
+        len = MIN(len, s->header.cluster_size);
+    }
+
     /* Find next cluster and start I/O */
-    qed_find_cluster(s, &acb->request,
-                      acb->cur_pos, acb->end_pos - acb->cur_pos,
-                      io_fn, acb);
+    qed_find_cluster(s, &acb->request, acb->cur_pos, len, io_fn, acb);
 }
 
 static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs,
@@ -1324,7 +1348,7 @@ static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs,
     acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
     acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
     acb->request.l2_table = NULL;
-    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
+    qemu_iovec_init(&acb->cur_qiov, qiov ? qiov->niov : 1);
 
     /* Start request */
     qed_aio_next_io(acb, 0);
@@ -1380,17 +1404,11 @@ static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
 {
     BlockDriverAIOCB *blockacb;
     QEDWriteZeroesCB cb = { .done = false };
-    QEMUIOVector qiov;
-    struct iovec iov;
 
     /* Zero writes start without an I/O buffer.  If a buffer becomes necessary
      * then it will be allocated during request processing.
      */
-    iov.iov_base = NULL,
-    iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE,
-
-    qemu_iovec_init_external(&qiov, &iov, 1);
-    blockacb = qed_aio_setup(bs, sector_num, &qiov, nb_sectors,
+    blockacb = qed_aio_setup(bs, sector_num, NULL, nb_sectors,
                              qed_co_write_zeroes_cb, &cb,
                              QED_AIOCB_WRITE | QED_AIOCB_ZERO);
     if (!blockacb) {
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-09 16:47   ` Kevin Wolf
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

In the next patches we will declare the guest's semantics for discard
to be "always zero the data".  We need to know whether the operation
need to be emulated or can be passed down.  For this purpose we need
to know the semantics of the operation and its granularity.

The granularity may not be related to the cluster size (for example
"raw" does not have a cluster size), so add it separately.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.h       |    4 ++++
 block/qcow2.c |    2 ++
 qemu-io.c     |    5 ++++-
 3 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/block.h b/block.h
index 48d0bf3..7feda73 100644
--- a/block.h
+++ b/block.h
@@ -15,6 +15,10 @@ typedef struct BlockDriverInfo {
     int cluster_size;
     /* offset at which the VM state can be saved (0 if not possible) */
     int64_t vm_state_offset;
+    /* whether discard is guaranteed to zero bytes */
+    bool discard_zeroes_data;
+    /* discard granularity in sectors */
+    int discard_granularity;
 } BlockDriverInfo;
 
 typedef struct QEMUSnapshotInfo {
diff --git a/block/qcow2.c b/block/qcow2.c
index eb5ea48..2908484 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1256,6 +1256,8 @@ static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     BDRVQcowState *s = bs->opaque;
     bdi->cluster_size = s->cluster_size;
     bdi->vm_state_offset = qcow2_vm_state_offset(s);
+    bdi->discard_zeroes_data = (bs->backing_hd == NULL);
+    bdi->discard_granularity = s->cluster_size / BDRV_SECTOR_SIZE;
     return 0;
 }
 
diff --git a/qemu-io.c b/qemu-io.c
index 3189530..8f28b6a 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -1437,7 +1437,7 @@ static const cmdinfo_t length_cmd = {
 static int info_f(int argc, char **argv)
 {
     BlockDriverInfo bdi;
-    char s1[64], s2[64];
+    char s1[64], s2[64], s3[64];
     int ret;
 
     if (bs->drv && bs->drv->format_name) {
@@ -1454,9 +1454,12 @@ static int info_f(int argc, char **argv)
 
     cvtstr(bdi.cluster_size, s1, sizeof(s1));
     cvtstr(bdi.vm_state_offset, s2, sizeof(s2));
+    cvtstr(bdi.discard_granularity * BDRV_SECTOR_SIZE, s3, sizeof(s3));
 
     printf("cluster size: %s\n", s1);
     printf("vm state offset: %s\n", s2);
+    printf("discard zeroes: %s\n", bdi.discard_zeroes_data ? "yes" : "no");
+    printf("discard granularity: %s\n", s3);
 
     return 0;
 }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (2 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-09 16:31   ` Kevin Wolf
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation Paolo Bonzini
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Move the bdrv_co_write_zeroes operation to bdrv_aio_discard, so that
the ugly coroutine loop can be eliminated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/qed.c |   54 ++++++++++++++++++------------------------------------
 1 files changed, 18 insertions(+), 36 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 4f3d88d..9944be5 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1374,6 +1374,20 @@ static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
                          opaque, QED_AIOCB_WRITE);
 }
 
+static BlockDriverAIOCB *bdrv_qed_aio_discard(BlockDriverState *bs,
+                                              int64_t sector_num,
+                                              int nb_sectors,
+                                              BlockDriverCompletionFunc *cb,
+                                              void *opaque)
+{
+    /* Zero writes start without an I/O buffer.  If a buffer becomes necessary
+     * then it will be allocated during request processing.
+     */
+    return qed_aio_setup(bs, sector_num, NULL, nb_sectors,
+                         cb, opaque,
+                         QED_AIOCB_WRITE | QED_AIOCB_ZERO);
+}
+
 static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
                                             BlockDriverCompletionFunc *cb,
                                             void *opaque)
@@ -1381,45 +1395,11 @@ static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
     return bdrv_aio_flush(bs->file, cb, opaque);
 }
 
-typedef struct {
-    Coroutine *co;
-    int ret;
-    bool done;
-} QEDWriteZeroesCB;
-
-static void coroutine_fn qed_co_write_zeroes_cb(void *opaque, int ret)
-{
-    QEDWriteZeroesCB *cb = opaque;
-
-    cb->done = true;
-    cb->ret = ret;
-    if (cb->co) {
-        qemu_coroutine_enter(cb->co, NULL);
-    }
-}
-
 static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
                                                  int64_t sector_num,
                                                  int nb_sectors)
 {
-    BlockDriverAIOCB *blockacb;
-    QEDWriteZeroesCB cb = { .done = false };
-
-    /* Zero writes start without an I/O buffer.  If a buffer becomes necessary
-     * then it will be allocated during request processing.
-     */
-    blockacb = qed_aio_setup(bs, sector_num, NULL, nb_sectors,
-                             qed_co_write_zeroes_cb, &cb,
-                             QED_AIOCB_WRITE | QED_AIOCB_ZERO);
-    if (!blockacb) {
-        return -EIO;
-    }
-    if (!cb.done) {
-        cb.co = qemu_coroutine_self();
-        qemu_coroutine_yield();
-    }
-    assert(cb.done);
-    return cb.ret;
+    return bdrv_co_discard(bs, sector_num, nb_sectors);
 }
 
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
@@ -1457,8 +1437,9 @@ static int bdrv_qed_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
     BDRVQEDState *s = bs->opaque;
 
-    memset(bdi, 0, sizeof(*bdi));
     bdi->cluster_size = s->header.cluster_size;
+    bdi->discard_zeroes_data = true;
+    bdi->discard_granularity = 1;
     return 0;
 }
 
@@ -1581,6 +1562,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_aio_readv           = bdrv_qed_aio_readv,
     .bdrv_aio_writev          = bdrv_qed_aio_writev,
     .bdrv_aio_flush           = bdrv_qed_aio_flush,
+    .bdrv_aio_discard         = bdrv_qed_aio_discard,
     .bdrv_co_write_zeroes     = bdrv_qed_co_write_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (3 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Avoid allocating an extra, useless bounce buffer if possible.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index 52ffe14..e4f7782 100644
--- a/block.c
+++ b/block.c
@@ -72,6 +72,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
 static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
     BdrvRequestFlags flags);
+static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
@@ -1676,8 +1678,8 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
 
     if (drv->bdrv_co_write_zeroes &&
         buffer_is_zero(bounce_buffer, iov.iov_len)) {
-        ret = drv->bdrv_co_write_zeroes(bs, cluster_sector_num,
-                                        cluster_nb_sectors);
+        ret = bdrv_co_do_write_zeroes(bs, cluster_sector_num,
+                                      cluster_nb_sectors, &bounce_qiov);
     } else {
         ret = drv->bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors,
                                   &bounce_qiov);
@@ -1780,10 +1782,10 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
 }
 
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
-    int64_t sector_num, int nb_sectors)
+    int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
-    QEMUIOVector qiov;
+    QEMUIOVector my_qiov;
     struct iovec iov;
     int ret;
 
@@ -1792,13 +1794,17 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
     }
 
+    if (qiov) {
+        return drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+    }
+
     /* Fall back to bounce buffer if write zeroes is unsupported */
     iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE;
     iov.iov_base = qemu_blockalign(bs, iov.iov_len);
     memset(iov.iov_base, 0, iov.iov_len);
-    qemu_iovec_init_external(&qiov, &iov, 1);
+    qemu_iovec_init_external(&my_qiov, &iov, 1);
 
-    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, &qiov);
+    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, &my_qiov);
 
     qemu_vfree(iov.iov_base);
     return ret;
@@ -1837,7 +1843,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
     if (flags & BDRV_REQ_ZERO_WRITE) {
-        ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
+        ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, qiov);
     } else {
         ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
     }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (4 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-09 16:37   ` Kevin Wolf
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Remove the bdrv_co_write_zeroes callback.  Instead use the discard
information from bdrv_get_info to choose between bdrv_co_discard
and a normal write.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c     |   12 +++++++++---
 block/qed.c |    8 --------
 block_int.h |    8 --------
 3 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/block.c b/block.c
index e4f7782..34db914 100644
--- a/block.c
+++ b/block.c
@@ -1656,6 +1656,7 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
     int cluster_nb_sectors;
     size_t skip_bytes;
     int ret;
+    BlockDriverInfo bdi;
 
     /* Cover entire cluster so no additional backing file I/O is required when
      * allocating cluster in the image file.
@@ -1676,7 +1677,8 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
         goto err;
     }
 
-    if (drv->bdrv_co_write_zeroes &&
+    /* If it is worthless, do not check if the buffer is zero.  */
+    if (bdrv_get_info(bs, &bdi) >= 0 && bdi.discard_zeroes_data &&
         buffer_is_zero(bounce_buffer, iov.iov_len)) {
         ret = bdrv_co_do_write_zeroes(bs, cluster_sector_num,
                                       cluster_nb_sectors, &bounce_qiov);
@@ -1785,13 +1787,17 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverInfo bdi;
     QEMUIOVector my_qiov;
     struct iovec iov;
     int ret;
 
     /* First try the efficient write zeroes operation */
-    if (drv->bdrv_co_write_zeroes) {
-        return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+    if (bdrv_get_info(bs, &bdi) >= 0 && bdi.discard_zeroes_data &&
+        bdi.discard_granularity &&
+        (sector_num & (bdi.discard_granularity - 1)) == 0 &&
+        (nb_sectors & (bdi.discard_granularity - 1)) == 0) {
+        return bdrv_co_discard(bs, sector_num, nb_sectors);
     }
 
     if (qiov) {
diff --git a/block/qed.c b/block/qed.c
index 9944be5..33fe03f 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1395,13 +1395,6 @@ static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
     return bdrv_aio_flush(bs->file, cb, opaque);
 }
 
-static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
-                                                 int64_t sector_num,
-                                                 int nb_sectors)
-{
-    return bdrv_co_discard(bs, sector_num, nb_sectors);
-}
-
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
 {
     BDRVQEDState *s = bs->opaque;
@@ -1563,7 +1556,6 @@ static BlockDriver bdrv_qed = {
     .bdrv_aio_writev          = bdrv_qed_aio_writev,
     .bdrv_aio_flush           = bdrv_qed_aio_flush,
     .bdrv_aio_discard         = bdrv_qed_aio_discard,
-    .bdrv_co_write_zeroes     = bdrv_qed_co_write_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
     .bdrv_get_info            = bdrv_qed_get_info,
diff --git a/block_int.h b/block_int.h
index b460c36..2e8cdfe 100644
--- a/block_int.h
+++ b/block_int.h
@@ -131,14 +131,6 @@ struct BlockDriver {
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
     int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
-    /*
-     * Efficiently zero a region of the disk image.  Typically an image format
-     * would use a compact metadata representation to implement this.  This
-     * function pointer may be NULL and .bdrv_co_writev() will be called
-     * instead.
-     */
-    int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_is_allocated)(BlockDriverState *bs,
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (5 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:55   ` Avi Kivity
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation Paolo Bonzini
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

While the formats/protocols are free to implement a discard operation
that does *not* zero the data, providing a common view to the guests is
the only way to avoid configuration and migration nightmares.  (We don't
want the admin to manually set up discard_zeroes_data/discard_granularity!)

This has a couple of drawbacks:

1) QEMU effectively will not try to use discard unless discard_zeroes_data
is true.  To do this we could add a flag such as BDRV_ALWAYS_DISCARD,
which the device models could use if their discard_zeroes_data property is
set to false.  However, I haven't done this, estimating that no strictly
positive integer could represent the number of people using it.

2) it may turn a thin-provisioning operation into a full preallocation,
which is not nice.

3) it may cost memory for a bounce buffer that only needs to be filled
with zeroes.

While (3) can be worked around, the only way around the other two,
unfortunately, is support in the formats and protocols.  We will still
provide device options to opt out of this, but with raw and qed covered
(+ qcow2 without backing file, and qcow3 in the future) it should not
be too bad.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 34db914..30b137f 100644
--- a/block.c
+++ b/block.c
@@ -74,6 +74,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     BdrvRequestFlags flags);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+static int coroutine_fn bdrv_co_do_discard(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
@@ -1797,7 +1799,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         bdi.discard_granularity &&
         (sector_num & (bdi.discard_granularity - 1)) == 0 &&
         (nb_sectors & (bdi.discard_granularity - 1)) == 0) {
-        return bdrv_co_discard(bs, sector_num, nb_sectors);
+        return bdrv_co_do_discard(bs, sector_num, nb_sectors);
     }
 
     if (qiov) {
@@ -3621,8 +3623,8 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque)
     rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
 }
 
-int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
-                                 int nb_sectors)
+static int coroutine_fn bdrv_co_do_discard(BlockDriverState *bs,
+                                           int64_t sector_num, int nb_sectors)
 {
     if (!bs->drv) {
         return -ENOMEDIUM;
@@ -3651,6 +3653,12 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
     }
 }
 
+int coroutine_fn bdrv_co_discard(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors)
+{
+    return bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, NULL);
+}
+
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
 {
     Coroutine *co;
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (6 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces Paolo Bonzini
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

They do not provide anything more than the discard operations now.
Remove them, and use discard instead in qemu-io and the tests.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                    |   17 +--------------
 block.h                    |   15 ++++++-------
 qemu-io.c                  |   50 +-------------------------------------------
 tests/qemu-iotests/031     |    6 ++--
 tests/qemu-iotests/031.out |    4 +-
 5 files changed, 14 insertions(+), 78 deletions(-)

diff --git a/block.c b/block.c
index 30b137f..af73497 100644
--- a/block.c
+++ b/block.c
@@ -50,7 +50,6 @@
 
 typedef enum {
     BDRV_REQ_COPY_ON_READ = 0x1,
-    BDRV_REQ_ZERO_WRITE   = 0x2,
 } BdrvRequestFlags;
 
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
@@ -1849,12 +1848,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
-
-    if (flags & BDRV_REQ_ZERO_WRITE) {
-        ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, qiov);
-    } else {
-        ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
-    }
+    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
 
     if (bs->dirty_bitmap) {
         set_dirty_bitmap(bs, sector_num, nb_sectors, 1);
@@ -1877,15 +1871,6 @@ int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov, 0);
 }
 
-int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs,
-                                      int64_t sector_num, int nb_sectors)
-{
-    trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
-
-    return bdrv_co_do_writev(bs, sector_num, nb_sectors, NULL,
-                             BDRV_REQ_ZERO_WRITE);
-}
-
 /**
  * Truncate file to 'offset' bytes (needed only for file protocols)
  */
diff --git a/block.h b/block.h
index 7feda73..ef07bb0 100644
--- a/block.h
+++ b/block.h
@@ -151,14 +151,6 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
-/*
- * Efficiently zero a region of the disk image.  Note that this is a regular
- * I/O request like read or write and should have a reasonable size.  This
- * function is not suitable for zeroing the entire image in a single request
- * because it may allocate memory for the entire region.
- */
-int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
-    int nb_sectors);
 int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, int *pnum);
 BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
@@ -196,6 +188,13 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                   BlockDriverCompletionFunc *cb, void *opaque);
 BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
                                  BlockDriverCompletionFunc *cb, void *opaque);
+
+/*
+ * Efficiently zero a region of the disk image.  Note that this is a regular
+ * I/O request like read or write and should have a reasonable size.  This
+ * function is not suitable for zeroing the entire image in a single request
+ * because it may allocate memory for the entire region.
+ */
 BlockDriverAIOCB *bdrv_aio_discard(BlockDriverState *bs,
                                    int64_t sector_num, int nb_sectors,
                                    BlockDriverCompletionFunc *cb, void *opaque);
diff --git a/qemu-io.c b/qemu-io.c
index 8f28b6a..bb5f7b8 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -218,51 +218,6 @@ static int do_pwrite(char *buf, int64_t offset, int count, int *total)
     return 1;
 }
 
-typedef struct {
-    int64_t offset;
-    int count;
-    int *total;
-    int ret;
-    bool done;
-} CoWriteZeroes;
-
-static void coroutine_fn co_write_zeroes_entry(void *opaque)
-{
-    CoWriteZeroes *data = opaque;
-
-    data->ret = bdrv_co_write_zeroes(bs, data->offset / BDRV_SECTOR_SIZE,
-                                     data->count / BDRV_SECTOR_SIZE);
-    data->done = true;
-    if (data->ret < 0) {
-        *data->total = data->ret;
-        return;
-    }
-
-    *data->total = data->count;
-}
-
-static int do_co_write_zeroes(int64_t offset, int count, int *total)
-{
-    Coroutine *co;
-    CoWriteZeroes data = {
-        .offset = offset,
-        .count  = count,
-        .total  = total,
-        .done   = false,
-    };
-
-    co = qemu_coroutine_create(co_write_zeroes_entry);
-    qemu_coroutine_enter(co, &data);
-    while (!data.done) {
-        qemu_aio_wait();
-    }
-    if (data.ret < 0) {
-        return data.ret;
-    } else {
-        return 1;
-    }
-}
-
 static int do_load_vmstate(char *buf, int64_t offset, int count, int *total)
 {
     *total = bdrv_load_vmstate(bs, (uint8_t *)buf, offset, count);
@@ -688,7 +643,6 @@ static void write_help(void)
 " -P, -- use different pattern to fill file\n"
 " -C, -- report statistics in a machine parsable format\n"
 " -q, -- quiet mode, do not show I/O statistics\n"
-" -z, -- write zeroes using bdrv_co_write_zeroes\n"
 "\n");
 }
 
@@ -717,7 +671,7 @@ static int write_f(int argc, char **argv)
     int total = 0;
     int pattern = 0xcd;
 
-    while ((c = getopt(argc, argv, "bCpP:qz")) != EOF) {
+    while ((c = getopt(argc, argv, "bCpP:q")) != EOF) {
         switch (c) {
         case 'b':
             bflag = 1;
@@ -796,8 +750,6 @@ static int write_f(int argc, char **argv)
         cnt = do_pwrite(buf, offset, count, &total);
     } else if (bflag) {
         cnt = do_save_vmstate(buf, offset, count, &total);
-    } else if (zflag) {
-        cnt = do_co_write_zeroes(offset, count, &total);
     } else {
         cnt = do_write(buf, offset, count, &total);
     }
diff --git a/tests/qemu-iotests/031 b/tests/qemu-iotests/031
index 9aee078..c88378e 100755
--- a/tests/qemu-iotests/031
+++ b/tests/qemu-iotests/031
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Test aligned and misaligned write zeroes operations.
+# Test aligned and misaligned discard operations.
 #
 # Copyright (C) 2012 Red Hat, Inc.
 #
@@ -50,7 +50,7 @@ echo
 echo "== preparing image =="
 $QEMU_IO -c "write -P 0xa 0x200 0x400" $TEST_IMG | _filter_qemu_io
 $QEMU_IO -c "write -P 0xa 0x20000 0x600" $TEST_IMG | _filter_qemu_io
-$QEMU_IO -c "write -z 0x400 0x20000" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "discard 0x400 0x20000" $TEST_IMG | _filter_qemu_io
 
 echo
 echo "== verifying patterns (1) =="
@@ -61,7 +61,7 @@ $QEMU_IO -c "read -P 0xa 0x20400 0x200" $TEST_IMG | _filter_qemu_io
 echo
 echo "== rewriting zeroes =="
 $QEMU_IO -c "write -P 0xb 0x10000 0x10000" $TEST_IMG | _filter_qemu_io
-$QEMU_IO -c "write -z 0x10000 0x10000" $TEST_IMG | _filter_qemu_io
+$QEMU_IO -c "discard 0x10000 0x10000" $TEST_IMG | _filter_qemu_io
 
 echo
 echo "== verifying patterns (2) =="
diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index 3c990dd..788ccfe 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -6,7 +6,7 @@ wrote 1024/1024 bytes at offset 512
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 1536/1536 bytes at offset 131072
 2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 131072/131072 bytes at offset 1024
+discard 131072/131072 bytes at offset 1024
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 == verifying patterns (1) ==
@@ -20,7 +20,7 @@ read 512/512 bytes at offset 132096
 == rewriting zeroes ==
 wrote 65536/65536 bytes at offset 65536
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 65536
+discard 65536/65536 bytes at offset 65536
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 == verifying patterns (2) ==
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (7 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Now that discard can actually issue writes, make it asynchronous.
At the same time, avoid consuming too much memory while it is
running.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/ide/core.c |   78 ++++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 4d568ac..f12470e 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -312,14 +312,26 @@ typedef struct TrimAIOCB {
     BlockDriverAIOCB common;
     QEMUBH *bh;
     int ret;
+    QEMUIOVector *qiov;
+    BlockDriverAIOCB *aiocb;
+    int i, j;
 } TrimAIOCB;
 
 static void trim_aio_cancel(BlockDriverAIOCB *acb)
 {
     TrimAIOCB *iocb = container_of(acb, TrimAIOCB, common);
 
+    /* Exit the loop in case bdrv_aio_cancel calls ide_issue_trim_cb again.  */
+    iocb->j = iocb->qiov->niov - 1;
+    iocb->i = (iocb->qiov->iov[iocb->j].iov_len / 8) - 1;
+
+    /* Tell ide_issue_trim_cb not to trigger the completion, too.  */
     qemu_bh_delete(iocb->bh);
     iocb->bh = NULL;
+
+    if (iocb->aiocb) {
+        bdrv_aio_cancel(iocb->aiocb);
+    }
     qemu_aio_release(iocb);
 }
 
@@ -336,43 +346,59 @@ static void ide_trim_bh_cb(void *opaque)
 
     qemu_bh_delete(iocb->bh);
     iocb->bh = NULL;
-
     qemu_aio_release(iocb);
 }
 
+static void ide_issue_trim_cb(void *opaque, int ret)
+{
+    TrimAIOCB *iocb = opaque;
+    if (ret >= 0) {
+        do {
+            int j = iocb->j;
+            if (++iocb->i < iocb->qiov->iov[j].iov_len / 8) {
+                int i = iocb->i;
+                uint64_t *buffer = iocb->qiov->iov[j].iov_base;
+
+                /* 6-byte LBA + 2-byte range per entry */
+                uint64_t entry = le64_to_cpu(buffer[i]);
+                uint64_t sector = entry & 0x0000ffffffffffffULL;
+                uint16_t count = entry >> 48;
+
+                if (count > 0) {
+                    iocb->aiocb = bdrv_aio_discard(iocb->common.bs,
+                                                   sector, count,
+                                                   ide_issue_trim_cb, opaque);
+                    return;
+                }
+
+                /* If count = 0, advance to next iov.  */
+            }
+
+            iocb->i = -1;
+        } while (++iocb->j < iocb->qiov->niov);
+    } else {
+        iocb->ret = ret;
+    }
+
+    iocb->aiocb = NULL;
+    if (iocb->bh) {
+        qemu_bh_schedule(iocb->bh);
+    }
+}
+
 BlockDriverAIOCB *ide_issue_trim(BlockDriverState *bs,
         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     TrimAIOCB *iocb;
-    int i, j, ret;
 
     iocb = qemu_aio_get(&trim_aio_pool, bs, cb, opaque);
     iocb->bh = qemu_bh_new(ide_trim_bh_cb, iocb);
     iocb->ret = 0;
-
-    for (j = 0; j < qiov->niov; j++) {
-        uint64_t *buffer = qiov->iov[j].iov_base;
-
-        for (i = 0; i < qiov->iov[j].iov_len / 8; i++) {
-            /* 6-byte LBA + 2-byte range per entry */
-            uint64_t entry = le64_to_cpu(buffer[i]);
-            uint64_t sector = entry & 0x0000ffffffffffffULL;
-            uint16_t count = entry >> 48;
-
-            if (count == 0) {
-                break;
-            }
-
-            ret = bdrv_discard(bs, sector, count);
-            if (!iocb->ret) {
-                iocb->ret = ret;
-            }
-        }
-    }
-
-    qemu_bh_schedule(iocb->bh);
-
+    iocb->qiov = qiov;
+    iocb->i = -1;
+    iocb->j = 0;
+    ide_issue_trim_cb(iocb, 0);
     return &iocb->common;
 }
 
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (8 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 18:13   ` Avi Kivity
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults Paolo Bonzini
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Now that the block layer guarantees stable semantics for discard,
we can expose them to guests.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.h        |    7 +++++--
 hw/ide/core.c  |    6 +++++-
 hw/scsi-disk.c |    5 ++++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/block.h b/block.h
index ef07bb0..e28321e 100644
--- a/block.h
+++ b/block.h
@@ -423,6 +423,7 @@ typedef struct BlockConf {
     uint32_t opt_io_size;
     int32_t bootindex;
     uint32_t discard_granularity;
+    uint32_t discard_zeroes_data;
 } BlockConf;
 
 static inline unsigned int get_physical_block_exp(BlockConf *conf)
@@ -444,10 +445,12 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
                        _conf.logical_block_size, 512),                  \
     DEFINE_PROP_UINT16("physical_block_size", _state,                   \
                        _conf.physical_block_size, 512),                 \
-    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),  \
+    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),    \
     DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
     DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
-    DEFINE_PROP_UINT32("discard_granularity", _state, \
+    DEFINE_PROP_BIT("discard_zeroes_data", _state,                      \
+                    _conf.discard_zeroes_data, 0, false),               \
+    DEFINE_PROP_UINT32("discard_granularity", _state,                   \
                        _conf.discard_granularity, 0)
 
 #endif
diff --git a/hw/ide/core.c b/hw/ide/core.c
index f12470e..1530b61 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -128,7 +128,11 @@ static void ide_identify(IDEState *s)
     put_le16(p + 67, 120);
     put_le16(p + 68, 120);
     if (dev && dev->conf.discard_granularity) {
-        put_le16(p + 69, (1 << 14)); /* determinate TRIM behavior */
+        int val;
+
+        val = 0x4000;
+        val |= dev->conf.discard_zeroes_data ? 0x20 : 0;
+        put_le16(p + 69, val); /* determinate TRIM behavior */
     }
 
     if (s->ncq_queues) {
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index add399e..f489078 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -1295,8 +1295,11 @@ static int scsi_disk_emulate_command(SCSIDiskReq *r)
             outbuf[13] = get_physical_block_exp(&s->qdev.conf);
 
             /* set TPE bit if the format supports discard */
-            if (s->qdev.conf.discard_granularity) {
+            if (s->qdev.type == TYPE_DISK && s->qdev.conf.discard_granularity) {
                 outbuf[14] = 0x80;
+                if (s->qdev.conf.discard_zeroes_data) {
+                    outbuf[14] |= 0x40;
+                }
             }
 
             /* Protection, exponent and lowest lba field left blank. */
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (9 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Let a discard_granularity equal to -1 represent "enabled with default
granularity".  Also, Linux requires the discard_granularity to be
a power of two because it uses it with an AND, make the user obey.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.h        |    6 +++---
 hw/ide/qdev.c  |    6 +++++-
 hw/scsi-disk.c |   12 ++++++++++++
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/block.h b/block.h
index e28321e..aa224d3 100644
--- a/block.h
+++ b/block.h
@@ -422,7 +422,7 @@ typedef struct BlockConf {
     uint16_t min_io_size;
     uint32_t opt_io_size;
     int32_t bootindex;
-    uint32_t discard_granularity;
+    int32_t discard_granularity;
     uint32_t discard_zeroes_data;
 } BlockConf;
 
@@ -450,8 +450,8 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
     DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
     DEFINE_PROP_BIT("discard_zeroes_data", _state,                      \
                     _conf.discard_zeroes_data, 0, false),               \
-    DEFINE_PROP_UINT32("discard_granularity", _state,                   \
-                       _conf.discard_granularity, 0)
+    DEFINE_PROP_INT32("discard_granularity", _state,                    \
+                      _conf.discard_granularity, 0)
 
 #endif
 
diff --git a/hw/ide/qdev.c b/hw/ide/qdev.c
index f6a4896..5c7141f 100644
--- a/hw/ide/qdev.c
+++ b/hw/ide/qdev.c
@@ -121,8 +121,12 @@ static int ide_dev_initfn(IDEDevice *dev, IDEDriveKind kind)
     IDEState *s = bus->ifs + dev->unit;
     const char *serial;
     DriveInfo *dinfo;
+    int discard_granularity;
 
-    if (dev->conf.discard_granularity && dev->conf.discard_granularity != 512) {
+    discard_granularity = dev->conf.discard_granularity;
+    if (discard_granularity == -1) {
+        dev->conf.discard_granularity = 512;
+    } else if (discard_granularity && discard_granularity != 512) {
         error_report("discard_granularity must be 512 for ide");
         return -1;
     }
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index f489078..00d355e 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -1577,12 +1577,24 @@ static int scsi_initfn(SCSIDevice *dev)
 {
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, dev);
     DriveInfo *dinfo;
+    int discard_granularity;
 
     if (!s->qdev.conf.bs) {
         error_report("drive property not set");
         return -1;
     }
 
+    discard_granularity = s->qdev.conf.discard_granularity;
+    if (discard_granularity == -1) {
+        s->qdev.conf.discard_granularity = s->qdev.conf.logical_block_size;
+    } else if (discard_granularity < s->qdev.conf.logical_block_size) {
+        error_report("scsi-block: invalid discard_granularity");
+        return -1;
+    } else if (discard_granularity & (discard_granularity - 1)) {
+        error_report("scsi-block: discard_granularity not a power of two");
+        return -1;
+    }
+
     if (!s->removable && !bdrv_is_inserted(s->qdev.conf.bs)) {
         error_report("Device needs media, but drive is empty");
         return -1;
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (10 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 18:17   ` Avi Kivity
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Turn on discard support in the device models by default, with
compatibility properties for older machine types.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.h      |    4 +-
 hw/pc_piix.c |  224 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+), 2 deletions(-)

diff --git a/block.h b/block.h
index aa224d3..1a8fd2d 100644
--- a/block.h
+++ b/block.h
@@ -449,9 +449,9 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
     DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
     DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
     DEFINE_PROP_BIT("discard_zeroes_data", _state,                      \
-                    _conf.discard_zeroes_data, 0, false),               \
+                    _conf.discard_zeroes_data, 0, true),                \
     DEFINE_PROP_INT32("discard_granularity", _state,                    \
-                      _conf.discard_granularity, 0)
+                      _conf.discard_granularity, -1)
 
 #endif
 
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 6c5c40f..0356424 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -381,6 +381,38 @@ static QEMUMachine pc_machine_v1_0 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "pc-sysfw",
             .property = "rom_only",
             .value    = stringify(1),
@@ -400,6 +432,38 @@ static QEMUMachine pc_machine_v0_15 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "pc-sysfw",
             .property = "rom_only",
             .value    = stringify(1),
@@ -419,6 +483,38 @@ static QEMUMachine pc_machine_v0_14 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "qxl",
             .property = "revision",
             .value    = stringify(2),
@@ -463,6 +559,38 @@ static QEMUMachine pc_machine_v0_13 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "virtio-9p-pci",
             .property = "vectors",
             .value    = stringify(0),
@@ -519,6 +647,38 @@ static QEMUMachine pc_machine_v0_12 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "virtio-serial-pci",
             .property = "max_ports",
             .value    = stringify(1),
@@ -579,6 +739,38 @@ static QEMUMachine pc_machine_v0_11 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "virtio-blk-pci",
             .property = "vectors",
             .value    = stringify(0),
@@ -647,6 +839,38 @@ static QEMUMachine pc_machine_v0_10 = {
     .max_cpus = 255,
     .compat_props = (GlobalProperty[]) {
         {
+            .driver   = "ide-drive",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-drive",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "ide-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-disk",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_granularity",
+            .value    = stringify(0),
+	}, {
+            .driver   = "scsi-hd",
+            .property = "discard_zeroes_data",
+            .value    = "off",
+	}, {
             .driver   = "virtio-blk-pci",
             .property = "class",
             .value    = stringify(PCI_CLASS_STORAGE_OTHER),
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (11 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-24 15:35   ` Christoph Hellwig
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Allow discard to fail, and fall back to the write operation.  This
is needed because there's no simple way to probe for availability
of FALLOC_FL_PUNCH_HOLE.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block.c b/block.c
index af73497..7397df6 100644
--- a/block.c
+++ b/block.c
@@ -1798,7 +1798,10 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         bdi.discard_granularity &&
         (sector_num & (bdi.discard_granularity - 1)) == 0 &&
         (nb_sectors & (bdi.discard_granularity - 1)) == 0) {
-        return bdrv_co_do_discard(bs, sector_num, nb_sectors);
+        ret = bdrv_co_do_discard(bs, sector_num, nb_sectors);
+        if (ret != -ENOTSUP) {
+            return ret;
+        }
     }
 
     if (qiov) {
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (12 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-09  8:20   ` Chris Wedgwood
                     ` (2 more replies)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 15/17] raw: add get_info Paolo Bonzini
                   ` (2 subsequent siblings)
  16 siblings, 3 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

This will enable discard on more filesystems, including ext4.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c |   29 ++++++++++++++++++++++++-----
 1 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 2d1bc13..f23d92b 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -52,6 +52,7 @@
 #include <sys/param.h>
 #include <linux/cdrom.h>
 #include <linux/fd.h>
+#include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
 #include <sys/disk.h>
@@ -131,6 +132,7 @@ typedef struct BDRVRawState {
 #endif
     uint8_t *aligned_buf;
     unsigned aligned_buf_size;
+    bool has_discard : 1;
 #ifdef CONFIG_XFS
     bool is_xfs : 1;
 #endif
@@ -257,11 +259,9 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
     }
 
 #ifdef CONFIG_XFS
-    if (platform_test_xfs_fd(s->fd)) {
-        s->is_xfs = 1;
-    }
+    s->is_xfs = platform_test_xfs_fd(s->fd);
 #endif
-
+    s->has_discard = 1;
     return 0;
 
 out_free_buf:
@@ -605,15 +605,34 @@ static int xfs_discard(BDRVRawState *s, int64_t sector_num, int nb_sectors)
 static coroutine_fn int raw_co_discard(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors)
 {
-#ifdef CONFIG_XFS
     BDRVRawState *s = bs->opaque;
+    int retval;
 
+    if (s->has_discard == 0) {
+        return -ENOTSUP;
+    }
+#ifdef CONFIG_XFS
     if (s->is_xfs) {
         return xfs_discard(s, sector_num, nb_sectors);
     }
 #endif
 
+#ifdef FALLOC_FL_PUNCH_HOLE
+    retval = fallocate(s->fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
+                       sector_num << 9, (int64_t)nb_sectors << 9);
+    if (retval == -1) {
+        if (errno == ENODEV || errno == ENOSYS || errno == EOPNOTSUPP) {
+            s->has_discard = 0;
+            return -ENOTSUP;
+        } else {
+            return -errno;
+        }
+    }
     return 0;
+#else
+    s->has_discard = 0;
+    return -ENOTSUP;
+#endif
 }
 
 static QEMUOptionParameter raw_create_options[] = {
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 15/17] raw: add get_info
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (13 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Now that everything is in place we can declare discard support
for the file protocol (and the raw format).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c |    9 +++++++++
 block/raw.c       |    6 ++++++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index f23d92b..56aabc7 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -635,6 +635,14 @@ static coroutine_fn int raw_co_discard(BlockDriverState *bs,
 #endif
 }
 
+static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    bdi->cluster_size = BDRV_SECTOR_SIZE;
+    bdi->discard_zeroes_data = true;
+    bdi->discard_granularity = 1;
+    return 0;
+}
+
 static QEMUOptionParameter raw_create_options[] = {
     {
         .name = BLOCK_OPT_SIZE,
@@ -660,6 +668,7 @@ static BlockDriver bdrv_file = {
 
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
+    .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
 
diff --git a/block/raw.c b/block/raw.c
index 1cdac0c..3618c2d 100644
--- a/block/raw.c
+++ b/block/raw.c
@@ -35,6 +35,11 @@ static int64_t raw_getlength(BlockDriverState *bs)
     return bdrv_getlength(bs->file);
 }
 
+static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    return bdrv_get_info(bs->file, bdi);
+}
+
 static int raw_truncate(BlockDriverState *bs, int64_t offset)
 {
     return bdrv_truncate(bs->file, offset);
@@ -118,6 +123,7 @@ static BlockDriver bdrv_raw = {
 
     .bdrv_probe         = raw_probe,
     .bdrv_getlength     = raw_getlength,
+    .bdrv_get_info      = raw_get_info,
     .bdrv_truncate      = raw_truncate,
 
     .bdrv_is_inserted   = raw_is_inserted,
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (14 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 15/17] raw: add get_info Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
  16 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 qemu-io.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/qemu-io.c b/qemu-io.c
index bb5f7b8..c1bc053 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -1513,7 +1513,7 @@ out:
 
 static int alloc_f(int argc, char **argv)
 {
-    int64_t offset;
+    int64_t offset, sector_num;
     int nb_sectors, remaining;
     char s1[64];
     int num, sum_alloc;
@@ -1534,8 +1534,10 @@ static int alloc_f(int argc, char **argv)
 
     remaining = nb_sectors;
     sum_alloc = 0;
+    sector_num = offset >> 9;
     while (remaining) {
-        ret = bdrv_is_allocated(bs, offset >> 9, nb_sectors, &num);
+        ret = bdrv_is_allocated(bs, sector_num, remaining, &num);
+        sector_num += num;
         remaining -= num;
         if (ret) {
             sum_alloc += num;
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated
  2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
                   ` (15 preceding siblings ...)
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command Paolo Bonzini
@ 2012-03-08 17:15 ` Paolo Bonzini
  2012-03-24 15:42   ` Christoph Hellwig
  16 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-08 17:15 UTC (permalink / raw)
  To: qemu-devel

SEEK_DATA and SEEK_HOLE can be used to implement the is_allocated
callback for raw files.  These currently work on btrfs, with an XFS
implementation also coming soon.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/raw.c       |    8 ++++++
 2 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 56aabc7..c0b32e1 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -105,6 +105,13 @@
 #define O_DIRECT O_DSYNC
 #endif
 
+#ifndef SEEK_DATA
+#define SEEK_DATA 3
+#endif
+#ifndef SEEK_HOLE
+#define SEEK_HOLE 4
+#endif
+
 #define FTYPE_FILE   0
 #define FTYPE_CD     1
 #define FTYPE_FD     2
@@ -583,6 +590,60 @@ static int raw_create(const char *filename, QEMUOptionParameter *options)
     return result;
 }
 
+/*
+ * Returns true iff the specified sector is present in the disk image. Drivers
+ * not implementing the functionality are assumed to not support backing files,
+ * hence all their sectors are reported as allocated.
+ *
+ * If 'sector_num' is beyond the end of the disk image the return value is 0
+ * and 'pnum' is set to 0.
+ *
+ * 'pnum' is set to the number of sectors (including and immediately following
+ * the specified sector) that are known to be in the same
+ * allocated/unallocated state.
+ *
+ * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
+ * beyond the end of the disk image it will be clamped.
+ */
+static int coroutine_fn raw_co_is_allocated(BlockDriverState *bs,
+                                            int64_t sector_num,
+                                            int nb_sectors, int *pnum)
+{
+    off_t start, data, hole;
+    int ret, num;
+    BDRVRawState *s = bs->opaque;
+
+    ret = fd_open(bs);
+    if (ret < 0) {
+        return ret;
+    }
+
+    start = sector_num * BDRV_SECTOR_SIZE;
+    hole = lseek(s->fd, start, SEEK_HOLE);
+    if (hole == -1) {
+        /* -ENXIO indicates that sector_num was past the end of the file.
+         * There is a virtual hole there.  */
+        assert(errno != -ENXIO);
+
+        /* Most likely EINVAL.  Assume everything is allocated.  */
+        return 1;
+    }
+
+    if (hole == start) {
+        /* On a hole.  We need another syscall to find its end.  */
+        data = lseek(s->fd, start, SEEK_DATA);
+        assert (data != -1);
+        num = (data - start) / BDRV_SECTOR_SIZE;
+    } else {
+        data = start;
+        num = (hole - start) / BDRV_SECTOR_SIZE;
+    }
+
+    /* Clamping to the file size is done by the kernel.  */
+    *pnum = MIN(nb_sectors, num);
+    return data == start;
+}
+
 #ifdef CONFIG_XFS
 static int xfs_discard(BDRVRawState *s, int64_t sector_num, int nb_sectors)
 {
@@ -661,6 +722,7 @@ static BlockDriver bdrv_file = {
     .bdrv_close = raw_close,
     .bdrv_create = raw_create,
     .bdrv_co_discard = raw_co_discard,
+    .bdrv_co_is_allocated = raw_co_is_allocated,
 
     .bdrv_aio_readv = raw_aio_readv,
     .bdrv_aio_writev = raw_aio_writev,
diff --git a/block/raw.c b/block/raw.c
index 3618c2d..e433bf2 100644
--- a/block/raw.c
+++ b/block/raw.c
@@ -30,6 +30,13 @@ static int coroutine_fn raw_co_flush(BlockDriverState *bs)
     return bdrv_co_flush(bs->file);
 }
 
+static int coroutine_fn raw_co_is_allocated(BlockDriverState *bs,
+                                            int64_t sector_num,
+                                            int nb_sectors, int *pnum)
+{
+    return bdrv_co_is_allocated(bs->file, sector_num, nb_sectors, pnum);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
     return bdrv_getlength(bs->file);
@@ -119,6 +126,7 @@ static BlockDriver bdrv_raw = {
     .bdrv_co_readv          = raw_co_readv,
     .bdrv_co_writev         = raw_co_writev,
     .bdrv_co_flush_to_disk  = raw_co_flush,
+    .bdrv_co_is_allocated   = raw_co_is_allocated,
     .bdrv_co_discard        = raw_co_discard,
 
     .bdrv_probe         = raw_probe,
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
@ 2012-03-08 17:55   ` Avi Kivity
  2012-03-09 16:42     ` Kevin Wolf
  0 siblings, 1 reply; 68+ messages in thread
From: Avi Kivity @ 2012-03-08 17:55 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
> While the formats/protocols are free to implement a discard operation
> that does *not* zero the data, providing a common view to the guests is
> the only way to avoid configuration and migration nightmares.  (We don't
> want the admin to manually set up discard_zeroes_data/discard_granularity!)
>
> This has a couple of drawbacks:
>
> 1) QEMU effectively will not try to use discard unless discard_zeroes_data
> is true.  To do this we could add a flag such as BDRV_ALWAYS_DISCARD,
> which the device models could use if their discard_zeroes_data property is
> set to false.  However, I haven't done this, estimating that no strictly
> positive integer could represent the number of people using it.
>
> 2) it may turn a thin-provisioning operation into a full preallocation,
> which is not nice.
>
> 3) it may cost memory for a bounce buffer that only needs to be filled
> with zeroes.
>
> While (3) can be worked around, the only way around the other two,
> unfortunately, is support in the formats and protocols.  We will still
> provide device options to opt out of this, but with raw and qed covered
> (+ qcow2 without backing file, and qcow3 in the future) it should not
> be too bad.

Can't qcow2 with a backing file also be supported?  Zero out the first
cluster, and remember it.  The following discards can reuse this zero
cluster, as long as it hasn't been overwritten.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
@ 2012-03-08 18:13   ` Avi Kivity
  2012-03-08 18:14     ` Avi Kivity
  0 siblings, 1 reply; 68+ messages in thread
From: Avi Kivity @ 2012-03-08 18:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
> Now that the block layer guarantees stable semantics for discard,
> we can expose them to guests.
>

How does this interact with -M something-old?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property
  2012-03-08 18:13   ` Avi Kivity
@ 2012-03-08 18:14     ` Avi Kivity
  0 siblings, 0 replies; 68+ messages in thread
From: Avi Kivity @ 2012-03-08 18:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On 03/08/2012 08:13 PM, Avi Kivity wrote:
> On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
> > Now that the block layer guarantees stable semantics for discard,
> > we can expose them to guests.
> >
>
> How does this interact with -M something-old?
>

Sorry, patch 12 addresses this.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
@ 2012-03-08 18:17   ` Avi Kivity
  0 siblings, 0 replies; 68+ messages in thread
From: Avi Kivity @ 2012-03-08 18:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
> Turn on discard support in the device models by default, with
> compatibility properties for older machine types.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block.h      |    4 +-
>  hw/pc_piix.c |  224 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 226 insertions(+), 2 deletions(-)
>
> diff --git a/block.h b/block.h
> index aa224d3..1a8fd2d 100644
> --- a/block.h
> +++ b/block.h
> @@ -449,9 +449,9 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
>      DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
>      DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
>      DEFINE_PROP_BIT("discard_zeroes_data", _state,                      \
> -                    _conf.discard_zeroes_data, 0, false),               \
> +                    _conf.discard_zeroes_data, 0, true),                \
>      DEFINE_PROP_INT32("discard_granularity", _state,                    \
> -                      _conf.discard_granularity, 0)
> +                      _conf.discard_granularity, -1)
>  
>  #endif
>  
> diff --git a/hw/pc_piix.c b/hw/pc_piix.c
> index 6c5c40f..0356424 100644
> --- a/hw/pc_piix.c
> +++ b/hw/pc_piix.c
> @@ -381,6 +381,38 @@ static QEMUMachine pc_machine_v1_0 = {
>      .max_cpus = 255,
>      .compat_props = (GlobalProperty[]) {
>          {
> +            .driver   = "ide-drive",
> +            .property = "discard_granularity",
> +            .value    = stringify(0),
> +	}, {
> +            .driver   = "ide-drive",
> +            .property = "discard_zeroes_data",
> +            .value    = "off",
> +	}, {
> +            .driver   = "ide-hd",
> +            .property = "discard_granularity",
> +            .value    = stringify(0),
> +	}, {
> +            .driver   = "ide-hd",
> +            .property = "discard_zeroes_data",
> +            .value    = "off",
> +	}, {
> +            .driver   = "scsi-disk",
> +            .property = "discard_granularity",
> +            .value    = stringify(0),
> +	}, {
> +            .driver   = "scsi-disk",
> +            .property = "discard_zeroes_data",
> +            .value    = "off",
> +	}, {
> +            .driver   = "scsi-hd",
> +            .property = "discard_granularity",
> +            .value    = stringify(0),
> +	}, {
> +            .driver   = "scsi-hd",
> +            .property = "discard_zeroes_data",
> +            .value    = "off",
> +	}, {
>              .driver   = "pc-sysfw",
>              .property = "rom_only",
>              .value    = stringify(1),
> @@ -400,6 +432,38 @@ static QEMUMachine pc_machine_v0_15 = {
>      .max_cpus = 255,
>      .compat_props = (GlobalProperty[]) {
>          {
> +            .driver   = "ide-drive",
> +            .property = "discard_granularity",
> +            .value    = stringify(0),
>

<snip>

We should define machines differentially.  0.15 is 1.0, with the
following differences: { }.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
@ 2012-03-09  8:20   ` Chris Wedgwood
  2012-03-09  8:31     ` Paolo Bonzini
  2012-03-09 10:31   ` Stefan Hajnoczi
  2012-03-09 20:36   ` Richard Laager
  2 siblings, 1 reply; 68+ messages in thread
From: Chris Wedgwood @ 2012-03-09  8:20 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

> This will enable discard on more filesystems, including ext4.

i think this should default to off


for preallocated files this will punch holes all over the place and
reallocation will cause fragmentation (guests will free and reallocat
blocks normally)

xfs has a mechanism to convert to unwritten extents that would avoid
that, but it's not implemented in other filesystems

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09  8:20   ` Chris Wedgwood
@ 2012-03-09  8:31     ` Paolo Bonzini
  2012-03-09  8:35       ` Chris Wedgwood
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-09  8:31 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: qemu-devel

Il 09/03/2012 09:20, Chris Wedgwood ha scritto:
> for preallocated files this will punch holes all over the place and
> reallocation will cause fragmentation (guests will free and reallocat
> blocks normally)

SEEK_HOLE could provide a very simple heuristic to detect preallocated
files (unfortunately ext4 does not implement SEEK_HOLE yet).

> xfs has a mechanism to convert to unwritten extents that would avoid
> that, but it's not implemented in other filesystems

Interesting, thanks!

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09  8:31     ` Paolo Bonzini
@ 2012-03-09  8:35       ` Chris Wedgwood
  2012-03-09  8:40         ` Paolo Bonzini
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wedgwood @ 2012-03-09  8:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

> SEEK_HOLE could provide a very simple heuristic to detect preallocated
> files (unfortunately ext4 does not implement SEEK_HOLE yet).

SEEK_HOLE is a weird (confusing and no intuitive) API (IMO).  There is
FIEMAP or whatever it's called which seems somewhat saner.

XFS specific there is also GETBMAP[X].

Simplest still compare the blocks allocated by the file
to it's length (ie. stat.st_blocks != stat.st_size>>9).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09  8:35       ` Chris Wedgwood
@ 2012-03-09  8:40         ` Paolo Bonzini
  0 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-09  8:40 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: qemu-devel

Il 09/03/2012 09:35, Chris Wedgwood ha scritto:
>> SEEK_HOLE could provide a very simple heuristic to detect preallocated
>> files (unfortunately ext4 does not implement SEEK_HOLE yet).
> 
> SEEK_HOLE is a weird (confusing and no intuitive) API (IMO).  There is
> FIEMAP or whatever it's called which seems somewhat saner.

FIEMAP is a powertool, we need a butter knife only (cit.).

> Simplest still compare the blocks allocated by the file
> to it's length (ie. stat.st_blocks != stat.st_size>>9).

But yes, you're right about this.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
  2012-03-09  8:20   ` Chris Wedgwood
@ 2012-03-09 10:31   ` Stefan Hajnoczi
  2012-03-09 10:43     ` Paolo Bonzini
  2012-03-09 20:36   ` Richard Laager
  2 siblings, 1 reply; 68+ messages in thread
From: Stefan Hajnoczi @ 2012-03-09 10:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On Thu, Mar 8, 2012 at 5:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> This will enable discard on more filesystems, including ext4.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block/raw-posix.c |   29 ++++++++++++++++++++++++-----
>  1 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 2d1bc13..f23d92b 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -52,6 +52,7 @@
>  #include <sys/param.h>
>  #include <linux/cdrom.h>
>  #include <linux/fd.h>
> +#include <linux/falloc.h>
>  #endif
>  #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
>  #include <sys/disk.h>
> @@ -131,6 +132,7 @@ typedef struct BDRVRawState {
>  #endif
>     uint8_t *aligned_buf;
>     unsigned aligned_buf_size;
> +    bool has_discard : 1;
>  #ifdef CONFIG_XFS
>     bool is_xfs : 1;
>  #endif
> @@ -257,11 +259,9 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
>     }
>
>  #ifdef CONFIG_XFS
> -    if (platform_test_xfs_fd(s->fd)) {
> -        s->is_xfs = 1;
> -    }
> +    s->is_xfs = platform_test_xfs_fd(s->fd);
>  #endif
> -
> +    s->has_discard = 1;
>     return 0;
>
>  out_free_buf:
> @@ -605,15 +605,34 @@ static int xfs_discard(BDRVRawState *s, int64_t sector_num, int nb_sectors)
>  static coroutine_fn int raw_co_discard(BlockDriverState *bs,
>     int64_t sector_num, int nb_sectors)
>  {
> -#ifdef CONFIG_XFS
>     BDRVRawState *s = bs->opaque;
> +    int retval;
>
> +    if (s->has_discard == 0) {
> +        return -ENOTSUP;
> +    }
> +#ifdef CONFIG_XFS
>     if (s->is_xfs) {
>         return xfs_discard(s, sector_num, nb_sectors);
>     }
>  #endif
>
> +#ifdef FALLOC_FL_PUNCH_HOLE
> +    retval = fallocate(s->fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
> +                       sector_num << 9, (int64_t)nb_sectors << 9);

I'm concerned about introducing blocking syscalls in coroutine code
paths.  This needs to be done asynchronously.

Stefan

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09 10:31   ` Stefan Hajnoczi
@ 2012-03-09 10:43     ` Paolo Bonzini
  2012-03-09 10:53       ` Stefan Hajnoczi
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-09 10:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

Il 09/03/2012 11:31, Stefan Hajnoczi ha scritto:
>> > +#ifdef FALLOC_FL_PUNCH_HOLE
>> > +    retval = fallocate(s->fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
>> > +                       sector_num << 9, (int64_t)nb_sectors << 9);
> I'm concerned about introducing blocking syscalls in coroutine code
> paths.  This needs to be done asynchronously.

Right; it is no worse than what is already there, except that XFS could
use paio_ioctl.  Alternatives are:

1) require a new-enough kernel and only use fallocate; return a NULL
aiocb if !has_discard and convert it to ENOTSUP.

2) extract now from my threads branch the work to generalize
posix-aio-compat into a more flexible threadpool, and move the AIO code
back to block/raw-posix.c.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09 10:43     ` Paolo Bonzini
@ 2012-03-09 10:53       ` Stefan Hajnoczi
  2012-03-09 10:57         ` Paolo Bonzini
  0 siblings, 1 reply; 68+ messages in thread
From: Stefan Hajnoczi @ 2012-03-09 10:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On Fri, Mar 9, 2012 at 10:43 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 09/03/2012 11:31, Stefan Hajnoczi ha scritto:
>>> > +#ifdef FALLOC_FL_PUNCH_HOLE
>>> > +    retval = fallocate(s->fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
>>> > +                       sector_num << 9, (int64_t)nb_sectors << 9);
>> I'm concerned about introducing blocking syscalls in coroutine code
>> paths.  This needs to be done asynchronously.
>
> Right; it is no worse than what is already there, except that XFS could
> use paio_ioctl.  Alternatives are:

It is worse in the sense that if you make this feature more widely
available then it deserves a production-quality implementation.

> 1) require a new-enough kernel and only use fallocate; return a NULL
> aiocb if !has_discard and convert it to ENOTSUP.
>
> 2) extract now from my threads branch the work to generalize
> posix-aio-compat into a more flexible threadpool, and move the AIO code
> back to block/raw-posix.c.

3) Add discard to posix-aio-compat.c as an AIO request?

Stefan

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09 10:53       ` Stefan Hajnoczi
@ 2012-03-09 10:57         ` Paolo Bonzini
  0 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-09 10:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel

Il 09/03/2012 11:53, Stefan Hajnoczi ha scritto:
> 
>> > 1) require a new-enough kernel and only use fallocate; return a NULL
>> > aiocb if !has_discard and convert it to ENOTSUP.
>> >
>> > 2) extract now from my threads branch the work to generalize
>> > posix-aio-compat into a more flexible threadpool, and move the AIO code
>> > back to block/raw-posix.c.
> 3) Add discard to posix-aio-compat.c as an AIO request?

That would be (1) or you would need to pass down is_xfs.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
@ 2012-03-09 16:31   ` Kevin Wolf
  2012-03-09 17:53     ` Paolo Bonzini
  0 siblings, 1 reply; 68+ messages in thread
From: Kevin Wolf @ 2012-03-09 16:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

Am 08.03.2012 18:15, schrieb Paolo Bonzini:
> Move the bdrv_co_write_zeroes operation to bdrv_aio_discard, so that
> the ugly coroutine loop can be eliminated.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

But the logically correct (and therefore easier to understand) way would
be to let bdrv_aio_discard call bdrv_co_write_zeroes instead of the
other way round. Mapping write_zeroes to discard is only correct as long
as QED implements discard as zero writes instead of really discarding data.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
@ 2012-03-09 16:37   ` Kevin Wolf
  2012-03-09 18:06     ` Paolo Bonzini
  0 siblings, 1 reply; 68+ messages in thread
From: Kevin Wolf @ 2012-03-09 16:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

Am 08.03.2012 18:15, schrieb Paolo Bonzini:
> Remove the bdrv_co_write_zeroes callback.  Instead use the discard
> information from bdrv_get_info to choose between bdrv_co_discard
> and a normal write.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

I'm not sure if this a good idea.

The goal of discard is to remove data from the image (or not add it if
it isn't there yet) and ideally deallocate the used clusters. The goal
of write_zeroes is to mark space as zero and explicitly allocate it for
this purpose.

>From a guest point of view these are pretty similar, but from a host
perspective I'd say there's a difference.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero
  2012-03-08 17:55   ` Avi Kivity
@ 2012-03-09 16:42     ` Kevin Wolf
  2012-03-12 10:42       ` Avi Kivity
  0 siblings, 1 reply; 68+ messages in thread
From: Kevin Wolf @ 2012-03-09 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paolo Bonzini, qemu-devel

Am 08.03.2012 18:55, schrieb Avi Kivity:
> On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
>> While the formats/protocols are free to implement a discard operation
>> that does *not* zero the data, providing a common view to the guests is
>> the only way to avoid configuration and migration nightmares.  (We don't
>> want the admin to manually set up discard_zeroes_data/discard_granularity!)
>>
>> This has a couple of drawbacks:
>>
>> 1) QEMU effectively will not try to use discard unless discard_zeroes_data
>> is true.  To do this we could add a flag such as BDRV_ALWAYS_DISCARD,
>> which the device models could use if their discard_zeroes_data property is
>> set to false.  However, I haven't done this, estimating that no strictly
>> positive integer could represent the number of people using it.
>>
>> 2) it may turn a thin-provisioning operation into a full preallocation,
>> which is not nice.
>>
>> 3) it may cost memory for a bounce buffer that only needs to be filled
>> with zeroes.
>>
>> While (3) can be worked around, the only way around the other two,
>> unfortunately, is support in the formats and protocols.  We will still
>> provide device options to opt out of this, but with raw and qed covered
>> (+ qcow2 without backing file, and qcow3 in the future) it should not
>> be too bad.
> 
> Can't qcow2 with a backing file also be supported?  Zero out the first
> cluster, and remember it.  The following discards can reuse this zero
> cluster, as long as it hasn't been overwritten.

qcow2 can't handle clusters that are referenced twice from the same L1
table. This would require a reverse lookup to adjust the QCOW_O_COPIED
flags in the L2 tables containing the other references.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
@ 2012-03-09 16:47   ` Kevin Wolf
  0 siblings, 0 replies; 68+ messages in thread
From: Kevin Wolf @ 2012-03-09 16:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

Am 08.03.2012 18:15, schrieb Paolo Bonzini:
> In the next patches we will declare the guest's semantics for discard
> to be "always zero the data".  We need to know whether the operation
> need to be emulated or can be passed down.  For this purpose we need
> to know the semantics of the operation and its granularity.
> 
> The granularity may not be related to the cluster size (for example
> "raw" does not have a cluster size), so add it separately.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  block.h       |    4 ++++
>  block/qcow2.c |    2 ++
>  qemu-io.c     |    5 ++++-
>  3 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/block.h b/block.h
> index 48d0bf3..7feda73 100644
> --- a/block.h
> +++ b/block.h
> @@ -15,6 +15,10 @@ typedef struct BlockDriverInfo {
>      int cluster_size;
>      /* offset at which the VM state can be saved (0 if not possible) */
>      int64_t vm_state_offset;
> +    /* whether discard is guaranteed to zero bytes */
> +    bool discard_zeroes_data;
> +    /* discard granularity in sectors */
> +    int discard_granularity;

Why not in bytes? The arbitrary 512 byte units used in the block layer
were probably not the best idea ever and we shouldn't spread it further.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard
  2012-03-09 16:31   ` Kevin Wolf
@ 2012-03-09 17:53     ` Paolo Bonzini
  0 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-09 17:53 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

Il 09/03/2012 17:31, Kevin Wolf ha scritto:
>> > Move the bdrv_co_write_zeroes operation to bdrv_aio_discard, so that
>> > the ugly coroutine loop can be eliminated.
>> > 
>> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> But the logically correct (and therefore easier to understand) way would
> be to let bdrv_aio_discard call bdrv_co_write_zeroes instead of the
> other way round. Mapping write_zeroes to discard is only correct as long
> as QED implements discard as zero writes instead of really discarding data.

Well, bdrv_co_write_zeroes will disappear soon.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-09 16:37   ` Kevin Wolf
@ 2012-03-09 18:06     ` Paolo Bonzini
  2012-03-10 18:02       ` Richard Laager
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-09 18:06 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

Il 09/03/2012 17:37, Kevin Wolf ha scritto:
>> > Remove the bdrv_co_write_zeroes callback.  Instead use the discard
>> > information from bdrv_get_info to choose between bdrv_co_discard
>> > and a normal write.
>> > 
>> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> I'm not sure if this a good idea.
> 
> The goal of discard is to remove data from the image (or not add it if
> it isn't there yet) and ideally deallocate the used clusters. The goal
> of write_zeroes is to mark space as zero and explicitly allocate it for
> this purpose.
> 
> From a guest point of view these are pretty similar, but from a host
> perspective I'd say there's a difference.

True.  However, we need to present a uniform view to the guests,
including the granularity, or discard can never be enabled.  The
granularity must be 512 on IDE (though it can be higher on SCSI), so
there are problems mapping block layer discard straight down to the guest.

There are basically three ways to do this:

1) we could cheat and present a discard_granularity that is smaller than
what the underlying format/protocol supports.  This is fine but forces
discard_zeroes_data to be false.  That's a pity because Linux 3.4 will
start using efficient zero write operations (WRITE SAME on SCSI, but
could be extended to UNMAP/TRIM if discard_zeroes_data is true).

2) we can make an emulated discard that always supports 512 bytes
granularity and always zeroes data.  This patch series takes this route.

3) we can let the user choose between (1) and (2).  I didn't choose this
because of laziness mostly---co_write_zeroes support is not really
complete, for example there's no aio version to use in device
models---and also because I doubt anyone would really use the option.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
  2012-03-09  8:20   ` Chris Wedgwood
  2012-03-09 10:31   ` Stefan Hajnoczi
@ 2012-03-09 20:36   ` Richard Laager
  2012-03-12  9:34     ` Paolo Bonzini
  2012-03-24 15:40     ` Christoph Hellwig
  2 siblings, 2 replies; 68+ messages in thread
From: Richard Laager @ 2012-03-09 20:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4320 bytes --]

I was just working on this as well, though your implementation is *far*
more complete than mine. (I was only looking at making changes to the
discard implementation in block/raw-posix.c.)

I've got several comments, which I've separated by logical topic...

-------- BLKDISCARD --------

fallocate() only supports files. In my patch, I was fstat()ing the fd
just after opening it and caching an is_device boolean. Then, when doing
discards, if !is_device, call fallocate(), else (i.e. for devices) do
the following (note: untested code):

        __uint64_t range[2];

        range[0] = sector_num << 9;
        range[1] = (__uint64_t)nb_sectors << 9;

        if (ioctl(s->fd, BLKDISCARD, &range) && errno != EOPNOTSUPP) {
             return -errno;
        }
        return 0;

Note that for BLKDISCARD, you probably do NOT want to clear has_discard
if you get EOPNOTSUPP. For example, if you have LVM across multiple
disks where some support discard and some do not, you'll get EOPNOTSUPP
for some discard requests, but not all. ext4 had a behavior where it
would stop issuing discards after one failed, but the LVM guys heavily
criticized that and asked them to get rid of that behavior. (I haven't
checked to make sure it actually happened.)

I'd imagine it's still fine to stop trying fallocate() hole punches
after the first failure; I'm not aware of any filesystem where discards
would be supported in only some ranges of a single file, nor would that
make much sense.

-------- sync requirements --------

I'm not sure if fallocate() and/or BLKDISCARD always guarantee that the
discard has made it to stable storage. If they don't, does O_DIRECT or
O_DSYNC on open() cause them to make any such guarantee? If not, should
you be calling fdatasync() or fsync() when the user has specified
cache=direct or cache=writethrough?

Note that the Illumos implementation (see below) has a flag to ask for
either behavior.

-------- non-Linux Implementations --------

FreeBSD and Illumos have ioctl()s similar to BLKDISCARD and I have
similar untested code for those as well:

/* FreeBSD */
#ifdef DIOCGDELETE
    if (s->is_device) {
        off_t range[2];

        range[0] = sector_num << 9;
        range[1] = (off_t)nb_sectors << 9;

        if (ioctl(s->fd, DIOCGDELETE, &range) != 0 && errno != ENOIOCTL)
        {
            return -errno;
        }
        return 0;
    }
#endif

/* Illumos */
#ifdef DKIOCFREE
    if (s->is_device) {
        dkioc_free_t df;

        /* TODO: Is this correct? Are the other discard ioctl()s sync or async? */
        df.df_flags = (s->open_flags & (O_DIRECT | O_DSYNC)) ?
            DF_WAIT_SYNC : 0;
        df.df_start = sector_num << 9;
        df.df_length = (diskaddr_t)nb_sectors << 9;

        if (ioctl(s->fd, DKIOCFREE, &range)) != 0 && errno != ENOTTY)
        {
            return -errno;
        }
        return 0;
    }
#endif

-------- Thin vs. Thick Provisioning --------

On Fri, 2012-03-09 at 00:35 -0800, Chris Wedgwood wrote:
> Simplest still compare the blocks allocated by the file
> to it's length (ie. stat.st_blocks != stat.st_size>>9).

I thought of this as well. It covers "99%" of the cases, but there's one
case where it breaks down. Imagine I have a sparse file backing my
virtual disk. In the guest, I fill the virtual disk 100%. Then, I
restart QEMU. Now it thinks that sparse file is non-sparse and stops
issuing hole punches.

To be completely correct, I suggest the following behavior:
     1. Add a discard boolean option to the disk layer.
     2. If discard is not specified:
              * For files, detect a true/false value by comparing
                stat.st_blocks != stat.st_size>>9.
              * For devices, assume a fixed value (true?).
     3. If discard is true, issue discards.
     4. If discard is false, do not issue discards.

-------- CONFIG_XFS --------

XFS has implemented the FALLOC_FL_PUNCH_HOLE interface since it was
created. So unless you're concerned about people compiling QEMU on newer
systems with FALLOC_FL_PUNCH_HOLE and then running that binary on older
kernels, there's no case where one needs both the CONFIG_XFS code and
FALLOC_FL_PUNCH_HOLE code.

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-09 18:06     ` Paolo Bonzini
@ 2012-03-10 18:02       ` Richard Laager
  2012-03-12 12:27         ` Paolo Bonzini
  2012-03-24 15:27         ` Christoph Hellwig
  0 siblings, 2 replies; 68+ messages in thread
From: Richard Laager @ 2012-03-10 18:02 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4099 bytes --]

I'm believe your patch set provides these behaviors now:
      * QEMU block drivers report discard_granularity. 
              * discard_granularity = 0 means no discard 
                      * The guest is told there's no discard support.
              * discard_granularity < 0 is undefined.
                discard_granularity > 0 is reported to the guest as
                discard support.
      * QEMU block drivers report discard_zeros_data.
                This is passed to the guest when discard_granularity >
                0.

I propose adding the following behaviors in any event:
      * If a QEMU block device reports a discard_granularity > 0, it
        must be equal to 2^n (n >= 0), or QEMU's block core will change
        it to 0. (Non-power-of-two granularities are not likely to exist
        in the real world, and this assumption greatly simplifies
        ensuring correctness.)
      * For SCSI, report an unmap_granularity to the guest as follows:
      max(logical_block_size, discard_granularity) / logical_block_size


Regarding emulating discard_zeros_data...

I agree that when discard_zeros_data is set, we will need to write
zeroes in some cases. As you noted, IDE has a fixed granularity of one
sector. And the SCSI granularity is a hint only; guests are not
guaranteed to align to that value either. [0]

As a design concept, instead of guaranteeing that 512B zero'ing discards
are supported, I think the QEMU block layer should instead guarantee
aligned discards to QEMU block devices, emulating any misaligned
discards (or portions thereof) by writing zeroes if (and only if)
discard_zeros_data is set. When the QEMU block layer gets a discard:
      * Of the specified discard range, see if it includes an aligned
        multiple of discard granularity. If so, save that as the
        starting point of a subrange. Then find the last aligned
        multiple, if any, and pass that subrange (if start != end) down
        to the block driver's discard function.
      * If the discard really fails (i.e. returns failure and sets errno
        to something other than "not supported" or equivalent), return
        failure to the guest. For "not supported", fall through to the
        code below with the full range.
      * At this point, we have zero, one, or two subranges to handle.
      * If and only if discard_zeros_data is set, write zeros to the
        remaining subranges, if any. (This would use a lower-level
        write_zeroes call which does not attempt to use discard.) If
        this fails, return failure to the guest.
      * Return success.

This leaves one remaining issue: In raw-posix.c, for files (i.e. not
devices), I assume you're going to advertise discard_granularity=1 and
discard_zeros_data=1 when compiled with support for
fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
guarantees that it zeros the data when punching holes. I haven't
verified this.

If the guest does a big discard (think mkfs) and fallocate() returns
EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
which, as you noted, will also allocate it (unless you explicitly check
for holes). This is bad. It can be avoided by not advertising
discard_zeros_data, but as you noted, that's unfortunate.

If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
going to work. This would side step these problems. You said it wasn't
possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
by extending the file by one byte and then punching that:
        char buf = 0;
        fstat(s->fd, &st);
        pwrite(s->fd, &buf, 1, st.st_size + 1);
        has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                                 st.st_size + 1, 1);
        ftruncate(s->fd, st.st_size);


[0] See the last paragraph starting on page 8:
    http://mkp.net/pubs/linux-advanced-storage.pdf

-- 
Richard


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09 20:36   ` Richard Laager
@ 2012-03-12  9:34     ` Paolo Bonzini
  2012-03-24 15:40     ` Christoph Hellwig
  1 sibling, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-12  9:34 UTC (permalink / raw)
  To: Richard Laager; +Cc: qemu-devel

Il 09/03/2012 21:36, Richard Laager ha scritto:
> fallocate() only supports files. In my patch, I was fstat()ing the fd
> just after opening it and caching an is_device boolean. Then, when doing
> discards, if !is_device, call fallocate(), else (i.e. for devices) do
> the following (note: untested code):
> 
>         __uint64_t range[2];
> 
>         range[0] = sector_num << 9;
>         range[1] = (__uint64_t)nb_sectors << 9;
> 
>         if (ioctl(s->fd, BLKDISCARD, &range) && errno != EOPNOTSUPP) {
>              return -errno;
>         }
>         return 0;

You can instead put this code in a separate function hdev_discard.  hdev
device is used instead of file when the backend is a special file (block
or character).

> -------- sync requirements --------
> 
> I'm not sure if fallocate() and/or BLKDISCARD always guarantee that the
> discard has made it to stable storage. If they don't, does O_DIRECT or
> O_DSYNC on open() cause them to make any such guarantee?

O_DIRECT shouldn't have any effect; for O_DSYNC I don't think so.

> -------- Thin vs. Thick Provisioning --------
> 
> On Fri, 2012-03-09 at 00:35 -0800, Chris Wedgwood wrote:
>> Simplest still compare the blocks allocated by the file
>> to it's length (ie. stat.st_blocks != stat.st_size>>9).
> 
> I thought of this as well. It covers "99%" of the cases, but there's one
> case where it breaks down. Imagine I have a sparse file backing my
> virtual disk. In the guest, I fill the virtual disk 100%. Then, I
> restart QEMU. Now it thinks that sparse file is non-sparse and stops
> issuing hole punches.

I would not really have any problem with that, it seems like a
relatively minor case (also because newer guests will allocate the first
partition at 1 MB, so you might still have a small hole at the beginning).

> To be completely correct, I suggest the following behavior:
>      1. Add a discard boolean option to the disk layer.
>      2. If discard is not specified:
>               * For files, detect a true/false value by comparing
>                 stat.st_blocks != stat.st_size>>9.
>               * For devices, assume a fixed value (true?).
>      3. If discard is true, issue discards.
>      4. If discard is false, do not issue discards.

The problem is, who will use this interface?

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero
  2012-03-09 16:42     ` Kevin Wolf
@ 2012-03-12 10:42       ` Avi Kivity
  2012-03-12 11:04         ` Kevin Wolf
  0 siblings, 1 reply; 68+ messages in thread
From: Avi Kivity @ 2012-03-12 10:42 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

On 03/09/2012 06:42 PM, Kevin Wolf wrote:
> >>
> >> While (3) can be worked around, the only way around the other two,
> >> unfortunately, is support in the formats and protocols.  We will still
> >> provide device options to opt out of this, but with raw and qed covered
> >> (+ qcow2 without backing file, and qcow3 in the future) it should not
> >> be too bad.
> > 
> > Can't qcow2 with a backing file also be supported?  Zero out the first
> > cluster, and remember it.  The following discards can reuse this zero
> > cluster, as long as it hasn't been overwritten.
>
> qcow2 can't handle clusters that are referenced twice from the same L1
> table. This would require a reverse lookup to adjust the QCOW_O_COPIED
> flags in the L2 tables containing the other references.

Don't follow, sorry.  What adjustment are you talking about?

If it's a 1->0 transition, is it mandatory to adjust the flag?  That is,
it it legal to have a refcount of exactly one, but have the flag clear?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero
  2012-03-12 10:42       ` Avi Kivity
@ 2012-03-12 11:04         ` Kevin Wolf
  2012-03-12 12:03           ` Avi Kivity
  0 siblings, 1 reply; 68+ messages in thread
From: Kevin Wolf @ 2012-03-12 11:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paolo Bonzini, qemu-devel

Am 12.03.2012 11:42, schrieb Avi Kivity:
> On 03/09/2012 06:42 PM, Kevin Wolf wrote:
>>>>
>>>> While (3) can be worked around, the only way around the other two,
>>>> unfortunately, is support in the formats and protocols.  We will still
>>>> provide device options to opt out of this, but with raw and qed covered
>>>> (+ qcow2 without backing file, and qcow3 in the future) it should not
>>>> be too bad.
>>>
>>> Can't qcow2 with a backing file also be supported?  Zero out the first
>>> cluster, and remember it.  The following discards can reuse this zero
>>> cluster, as long as it hasn't been overwritten.
>>
>> qcow2 can't handle clusters that are referenced twice from the same L1
>> table. This would require a reverse lookup to adjust the QCOW_O_COPIED
>> flags in the L2 tables containing the other references.
> 
> Don't follow, sorry.  What adjustment are you talking about?
> 
> If it's a 1->0 transition, is it mandatory to adjust the flag?  That is,
> it it legal to have a refcount of exactly one, but have the flag clear?

According to the spec it's illegal and qemu-img check will complain,
too. In practice, I'm not entirely sure if it will cause real corruption
or just unnecessary COWs. I believe it may be the latter.

But it's not worth the trouble anyway when we can have a real zero flag
in qcow3.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero
  2012-03-12 11:04         ` Kevin Wolf
@ 2012-03-12 12:03           ` Avi Kivity
  0 siblings, 0 replies; 68+ messages in thread
From: Avi Kivity @ 2012-03-12 12:03 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

On 03/12/2012 01:04 PM, Kevin Wolf wrote:
> >>
> >> qcow2 can't handle clusters that are referenced twice from the same L1
> >> table. This would require a reverse lookup to adjust the QCOW_O_COPIED
> >> flags in the L2 tables containing the other references.
> > 
> > Don't follow, sorry.  What adjustment are you talking about?
> > 
> > If it's a 1->0 transition, is it mandatory to adjust the flag?  That is,
> > it it legal to have a refcount of exactly one, but have the flag clear?
>
> According to the spec it's illegal and qemu-img check will complain,
> too. In practice, I'm not entirely sure if it will cause real corruption
> or just unnecessary COWs. I believe it may be the latter.

We could retro-doc it but I'm not a huge fan of this practice.

> But it's not worth the trouble anyway when we can have a real zero flag
> in qcow3.

ok.  Thanks for the explanations.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-10 18:02       ` Richard Laager
@ 2012-03-12 12:27         ` Paolo Bonzini
  2012-03-12 13:04           ` Kevin Wolf
  2012-03-13 19:13           ` Richard Laager
  2012-03-24 15:27         ` Christoph Hellwig
  1 sibling, 2 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-12 12:27 UTC (permalink / raw)
  To: Richard Laager; +Cc: Kevin Wolf, qemu-devel

Il 10/03/2012 19:02, Richard Laager ha scritto:
> I propose adding the following behaviors in any event:
>       * If a QEMU block device reports a discard_granularity > 0, it
>         must be equal to 2^n (n >= 0), or QEMU's block core will change
>         it to 0. (Non-power-of-two granularities are not likely to exist
>         in the real world, and this assumption greatly simplifies
>         ensuring correctness.)

Yeah, I was considering this to be simply a bug in the block device.

>       * For SCSI, report an unmap_granularity to the guest as follows:
>       max(logical_block_size, discard_granularity) / logical_block_size

This is more or less already in place later in the series.

> As a design concept, instead of guaranteeing that 512B zero'ing discards
> are supported, I think the QEMU block layer should instead guarantee
> aligned discards to QEMU block devices, emulating any misaligned
> discards (or portions thereof) by writing zeroes if (and only if)
> discard_zeros_data is set.

Yes, this can be done of course.  This series does not include it yet.

> This leaves one remaining issue: In raw-posix.c, for files (i.e. not
> devices), I assume you're going to advertise discard_granularity=1 and
> discard_zeros_data=1 when compiled with support for
> fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
> guarantees that it zeros the data when punching holes.

It does, that's pretty much the definition of a hole.

> If the guest does a big discard (think mkfs) and fallocate() returns
> EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
> which, as you noted, will also allocate it (unless you explicitly check
> for holes). This is bad. It can be avoided by not advertising
> discard_zeros_data, but as you noted, that's unfortunate.

If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
be done by skipping the zero write on known holes.

This could even be done at the block layer level using bdrv_is_allocated.

> If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> going to work. This would side step these problems. 

... and introduce others when migrating if your datacenter doesn't have
homogeneous kernel versions and/or filesystems. :(

> You said it wasn't
> possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
> by extending the file by one byte and then punching that:
>         char buf = 0;
>         fstat(s->fd, &st);
>         pwrite(s->fd, &buf, 1, st.st_size + 1);
>         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                                  st.st_size + 1, 1);
>         ftruncate(s->fd, st.st_size);

Nice trick. :)   Yes, that could work.

Do you know if non-Linux operating systems have something similar to
BLKDISCARDZEROES?

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-12 12:27         ` Paolo Bonzini
@ 2012-03-12 13:04           ` Kevin Wolf
  2012-03-13 19:13           ` Richard Laager
  1 sibling, 0 replies; 68+ messages in thread
From: Kevin Wolf @ 2012-03-12 13:04 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Laager, qemu-devel

Am 12.03.2012 13:27, schrieb Paolo Bonzini:
> Il 10/03/2012 19:02, Richard Laager ha scritto:
>> I propose adding the following behaviors in any event:
>>       * If a QEMU block device reports a discard_granularity > 0, it
>>         must be equal to 2^n (n >= 0), or QEMU's block core will change
>>         it to 0. (Non-power-of-two granularities are not likely to exist
>>         in the real world, and this assumption greatly simplifies
>>         ensuring correctness.)
> 
> Yeah, I was considering this to be simply a bug in the block device.

Block driver you mean? Yes, I think we can just assert() this.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-12 12:27         ` Paolo Bonzini
  2012-03-12 13:04           ` Kevin Wolf
@ 2012-03-13 19:13           ` Richard Laager
  2012-03-14  7:41             ` Paolo Bonzini
  1 sibling, 1 reply; 68+ messages in thread
From: Richard Laager @ 2012-03-13 19:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4349 bytes --]

On Mon, 2012-03-12 at 10:34 +0100, Paolo Bonzini wrote:
> To be completely correct, I suggest the following behavior:
> >      1. Add a discard boolean option to the disk layer.
> >      2. If discard is not specified:
> >               * For files, detect a true/false value by comparing
> >                 stat.st_blocks != stat.st_size>>9.
> >               * For devices, assume a fixed value (true?).
> >      3. If discard is true, issue discards.
> >      4. If discard is false, do not issue discards.
> 
> The problem is, who will use this interface?

I'm a libvirt and virt-manager user; virt-manager already differentiates
between thin and thick provisioning. So I'm envisioning passing that
information to libvirt, which would save it in a config file and use
that to set discard=true vs. discard=false when using QEMU.

On Mon, 2012-03-12 at 13:27 +0100, Paolo Bonzini wrote:
> Il 10/03/2012 19:02, Richard Laager ha scritto:
> > I propose adding the following behaviors in any event:
> >       * If a QEMU block device reports a discard_granularity > 0, it
> >         must be equal to 2^n (n >= 0), or QEMU's block core will change
> >         it to 0. (Non-power-of-two granularities are not likely to exist
> >         in the real world, and this assumption greatly simplifies
> >         ensuring correctness.)
> 
> Yeah, I was considering this to be simply a bug in the block device.
> 
> >       * For SCSI, report an unmap_granularity to the guest as follows:
> >       max(logical_block_size, discard_granularity) / logical_block_size
> 
> This is more or less already in place later in the series.

I didn't see it. Which patch number?

> > Note, I'm assuming fallocate() actually
> > guarantees that it zeros the data when punching holes.
> 
> It does, that's pretty much the definition of a hole.

Agreed. I verified this fact after sending that email. At the time, I
just wanted to be very clear on what I knew for sure vs. what I had not
yet verified.

> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
> be done by skipping the zero write on known holes.
> 
> This could even be done at the block layer level using bdrv_is_allocated.

Would we want to make all write_zeros operations check for and skip
holes, or is write_zeros different from a discard in that it SHOULD/MUST
allocate space?

> > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> > going to work. This would side step these problems. 
> 
> ... and introduce others when migrating if your datacenter doesn't have
> homogeneous kernel versions and/or filesystems. :(

I hadn't thought of the migration issues. Thanks for bringing that up.

Worst case, you end up doing a bunch of zero writing if and only if you
migrate from a discard_zeros_data host to one that doesn't (or doesn't
do discard at all). But this only lasts until the guest reboots
(assuming we also add a behavior of re-probing on guest reboot--or until
it shuts down if we don't or can't). As far as I can see, this is
unavoidable, though. And this is no worse than writing zeros ALL of the
time that fallocate() fails, which is the behavior of your patch series,
right?

This might be another use case for a discard option on the disk. If some
but not all of one's hosts support discard, a system administrator might
want to set discard=0 to avoid this.

> Do you know if non-Linux operating systems have something similar to
> BLKDISCARDZEROES?

As far as I know, no. The SunOS one is only on Illumos (the open source
kernel forked from the now dead OpenSolaris) and only implemented for
ZFS zvols. So currently, it's roughly equivalent to fallocate() on Linux
in that it's happening at the filesystem level. (It doesn't actually
reach the platters yet. But even if it did, that's unrelated to the
guarantees provided by ZFS.) Thus, it always zeros, so we could set
discard_zeros_data = 1 unconditionally there. I should probably run that
by the Illumos developers, though, to ensure they're comfortable with
that ioctl() guaranteeing zeroing.

I haven't looked into the FreeBSD one as much yet. Worst case, we
unconditionally set discard_zeros_data = 0.

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-13 19:13           ` Richard Laager
@ 2012-03-14  7:41             ` Paolo Bonzini
  2012-03-14 12:01               ` Kevin Wolf
  2012-03-15  0:42               ` Richard Laager
  0 siblings, 2 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-14  7:41 UTC (permalink / raw)
  To: Richard Laager; +Cc: qemu-devel

Il 13/03/2012 20:13, Richard Laager ha scritto:
>> > To be completely correct, I suggest the following behavior:
>>> > >      1. Add a discard boolean option to the disk layer.
>>> > >      2. If discard is not specified:
>>> > >               * For files, detect a true/false value by comparing
>>> > >                 stat.st_blocks != stat.st_size>>9.
>>> > >               * For devices, assume a fixed value (true?).
>>> > >      3. If discard is true, issue discards.
>>> > >      4. If discard is false, do not issue discards.
>> > 
>> > The problem is, who will use this interface?
> I'm a libvirt and virt-manager user; virt-manager already differentiates
> between thin and thick provisioning. So I'm envisioning passing that
> information to libvirt, which would save it in a config file and use
> that to set discard=true vs. discard=false when using QEMU.

Yeah, it could be set also at the pool level for libvirt.

>>> > >       * For SCSI, report an unmap_granularity to the guest as follows:
>>> > >       max(logical_block_size, discard_granularity) / logical_block_size
>> > 
>> > This is more or less already in place later in the series.
> I didn't see it. Which patch number?

Patch 11:

+    discard_granularity = s->qdev.conf.discard_granularity;
+    if (discard_granularity == -1) {
+        s->qdev.conf.discard_granularity = s->qdev.conf.logical_block_size;
+    } else if (discard_granularity < s->qdev.conf.logical_block_size) {
+        error_report("scsi-block: invalid discard_granularity");
+        return -1;
+    } else if (discard_granularity & (discard_granularity - 1)) {
+        error_report("scsi-block: discard_granularity not a power of two");
+        return -1;
+    }

> > If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
> > be done by skipping the zero write on known holes.
> > 
> > This could even be done at the block layer level using bdrv_is_allocated.
> 
> Would we want to make all write_zeros operations check for and skip
> holes, or is write_zeros different from a discard in that it SHOULD/MUST
> allocate space?

I think that's pretty much the question to answer for this patch to graduate
from the RFC state (the rest is just technicalities, so to speak).  So far,
write_zeros was intended to be an efficient operation (it avoids allocating
a cluster in qed and will do the same in qcow3, which is why I decided to
merge it with discard).

>>> > > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
>>> > > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
>>> > > going to work. This would side step these problems. 
>> > 
>> > ... and introduce others when migrating if your datacenter doesn't have
>> > homogeneous kernel versions and/or filesystems. :(
> I hadn't thought of the migration issues. Thanks for bringing that up.
> 
> Worst case, you end up doing a bunch of zero writing if and only if you
> migrate from a discard_zeros_data host to one that doesn't (or doesn't
> do discard at all). But this only lasts until the guest reboots
> (assuming we also add a behavior of re-probing on guest reboot--or until
> it shuts down if we don't or can't). As far as I can see, this is
> unavoidable, though. And this is no worse than writing zeros ALL of the
> time that fallocate() fails, which is the behavior of your patch series,
> right?

It is worse in that we do not want the hardware parameters exposed to the
guest to change behind the scenes, except if you change the machine type
or if you use the default unversioned type.

> This might be another use case for a discard option on the disk. If some
> but not all of one's hosts support discard, a system administrator might
> want to set discard=0 to avoid this.

A discard option is already there (discard_granularity=0).  Libvirt could
choose to expose it even now, but it would be of little use in the absence
of support for unaligned discards.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14  7:41             ` Paolo Bonzini
@ 2012-03-14 12:01               ` Kevin Wolf
  2012-03-14 12:14                 ` Paolo Bonzini
  2012-03-24 15:29                 ` Christoph Hellwig
  2012-03-15  0:42               ` Richard Laager
  1 sibling, 2 replies; 68+ messages in thread
From: Kevin Wolf @ 2012-03-14 12:01 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Laager, qemu-devel

Am 14.03.2012 08:41, schrieb Paolo Bonzini:
> Il 13/03/2012 20:13, Richard Laager ha scritto:
>>> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
>>> be done by skipping the zero write on known holes.
>>>
>>> This could even be done at the block layer level using bdrv_is_allocated.
>>
>> Would we want to make all write_zeros operations check for and skip
>> holes, or is write_zeros different from a discard in that it SHOULD/MUST
>> allocate space?
> 
> I think that's pretty much the question to answer for this patch to graduate
> from the RFC state (the rest is just technicalities, so to speak).  So far,
> write_zeros was intended to be an efficient operation (it avoids allocating
> a cluster in qed and will do the same in qcow3, which is why I decided to
> merge it with discard).

Yes, for qcow3 and to some degree also for QED, setting the zero flag is
the natural implementation for both discard and write_zeros. The big
question is what happens with other formats.

Paolo mentioned a use case as a fast way for guests to write zeros, but
is it really faster than a normal write when we have to emulate it by a
bdrv_write with a temporary buffer of zeros? On the other hand we have
the cases where discard really means "I don't care about the data any
more" and emulating it by writing zeros is just a waste of resources there.

So I think we only want to advertise that discard zeroes data if we can
do it efficiently. This means that the format does support it, and that
the device is able to communicate the discard granularity (= cluster
size) to the guest OS.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:01               ` Kevin Wolf
@ 2012-03-14 12:14                 ` Paolo Bonzini
  2012-03-14 12:37                   ` Kevin Wolf
  2012-03-24 15:30                   ` Christoph Hellwig
  2012-03-24 15:29                 ` Christoph Hellwig
  1 sibling, 2 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-14 12:14 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Richard Laager, qemu-devel

Il 14/03/2012 13:01, Kevin Wolf ha scritto:
> Am 14.03.2012 08:41, schrieb Paolo Bonzini:
>> Il 13/03/2012 20:13, Richard Laager ha scritto:
>>>> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
>>>> be done by skipping the zero write on known holes.
>>>>
>>>> This could even be done at the block layer level using bdrv_is_allocated.
>>>
>>> Would we want to make all write_zeros operations check for and skip
>>> holes, or is write_zeros different from a discard in that it SHOULD/MUST
>>> allocate space?
>>
>> I think that's pretty much the question to answer for this patch to graduate
>> from the RFC state (the rest is just technicalities, so to speak).  So far,
>> write_zeros was intended to be an efficient operation (it avoids allocating
>> a cluster in qed and will do the same in qcow3, which is why I decided to
>> merge it with discard).
> 
> Yes, for qcow3 and to some degree also for QED, setting the zero flag is
> the natural implementation for both discard and write_zeros. The big
> question is what happens with other formats.

Also raw if used with a sparse file.

> Paolo mentioned a use case as a fast way for guests to write zeros, but
> is it really faster than a normal write when we have to emulate it by a
> bdrv_write with a temporary buffer of zeros? 

No, of course not.

> On the other hand we have
> the cases where discard really means "I don't care about the data any
> more" and emulating it by writing zeros is just a waste of resources there.
> 
> So I think we only want to advertise that discard zeroes data if we can
> do it efficiently. This means that the format does support it, and that
> the device is able to communicate the discard granularity (= cluster
> size) to the guest OS.

Note that the discard granularity is only a hint, so it's really more a
maximum suggested value than a granularity.  Outside of a cluster
boundary the format would still have to write zeros manually.

Also, Linux for example will only round the number of sectors down to
the granularity, not the start sector.  Rereading the code, for SCSI we
want to advertise a zero granularity (aka do whatever you want),
otherwise we may get only misaligned discard requests and end up writing
zeroes inefficiently all the time.

The problem is that advertising discard_zeroes_data based on the backend
calls for trouble as soon as you migrate between storage formats,
filesystems or disks.

(BTW, if the backing file allows discard and zeroes data, efficient
write-zeroes could be done in qcow2 by allocating a cluster and
discarding its contents.  It's similar to how you do preallocated metadata).

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:14                 ` Paolo Bonzini
@ 2012-03-14 12:37                   ` Kevin Wolf
  2012-03-14 12:49                     ` Paolo Bonzini
  2012-03-24 15:30                   ` Christoph Hellwig
  1 sibling, 1 reply; 68+ messages in thread
From: Kevin Wolf @ 2012-03-14 12:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Laager, qemu-devel

Am 14.03.2012 13:14, schrieb Paolo Bonzini:
>> Paolo mentioned a use case as a fast way for guests to write zeros, but
>> is it really faster than a normal write when we have to emulate it by a
>> bdrv_write with a temporary buffer of zeros? 
> 
> No, of course not.
> 
>> On the other hand we have
>> the cases where discard really means "I don't care about the data any
>> more" and emulating it by writing zeros is just a waste of resources there.
>>
>> So I think we only want to advertise that discard zeroes data if we can
>> do it efficiently. This means that the format does support it, and that
>> the device is able to communicate the discard granularity (= cluster
>> size) to the guest OS.
> 
> Note that the discard granularity is only a hint, so it's really more a
> maximum suggested value than a granularity.  Outside of a cluster
> boundary the format would still have to write zeros manually.

You're talking about SCSI here, I guess? Would be one case where being
able to define sane semantics for virtio-blk would have been an
advantage... I had hoped that SCSI was already sane, but if doesn't
distinguish between "I don't care about this any more" and "I want to
have zeros here", then I'm afraid I can't call it sane any more.

We can make the conditions even stricter, i.e. allow it only if protocol
can pass through discards for unaligned requests. This wouldn't free
clusters on an image format level, but at least on a file system level.

> Also, Linux for example will only round the number of sectors down to
> the granularity, not the start sector.  Rereading the code, for SCSI we
> want to advertise a zero granularity (aka do whatever you want),
> otherwise we may get only misaligned discard requests and end up writing
> zeroes inefficiently all the time.

Does this make sense with real hardware or is it a Linux bug?

> The problem is that advertising discard_zeroes_data based on the backend
> calls for trouble as soon as you migrate between storage formats,
> filesystems or disks.

True. You would have to emulate if you migrate from a source that can
discard to zeros efficiently to a destination that can't.

In the end, I guess we'll just have to accept that we can't fix bad
semantics of ATA and SCSI, and just need to decide whether "I don't
care" or "I want to have zeros" is more common. My feeling is that "I
don't care" is the more useful operation because it can't be expressed
otherwise, but I haven't checked what guests really do.

> (BTW, if the backing file allows discard and zeroes data, efficient
> write-zeroes could be done in qcow2 by allocating a cluster and
> discarding its contents.  It's similar to how you do preallocated metadata).

Yes.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:37                   ` Kevin Wolf
@ 2012-03-14 12:49                     ` Paolo Bonzini
  2012-03-14 13:04                       ` Kevin Wolf
  2012-03-24 15:33                       ` Christoph Hellwig
  0 siblings, 2 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-14 12:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Richard Laager, qemu-devel

Il 14/03/2012 13:37, Kevin Wolf ha scritto:
> Am 14.03.2012 13:14, schrieb Paolo Bonzini:
>>> Paolo mentioned a use case as a fast way for guests to write zeros, but
>>> is it really faster than a normal write when we have to emulate it by a
>>> bdrv_write with a temporary buffer of zeros? 
>>
>> No, of course not.
>>
>>> On the other hand we have
>>> the cases where discard really means "I don't care about the data any
>>> more" and emulating it by writing zeros is just a waste of resources there.
>>>
>>> So I think we only want to advertise that discard zeroes data if we can
>>> do it efficiently. This means that the format does support it, and that
>>> the device is able to communicate the discard granularity (= cluster
>>> size) to the guest OS.
>>
>> Note that the discard granularity is only a hint, so it's really more a
>> maximum suggested value than a granularity.  Outside of a cluster
>> boundary the format would still have to write zeros manually.
> 
> You're talking about SCSI here, I guess? Would be one case where being
> able to define sane semantics for virtio-blk would have been an
> advantage... I had hoped that SCSI was already sane, but if doesn't
> distinguish between "I don't care about this any more" and "I want to
> have zeros here", then I'm afraid I can't call it sane any more.

It does make the distinction.  "I don't care" is UNMAP (or WRITE
SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE
SAME(10) or WRITE SAME(16) with an all-zero payload.

> We can make the conditions even stricter, i.e. allow it only if protocol
> can pass through discards for unaligned requests. This wouldn't free
> clusters on an image format level, but at least on a file system level.
> 
>> Also, Linux for example will only round the number of sectors down to
>> the granularity, not the start sector.  Rereading the code, for SCSI we
>> want to advertise a zero granularity (aka do whatever you want),
>> otherwise we may get only misaligned discard requests and end up writing
>> zeroes inefficiently all the time.
> 
> Does this make sense with real hardware or is it a Linux bug?

It's a bug, SCSI defines the "optimal unmap request starting LBA" to be
"(n × optimal unmap granularity) + unmap granularity alignment".

>> The problem is that advertising discard_zeroes_data based on the backend
>> calls for trouble as soon as you migrate between storage formats,
>> filesystems or disks.
> 
> True. You would have to emulate if you migrate from a source that can
> discard to zeros efficiently to a destination that can't.
> 
> In the end, I guess we'll just have to accept that we can't fix bad
> semantics of ATA and SCSI, and just need to decide whether "I don't
> care" or "I want to have zeros" is more common. My feeling is that "I
> don't care" is the more useful operation because it can't be expressed
> otherwise, but I haven't checked what guests really do.

Yeah, guests right now only use it for unused filesystem pieces, so the
"do not care" semantics are fine.  I also hoped to use discard to avoid
blowing up thin-provisioned images when streaming.  Perhaps we can use
bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to
the block driver.

Anyhow, there are some patches from this series that are relatively
independent and ready for inclusion, I'll extract them and post them
separately.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:49                     ` Paolo Bonzini
@ 2012-03-14 13:04                       ` Kevin Wolf
  2012-03-24 15:33                       ` Christoph Hellwig
  1 sibling, 0 replies; 68+ messages in thread
From: Kevin Wolf @ 2012-03-14 13:04 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Laager, qemu-devel

Am 14.03.2012 13:49, schrieb Paolo Bonzini:
> Il 14/03/2012 13:37, Kevin Wolf ha scritto:
>> Am 14.03.2012 13:14, schrieb Paolo Bonzini:
>>>> Paolo mentioned a use case as a fast way for guests to write zeros, but
>>>> is it really faster than a normal write when we have to emulate it by a
>>>> bdrv_write with a temporary buffer of zeros? 
>>>
>>> No, of course not.
>>>
>>>> On the other hand we have
>>>> the cases where discard really means "I don't care about the data any
>>>> more" and emulating it by writing zeros is just a waste of resources there.
>>>>
>>>> So I think we only want to advertise that discard zeroes data if we can
>>>> do it efficiently. This means that the format does support it, and that
>>>> the device is able to communicate the discard granularity (= cluster
>>>> size) to the guest OS.
>>>
>>> Note that the discard granularity is only a hint, so it's really more a
>>> maximum suggested value than a granularity.  Outside of a cluster
>>> boundary the format would still have to write zeros manually.
>>
>> You're talking about SCSI here, I guess? Would be one case where being
>> able to define sane semantics for virtio-blk would have been an
>> advantage... I had hoped that SCSI was already sane, but if doesn't
>> distinguish between "I don't care about this any more" and "I want to
>> have zeros here", then I'm afraid I can't call it sane any more.
> 
> It does make the distinction.  "I don't care" is UNMAP (or WRITE
> SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE
> SAME(10) or WRITE SAME(16) with an all-zero payload.

Good. So why are we trying to make both the same in qemu? Keeping two
different callbacks in the block drivers for two different operations
seems to make much more sense to me.

>> We can make the conditions even stricter, i.e. allow it only if protocol
>> can pass through discards for unaligned requests. This wouldn't free
>> clusters on an image format level, but at least on a file system level.
>>
>>> Also, Linux for example will only round the number of sectors down to
>>> the granularity, not the start sector.  Rereading the code, for SCSI we
>>> want to advertise a zero granularity (aka do whatever you want),
>>> otherwise we may get only misaligned discard requests and end up writing
>>> zeroes inefficiently all the time.
>>
>> Does this make sense with real hardware or is it a Linux bug?
> 
> It's a bug, SCSI defines the "optimal unmap request starting LBA" to be
> "(n × optimal unmap granularity) + unmap granularity alignment".

Okay. Needs to get fixed in Linux then, not in qemu. Means that qemu
might ignore some discard requests by broken Linux versions, but
ignoring is a valid action for "don't care".

>>> The problem is that advertising discard_zeroes_data based on the backend
>>> calls for trouble as soon as you migrate between storage formats,
>>> filesystems or disks.
>>
>> True. You would have to emulate if you migrate from a source that can
>> discard to zeros efficiently to a destination that can't.
>>
>> In the end, I guess we'll just have to accept that we can't fix bad
>> semantics of ATA and SCSI, and just need to decide whether "I don't
>> care" or "I want to have zeros" is more common. My feeling is that "I
>> don't care" is the more useful operation because it can't be expressed
>> otherwise, but I haven't checked what guests really do.
> 
> Yeah, guests right now only use it for unused filesystem pieces, so the
> "do not care" semantics are fine.  I also hoped to use discard to avoid
> blowing up thin-provisioned images when streaming.  Perhaps we can use
> bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to
> the block driver.

You mean for image formats that don't support explicit zero clusters? It
only works without backing files, and I'm not sure how common this case is.

> Anyhow, there are some patches from this series that are relatively
> independent and ready for inclusion, I'll extract them and post them
> separately.

Okay.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14  7:41             ` Paolo Bonzini
  2012-03-14 12:01               ` Kevin Wolf
@ 2012-03-15  0:42               ` Richard Laager
  2012-03-15  9:36                 ` Paolo Bonzini
  1 sibling, 1 reply; 68+ messages in thread
From: Richard Laager @ 2012-03-15  0:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2817 bytes --]

On Wed, 2012-03-14 at 08:41 +0100, Paolo Bonzini wrote:
> Il 13/03/2012 20:13, Richard Laager ha scritto:
> >>> > >       * For SCSI, report an unmap_granularity to the guest as follows:
> >>> > >       max(logical_block_size, discard_granularity) / logical_block_size
> >> > 
> >> > This is more or less already in place later in the series.
> > I didn't see it. Which patch number?
> 
> Patch 11:

I was saying QEMU should pass the discard_granularity to the guest as
OPTIMAL UNMAP GRANULARITY. This would almost surely need to be done in
hw/scsi-disk.c, roughly around this change from your patch 10:

--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -1295,8 +1295,11 @@ static int scsi_disk_emulate_command(SCSIDiskReq *r)
             outbuf[13] = get_physical_block_exp(&s->qdev.conf);

             /* set TPE bit if the format supports discard */
-            if (s->qdev.conf.discard_granularity) {
+            if (s->qdev.type == TYPE_DISK && s->qdev.conf.discard_granularity) {
                 outbuf[14] = 0x80;
+                if (s->qdev.conf.discard_zeroes_data) {
+                    outbuf[14] |= 0x40;
+                }
             }

             /* Protection, exponent and lowest lba field left blank. */

The code from patch 11 is more along the lines of what I think QEMU
should have in the block layer:

> +    } else if (discard_granularity < s->qdev.conf.logical_block_size) {
> +        error_report("scsi-block: invalid discard_granularity");
> +        return -1;
>
> +    } else if (discard_granularity & (discard_granularity - 1)) {
> +        error_report("scsi-block: discard_granularity not a power of two");
> +        return -1;
> +    }

However, I think the first check is unnecessarily restrictive. As long
as discard_granularity is a power of two, if it's less than the block
size (which is also a power of two), the block size will always be a
multiple of discard_granularity, so there's no problem.

I'd also like to explicitly allow discard_granularity = 1, which is what
fallocate() provides.

> It is worse in that we do not want the hardware parameters exposed to the
> guest to change behind the scenes, except if you change the machine type
> or if you use the default unversioned type.

You're saying that discard_granularity and discard_zeros_data need to be
properties of the machine type? If you start with that as a requirement,
I can see why you want to always report discard_granularity=512 &
discard_zeros_data=1. But that design has many downsides. I'm not
convinced that discard_granularity and discard_zeros_data need to be
properties of the machine type. Why do you feel that's necessary? What's
the harm in those properties changing across QEMU startups (i.e. guest
boots)?

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-15  0:42               ` Richard Laager
@ 2012-03-15  9:36                 ` Paolo Bonzini
  2012-03-16  0:47                   ` Richard Laager
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-15  9:36 UTC (permalink / raw)
  To: Richard Laager; +Cc: qemu-devel

Il 15/03/2012 01:42, Richard Laager ha scritto:
>> > It is worse in that we do not want the hardware parameters exposed to the
>> > guest to change behind the scenes, except if you change the machine type
>> > or if you use the default unversioned type.
> You're saying that discard_granularity and discard_zeros_data need to be
> properties of the machine type? If you start with that as a requirement,
> I can see why you want to always report discard_granularity=512 &
> discard_zeros_data=1. But that design has many downsides. I'm not
> convinced that discard_granularity and discard_zeros_data need to be
> properties of the machine type. Why do you feel that's necessary? What's
> the harm in those properties changing across QEMU startups (i.e. guest
> boots)?

Changing across guest boots is a minor problem, but changing across
migration must be avoided at all costs.

BTW, after this discussion I think we can instead report
discard_granularity = 512 and discard_zeroes_data=0 and get most of the
benefit, at least on file-backed storage.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-15  9:36                 ` Paolo Bonzini
@ 2012-03-16  0:47                   ` Richard Laager
  2012-03-16  9:34                     ` Paolo Bonzini
  0 siblings, 1 reply; 68+ messages in thread
From: Richard Laager @ 2012-03-16  0:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1935 bytes --]

On Thu, 2012-03-15 at 10:36 +0100, Paolo Bonzini wrote:
> Changing across guest boots is a minor problem, but changing across
> migration must be avoided at all costs.
> 
> BTW, after this discussion I think we can instead report
> discard_granularity = 512 and discard_zeroes_data=0 and get most of the
> benefit, at least on file-backed storage.

Are you going to report that to guests all the time, or only when the
host supports discard? If you don't report it all the time, you could
still be "changing across migration". If you do report it all the time,
then you're incurring a performance penalty on systems that don't
support discard, as the guest will be sending discard requests that QEMU
has to throw away (but by which time some work has been wasted).

And either way, what you're proposing means that users with
discard_zeros_data = 1 hosts can't get the (albeit small) benefits of
that because some other QEMU user might want to do a migration across
heterogeneous storage.

The more I think about this, the more it seems like we just need to make
this configurable. Then people that, because of migration concerns, want
to advertise lowest-common-denominator storage to their guests can do so
(just like they already can and likely do advertise a
lowest-common-denominator CPU). Everyone else can get whatever their
host supports (just like they do with CPUs).

Adding configuration options also means that developers can use QEMU to
test guest discard support in various ways.

Finally, I see your proposal of advertising fixed discard support to
guests (regardless of which set of values is chosen) out of line with
the existing QEMU precedent with CPUs as well as the design decision
used for discard in the Linux kernel (which passes the values all the
way up the stack). While this doesn't inherently mean it's a bad design,
I think QEMU should tread carefully.

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-16  0:47                   ` Richard Laager
@ 2012-03-16  9:34                     ` Paolo Bonzini
  0 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2012-03-16  9:34 UTC (permalink / raw)
  To: Richard Laager; +Cc: qemu-devel

Il 16/03/2012 01:47, Richard Laager ha scritto:
> On Thu, 2012-03-15 at 10:36 +0100, Paolo Bonzini wrote:
>> Changing across guest boots is a minor problem, but changing across
>> migration must be avoided at all costs.
>>
>> BTW, after this discussion I think we can instead report
>> discard_granularity = 512 and discard_zeroes_data=0 and get most of the
>> benefit, at least on file-backed storage.
> 
> Are you going to report that to guests all the time, or only when the
> host supports discard? If you don't report it all the time, you could
> still be "changing across migration". If you do report it all the time,
> then you're incurring a performance penalty on systems that don't
> support discard, as the guest will be sending discard requests that QEMU
> has to throw away (but by which time some work has been wasted).

I don't think that should be that bad.  Discard requests should be
relatively rare.

> And either way, what you're proposing means that users with
> discard_zeros_data = 1 hosts can't get the (albeit small) benefits of
> that because some other QEMU user might want to do a migration across
> heterogeneous storage.

Yes, discard_zeroes_data can be made configurable on top, and either
rejected or emulated if the storage does not support it.

> Finally, I see your proposal of advertising fixed discard support

It does not really have to be fixed, it's just a default.

Paolo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-10 18:02       ` Richard Laager
  2012-03-12 12:27         ` Paolo Bonzini
@ 2012-03-24 15:27         ` Christoph Hellwig
  2012-03-26 19:40           ` Richard Laager
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:27 UTC (permalink / raw)
  To: Richard Laager; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel

On Sat, Mar 10, 2012 at 12:02:40PM -0600, Richard Laager wrote:
> If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> going to work. This would side step these problems. You said it wasn't
> possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
> by extending the file by one byte and then punching that:
>         char buf = 0;
>         fstat(s->fd, &st);
>         pwrite(s->fd, &buf, 1, st.st_size + 1);
>         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,

There is no point in using FALLOC_FL_KEEP_SIZE together with
FALLOC_FL_PUNCH_HOLE.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:01               ` Kevin Wolf
  2012-03-14 12:14                 ` Paolo Bonzini
@ 2012-03-24 15:29                 ` Christoph Hellwig
  2012-03-26  9:44                   ` Daniel P. Berrange
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Richard Laager, qemu-devel

On Wed, Mar 14, 2012 at 01:01:35PM +0100, Kevin Wolf wrote:
> Paolo mentioned a use case as a fast way for guests to write zeros, but
> is it really faster than a normal write when we have to emulate it by a
> bdrv_write with a temporary buffer of zeros? On the other hand we have
> the cases where discard really means "I don't care about the data any
> more" and emulating it by writing zeros is just a waste of resources there.

On raw a real discard also will destroy the preallocation, which might
absolutely kill your performance on filesystems.

XFS provides a XFS_IOC_ZERO_RANGE ioctl (same calloing convention as
XFS_IOC_RESVSP), which would provide a good backend for WRITE SAME emulation
in this context.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:14                 ` Paolo Bonzini
  2012-03-14 12:37                   ` Kevin Wolf
@ 2012-03-24 15:30                   ` Christoph Hellwig
  2012-03-26 19:40                     ` Richard Laager
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Richard Laager, qemu-devel

On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote:
> 
> Note that the discard granularity is only a hint, so it's really more a
> maximum suggested value than a granularity.  Outside of a cluster
> boundary the format would still have to write zeros manually.
> 
> Also, Linux for example will only round the number of sectors down to
> the granularity, not the start sector.

Which really is more of a bug than a feature.  The current discard
implementation in Linux is still very poor.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-14 12:49                     ` Paolo Bonzini
  2012-03-14 13:04                       ` Kevin Wolf
@ 2012-03-24 15:33                       ` Christoph Hellwig
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Richard Laager, qemu-devel

On Wed, Mar 14, 2012 at 01:49:48PM +0100, Paolo Bonzini wrote:
> It does make the distinction.  "I don't care" is UNMAP (or WRITE
> SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE
> SAME(10) or WRITE SAME(16) with an all-zero payload.

But once the taget sets the unmap zeroes data bit even UNMAP / WRITE SAME with
unmap have to zero all blocks touched.

Also strictly speaking even the WRITE SAME with unmap requires a payload
of zeroes, even if most targets don't bother to check for this.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
@ 2012-03-24 15:35   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On Thu, Mar 08, 2012 at 06:15:13PM +0100, Paolo Bonzini wrote:
> Allow discard to fail, and fall back to the write operation.  This
> is needed because there's no simple way to probe for availability
> of FALLOC_FL_PUNCH_HOLE.

So you switch on advertising TRIM support in the patch before, and then
fall back to a very suboptimal implementation here?

This will make Windows 7 users very unhappy.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
  2012-03-09 20:36   ` Richard Laager
  2012-03-12  9:34     ` Paolo Bonzini
@ 2012-03-24 15:40     ` Christoph Hellwig
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:40 UTC (permalink / raw)
  To: Richard Laager; +Cc: Paolo Bonzini, qemu-devel

On Fri, Mar 09, 2012 at 02:36:50PM -0600, Richard Laager wrote:
> I'm not sure if fallocate() and/or BLKDISCARD always guarantee that the
> discard has made it to stable storage. If they don't, does O_DIRECT or
> O_DSYNC on open() cause them to make any such guarantee? If not, should
> you be calling fdatasync() or fsync() when the user has specified
> cache=direct or cache=writethrough?
> 
> Note that the Illumos implementation (see below) has a flag to ask for
> either behavior.

For fallocate the current Linux behaviour is that you need an fdatasync
or the O_DSYNC flag to guarantee that it makes it's way to disk.  For
XFS the history implementation of both XFS_IOC_FREESP and hole punching
was that it always made it to disk, and for all other filesystems that
historic behaviour was that O_DSYNC was ignored, and depending on the
fs a fdatasync might or might not be able to give your a guarantee either.

For BLKDISCARD you'd strictly speaking need to issue a cache flush via
fdatasync, too, but I'm not sure if any actual devices require that.

> On Fri, 2012-03-09 at 00:35 -0800, Chris Wedgwood wrote:
> > Simplest still compare the blocks allocated by the file
> > to it's length (ie. stat.st_blocks != stat.st_size>>9).
> 
> I thought of this as well. It covers "99%" of the cases, but there's one
> case where it breaks down. Imagine I have a sparse file backing my
> virtual disk. In the guest, I fill the virtual disk 100%. Then, I
> restart QEMU. Now it thinks that sparse file is non-sparse and stops
> issuing hole punches.

Note that due to speculative preallocation in the filesystem stat.st_blocks
can easily be bigger than stat.st_size>>9, so the above comparism should
at least be a < instead of !=.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated
  2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
@ 2012-03-24 15:42   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-24 15:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On Thu, Mar 08, 2012 at 06:15:17PM +0100, Paolo Bonzini wrote:
> SEEK_DATA and SEEK_HOLE can be used to implement the is_allocated
> callback for raw files.  These currently work on btrfs, with an XFS
> implementation also coming soon.

Btw - if you're interested in a bit more kernel hacking it would be cool
to implement it for block devices using GET LBA STATUS.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-24 15:29                 ` Christoph Hellwig
@ 2012-03-26  9:44                   ` Daniel P. Berrange
  2012-03-26  9:56                     ` Christoph Hellwig
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel P. Berrange @ 2012-03-26  9:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kevin Wolf, Paolo Bonzini, Richard Laager, qemu-devel

On Sat, Mar 24, 2012 at 04:29:26PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 14, 2012 at 01:01:35PM +0100, Kevin Wolf wrote:
> > Paolo mentioned a use case as a fast way for guests to write zeros, but
> > is it really faster than a normal write when we have to emulate it by a
> > bdrv_write with a temporary buffer of zeros? On the other hand we have
> > the cases where discard really means "I don't care about the data any
> > more" and emulating it by writing zeros is just a waste of resources there.
> 
> On raw a real discard also will destroy the preallocation, which might
> absolutely kill your performance on filesystems.

This suggests that there be a new command line param to '-drive' to turn
discard support on/off, since QEMU can't reliably know if the raw file
it is given is intended to be fully pre-allocated by the mgmt app.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-26  9:44                   ` Daniel P. Berrange
@ 2012-03-26  9:56                     ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-26  9:56 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kevin Wolf, Paolo Bonzini, Richard Laager, Christoph Hellwig, qemu-devel

On Mon, Mar 26, 2012 at 10:44:07AM +0100, Daniel P. Berrange wrote:
> This suggests that there be a new command line param to '-drive' to turn
> discard support on/off, since QEMU can't reliably know if the raw file
> it is given is intended to be fully pre-allocated by the mgmt app.

Yes.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-24 15:27         ` Christoph Hellwig
@ 2012-03-26 19:40           ` Richard Laager
  2012-03-27  9:08             ` Christoph Hellwig
  0 siblings, 1 reply; 68+ messages in thread
From: Richard Laager @ 2012-03-26 19:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 443 bytes --]

On Sat, 2012-03-24 at 16:27 +0100, Christoph Hellwig wrote:
> >         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> 
> There is no point in using FALLOC_FL_KEEP_SIZE together with
> FALLOC_FL_PUNCH_HOLE.

It's *required*. From the man page [0], "The FALLOC_FL_PUNCH_HOLE flag
must be ORed with FALLOC_FL_KEEP_SIZE in mode;"

[0] http://man7.org/linux/man-pages/man2/fallocate.2.html

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-24 15:30                   ` Christoph Hellwig
@ 2012-03-26 19:40                     ` Richard Laager
  2012-03-27 10:20                       ` Kevin Wolf
  0 siblings, 1 reply; 68+ messages in thread
From: Richard Laager @ 2012-03-26 19:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 828 bytes --]

On Sat, 2012-03-24 at 16:30 +0100, Christoph Hellwig wrote:
> On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote:
> > 
> > Note that the discard granularity is only a hint, so it's really more a
> > maximum suggested value than a granularity.  Outside of a cluster
> > boundary the format would still have to write zeros manually.
> > 
> > Also, Linux for example will only round the number of sectors down to
> > the granularity, not the start sector.
> 
> Which really is more of a bug than a feature.  The current discard
> implementation in Linux is still very poor.

The disk protocols do not require the granularity to be respected. It is
*only a hint*. Therefore, QEMU has to handle non-aligned discards. It
doesn't really matter what we wish was the case; this is the reality.

-- 
Richard

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-26 19:40           ` Richard Laager
@ 2012-03-27  9:08             ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2012-03-27  9:08 UTC (permalink / raw)
  To: Richard Laager; +Cc: Christoph Hellwig, qemu-devel

On Mon, Mar 26, 2012 at 02:40:47PM -0500, Richard Laager wrote:
> On Sat, 2012-03-24 at 16:27 +0100, Christoph Hellwig wrote:
> > >         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > 
> > There is no point in using FALLOC_FL_KEEP_SIZE together with
> > FALLOC_FL_PUNCH_HOLE.
> 
> It's *required*. From the man page [0], "The FALLOC_FL_PUNCH_HOLE flag
> must be ORed with FALLOC_FL_KEEP_SIZE in mode;"
> 
> [0] http://man7.org/linux/man-pages/man2/fallocate.2.html

You're right - we check that it is set in common code before entering the
filesystem.

It's utterly pointless, but too late to change now - I should have reviewed
the patch to add it better.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations
  2012-03-26 19:40                     ` Richard Laager
@ 2012-03-27 10:20                       ` Kevin Wolf
  0 siblings, 0 replies; 68+ messages in thread
From: Kevin Wolf @ 2012-03-27 10:20 UTC (permalink / raw)
  To: Richard Laager; +Cc: Christoph Hellwig, qemu-devel

Am 26.03.2012 21:40, schrieb Richard Laager:
> On Sat, 2012-03-24 at 16:30 +0100, Christoph Hellwig wrote:
>> On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote:
>>>
>>> Note that the discard granularity is only a hint, so it's really more a
>>> maximum suggested value than a granularity.  Outside of a cluster
>>> boundary the format would still have to write zeros manually.
>>>
>>> Also, Linux for example will only round the number of sectors down to
>>> the granularity, not the start sector.
>>
>> Which really is more of a bug than a feature.  The current discard
>> implementation in Linux is still very poor.
> 
> The disk protocols do not require the granularity to be respected. It is
> *only a hint*. Therefore, QEMU has to handle non-aligned discards. It
> doesn't really matter what we wish was the case; this is the reality.

But we can do suboptimal things (like just ignoring the request if we
didn't promise to zero data, or turn it into a non-discarding zero
buffer write otherwise) if the OS knowingly uses misaligned discards.
Not our bug then.

Kevin

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2012-03-27 10:17 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
2012-03-09 16:47   ` Kevin Wolf
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
2012-03-09 16:31   ` Kevin Wolf
2012-03-09 17:53     ` Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
2012-03-09 16:37   ` Kevin Wolf
2012-03-09 18:06     ` Paolo Bonzini
2012-03-10 18:02       ` Richard Laager
2012-03-12 12:27         ` Paolo Bonzini
2012-03-12 13:04           ` Kevin Wolf
2012-03-13 19:13           ` Richard Laager
2012-03-14  7:41             ` Paolo Bonzini
2012-03-14 12:01               ` Kevin Wolf
2012-03-14 12:14                 ` Paolo Bonzini
2012-03-14 12:37                   ` Kevin Wolf
2012-03-14 12:49                     ` Paolo Bonzini
2012-03-14 13:04                       ` Kevin Wolf
2012-03-24 15:33                       ` Christoph Hellwig
2012-03-24 15:30                   ` Christoph Hellwig
2012-03-26 19:40                     ` Richard Laager
2012-03-27 10:20                       ` Kevin Wolf
2012-03-24 15:29                 ` Christoph Hellwig
2012-03-26  9:44                   ` Daniel P. Berrange
2012-03-26  9:56                     ` Christoph Hellwig
2012-03-15  0:42               ` Richard Laager
2012-03-15  9:36                 ` Paolo Bonzini
2012-03-16  0:47                   ` Richard Laager
2012-03-16  9:34                     ` Paolo Bonzini
2012-03-24 15:27         ` Christoph Hellwig
2012-03-26 19:40           ` Richard Laager
2012-03-27  9:08             ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
2012-03-08 17:55   ` Avi Kivity
2012-03-09 16:42     ` Kevin Wolf
2012-03-12 10:42       ` Avi Kivity
2012-03-12 11:04         ` Kevin Wolf
2012-03-12 12:03           ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
2012-03-08 18:13   ` Avi Kivity
2012-03-08 18:14     ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
2012-03-08 18:17   ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
2012-03-24 15:35   ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
2012-03-09  8:20   ` Chris Wedgwood
2012-03-09  8:31     ` Paolo Bonzini
2012-03-09  8:35       ` Chris Wedgwood
2012-03-09  8:40         ` Paolo Bonzini
2012-03-09 10:31   ` Stefan Hajnoczi
2012-03-09 10:43     ` Paolo Bonzini
2012-03-09 10:53       ` Stefan Hajnoczi
2012-03-09 10:57         ` Paolo Bonzini
2012-03-09 20:36   ` Richard Laager
2012-03-12  9:34     ` Paolo Bonzini
2012-03-24 15:40     ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 15/17] raw: add get_info Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
2012-03-24 15:42   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.