All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack
@ 2013-11-12 15:49 Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This series fixes the implementation of WRITE SAME in the SCSI emulation
layer.  On top of this, it ensures that zero writes from the guest can
be offloaded to the host device or filesystem if that's supported.
This is a relatively important part of the thin provisioning feature,
and builds on the unmap flag to write_zeroes, recently added by Peter
Lieven.  The feature works with iscsi, block device and filesystem
targets.

The block layer patches are 1-3 (adding prerequisites for using the
feature in scsi-disk) and 7-11 (which support the feature on non-iscsi
targets).

Paolo Bonzini (11):
  block: generalize BlockLimits handling to cover bdrv_aio_discard too
  block: add flags to BlockRequest
  block: add bdrv_aio_write_zeroes
  scsi-disk: catch write protection errors in UNMAP
  scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands
  scsi-disk: correctly implement WRITE SAME
  block: handle ENOTSUP from discard in generic code
  raw-posix: implement write_zeroes with MAY_UNMAP for files
  raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  raw-posix: add support for write_zeroes on XFS and block devices
  qemu-iotests: 033 is fast

 block.c                  | 108 +++++++++++++++++------------
 block/raw-aio.h          |   3 +-
 block/raw-posix.c        | 173 +++++++++++++++++++++++++++++++++++++++++++++--
 hw/scsi/scsi-disk.c      | 154 ++++++++++++++++++++++++++++++++++-------
 include/block/block.h    |   4 ++
 tests/qemu-iotests/group |   2 +-
 trace-events             |   2 +
 7 files changed, 369 insertions(+), 77 deletions(-)

-- 
1.8.4.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-13  6:07   ` Peter Lieven
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 02/11] block: add flags to BlockRequest Paolo Bonzini
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

bdrv_co_discard is only covering drivers which have a .bdrv_co_discard()
implementation, but not those with .bdrv_aio_discard(). Not very nice,
and easy to avoid.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c | 80 +++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/block.c b/block.c
index aba6a19..ba6872e 100644
--- a/block.c
+++ b/block.c
@@ -4281,6 +4281,8 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque)
 int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
                                  int nb_sectors)
 {
+    int max_discard;
+
     if (!bs->drv) {
         return -ENOMEDIUM;
     } else if (bdrv_check_request(bs, sector_num, nb_sectors)) {
@@ -4298,55 +4300,55 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
         return 0;
     }
 
-    if (bs->drv->bdrv_co_discard) {
-        int max_discard = bs->bl.max_discard ?
-                          bs->bl.max_discard : MAX_DISCARD_DEFAULT;
+    if (!bs->drv->bdrv_co_discard && !bs->drv->bdrv_aio_discard) {
+        return 0;
+    }
 
-        while (nb_sectors > 0) {
-            int ret;
-            int num = nb_sectors;
+    max_discard = bs->bl.max_discard ?  bs->bl.max_discard : MAX_DISCARD_DEFAULT;
+    while (nb_sectors > 0) {
+        int ret;
+        int num = nb_sectors;
 
-            /* align request */
-            if (bs->bl.discard_alignment &&
-                num >= bs->bl.discard_alignment &&
-                sector_num % bs->bl.discard_alignment) {
-                if (num > bs->bl.discard_alignment) {
-                    num = bs->bl.discard_alignment;
-                }
-                num -= sector_num % bs->bl.discard_alignment;
+        /* align request */
+        if (bs->bl.discard_alignment &&
+            num >= bs->bl.discard_alignment &&
+            sector_num % bs->bl.discard_alignment) {
+            if (num > bs->bl.discard_alignment) {
+                num = bs->bl.discard_alignment;
             }
+            num -= sector_num % bs->bl.discard_alignment;
+        }
 
-            /* limit request size */
-            if (num > max_discard) {
-                num = max_discard;
-            }
+        /* limit request size */
+        if (num > max_discard) {
+            num = max_discard;
+        }
 
+        if (bs->drv->bdrv_co_discard) {
             ret = bs->drv->bdrv_co_discard(bs, sector_num, num);
-            if (ret) {
-                return ret;
+        } else {
+            BlockDriverAIOCB *acb;
+            CoroutineIOCompletion co = {
+                .coroutine = qemu_coroutine_self(),
+            };
+
+            acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
+                                            bdrv_co_io_em_complete, &co);
+            if (acb == NULL) {
+                return -EIO;
+            } else {
+                qemu_coroutine_yield();
+                ret = co.ret;
             }
-
-            sector_num += num;
-            nb_sectors -= num;
         }
-        return 0;
-    } else if (bs->drv->bdrv_aio_discard) {
-        BlockDriverAIOCB *acb;
-        CoroutineIOCompletion co = {
-            .coroutine = qemu_coroutine_self(),
-        };
-
-        acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
-                                        bdrv_co_io_em_complete, &co);
-        if (acb == NULL) {
-            return -EIO;
-        } else {
-            qemu_coroutine_yield();
-            return co.ret;
+        if (ret) {
+            return ret;
         }
-    } else {
-        return 0;
+
+        sector_num += num;
+        nb_sectors -= num;
     }
+    return 0;
 }
 
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 02/11] block: add flags to BlockRequest
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 03/11] block: add bdrv_aio_write_zeroes Paolo Bonzini
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This lets bdrv_co_do_rw receive flags, so that it can be used for
zero writes.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c               | 17 +++++++++++------
 include/block/block.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index ba6872e..7ea2361 100644
--- a/block.c
+++ b/block.c
@@ -74,6 +74,7 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
                                                int nb_sectors,
+                                               BdrvRequestFlags flags,
                                                BlockDriverCompletionFunc *cb,
                                                void *opaque,
                                                bool is_write);
@@ -3648,7 +3649,7 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
 
-    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors,
+    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
                                  cb, opaque, false);
 }
 
@@ -3658,7 +3659,7 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
 
-    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors,
+    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
                                  cb, opaque, true);
 }
 
@@ -3830,8 +3831,10 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs)
     /* Run the aio requests. */
     mcb->num_requests = num_reqs;
     for (i = 0; i < num_reqs; i++) {
-        bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov,
-            reqs[i].nb_sectors, multiwrite_cb, mcb);
+        bdrv_co_aio_rw_vector(bs, reqs[i].sector, reqs[i].qiov,
+                              reqs[i].nb_sectors, reqs[i].flags,
+                              multiwrite_cb, mcb,
+                              true);
     }
 
     return 0;
@@ -3973,10 +3976,10 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
 
     if (!acb->is_write) {
         acb->req.error = bdrv_co_do_readv(bs, acb->req.sector,
-            acb->req.nb_sectors, acb->req.qiov, 0);
+            acb->req.nb_sectors, acb->req.qiov, acb->req.flags);
     } else {
         acb->req.error = bdrv_co_do_writev(bs, acb->req.sector,
-            acb->req.nb_sectors, acb->req.qiov, 0);
+            acb->req.nb_sectors, acb->req.qiov, acb->req.flags);
     }
 
     acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
@@ -3987,6 +3990,7 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
                                                int nb_sectors,
+                                               BdrvRequestFlags flags,
                                                BlockDriverCompletionFunc *cb,
                                                void *opaque,
                                                bool is_write)
@@ -3998,6 +4002,7 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
     acb->req.sector = sector_num;
     acb->req.nb_sectors = nb_sectors;
     acb->req.qiov = qiov;
+    acb->req.flags = flags;
     acb->is_write = is_write;
     acb->done = NULL;
 
diff --git a/include/block/block.h b/include/block/block.h
index 4d9e67c..703d875 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -311,6 +311,7 @@ typedef struct BlockRequest {
     /* Fields to be filled by multiwrite caller */
     int64_t sector;
     int nb_sectors;
+    int flags;
     QEMUIOVector *qiov;
     BlockDriverCompletionFunc *cb;
     void *opaque;
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 03/11] block: add bdrv_aio_write_zeroes
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 02/11] block: add flags to BlockRequest Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 04/11] scsi-disk: catch write protection errors in UNMAP Paolo Bonzini
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This will be used by the SCSI layer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c               | 11 +++++++++++
 include/block/block.h |  3 +++
 trace-events          |  1 +
 3 files changed, 15 insertions(+)

diff --git a/block.c b/block.c
index 7ea2361..f195a64 100644
--- a/block.c
+++ b/block.c
@@ -3663,6 +3663,17 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                  cb, opaque, true);
 }
 
+BlockDriverAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors, BdrvRequestFlags flags,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    trace_bdrv_aio_write_zeroes(bs, sector_num, nb_sectors, flags, opaque);
+
+    return bdrv_co_aio_rw_vector(bs, sector_num, NULL, nb_sectors,
+                                 BDRV_REQ_ZERO_WRITE | flags,
+                                 cb, opaque, true);
+}
+
 
 typedef struct MultiwriteCB {
     int error;
diff --git a/include/block/block.h b/include/block/block.h
index 703d875..4967ed2 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -216,6 +216,9 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
                const uint8_t *buf, int nb_sectors);
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
                int nb_sectors, BdrvRequestFlags flags);
+BlockDriverAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs, int64_t sector_num,
+                                        int nb_sectors, BdrvRequestFlags flags,
+                                        BlockDriverCompletionFunc *cb, void *opaque);
 int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags flags);
 int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov);
 int bdrv_pread(BlockDriverState *bs, int64_t offset,
diff --git a/trace-events b/trace-events
index 8695e9e..96b3974 100644
--- a/trace-events
+++ b/trace-events
@@ -60,6 +60,7 @@ bdrv_aio_discard(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
 bdrv_aio_flush(void *bs, void *opaque) "bs %p opaque %p"
 bdrv_aio_readv(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
 bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
+bdrv_aio_write_zeroes(void *bs, int64_t sector_num, int nb_sectors, int flags, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d flags %d opaque %p"
 bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 04/11] scsi-disk: catch write protection errors in UNMAP
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (2 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 03/11] block: add bdrv_aio_write_zeroes Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 05/11] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This is the same that is already done for WRITE SAME.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/scsi-disk.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 74e6a14..4138268 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1543,6 +1543,7 @@ done:
 
 static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
 {
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
     uint8_t *p = inbuf;
     int len = r->req.cmd.xfer;
     UnmapCBData *data;
@@ -1560,6 +1561,11 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
         goto invalid_param_len;
     }
 
+    if (bdrv_is_read_only(s->qdev.conf.bs)) {
+        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
+        return;
+    }
+
     data = g_new0(UnmapCBData, 1);
     data->r = r;
     data->inbuf = &p[8];
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 05/11] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (3 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 04/11] scsi-disk: catch write protection errors in UNMAP Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-13 19:03   ` Eric Blake
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Since we report ANC_SUP==0 in VPD page B2h, we need to return
an error (ILLEGAL REQUEST/INVALID FIELD IN CDB) for all WRITE SAME
requests with ANCHOR==1.

Inspired by a similar patch to the LIO in-kernel target.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/scsi-disk.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4138268..7e29760 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1548,6 +1548,11 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
     int len = r->req.cmd.xfer;
     UnmapCBData *data;
 
+    /* Reject ANCHOR=1.  */
+    if (r->req.cmd.buf[1] & 0x1) {
+        goto invalid_field;
+    }
+
     if (len < 8) {
         goto invalid_param_len;
     }
@@ -1578,6 +1583,10 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
 
 invalid_param_len:
     scsi_check_condition(r, SENSE_CODE(INVALID_PARAM_LEN));
+    return;
+
+invalid_field:
+    scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
 static void scsi_disk_emulate_write_data(SCSIRequest *req)
@@ -1856,8 +1865,9 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
 
         /*
          * We only support WRITE SAME with the unmap bit set for now.
+	 * Reject UNMAP=0 or ANCHOR=1.
          */
-        if (!(req->cmd.buf[1] & 0x8)) {
+        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
             goto illegal_request;
         }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (4 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 05/11] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-13  6:18   ` Peter Lieven
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 07/11] block: handle ENOTSUP from discard in generic code Paolo Bonzini
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

The WRITE SAME command is implemented incorrectly.  WRITE SAME with the
UNMAP bit set should _not_ unmap the sectors unless the written data
matches the payload of the WRITE SAME command; currently, QEMU is not
looking at the payload at all.

Thus, fetch the data to be written from the input buffer.  If it is
all zeroes, we can use the write_zeroes call (possibly with the new
MAY_UNMAP flag).  Otherwise, do as many write cycles as needed, covering
512k at a time to avoid allocating lots of memory for the bounce
buffer.

Strictly speaking, this is still incorrect because a zero cluster should
only be written if the MAY_UNMAP flag is set.  But this is a bug in the
block layer, not here.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/scsi-disk.c | 140 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 24 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 7e29760..cd5116c 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -41,6 +41,7 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
 #include <scsi/sg.h>
 #endif
 
+#define SCSI_WRITE_SAME_MAX         524288
 #define SCSI_DMA_BUF_SIZE           131072
 #define SCSI_MAX_INQUIRY_LEN        256
 #define SCSI_MAX_MODE_LEN           256
@@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf)
             buflen = 0x40;
             memset(outbuf + 4, 0, buflen - 4);
 
+            outbuf[4] = 0x1; /* wsnz */
+
             /* optimal transfer length granularity */
             outbuf[6] = (min_io_size >> 8) & 0xff;
             outbuf[7] = min_io_size & 0xff;
@@ -1589,6 +1592,111 @@ invalid_field:
     scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
+typedef struct WriteSameCBData {
+    SCSIDiskReq *r;
+    int64_t sector;
+    int nb_sectors;
+    QEMUIOVector qiov;
+    struct iovec iov;
+} WriteSameCBData;
+
+static void scsi_write_same_complete(void *opaque, int ret)
+{
+    WriteSameCBData *data = opaque;
+    SCSIDiskReq *r = data->r;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+    assert(r->req.aiocb != NULL);
+    r->req.aiocb = NULL;
+    bdrv_acct_done(s->qdev.conf.bs, &r->acct);
+    if (r->req.io_canceled) {
+        goto done;
+    }
+
+    if (ret < 0) {
+        if (scsi_handle_rw_error(r, -ret)) {
+            goto done;
+        }
+    }
+
+    data->nb_sectors -= data->iov.iov_len / 512;
+    data->sector += data->iov.iov_len / 512;
+    data->iov.iov_len = MIN(data->nb_sectors * 512, data->iov.iov_len);
+    if (data->iov.iov_len) {
+        bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
+        r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
+                                       &data->qiov, data->iov.iov_len / 512,
+                                       scsi_write_same_complete, r);
+        return;
+    }
+
+    scsi_req_complete(&r->req, GOOD);
+
+done:
+    if (!r->req.io_canceled) {
+        scsi_req_unref(&r->req);
+    }
+    g_free (data->iov.iov_base);
+    g_free (data);
+}
+
+static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
+{
+    SCSIRequest *req = &r->req;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
+    uint32_t nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
+    WriteSameCBData *data;
+    uint8_t *buf;
+    int i;
+
+    /* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
+    if (nb_sectors == 0 || (req->cmd.buf[1] & 0x16)) {
+        scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
+        return;
+    }
+
+    if (bdrv_is_read_only(s->qdev.conf.bs)) {
+        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
+        return;
+    }
+    if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
+        scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
+        return;
+    }
+
+    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
+        int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
+
+        /* The request is used as the AIO opaque value, so add a ref.  */
+        scsi_req_ref(&r->req);
+        bdrv_acct_start(s->qdev.conf.bs, &r->acct, nb_sectors * s->qdev.blocksize,
+                        BDRV_ACCT_WRITE);
+        r->req.aiocb = bdrv_aio_write_zeroes(s->qdev.conf.bs,
+                                             r->req.cmd.lba * (s->qdev.blocksize / 512),
+                                             nb_sectors * (s->qdev.blocksize / 512),
+                                             flags, scsi_aio_complete, r);
+        return;
+    }
+
+    data = g_new0(WriteSameCBData, 1);
+    data->r = r;
+    data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
+    data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
+    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
+    data->iov.iov_base = buf = g_malloc(data->iov.iov_len);
+    qemu_iovec_init_external(&data->qiov, &data->iov, 1);
+
+    for (i = 0; i < data->iov.iov_len; i += s->qdev.blocksize) {
+         memcpy(&buf[i], inbuf, s->qdev.blocksize);
+    }
+
+    scsi_req_ref(&r->req);
+    bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
+    r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
+                                   &data->qiov, data->iov.iov_len / 512,
+                                   scsi_write_same_complete, data);
+}
+
 static void scsi_disk_emulate_write_data(SCSIRequest *req)
 {
     SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
@@ -1612,6 +1720,10 @@ static void scsi_disk_emulate_write_data(SCSIRequest *req)
         scsi_disk_emulate_unmap(r, r->iov.iov_base);
         break;
 
+    case WRITE_SAME_10:
+    case WRITE_SAME_16:
+        scsi_disk_emulate_write_same(r, r->iov.iov_base);
+        break;
     default:
         abort();
     }
@@ -1854,30 +1966,10 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
         break;
     case WRITE_SAME_10:
     case WRITE_SAME_16:
-        nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
-        if (bdrv_is_read_only(s->qdev.conf.bs)) {
-            scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
-            return 0;
-        }
-        if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
-            goto illegal_lba;
-        }
-
-        /*
-         * We only support WRITE SAME with the unmap bit set for now.
-	 * Reject UNMAP=0 or ANCHOR=1.
-         */
-        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
-            goto illegal_request;
-        }
-
-        /* The request is used as the AIO opaque value, so add a ref.  */
-        scsi_req_ref(&r->req);
-        r->req.aiocb = bdrv_aio_discard(s->qdev.conf.bs,
-                                        r->req.cmd.lba * (s->qdev.blocksize / 512),
-                                        nb_sectors * (s->qdev.blocksize / 512),
-                                        scsi_aio_complete, r);
-        return 0;
+        DPRINTF("WRITE SAME %d (len %lu)\n",
+                req->cmd.buf[0] == WRITE_SAME_10 ? 10 : 16,
+                (long)r->req.cmd.xfer);
+        break;
     default:
         DPRINTF("Unknown SCSI command (%2.2x)\n", buf[0]);
         scsi_check_condition(r, SENSE_CODE(INVALID_OPCODE));
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 07/11] block: handle ENOTSUP from discard in generic code
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (5 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Similar to write_zeroes, let the generic code receive a ENOTSUP for
discard operations.  Since bdrv_discard has advisory semantics,
we can just swallow the error.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c           |  2 +-
 block/raw-posix.c | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index f195a64..1e17853 100644
--- a/block.c
+++ b/block.c
@@ -4357,7 +4357,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
                 ret = co.ret;
             }
         }
-        if (ret) {
+        if (ret && ret != -ENOTSUP) {
             return ret;
         }
 
diff --git a/block/raw-posix.c b/block/raw-posix.c
index f6d48bb..27fe47d 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -324,10 +324,10 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
-    s->has_discard = 1;
+    s->has_discard = true;
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
-        s->is_xfs = 1;
+        s->is_xfs = true;
     }
 #endif
 
@@ -699,8 +699,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
     int ret = -EOPNOTSUPP;
     BDRVRawState *s = aiocb->bs->opaque;
 
-    if (s->has_discard == 0) {
-        return 0;
+    if (!s->has_discard) {
+        return -ENOTSUP;
     }
 
     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
@@ -735,8 +735,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 
     if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
         ret == -ENOTTY) {
-        s->has_discard = 0;
-        ret = 0;
+        s->has_discard = false;
+        ret = -ENOTSUP;
     }
     return ret;
 }
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (6 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 07/11] block: handle ENOTSUP from discard in generic code Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-13  6:27   ` Peter Lieven
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Writing zeroes to a file can be done by punching a hole if MAY_UNMAP
is set.

Note that in this case handle_aiocb_discard's ENOTSUP return code
is not ignored, but makes the block layer fall back to the generic
implementation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 trace-events      |  1 +
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 27fe47d..830e109 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -142,6 +142,7 @@ typedef struct BDRVRawState {
     bool is_xfs : 1;
 #endif
     bool has_discard : 1;
+    bool discard_zeroes : 1;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
@@ -283,6 +284,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     Error *local_err = NULL;
     const char *filename;
     int fd, ret;
+    struct stat st;
 
     opts = qemu_opts_create_nofail(&raw_runtime_opts);
     qemu_opts_absorb_qdict(opts, options, &local_err);
@@ -325,6 +327,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #endif
 
     s->has_discard = true;
+
+    if (fstat(s->fd, &st) < 0) {
+        error_setg_errno(errp, errno, "Could not stat file");
+        goto fail;
+    }
+    if (S_ISREG(st.st_mode)) {
+        s->discard_zeroes = true;
+    }
+
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
         s->is_xfs = true;
@@ -788,6 +799,29 @@ static int aio_worker(void *arg)
     return ret;
 }
 
+static int paio_submit_co(BlockDriverState *bs, int fd,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        int type)
+{
+    RawPosixAIOData *acb = g_slice_new(RawPosixAIOData);
+    ThreadPool *pool;
+
+    acb->bs = bs;
+    acb->aio_type = type;
+    acb->aio_fildes = fd;
+
+    if (qiov) {
+        acb->aio_iov = qiov->iov;
+        acb->aio_niov = qiov->niov;
+    }
+    acb->aio_nbytes = nb_sectors * 512;
+    acb->aio_offset = sector_num * 512;
+
+    trace_paio_submit_co(sector_num, nb_sectors, type);
+    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    return thread_pool_submit_co(pool, aio_worker, acb);
+}
+
 static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque, int type)
@@ -1200,6 +1234,31 @@ static coroutine_fn BlockDriverAIOCB *raw_aio_discard(BlockDriverState *bs,
                        cb, opaque, QEMU_AIO_DISCARD);
 }
 
+static int coroutine_fn raw_co_write_zeroes(
+    BlockDriverState *bs, int64_t sector_num,
+    int nb_sectors, BdrvRequestFlags flags)
+{
+    BDRVRawState *s = bs->opaque;
+
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        return -ENOTSUP;
+    }
+    if (!s->discard_zeroes) {
+        return -ENOTSUP;
+    }
+    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                          QEMU_AIO_DISCARD);
+}
+
+static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVRawState *s = bs->opaque;
+
+    bdi->unallocated_blocks_are_zero = s->discard_zeroes;
+    bdi->can_write_zeroes_with_unmap = s->discard_zeroes;
+    return 0;
+}
+
 static QEMUOptionParameter raw_create_options[] = {
     {
         .name = BLOCK_OPT_SIZE,
@@ -1223,6 +1282,7 @@ static BlockDriver bdrv_file = {
     .bdrv_create = raw_create,
     .bdrv_has_zero_init = bdrv_has_zero_init_1,
     .bdrv_co_get_block_status = raw_co_get_block_status,
+    .bdrv_co_write_zeroes = raw_co_write_zeroes,
 
     .bdrv_aio_readv = raw_aio_readv,
     .bdrv_aio_writev = raw_aio_writev,
@@ -1231,6 +1291,7 @@ static BlockDriver bdrv_file = {
 
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
+    .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
 
@@ -1586,6 +1647,7 @@ static BlockDriver bdrv_host_device = {
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength	= raw_getlength,
+    .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
 
@@ -1715,7 +1777,7 @@ static BlockDriver bdrv_host_floppy = {
     .bdrv_aio_flush	= raw_aio_flush,
 
     .bdrv_truncate      = raw_truncate,
-    .bdrv_getlength      = raw_getlength,
+    .bdrv_getlength     = raw_getlength,
     .has_variable_length = true,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
diff --git a/trace-events b/trace-events
index 96b3974..995c84a 100644
--- a/trace-events
+++ b/trace-events
@@ -128,6 +128,7 @@ thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 
 # block/raw-win32.c
 # block/raw-posix.c
+paio_submit_co(int64_t sector_num, int nb_sectors, int type) "sector_num %"PRId64" nb_sectors %d type %d"
 paio_submit(void *acb, void *opaque, int64_t sector_num, int nb_sectors, int type) "acb %p opaque %p sector_num %"PRId64" nb_sectors %d type %d"
 
 # ioport.c
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (7 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-13  6:29   ` Peter Lieven
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 10/11] raw-posix: add support for write_zeroes on XFS and " Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 11/11] qemu-iotests: 033 is fast Paolo Bonzini
  10 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

See the next commit for the description of the Linux kernel problem
that is worked around in raw_open_common.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 830e109..5cb46f1 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -335,6 +335,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     if (S_ISREG(st.st_mode)) {
         s->discard_zeroes = true;
     }
+#ifdef BLKDISCARDZEROES
+    if (S_ISBLK(st.st_mode)) {
+        unsigned int arg;
+        if (ioctl(s->fd, BLKDISCARDZEROES, &arg) == 0 && arg) {
+            s->discard_zeroes = true;
+        }
+    }
+#endif
+#ifdef CONFIG_LINUX
+	/* On Linux 3.10, BLKDISCARD leaves stale data in the page cache.  Do
+	 * not rely on the contents of discarded blocks unless using O_DIRECT.
+	 */
+        if (!(bs->open_flags & BDRV_O_NOCACHE)) {
+            s->discard_zeroes = false;
+        }
+#endif
+    }
 
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
@@ -1587,6 +1604,26 @@ static coroutine_fn BlockDriverAIOCB *hdev_aio_discard(BlockDriverState *bs,
                        cb, opaque, QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
 }
 
+static coroutine_fn int hdev_co_write_zeroes(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
+{
+    BDRVRawState *s = bs->opaque;
+    int rc;
+
+    rc = fd_open(bs);
+    if (rc < 0) {
+        return rc;
+    }
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        return -ENOTSUP;
+    }
+    if (!s->discard_zeroes) {
+        return -ENOTSUP;
+    }
+    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                          QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
+}
+
 static int hdev_create(const char *filename, QEMUOptionParameter *options,
                        Error **errp)
 {
@@ -1639,6 +1676,7 @@ static BlockDriver bdrv_host_device = {
     .bdrv_reopen_abort   = raw_reopen_abort,
     .bdrv_create        = hdev_create,
     .create_options     = raw_create_options,
+    .bdrv_co_write_zeroes = hdev_co_write_zeroes,
 
     .bdrv_aio_readv	= raw_aio_readv,
     .bdrv_aio_writev	= raw_aio_writev,
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 10/11] raw-posix: add support for write_zeroes on XFS and block devices
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (8 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 11/11] qemu-iotests: 033 is fast Paolo Bonzini
  10 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

The code is similar to the implementation of discard and write_zeroes
with UNMAP.  There is no fallocate flag for this operation (yet?)
so we must write XFS-specific code like xfs_discard.

With an image stored on XFS, doing this in the guest:

    # sg_write_same --in /dev/zero --num=256 --lba=0 /dev/sda

will give "qemu-img map" output similar to the following:

    { "start": 0, "length": 131072, "depth": 0, "zero": true, "data": true,
      "offset": 0}

where the unwritten extents are exposed as a preallocated zero area.

The stale page cache problem can be reproduced as follows:

    # modprobe scsi-debug lbpws=1 lbprz=1
    # ./qemu-io /dev/sdXX
    qemu-io> write -P 0xcc 0 2M
    qemu-io> write -z 0 1M
    qemu-io> read -P 0x00 0 512
    Pattern verification failed at offset 0, 512 bytes
    qemu-io> read -v 0 512
    00000000:  cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
    ...

    # ./qemu-io --cache=none /dev/sdXX
    qemu-io> write -P 0xcc 0 2M
    qemu-io> write -z 0 1M
    qemu-io> read -P 0x00 0 512
    qemu-io> read -v 0 512
    00000000:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    ...

And similarly with discard instead of "write -z".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-aio.h   |  3 +-
 block/raw-posix.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 75 insertions(+), 15 deletions(-)

diff --git a/block/raw-aio.h b/block/raw-aio.h
index c61f159..7ad0a8a 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -21,9 +21,10 @@
 #define QEMU_AIO_IOCTL        0x0004
 #define QEMU_AIO_FLUSH        0x0008
 #define QEMU_AIO_DISCARD      0x0010
+#define QEMU_AIO_WRITE_ZEROES 0x0020
 #define QEMU_AIO_TYPE_MASK \
         (QEMU_AIO_READ|QEMU_AIO_WRITE|QEMU_AIO_IOCTL|QEMU_AIO_FLUSH| \
-         QEMU_AIO_DISCARD)
+         QEMU_AIO_DISCARD|QEMU_AIO_WRITE_ZEROES)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 5cb46f1..6515def 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -142,6 +142,7 @@ typedef struct BDRVRawState {
     bool is_xfs : 1;
 #endif
     bool has_discard : 1;
+    bool has_write_zeroes : 1;
     bool discard_zeroes : 1;
 } BDRVRawState;
 
@@ -327,6 +328,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #endif
 
     s->has_discard = true;
+    s->has_write_zeroes = true;
 
     if (fstat(s->fd, &st) < 0) {
         error_setg_errno(errp, errno, "Could not stat file");
@@ -335,20 +337,21 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     if (S_ISREG(st.st_mode)) {
         s->discard_zeroes = true;
     }
-#ifdef BLKDISCARDZEROES
     if (S_ISBLK(st.st_mode)) {
+#ifdef BLKDISCARDZEROES
         unsigned int arg;
         if (ioctl(s->fd, BLKDISCARDZEROES, &arg) == 0 && arg) {
             s->discard_zeroes = true;
         }
-    }
 #endif
 #ifdef CONFIG_LINUX
 	/* On Linux 3.10, BLKDISCARD leaves stale data in the page cache.  Do
 	 * not rely on the contents of discarded blocks unless using O_DIRECT.
+	 * Same for BLKZEROOUT.
 	 */
         if (!(bs->open_flags & BDRV_O_NOCACHE)) {
             s->discard_zeroes = false;
+            s->has_write_zeroes = false;
         }
 #endif
     }
@@ -704,6 +707,23 @@ static ssize_t handle_aiocb_rw(RawPosixAIOData *aiocb)
 }
 
 #ifdef CONFIG_XFS
+static int xfs_write_zeroes(BDRVRawState *s, int64_t offset, uint64_t bytes)
+{
+    struct xfs_flock64 fl;
+
+    memset(&fl, 0, sizeof(fl));
+    fl.l_whence = SEEK_SET;
+    fl.l_start = offset;
+    fl.l_len = bytes;
+
+    if (xfsctl(NULL, s->fd, XFS_IOC_ZERO_RANGE, &fl) < 0) {
+        DEBUG_BLOCK_PRINT("cannot write zero range (%s)\n", strerror(errno));
+        return -errno;
+    }
+
+    return 0;
+}
+
 static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 {
     struct xfs_flock64 fl;
@@ -722,6 +742,42 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 }
 #endif
 
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+    int ret = -EOPNOTSUPP;
+    BDRVRawState *s = aiocb->bs->opaque;
+
+    if (s->has_write_zeroes == 0) {
+        return -ENOTSUP;
+    }
+
+    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+#ifdef BLKZEROOUT
+        do {
+            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+                return 0;
+            }
+        } while (errno == EINTR);
+
+        ret = -errno;
+#endif
+    } else {
+#ifdef CONFIG_XFS
+        if (s->is_xfs) {
+            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+        }
+#endif
+    }
+
+    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
+        ret == -ENOTTY) {
+        s->has_write_zeroes = false;
+        ret = -ENOTSUP;
+    }
+    return ret;
+}
+
 static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
@@ -806,6 +862,9 @@ static int aio_worker(void *arg)
     case QEMU_AIO_DISCARD:
         ret = handle_aiocb_discard(aiocb);
         break;
+    case QEMU_AIO_WRITE_ZEROES:
+        ret = handle_aiocb_write_zeroes(aiocb);
+        break;
     default:
         fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
         ret = -EINVAL;
@@ -1258,13 +1317,13 @@ static int coroutine_fn raw_co_write_zeroes(
     BDRVRawState *s = bs->opaque;
 
     if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-        return -ENOTSUP;
-    }
-    if (!s->discard_zeroes) {
-        return -ENOTSUP;
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_WRITE_ZEROES);
+    } else if (s->discard_zeroes) {
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_DISCARD);
     }
-    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
-                          QEMU_AIO_DISCARD);
+    return -ENOTSUP;
 }
 
 static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
@@ -1615,13 +1674,13 @@ static coroutine_fn int hdev_co_write_zeroes(BlockDriverState *bs,
         return rc;
     }
     if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-        return -ENOTSUP;
-    }
-    if (!s->discard_zeroes) {
-        return -ENOTSUP;
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_WRITE_ZEROES|QEMU_AIO_BLKDEV);
+    } else if (s->discard_zeroes) {
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
     }
-    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
-                          QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
+    return -ENOTSUP;
 }
 
 static int hdev_create(const char *filename, QEMUOptionParameter *options,
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH 11/11] qemu-iotests: 033 is fast
  2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (9 preceding siblings ...)
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 10/11] raw-posix: add support for write_zeroes on XFS and " Paolo Bonzini
@ 2013-11-12 15:49 ` Paolo Bonzini
  10 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-12 15:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tests/qemu-iotests/group | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index c57ff35..cdd4b00 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -39,7 +39,7 @@
 030 rw auto backing
 031 rw auto quick
 032 rw auto
-033 rw auto
+033 rw auto quick
 034 rw auto backing
 035 rw auto quick
 036 rw auto quick
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
@ 2013-11-13  6:07   ` Peter Lieven
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Lieven @ 2013-11-13  6:07 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha


Am 12.11.2013 um 16:49 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> bdrv_co_discard is only covering drivers which have a .bdrv_co_discard()
> implementation, but not those with .bdrv_aio_discard(). Not very nice,
> and easy to avoid.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> block.c | 80 +++++++++++++++++++++++++++++++++--------------------------------
> 1 file changed, 41 insertions(+), 39 deletions(-)
> 
> diff --git a/block.c b/block.c
> index aba6a19..ba6872e 100644
> --- a/block.c
> +++ b/block.c
> @@ -4281,6 +4281,8 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque)
> int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>                                  int nb_sectors)
> {
> +    int max_discard;
> +
>     if (!bs->drv) {
>         return -ENOMEDIUM;
>     } else if (bdrv_check_request(bs, sector_num, nb_sectors)) {
> @@ -4298,55 +4300,55 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>         return 0;
>     }
> 
> -    if (bs->drv->bdrv_co_discard) {
> -        int max_discard = bs->bl.max_discard ?
> -                          bs->bl.max_discard : MAX_DISCARD_DEFAULT;
> +    if (!bs->drv->bdrv_co_discard && !bs->drv->bdrv_aio_discard) {
> +        return 0;
> +    }
> 
> -        while (nb_sectors > 0) {
> -            int ret;
> -            int num = nb_sectors;
> +    max_discard = bs->bl.max_discard ?  bs->bl.max_discard : MAX_DISCARD_DEFAULT;
> +    while (nb_sectors > 0) {
> +        int ret;
> +        int num = nb_sectors;
> 
> -            /* align request */
> -            if (bs->bl.discard_alignment &&
> -                num >= bs->bl.discard_alignment &&
> -                sector_num % bs->bl.discard_alignment) {
> -                if (num > bs->bl.discard_alignment) {
> -                    num = bs->bl.discard_alignment;
> -                }
> -                num -= sector_num % bs->bl.discard_alignment;
> +        /* align request */
> +        if (bs->bl.discard_alignment &&
> +            num >= bs->bl.discard_alignment &&
> +            sector_num % bs->bl.discard_alignment) {
> +            if (num > bs->bl.discard_alignment) {
> +                num = bs->bl.discard_alignment;
>             }
> +            num -= sector_num % bs->bl.discard_alignment;
> +        }
> 
> -            /* limit request size */
> -            if (num > max_discard) {
> -                num = max_discard;
> -            }
> +        /* limit request size */
> +        if (num > max_discard) {
> +            num = max_discard;
> +        }
> 
> +        if (bs->drv->bdrv_co_discard) {
>             ret = bs->drv->bdrv_co_discard(bs, sector_num, num);
> -            if (ret) {
> -                return ret;
> +        } else {
> +            BlockDriverAIOCB *acb;
> +            CoroutineIOCompletion co = {
> +                .coroutine = qemu_coroutine_self(),
> +            };
> +
> +            acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
> +                                            bdrv_co_io_em_complete, &co);
> +            if (acb == NULL) {
> +                return -EIO;
> +            } else {
> +                qemu_coroutine_yield();
> +                ret = co.ret;
>             }
> -
> -            sector_num += num;
> -            nb_sectors -= num;
>         }
> -        return 0;
> -    } else if (bs->drv->bdrv_aio_discard) {
> -        BlockDriverAIOCB *acb;
> -        CoroutineIOCompletion co = {
> -            .coroutine = qemu_coroutine_self(),
> -        };
> -
> -        acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
> -                                        bdrv_co_io_em_complete, &co);
> -        if (acb == NULL) {
> -            return -EIO;
> -        } else {
> -            qemu_coroutine_yield();
> -            return co.ret;
> +        if (ret) {
> +            return ret;
>         }
> -    } else {
> -        return 0;
> +
> +        sector_num += num;
> +        nb_sectors -= num;
>     }
> +    return 0;
> }
> 
> int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
> -- 
> 1.8.4.2
> 
> 


Thank you.

Reviewed-By: Peter Lieven <pl@kamp.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
@ 2013-11-13  6:18   ` Peter Lieven
  2013-11-13  9:38     ` Paolo Bonzini
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Lieven @ 2013-11-13  6:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha


Am 12.11.2013 um 16:49 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> The WRITE SAME command is implemented incorrectly.  WRITE SAME with the
> UNMAP bit set should _not_ unmap the sectors unless the written data
> matches the payload of the WRITE SAME command; currently, QEMU is not
> looking at the payload at all.
> 
> Thus, fetch the data to be written from the input buffer.  If it is
> all zeroes, we can use the write_zeroes call (possibly with the new
> MAY_UNMAP flag).  Otherwise, do as many write cycles as needed, covering
> 512k at a time to avoid allocating lots of memory for the bounce
> buffer.

Would it make sense to add a bdrv_write_same or is the use case for
WRITE SAME with non-zero payload too rare?

And secondly would it make sense to add an optimal request size field
to the BlockLimits?

> 
> Strictly speaking, this is still incorrect because a zero cluster should
> only be written if the MAY_UNMAP flag is set.  But this is a bug in the
> block layer, not here.

Can you explain what exactly you mean?

> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> hw/scsi/scsi-disk.c | 140 +++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 116 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 7e29760..cd5116c 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -41,6 +41,7 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
> #include <scsi/sg.h>
> #endif
> 
> +#define SCSI_WRITE_SAME_MAX         524288

I would call this SCSI_MAX_WRITE_SAME_LEN (like max_ws_len)

> #define SCSI_DMA_BUF_SIZE           131072
> #define SCSI_MAX_INQUIRY_LEN        256
> #define SCSI_MAX_MODE_LEN           256
> @@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf)
>             buflen = 0x40;
>             memset(outbuf + 4, 0, buflen - 4);
> 
> +            outbuf[4] = 0x1; /* wsnz */
> +
>             /* optimal transfer length granularity */
>             outbuf[6] = (min_io_size >> 8) & 0xff;
>             outbuf[7] = min_io_size & 0xff;
> @@ -1589,6 +1592,111 @@ invalid_field:
>     scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
> }
> 
> +typedef struct WriteSameCBData {
> +    SCSIDiskReq *r;
> +    int64_t sector;
> +    int nb_sectors;
> +    QEMUIOVector qiov;
> +    struct iovec iov;
> +} WriteSameCBData;
> +
> +static void scsi_write_same_complete(void *opaque, int ret)
> +{
> +    WriteSameCBData *data = opaque;
> +    SCSIDiskReq *r = data->r;
> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +    assert(r->req.aiocb != NULL);
> +    r->req.aiocb = NULL;
> +    bdrv_acct_done(s->qdev.conf.bs, &r->acct);
> +    if (r->req.io_canceled) {
> +        goto done;
> +    }
> +
> +    if (ret < 0) {
> +        if (scsi_handle_rw_error(r, -ret)) {
> +            goto done;
> +        }
> +    }
> +
> +    data->nb_sectors -= data->iov.iov_len / 512;
> +    data->sector += data->iov.iov_len / 512;
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, data->iov.iov_len);
> +    if (data->iov.iov_len) {
> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
> +        r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
> +                                       &data->qiov, data->iov.iov_len / 512,
> +                                       scsi_write_same_complete, r);
> +        return;
> +    }
> +
> +    scsi_req_complete(&r->req, GOOD);
> +
> +done:
> +    if (!r->req.io_canceled) {
> +        scsi_req_unref(&r->req);
> +    }
> +    g_free (data->iov.iov_base);
> +    g_free (data);
> +}
> +
> +static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
> +{
> +    SCSIRequest *req = &r->req;
> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +    uint32_t nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
> +    WriteSameCBData *data;
> +    uint8_t *buf;
> +    int i;
> +
> +    /* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
> +    if (nb_sectors == 0 || (req->cmd.buf[1] & 0x16)) {
> +        scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
> +        return;
> +    }
> +
> +    if (bdrv_is_read_only(s->qdev.conf.bs)) {
> +        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
> +        return;
> +    }
> +    if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
> +        scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
> +        return;
> +    }
> +
> +    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
> +        int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
> +
> +        /* The request is used as the AIO opaque value, so add a ref.  */
> +        scsi_req_ref(&r->req);
> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, nb_sectors * s->qdev.blocksize,
> +                        BDRV_ACCT_WRITE);
> +        r->req.aiocb = bdrv_aio_write_zeroes(s->qdev.conf.bs,
> +                                             r->req.cmd.lba * (s->qdev.blocksize / 512),
> +                                             nb_sectors * (s->qdev.blocksize / 512),
> +                                             flags, scsi_aio_complete, r);
> +        return;
> +    }
> +
> +    data = g_new0(WriteSameCBData, 1);
> +    data->r = r;
> +    data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
> +    data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
> +    data->iov.iov_base = buf = g_malloc(data->iov.iov_len);
> +    qemu_iovec_init_external(&data->qiov, &data->iov, 1);
> +
> +    for (i = 0; i < data->iov.iov_len; i += s->qdev.blocksize) {
> +         memcpy(&buf[i], inbuf, s->qdev.blocksize);
> +    }
> +
> +    scsi_req_ref(&r->req);
> +    bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
> +    r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
> +                                   &data->qiov, data->iov.iov_len / 512,
> +                                   scsi_write_same_complete, data);
> +}

I asked myself many time if to use 512 or BDRV_SECTOR_SIZE. Both is used
heavily in the whole block code.

> +
> static void scsi_disk_emulate_write_data(SCSIRequest *req)
> {
>     SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> @@ -1612,6 +1720,10 @@ static void scsi_disk_emulate_write_data(SCSIRequest *req)
>         scsi_disk_emulate_unmap(r, r->iov.iov_base);
>         break;
> 
> +    case WRITE_SAME_10:
> +    case WRITE_SAME_16:
> +        scsi_disk_emulate_write_same(r, r->iov.iov_base);
> +        break;
>     default:
>         abort();
>     }
> @@ -1854,30 +1966,10 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
>         break;
>     case WRITE_SAME_10:
>     case WRITE_SAME_16:
> -        nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
> -        if (bdrv_is_read_only(s->qdev.conf.bs)) {
> -            scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
> -            return 0;
> -        }
> -        if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
> -            goto illegal_lba;
> -        }
> -
> -        /*
> -         * We only support WRITE SAME with the unmap bit set for now.
> -	 * Reject UNMAP=0 or ANCHOR=1.
> -         */
> -        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
> -            goto illegal_request;
> -        }
> -
> -        /* The request is used as the AIO opaque value, so add a ref.  */
> -        scsi_req_ref(&r->req);
> -        r->req.aiocb = bdrv_aio_discard(s->qdev.conf.bs,
> -                                        r->req.cmd.lba * (s->qdev.blocksize / 512),
> -                                        nb_sectors * (s->qdev.blocksize / 512),
> -                                        scsi_aio_complete, r);
> -        return 0;
> +        DPRINTF("WRITE SAME %d (len %lu)\n",
> +                req->cmd.buf[0] == WRITE_SAME_10 ? 10 : 16,
> +                (long)r->req.cmd.xfer);
> +        break;
>     default:
>         DPRINTF("Unknown SCSI command (%2.2x)\n", buf[0]);
>         scsi_check_condition(r, SENSE_CODE(INVALID_OPCODE));
> -- 
> 1.8.4.2
> 
> 

Peter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
@ 2013-11-13  6:27   ` Peter Lieven
  2013-11-13  6:30     ` Peter Lieven
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Lieven @ 2013-11-13  6:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 5738 bytes --]


Am 12.11.2013 um 16:49 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Writing zeroes to a file can be done by punching a hole if MAY_UNMAP
> is set.
> 
> Note that in this case handle_aiocb_discard's ENOTSUP return code
> is not ignored, but makes the block layer fall back to the generic
> implementation.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> block/raw-posix.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> trace-events      |  1 +
> 2 files changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 27fe47d..830e109 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -142,6 +142,7 @@ typedef struct BDRVRawState {
>     bool is_xfs : 1;
> #endif
>     bool has_discard : 1;
> +    bool discard_zeroes : 1;
> } BDRVRawState;
> 
> typedef struct BDRVRawReopenState {
> @@ -283,6 +284,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>     Error *local_err = NULL;
>     const char *filename;
>     int fd, ret;
> +    struct stat st;
> 
>     opts = qemu_opts_create_nofail(&raw_runtime_opts);
>     qemu_opts_absorb_qdict(opts, options, &local_err);
> @@ -325,6 +327,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> #endif
> 
>     s->has_discard = true;
> +
> +    if (fstat(s->fd, &st) < 0) {
> +        error_setg_errno(errp, errno, "Could not stat file");
> +        goto fail;
> +    }
> +    if (S_ISREG(st.st_mode)) {
> +        s->discard_zeroes = true;
> +    }
> +
> #ifdef CONFIG_XFS
>     if (platform_test_xfs_fd(s->fd)) {
>         s->is_xfs = true;
> @@ -788,6 +799,29 @@ static int aio_worker(void *arg)
>     return ret;
> }
> 
> +static int paio_submit_co(BlockDriverState *bs, int fd,
> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> +        int type)
> +{
> +    RawPosixAIOData *acb = g_slice_new(RawPosixAIOData);
> +    ThreadPool *pool;
> +
> +    acb->bs = bs;
> +    acb->aio_type = type;
> +    acb->aio_fildes = fd;
> +
> +    if (qiov) {
> +        acb->aio_iov = qiov->iov;
> +        acb->aio_niov = qiov->niov;
> +    }
> +    acb->aio_nbytes = nb_sectors * 512;
> +    acb->aio_offset = sector_num * 512;
> +
> +    trace_paio_submit_co(sector_num, nb_sectors, type);
> +    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
> +    return thread_pool_submit_co(pool, aio_worker, acb);
> +}
> +
> static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
>         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
>         BlockDriverCompletionFunc *cb, void *opaque, int type)
> @@ -1200,6 +1234,31 @@ static coroutine_fn BlockDriverAIOCB *raw_aio_discard(BlockDriverState *bs,
>                        cb, opaque, QEMU_AIO_DISCARD);
> }
> 
> +static int coroutine_fn raw_co_write_zeroes(
> +    BlockDriverState *bs, int64_t sector_num,
> +    int nb_sectors, BdrvRequestFlags flags)
> +{
> +    BDRVRawState *s = bs->opaque;
> +
> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> +        return -ENOTSUP;
> +    }
> +    if (!s->discard_zeroes) {
> +        return -ENOTSUP;
> +    }
> +    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
> +                          QEMU_AIO_DISCARD);
> +}
> +
> +static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> +{
> +    BDRVRawState *s = bs->opaque;
> +
> +    bdi->unallocated_blocks_are_zero = s->discard_zeroes;
> +    bdi->can_write_zeroes_with_unmap = s->discard_zeroes;

does BLKDISCARDZEROES ioctl guarantee that a device is
zero initialized or does it just guarantee that a discard may not
fail and that it reads as zeroes afterwards?

Peter

> +    return 0;
> +}
> +
> static QEMUOptionParameter raw_create_options[] = {
>     {
>         .name = BLOCK_OPT_SIZE,
> @@ -1223,6 +1282,7 @@ static BlockDriver bdrv_file = {
>     .bdrv_create = raw_create,
>     .bdrv_has_zero_init = bdrv_has_zero_init_1,
>     .bdrv_co_get_block_status = raw_co_get_block_status,
> +    .bdrv_co_write_zeroes = raw_co_write_zeroes,
> 
>     .bdrv_aio_readv = raw_aio_readv,
>     .bdrv_aio_writev = raw_aio_writev,
> @@ -1231,6 +1291,7 @@ static BlockDriver bdrv_file = {
> 
>     .bdrv_truncate = raw_truncate,
>     .bdrv_getlength = raw_getlength,
> +    .bdrv_get_info = raw_get_info,
>     .bdrv_get_allocated_file_size
>                         = raw_get_allocated_file_size,
> 
> @@ -1586,6 +1647,7 @@ static BlockDriver bdrv_host_device = {
> 
>     .bdrv_truncate      = raw_truncate,
>     .bdrv_getlength	= raw_getlength,
> +    .bdrv_get_info = raw_get_info,
>     .bdrv_get_allocated_file_size
>                         = raw_get_allocated_file_size,
> 
> @@ -1715,7 +1777,7 @@ static BlockDriver bdrv_host_floppy = {
>     .bdrv_aio_flush	= raw_aio_flush,
> 
>     .bdrv_truncate      = raw_truncate,
> -    .bdrv_getlength      = raw_getlength,
> +    .bdrv_getlength     = raw_getlength,
>     .has_variable_length = true,
>     .bdrv_get_allocated_file_size
>                         = raw_get_allocated_file_size,
> diff --git a/trace-events b/trace-events
> index 96b3974..995c84a 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -128,6 +128,7 @@ thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
> 
> # block/raw-win32.c
> # block/raw-posix.c
> +paio_submit_co(int64_t sector_num, int nb_sectors, int type) "sector_num %"PRId64" nb_sectors %d type %d"
> paio_submit(void *acb, void *opaque, int64_t sector_num, int nb_sectors, int type) "acb %p opaque %p sector_num %"PRId64" nb_sectors %d type %d"
> 
> # ioport.c
> -- 
> 1.8.4.2
> 
> 


[-- Attachment #2: Type: text/html, Size: 9170 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
@ 2013-11-13  6:29   ` Peter Lieven
  2013-11-13  9:44     ` Paolo Bonzini
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Lieven @ 2013-11-13  6:29 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha

Wouldn't it be good to add bdi->can_write_zeroes_with_unmap here as well?
This would automatically avoid full allocation when converting something to a host device
supporting BLKDISCARDZEROES.

Peter

Am 12.11.2013 um 16:49 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> See the next commit for the description of the Linux kernel problem
> that is worked around in raw_open_common.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> block/raw-posix.c | 38 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 38 insertions(+)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 830e109..5cb46f1 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -335,6 +335,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>     if (S_ISREG(st.st_mode)) {
>         s->discard_zeroes = true;
>     }
> +#ifdef BLKDISCARDZEROES
> +    if (S_ISBLK(st.st_mode)) {
> +        unsigned int arg;
> +        if (ioctl(s->fd, BLKDISCARDZEROES, &arg) == 0 && arg) {
> +            s->discard_zeroes = true;
> +        }
> +    }
> +#endif
> +#ifdef CONFIG_LINUX
> +	/* On Linux 3.10, BLKDISCARD leaves stale data in the page cache.  Do
> +	 * not rely on the contents of discarded blocks unless using O_DIRECT.
> +	 */
> +        if (!(bs->open_flags & BDRV_O_NOCACHE)) {
> +            s->discard_zeroes = false;
> +        }
> +#endif
> +    }
> 
> #ifdef CONFIG_XFS
>     if (platform_test_xfs_fd(s->fd)) {
> @@ -1587,6 +1604,26 @@ static coroutine_fn BlockDriverAIOCB *hdev_aio_discard(BlockDriverState *bs,
>                        cb, opaque, QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
> }
> 
> +static coroutine_fn int hdev_co_write_zeroes(BlockDriverState *bs,
> +    int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    int rc;
> +
> +    rc = fd_open(bs);
> +    if (rc < 0) {
> +        return rc;
> +    }
> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> +        return -ENOTSUP;
> +    }
> +    if (!s->discard_zeroes) {
> +        return -ENOTSUP;
> +    }
> +    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
> +                          QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
> +}
> +
> static int hdev_create(const char *filename, QEMUOptionParameter *options,
>                        Error **errp)
> {
> @@ -1639,6 +1676,7 @@ static BlockDriver bdrv_host_device = {
>     .bdrv_reopen_abort   = raw_reopen_abort,
>     .bdrv_create        = hdev_create,
>     .create_options     = raw_create_options,
> +    .bdrv_co_write_zeroes = hdev_co_write_zeroes,
> 
>     .bdrv_aio_readv	= raw_aio_readv,
>     .bdrv_aio_writev	= raw_aio_writev,
> -- 
> 1.8.4.2
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files
  2013-11-13  6:27   ` Peter Lieven
@ 2013-11-13  6:30     ` Peter Lieven
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Lieven @ 2013-11-13  6:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 6053 bytes --]


Am 13.11.2013 um 07:27 schrieb Peter Lieven <pl@kamp.de>:

> 
> Am 12.11.2013 um 16:49 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Writing zeroes to a file can be done by punching a hole if MAY_UNMAP
>> is set.
>> 
>> Note that in this case handle_aiocb_discard's ENOTSUP return code
>> is not ignored, but makes the block layer fall back to the generic
>> implementation.
>> 
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> block/raw-posix.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> trace-events      |  1 +
>> 2 files changed, 64 insertions(+), 1 deletion(-)
>> 
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 27fe47d..830e109 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -142,6 +142,7 @@ typedef struct BDRVRawState {
>>     bool is_xfs : 1;
>> #endif
>>     bool has_discard : 1;
>> +    bool discard_zeroes : 1;
>> } BDRVRawState;
>> 
>> typedef struct BDRVRawReopenState {
>> @@ -283,6 +284,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>     Error *local_err = NULL;
>>     const char *filename;
>>     int fd, ret;
>> +    struct stat st;
>> 
>>     opts = qemu_opts_create_nofail(&raw_runtime_opts);
>>     qemu_opts_absorb_qdict(opts, options, &local_err);
>> @@ -325,6 +327,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>> #endif
>> 
>>     s->has_discard = true;
>> +
>> +    if (fstat(s->fd, &st) < 0) {
>> +        error_setg_errno(errp, errno, "Could not stat file");
>> +        goto fail;
>> +    }
>> +    if (S_ISREG(st.st_mode)) {
>> +        s->discard_zeroes = true;
>> +    }
>> +
>> #ifdef CONFIG_XFS
>>     if (platform_test_xfs_fd(s->fd)) {
>>         s->is_xfs = true;
>> @@ -788,6 +799,29 @@ static int aio_worker(void *arg)
>>     return ret;
>> }
>> 
>> +static int paio_submit_co(BlockDriverState *bs, int fd,
>> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
>> +        int type)
>> +{
>> +    RawPosixAIOData *acb = g_slice_new(RawPosixAIOData);
>> +    ThreadPool *pool;
>> +
>> +    acb->bs = bs;
>> +    acb->aio_type = type;
>> +    acb->aio_fildes = fd;
>> +
>> +    if (qiov) {
>> +        acb->aio_iov = qiov->iov;
>> +        acb->aio_niov = qiov->niov;
>> +    }
>> +    acb->aio_nbytes = nb_sectors * 512;
>> +    acb->aio_offset = sector_num * 512;
>> +
>> +    trace_paio_submit_co(sector_num, nb_sectors, type);
>> +    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
>> +    return thread_pool_submit_co(pool, aio_worker, acb);
>> +}
>> +
>> static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
>>         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
>>         BlockDriverCompletionFunc *cb, void *opaque, int type)
>> @@ -1200,6 +1234,31 @@ static coroutine_fn BlockDriverAIOCB *raw_aio_discard(BlockDriverState *bs,
>>                        cb, opaque, QEMU_AIO_DISCARD);
>> }
>> 
>> +static int coroutine_fn raw_co_write_zeroes(
>> +    BlockDriverState *bs, int64_t sector_num,
>> +    int nb_sectors, BdrvRequestFlags flags)
>> +{
>> +    BDRVRawState *s = bs->opaque;
>> +
>> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
>> +        return -ENOTSUP;
>> +    }
>> +    if (!s->discard_zeroes) {
>> +        return -ENOTSUP;
>> +    }
>> +    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
>> +                          QEMU_AIO_DISCARD);
>> +}
>> +
>> +static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>> +{
>> +    BDRVRawState *s = bs->opaque;
>> +
>> +    bdi->unallocated_blocks_are_zero = s->discard_zeroes;
>> +    bdi->can_write_zeroes_with_unmap = s->discard_zeroes;
> 
> does BLKDISCARDZEROES ioctl guarantee that a device is
> zero initialized or does it just guarantee that a discard may not
> fail and that it reads as zeroes afterwards?

Please ignore this. We are talking about a file here.

Peter

> 
> Peter
> 
>> +    return 0;
>> +}
>> +
>> static QEMUOptionParameter raw_create_options[] = {
>>     {
>>         .name = BLOCK_OPT_SIZE,
>> @@ -1223,6 +1282,7 @@ static BlockDriver bdrv_file = {
>>     .bdrv_create = raw_create,
>>     .bdrv_has_zero_init = bdrv_has_zero_init_1,
>>     .bdrv_co_get_block_status = raw_co_get_block_status,
>> +    .bdrv_co_write_zeroes = raw_co_write_zeroes,
>> 
>>     .bdrv_aio_readv = raw_aio_readv,
>>     .bdrv_aio_writev = raw_aio_writev,
>> @@ -1231,6 +1291,7 @@ static BlockDriver bdrv_file = {
>> 
>>     .bdrv_truncate = raw_truncate,
>>     .bdrv_getlength = raw_getlength,
>> +    .bdrv_get_info = raw_get_info,
>>     .bdrv_get_allocated_file_size
>>                         = raw_get_allocated_file_size,
>> 
>> @@ -1586,6 +1647,7 @@ static BlockDriver bdrv_host_device = {
>> 
>>     .bdrv_truncate      = raw_truncate,
>>     .bdrv_getlength	= raw_getlength,
>> +    .bdrv_get_info = raw_get_info,
>>     .bdrv_get_allocated_file_size
>>                         = raw_get_allocated_file_size,
>> 
>> @@ -1715,7 +1777,7 @@ static BlockDriver bdrv_host_floppy = {
>>     .bdrv_aio_flush	= raw_aio_flush,
>> 
>>     .bdrv_truncate      = raw_truncate,
>> -    .bdrv_getlength      = raw_getlength,
>> +    .bdrv_getlength     = raw_getlength,
>>     .has_variable_length = true,
>>     .bdrv_get_allocated_file_size
>>                         = raw_get_allocated_file_size,
>> diff --git a/trace-events b/trace-events
>> index 96b3974..995c84a 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -128,6 +128,7 @@ thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
>> 
>> # block/raw-win32.c
>> # block/raw-posix.c
>> +paio_submit_co(int64_t sector_num, int nb_sectors, int type) "sector_num %"PRId64" nb_sectors %d type %d"
>> paio_submit(void *acb, void *opaque, int64_t sector_num, int nb_sectors, int type) "acb %p opaque %p sector_num %"PRId64" nb_sectors %d type %d"
>> 
>> # ioport.c
>> -- 
>> 1.8.4.2
>> 
>> 
> 


[-- Attachment #2: Type: text/html, Size: 9817 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME
  2013-11-13  6:18   ` Peter Lieven
@ 2013-11-13  9:38     ` Paolo Bonzini
  0 siblings, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-13  9:38 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 13/11/2013 07:18, Peter Lieven ha scritto:
>> > The WRITE SAME command is implemented incorrectly.  WRITE SAME with the
>> > UNMAP bit set should _not_ unmap the sectors unless the written data
>> > matches the payload of the WRITE SAME command; currently, QEMU is not
>> > looking at the payload at all.
>> > 
>> > Thus, fetch the data to be written from the input buffer.  If it is
>> > all zeroes, we can use the write_zeroes call (possibly with the new
>> > MAY_UNMAP flag).  Otherwise, do as many write cycles as needed, covering
>> > 512k at a time to avoid allocating lots of memory for the bounce
>> > buffer.
> 
> Would it make sense to add a bdrv_write_same or is the use case for
> WRITE SAME with non-zero payload too rare?

It would, but it is definitely very rare, probably so much that we need
not care.  Linux only invokes it for zero payloads.

Also, for zero payload there are additional benefits.  First, supporting
WRITE SAME with UNMAP if the host has LBPRZ=1 or analogous.  Second,
using zero clusters in qcow2/qed/vmdk.

> And secondly would it make sense to add an optimal request size field
> to the BlockLimits?

The optimal request size is not particularly useful if it is not visible
to the guest, unfortunately.  But we cannot pass values arbitrarily to
the guest because they would change if the backing storage changed (e.g.
from NFS to local, or from raw to qcow2).

So I'm not sure who would actually use the optimal request size.

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-13  6:29   ` Peter Lieven
@ 2013-11-13  9:44     ` Paolo Bonzini
  2013-11-13 14:14       ` Peter Lieven
  0 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-13  9:44 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 13/11/2013 07:29, Peter Lieven ha scritto:
> Wouldn't it be good to add bdi->can_write_zeroes_with_unmap here as well?

We do:

> +    bdi->unallocated_blocks_are_zero = s->discard_zeroes;
> +    bdi->can_write_zeroes_with_unmap = s->discard_zeroes;

> This would automatically avoid full allocation when converting something to a host device
> supporting BLKDISCARDZEROES.

Yes, that's (part of) the point of this patch.

Regarding the question you posed in the previous patch:

> does BLKDISCARDZEROES ioctl guarantee that a device is
> zero initialized or does it just guarantee that a discard may not
> fail and that it reads as zeroes afterwards?

Only the latter.  ".bdrv_has_zero_init" is only present in the bdrv_file
BlockDriver.

Paolo

> Peter
> 
> Am 12.11.2013 um 16:49 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> See the next commit for the description of the Linux kernel problem
>> that is worked around in raw_open_common.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> block/raw-posix.c | 38 ++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 38 insertions(+)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 830e109..5cb46f1 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -335,6 +335,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>     if (S_ISREG(st.st_mode)) {
>>         s->discard_zeroes = true;
>>     }
>> +#ifdef BLKDISCARDZEROES
>> +    if (S_ISBLK(st.st_mode)) {
>> +        unsigned int arg;
>> +        if (ioctl(s->fd, BLKDISCARDZEROES, &arg) == 0 && arg) {
>> +            s->discard_zeroes = true;
>> +        }
>> +    }
>> +#endif
>> +#ifdef CONFIG_LINUX
>> +	/* On Linux 3.10, BLKDISCARD leaves stale data in the page cache.  Do
>> +	 * not rely on the contents of discarded blocks unless using O_DIRECT.
>> +	 */
>> +        if (!(bs->open_flags & BDRV_O_NOCACHE)) {
>> +            s->discard_zeroes = false;
>> +        }
>> +#endif
>> +    }
>>
>> #ifdef CONFIG_XFS
>>     if (platform_test_xfs_fd(s->fd)) {
>> @@ -1587,6 +1604,26 @@ static coroutine_fn BlockDriverAIOCB *hdev_aio_discard(BlockDriverState *bs,
>>                        cb, opaque, QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
>> }
>>
>> +static coroutine_fn int hdev_co_write_zeroes(BlockDriverState *bs,
>> +    int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>> +{
>> +    BDRVRawState *s = bs->opaque;
>> +    int rc;
>> +
>> +    rc = fd_open(bs);
>> +    if (rc < 0) {
>> +        return rc;
>> +    }
>> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
>> +        return -ENOTSUP;
>> +    }
>> +    if (!s->discard_zeroes) {
>> +        return -ENOTSUP;
>> +    }
>> +    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
>> +                          QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
>> +}
>> +
>> static int hdev_create(const char *filename, QEMUOptionParameter *options,
>>                        Error **errp)
>> {
>> @@ -1639,6 +1676,7 @@ static BlockDriver bdrv_host_device = {
>>     .bdrv_reopen_abort   = raw_reopen_abort,
>>     .bdrv_create        = hdev_create,
>>     .create_options     = raw_create_options,
>> +    .bdrv_co_write_zeroes = hdev_co_write_zeroes,
>>
>>     .bdrv_aio_readv	= raw_aio_readv,
>>     .bdrv_aio_writev	= raw_aio_writev,
>> -- 
>> 1.8.4.2
>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-13  9:44     ` Paolo Bonzini
@ 2013-11-13 14:14       ` Peter Lieven
  2013-11-13 14:39         ` Paolo Bonzini
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Lieven @ 2013-11-13 14:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha


Am 13.11.2013 um 10:44 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 13/11/2013 07:29, Peter Lieven ha scritto:
>> Wouldn't it be good to add bdi->can_write_zeroes_with_unmap here as well?
> 
> We do:
> 
>> +    bdi->unallocated_blocks_are_zero = s->discard_zeroes;
>> +    bdi->can_write_zeroes_with_unmap = s->discard_zeroes;

Sorry, I missed this.

> 
>> This would automatically avoid full allocation when converting something to a host device
>> supporting BLKDISCARDZEROES.
> 
> Yes, that's (part of) the point of this patch.
> 
> Regarding the question you posed in the previous patch:
> 
>> does BLKDISCARDZEROES ioctl guarantee that a device is
>> zero initialized or does it just guarantee that a discard may not
>> fail and that it reads as zeroes afterwards?
> 
> Only the latter.  ".bdrv_has_zero_init" is only present in the bdrv_file
> BlockDriver.

Then bdi->unallocated_blocks_are_zero must stay 0. .bdrv_has_zero_init's
semantic is to reflect the zero status of all blocks of the device right after bdrv_create
independently of their allocation status. bdi->unallocated_blocks_are_zero
reflects the zero status of every unallocated block regardless if it was
unallocated right from the beginning or became unallocated through a discard.

Peter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-13 14:14       ` Peter Lieven
@ 2013-11-13 14:39         ` Paolo Bonzini
  2013-11-13 14:45           ` Peter Lieven
  0 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-11-13 14:39 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 13/11/2013 15:14, Peter Lieven ha scritto:
>>> >> does BLKDISCARDZEROES ioctl guarantee that a device is
>>> >> zero initialized or does it just guarantee that a discard may not
>>> >> fail and that it reads as zeroes afterwards?
>> > 
>> > Only the latter.  ".bdrv_has_zero_init" is only present in the bdrv_file
>> > BlockDriver.
> Then bdi->unallocated_blocks_are_zero must stay 0. .bdrv_has_zero_init's
> semantic is to reflect the zero status of all blocks of the device right after bdrv_create
> independently of their allocation status. bdi->unallocated_blocks_are_zero
> reflects the zero status of every unallocated block regardless if it was
> unallocated right from the beginning or became unallocated through a discard.

What we have is:

* bdi->unallocated_blocks_are_zero returns true

* bdrv_create doesn't ensure that every block starts unallocated

* hence bdrv_has_zero_init returns false

Blocks that (for any reason) are unallocated after bdrv_create *will* be
zero if BLKDISCARDZEROES returns true.

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-13 14:39         ` Paolo Bonzini
@ 2013-11-13 14:45           ` Peter Lieven
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Lieven @ 2013-11-13 14:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha



> Am 13.11.2013 um 15:39 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
> Il 13/11/2013 15:14, Peter Lieven ha scritto:
>>>>>> does BLKDISCARDZEROES ioctl guarantee that a device is
>>>>>> zero initialized or does it just guarantee that a discard may not
>>>>>> fail and that it reads as zeroes afterwards?
>>>> 
>>>> Only the latter.  ".bdrv_has_zero_init" is only present in the bdrv_file
>>>> BlockDriver.
>> Then bdi->unallocated_blocks_are_zero must stay 0. .bdrv_has_zero_init's
>> semantic is to reflect the zero status of all blocks of the device right after bdrv_create
>> independently of their allocation status. bdi->unallocated_blocks_are_zero
>> reflects the zero status of every unallocated block regardless if it was
>> unallocated right from the beginning or became unallocated through a discard.
> 
> What we have is:
> 
> * bdi->unallocated_blocks_are_zero returns true
> 
> * bdrv_create doesn't ensure that every block starts unallocated
> 
> * hence bdrv_has_zero_init returns false
> 
> Blocks that (for any reason) are unallocated after bdrv_create *will* be
> zero if BLKDISCARDZEROES returns true.

Ok, then i misunderstood your comment. If BLKDISCARDZEROES means lbprz == 1 SCSI speaking then you are right.

Peter

> 
> Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH 05/11] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands
  2013-11-12 15:49 ` [Qemu-devel] [PATCH 05/11] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
@ 2013-11-13 19:03   ` Eric Blake
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Blake @ 2013-11-13 19:03 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, pl, stefanha

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

On 11/12/2013 08:49 AM, Paolo Bonzini wrote:
> Since we report ANC_SUP==0 in VPD page B2h, we need to return
> an error (ILLEGAL REQUEST/INVALID FIELD IN CDB) for all WRITE SAME
> requests with ANCHOR==1.
> 
> Inspired by a similar patch to the LIO in-kernel target.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

> @@ -1856,8 +1865,9 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
>  
>          /*
>           * We only support WRITE SAME with the unmap bit set for now.
> +	 * Reject UNMAP=0 or ANCHOR=1.

TAB damage.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-11-13 19:03 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-12 15:49 [Qemu-devel] [PATCH 00/11] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 01/11] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
2013-11-13  6:07   ` Peter Lieven
2013-11-12 15:49 ` [Qemu-devel] [PATCH 02/11] block: add flags to BlockRequest Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 03/11] block: add bdrv_aio_write_zeroes Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 04/11] scsi-disk: catch write protection errors in UNMAP Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 05/11] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
2013-11-13 19:03   ` Eric Blake
2013-11-12 15:49 ` [Qemu-devel] [PATCH 06/11] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
2013-11-13  6:18   ` Peter Lieven
2013-11-13  9:38     ` Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 07/11] block: handle ENOTSUP from discard in generic code Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 08/11] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
2013-11-13  6:27   ` Peter Lieven
2013-11-13  6:30     ` Peter Lieven
2013-11-12 15:49 ` [Qemu-devel] [PATCH 09/11] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
2013-11-13  6:29   ` Peter Lieven
2013-11-13  9:44     ` Paolo Bonzini
2013-11-13 14:14       ` Peter Lieven
2013-11-13 14:39         ` Paolo Bonzini
2013-11-13 14:45           ` Peter Lieven
2013-11-12 15:49 ` [Qemu-devel] [PATCH 10/11] raw-posix: add support for write_zeroes on XFS and " Paolo Bonzini
2013-11-12 15:49 ` [Qemu-devel] [PATCH 11/11] qemu-iotests: 033 is fast Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.